Finite difference Hessian

An obvious choice for parallelisation is the calculation of a finite difference Hessian. Each entry in the Hessian matrix is calculated using the difference of two gradients. Using a forward difference algorithm an $ N$-atom system requires $ 3N+1$ independent gradient evaluations. With a central difference algorithm this rises to $ 6N$ evaluations.

In the original ChemShell implementation the gradient calculations and Hessian evaluation are performed using a single Tcl command (force). In the task-farmed version this command has been split up into three stages to facilitate parallelisation. In the first stage (force_precalc), the required set of gradients is calculated and stored on disk as ChemShell objects. This work can be divided up among the workgroups to be carried out in parallel using the option task_atoms with a list of atoms. In the second stage the ChemShell gradient objects are made available to all workgroups using the command taskfarm_globalise_forcegradients. Finally, the Hessian matrix is evaluated using the pre-calculated gradients (using force with the option precalc=yes). The Hessian calculation can be restricted to a single workgroup if desired by a conditional test on the workgroup ID.

Figure 3: The silicate-VO$ _3$ cluster used for the Hessian benchmark calculations.
The 57-atom silicate-VO$ _3$ cluster shown in Figure 3 was used to assess the performance of the task-farmed implementation. Energies and gradients were calculated using GAMESS-UK with the B3LYP functional [11]. Two basis sets were used: the LANL2 effective core potential basis [12] (giving 413 basis functions) and the TZVP [13,14] all-electron basis (1032 basis functions). The larger basis test is present to fully assess the PeIGS build of GAMESS-UK, as PeIGS is only used to diagonalize matrices larger than the total number of processors.

Figure 4: Calculation time in wall clock seconds for a single point energy and gradient evaluation of a 57-atom silicate-VO$ _3$ system using the LANL2 ECP and TZVP basis sets.
To give an indication of the time required for the full Hessian calculation, single-point calculations were carried out with differing numbers of processors. The results are shown in Figure 4. If perfect scaling were achieved the calculation time would halve with each doubling of the number of processors (and a second level of parallelism would be unnecessary). For both basis sets reasonable scaling is achieved up to approximately 128 processors, but for larger processor counts the gains are very small. This suggests that large efficiency gains should be possible using task-farmed calculations.

The full forward difference Hessian was evaluated using a set of 1024-processor calculations with differing numbers of workgroups. The tasks were parallelised using a simple static load-balancing scheme where as far as possible an equal number of gradient calculations were assigned to each workgroup. As each gradient calculation should take approximately the same amount of time (apart from the first where no wavefunction guess is provided), no major gains would be expected from a more sophisticated load-balancing mechanism.

Table 1: Calculation time in wall clock seconds for a forward difference Hessian matrix evaluation using 1024 processors divided into workgroups. Speed-up factor is compared to the single workgroup calculation.
LANL2 ECP basis
Workgroups Procs/workgroup Time / s Speed-up
1 1024 7896
2 512 4354 1.8
4 256 2444 3.2
8 128 1665 4.7
16 64 1290 6.1
32 32 1176 6.7
64 16 1151 6.9
128 8 2165 3.7
TZVP basis
Workgroups Procs/workgroup Time / s Speed-up
1 1024 52762
64 16 7812 6.8

The results are shown as wall clock times in Table 1. To correctly interpret the results it is important to keep in mind that all calculations run with the same number of processors and only the division into workgroups is changed. As the number of workgroups increases, the number of processors in each workgroup falls proportionally. The calculation with the highest speed-up factor therefore gives the best balance between parallelisation of individual gradient evaluations and parallelisation of the Hessian as a whole. This is different to the benchmarking of single-level parallelism where scaling of calculation time with number of processors is used to measure efficiency. There is no scaling in this sense in Table 1, as the change in the the speed-up with the number of workgroups approaches zero at peak efficiency. If the number of workgroups is too high the speed-up will begin to fall again.

Speed-up factors are calculated by comparison with the single workgroup calculation as it is the slowest. For the LANL2 ECP basis set substantial speed-ups are seen up to a maximum of 64 workgroups (with 16 processors per workgroup), where a speed-up factor of almost 7 is achieved. The task-farmed approach is therefore considerably more efficient than using the parallel routines in GAMESS-UK alone. Further gains were not achieved by going beyond 64 workgroups. This is firstly because a larger number of workgroups means that a larger proportion of the calculations do not benefit from an initial wavefunction guess (although for a non-benchmark calculation this could be provided using a preliminary single point evaluation step). Secondly, only 172 gradient evaluations in total are required and therefore the load is not efficiently balanced in the 128 workgroup calculation. For larger systems it may be advantageous to use 128 or more workgroups.

For the TZVP basis set calculations were performed using a single workgroup and 64 workgroups. Similar results are seen, with the 64 workgroup calculation again achieving a speed-up of approximately 7. This indicates that the efficiency gains remain even when large matrix diagonalisations are involved.

Tom Keal 2010-06-29