An obvious choice for parallelisation is the calculation of a finite difference Hessian. Each entry in the Hessian matrix is calculated using the difference of two gradients. Using a forward difference algorithm an atom system requires independent gradient evaluations. With a central difference algorithm this rises to evaluations.
In the original ChemShell implementation the gradient calculations and Hessian evaluation
are performed using a single Tcl command (force
).
In the taskfarmed version this command has been
split up into three stages to facilitate parallelisation. In the first stage (force_precalc
), the
required set of gradients is calculated and stored on disk as ChemShell objects. This
work can be divided up among the workgroups to be carried out in parallel using
the option task_atoms
with a list of atoms. In the second
stage the ChemShell gradient objects are made available to all workgroups using
the command taskfarm_globalise_forcegradients
.
Finally, the Hessian matrix is evaluated using the
precalculated gradients (using force
with the option precalc=yes
).
The Hessian calculation can be restricted to a single workgroup if desired by a conditional
test on the workgroup ID.
The 57atom silicateVO cluster shown in Figure 3 was used to assess the performance of the taskfarmed implementation. Energies and gradients were calculated using GAMESSUK with the B3LYP functional [11]. Two basis sets were used: the LANL2 effective core potential basis [12] (giving 413 basis functions) and the TZVP [13,14] allelectron basis (1032 basis functions). The larger basis test is present to fully assess the PeIGS build of GAMESSUK, as PeIGS is only used to diagonalize matrices larger than the total number of processors.

The full forward difference Hessian was evaluated using a set of 1024processor calculations with differing numbers of workgroups. The tasks were parallelised using a simple static loadbalancing scheme where as far as possible an equal number of gradient calculations were assigned to each workgroup. As each gradient calculation should take approximately the same amount of time (apart from the first where no wavefunction guess is provided), no major gains would be expected from a more sophisticated loadbalancing mechanism.
LANL2 ECP basis  
Workgroups  Procs/workgroup  Time / s  Speedup 
1  1024  7896  
2  512  4354  1.8 
4  256  2444  3.2 
8  128  1665  4.7 
16  64  1290  6.1 
32  32  1176  6.7 
64  16  1151  6.9 
128  8  2165  3.7 
TZVP basis  
Workgroups  Procs/workgroup  Time / s  Speedup 
1  1024  52762  
64  16  7812  6.8 
Speedup factors are calculated by comparison with the single workgroup calculation as it is the slowest. For the LANL2 ECP basis set substantial speedups are seen up to a maximum of 64 workgroups (with 16 processors per workgroup), where a speedup factor of almost 7 is achieved. The taskfarmed approach is therefore considerably more efficient than using the parallel routines in GAMESSUK alone. Further gains were not achieved by going beyond 64 workgroups. This is firstly because a larger number of workgroups means that a larger proportion of the calculations do not benefit from an initial wavefunction guess (although for a nonbenchmark calculation this could be provided using a preliminary single point evaluation step). Secondly, only 172 gradient evaluations in total are required and therefore the load is not efficiently balanced in the 128 workgroup calculation. For larger systems it may be advantageous to use 128 or more workgroups.
For the TZVP basis set calculations were performed using a single workgroup and 64 workgroups. Similar results are seen, with the 64 workgroup calculation again achieving a speedup of approximately 7. This indicates that the efficiency gains remain even when large matrix diagonalisations are involved.
Tom Keal 20100629