An obvious choice for parallelisation is the calculation of a finite difference
Hessian. Each entry in the Hessian matrix is calculated using the difference of
two gradients. Using a forward difference algorithm an -atom system requires
independent
gradient evaluations. With a central difference algorithm this rises to
evaluations.
In the original ChemShell implementation the gradient calculations and Hessian evaluation
are performed using a single Tcl command (force
).
In the task-farmed version this command has been
split up into three stages to facilitate parallelisation. In the first stage (force_precalc
), the
required set of gradients is calculated and stored on disk as ChemShell objects. This
work can be divided up among the workgroups to be carried out in parallel using
the option task_atoms
with a list of atoms. In the second
stage the ChemShell gradient objects are made available to all workgroups using
the command taskfarm_globalise_forcegradients
.
Finally, the Hessian matrix is evaluated using the
pre-calculated gradients (using force
with the option precalc=yes
).
The Hessian calculation can be restricted to a single workgroup if desired by a conditional
test on the workgroup ID.
The 57-atom silicate-VO
![]() |
The full forward difference Hessian was evaluated using a set of 1024-processor calculations with differing numbers of workgroups. The tasks were parallelised using a simple static load-balancing scheme where as far as possible an equal number of gradient calculations were assigned to each workgroup. As each gradient calculation should take approximately the same amount of time (apart from the first where no wavefunction guess is provided), no major gains would be expected from a more sophisticated load-balancing mechanism.
LANL2 ECP basis | |||
Workgroups | Procs/workgroup | Time / s | Speed-up |
1 | 1024 | 7896 | |
2 | 512 | 4354 | 1.8 |
4 | 256 | 2444 | 3.2 |
8 | 128 | 1665 | 4.7 |
16 | 64 | 1290 | 6.1 |
32 | 32 | 1176 | 6.7 |
64 | 16 | 1151 | 6.9 |
128 | 8 | 2165 | 3.7 |
TZVP basis | |||
Workgroups | Procs/workgroup | Time / s | Speed-up |
1 | 1024 | 52762 | |
64 | 16 | 7812 | 6.8 |
Speed-up factors are calculated by comparison with the single workgroup calculation as it is the slowest. For the LANL2 ECP basis set substantial speed-ups are seen up to a maximum of 64 workgroups (with 16 processors per workgroup), where a speed-up factor of almost 7 is achieved. The task-farmed approach is therefore considerably more efficient than using the parallel routines in GAMESS-UK alone. Further gains were not achieved by going beyond 64 workgroups. This is firstly because a larger number of workgroups means that a larger proportion of the calculations do not benefit from an initial wavefunction guess (although for a non-benchmark calculation this could be provided using a preliminary single point evaluation step). Secondly, only 172 gradient evaluations in total are required and therefore the load is not efficiently balanced in the 128 workgroup calculation. For larger systems it may be advantageous to use 128 or more workgroups.
For the TZVP basis set calculations were performed using a single workgroup and 64 workgroups. Similar results are seen, with the 64 workgroup calculation again achieving a speed-up of approximately 7. This indicates that the efficiency gains remain even when large matrix diagonalisations are involved.
Tom Keal 2010-06-29