The nudged elastic band (NEB) algorithm is a method for finding a minimum energy path between two structures. Typically it is used to characterise a reaction path including an energy barrier. The improvedtangent variant of NEB [15] is available in DLFIND [6].
Each NEB optimisation cycle consists of energy and gradient evaluations for a sequence of structures (images) with geometries that sit along a path between the two endpoints. The final NEB gradient is constructed using spring forces that connect the images. However, the gradient calculations for the images are independent and therefore can be evaluated in parallel.
The NEB algorithm in DLFIND has been parallelised using static loadbalancing. An image is assigned to a workgroup if the image number modulo the number of workgroups is equal to the workgroup ID. This method ensures that a particular image is always assigned to the same workgroup, which is important if the external program uses restart files (as is the case for a GAMESSUK calculation). At the end of each cycle the energies and gradients are shared between all workgroups by an MPI call within DLFIND.
The first NEB cycle is different from the others in that the workgroups each calculate the gradients in serial along the whole path. This is to help convergence of the external QM program, as the wavefunction of the previous image can be used as a guess for the next. In subsequent cycles the corresponding image from the previous iteration can be used as the guess.

GAMESSUK was used for the QM calculations with the B971 functional. GULP was used to provide MM energies and gradients using the shell model interatomic potential of Ref. [17]. 10 images are used to describe the path, with the two endpoints frozen, giving 8 gradient evaluations in total per cycle.
A singlepoint energy and gradient evaluation for the test system is actually an iterative cycle of QM and MM calculations. This is because the QM region is polarised by the MM atoms as point charges and the shells of the MM system are polarised in turn by the QM region. The QM/MM gradient must therefore be iterated until converged each time it is calculated.

The NEB benchmark calculations were performed over 50 cycles. This results in an effectively converged path. Continuing on to full convergence was not desired as small numerical differences can lead to variation in the total number of cycles for the optimisation and this would not reflect the intrinsic performance of the parallelisation. The most basic form of the NEB method was used, with no climbing image [18] and no freezing of intermediate images during the optimisation.
Procs  Workgroups  Procs/  Time / s  Speedup  Speedup 
workgroup  vs 1024  vs 256  
1024  1  1024  26404  
256  1  256  23536  
1024  2  512  14673  1.8  1.6 
1024  4  256  7089  3.7  3.3 
1024  8  128  3110  8.5  7.6 
The improvement offered by the taskforming approach is significant, and as expected using the maximum number of workgroups gives the highest performance. When compared to a single workgroup 1024processor run a speedup of over 8 is found, which is only possible because the singlepoint 128processor calculation is faster than the 1024processor equivalent. When compared to the 256processor NEB run, the speedup is lower than 8 but still very substantial.
Tom Keal 20100629