Nudged elastic band optimisation

The nudged elastic band (NEB) algorithm is a method for finding a minimum energy path between two structures. Typically it is used to characterise a reaction path including an energy barrier. The improved-tangent variant of NEB [15] is available in DL-FIND [6].

Each NEB optimisation cycle consists of energy and gradient evaluations for a sequence of structures (images) with geometries that sit along a path between the two endpoints. The final NEB gradient is constructed using spring forces that connect the images. However, the gradient calculations for the images are independent and therefore can be evaluated in parallel.

The NEB algorithm in DL-FIND has been parallelised using static load-balancing. An image is assigned to a workgroup if the image number modulo the number of workgroups is equal to the workgroup ID. This method ensures that a particular image is always assigned to the same workgroup, which is important if the external program uses restart files (as is the case for a GAMESS-UK calculation). At the end of each cycle the energies and gradients are shared between all workgroups by an MPI call within DL-FIND.

The first NEB cycle is different from the others in that the workgroups each calculate the gradients in serial along the whole path. This is to help convergence of the external QM program, as the wavefunction of the previous image can be used as a guess for the next. In subsequent cycles the corresponding image from the previous iteration can be used as the guess.

**Figure 5:** The ZnO cluster used for the nudged elastic band benchmark calculations. The small spheres on the outside of the cluster are the point charges used for approximating the electrostatic effect of the bulk crystal.
$\includegraphics[width=15cm,clip]{figures/neb-zno.eps}$

The benchmark system for the NEB method is shown in Figure 5. It is a 3207-atom QM/MM cluster consisting of CO

and 2H

adsorbed onto an Al-doped zinc oxide surface. The NEB method was used to find the barrier for the interchange of H in H-CO

between two different sites. The QM region consisted of 32 atoms with a PVDZ basis (except for Zn, where a Stuttgart ECP was used [16]). This gives a total of 194 basis functions. Following Ref. [4], this region is partitioned into an inner 19 atoms treated fully by quantum mechanics and a surrounding QM/MM interface region with pseudopotentials placed on 13 Zn atoms. The interface region provides a localised embedding potential that prevents the electrons from the inner region spilling out onto the positively charged centres in the MM region.

GAMESS-UK was used for the QM calculations with the B97-1 functional. GULP was used to provide MM energies and gradients using the shell model interatomic potential of Ref. [17]. 10 images are used to describe the path, with the two endpoints frozen, giving 8 gradient evaluations in total per cycle.

A single-point energy and gradient evaluation for the test system is actually an iterative cycle of QM and MM calculations. This is because the QM region is polarised by the MM atoms as point charges and the shells of the MM system are polarised in turn by the QM region. The QM/MM gradient must therefore be iterated until converged each time it is calculated.

**Figure 6:** Calculation time in wall clock seconds for a single point energy and gradient evaluation of an embedded cluster model of a ZnO surface reaction.
$\includegraphics[width=15cm,clip]{figures/neb-eandg.eps}$

Timings for the single point calculation are shown in Figure 6. For large processor counts the scaling is very poor, with a 1024-processor calculation actually taking a longer time to complete than a 256-processor calculation. This means that the overhead of message passing between so many processors outweighs any advantage from the extra computing power. Although the full iterated single-point calculation takes a significant amount of time, the individual QM and MM calculations are quite modest and do not benefit from such large processor counts. Increasing the size of the QM region would make larger processor counts useful, but this would have made running a full benchmark too computationally expensive. The results for this system indicate that the number of workgroups should be set as high as possible (i.e. to 8).

The NEB benchmark calculations were performed over 50 cycles. This results in an effectively converged path. Continuing on to full convergence was not desired as small numerical differences can lead to variation in the total number of cycles for the optimisation and this would not reflect the intrinsic performance of the parallelisation. The most basic form of the NEB method was used, with no climbing image [18] and no freezing of intermediate images during the optimisation.

**Table 2:** Calculation time in wall clock seconds for 50 cycles of a nudged elastic band optimisation. Speed-up factors are compared to single workgroup calculations.
Procs	Workgroups	Procs/	Time / s	Speed-up	Speed-up
		workgroup		vs 1024	vs 256
1024	1	1024	26404
256	1	256	23536
1024	2	512	14673	1.8	1.6
1024	4	256	7089	3.7	3.3
1024	8	128	3110	8.5	7.6

The results are shown in Table 2. Speed-up factors are calculated relative to the full single workgroup run of 1024 processors and also a run of 256 processors. The 256 processor run represents the best performance achievable using a single workgroup.

The improvement offered by the task-forming approach is significant, and as expected using the maximum number of workgroups gives the highest performance. When compared to a single workgroup 1024-processor run a speed-up of over 8 is found, which is only possible because the single-point 128-processor calculation is faster than the 1024-processor equivalent. When compared to the 256-processor NEB run, the speed-up is lower than 8 but still very substantial.

Tom Keal 2010-06-29