Global optimisation methods

Two intrinsically parallel optimisation methods are available in DL-FIND [6], namely a genetic algorithm and stochastic search. These methods are typically used for finding the global minimum on a potential energy surface. Both methods involve calculations on a population of structures during each cycle, which can be run in parallel.

The genetic algorithm and stochastic search methods use the same parallel DL-FIND interface that was used as the basis for parallelising the NEB method. To support the new methods the interface was modified on the ChemShell side to pass the relevant input options to DL-FIND.

A change to the handling of shell model systems in ChemShell was also required to ensure that the parallel optimisations would work with these systems if desired. The DL-FIND optimiser only uses atom positions and ignores shells, which are a feature specific to ChemShell. Previously the shell positions were stored in ChemShell as absolute coordinates and relaxed starting from the old positions when a new geometry was created. This relaxation can be slow or difficult to converge if the geometry changes radically, which is often the case for the global optimisation routines, where a wide variety of geometries is considered at every step. The shell handling in ChemShell was improved by storing shell positions as relative coordinates, so they would stay near to their parent atoms even under a large change of geometry. This change benefits all the other optimisation methods in DL-FIND as well, but is most important for the global optimisers.

Following Ref. [19], benchmark calculations were performed on ZnO nanoclusters. In Ref. [19] rigid ion MM calculations were used but for the purpose of benchmarking a much more demanding QM calculation was set up using GAMESS-UK. Timing calculations were performed on a (ZnO) $_{28}$ cluster. The B97-1 functional was again used with a PVDZ basis (560 basis functions) and the Stuttgart ECP for the Zn atoms. A population of 32 structures was used, which is a typical size for these methods and is an efficient number for task-farm parallelisation.

**Figure 7:** Calculation time in wall clock seconds for a single stochastic search cycle (32 energy and gradient evaluations) of a (ZnO) $_{28}$ cluster.
$\includegraphics[width=15cm,clip]{figures/stoch-eandg.eps}$

The scaling properties of the genetic algorithm and stochastic search methods are expected to be identical. For the benchmark tests stochastic search was used. A single energy and gradient evaluation is very quick and so to obtain reliable preliminary timings a full cycle of 32 evaluations was performed. The results are shown in Figure 7. Again, the best performance is given by 256 processors with a slowdown observed for higher processor counts. This again suggests that substantial gains can be made by splitting the processors into workgroups.

Normally a genetic algorithm or stochastic search optimisation would be run for hundreds or thousands of cycles in order to have a good chance of finding the global minimum. However, due to computational expense it was not feasible to run a baseline calculation of this length using a single workgroup. To obtain a benchmark the stochastic search algorithm was run for 20 cycles instead. This is expected to give a good representation of the scaling behaviour as each cycle contains the same number of evaluations and should therefore take approximately the same amount of time.

**Table 3:** Calculation time in wall clock seconds for 20 cycles of a stochastic search optimisation. Speed-up factors are compared to single workgroup calculations.
Procs	Workgroups	Procs/	Time / s	Speed-up
		workgroup		vs 1024
1024	1	1024	23535
256	1	256	26560
1024	2	512	14881	1.6
1024	4	256	8930	2.6
1024	8	128	5819	4.0
1024	16	64	4270	5.5
1024	32	32	4197	5.6

The results are shown in Table 3. Surprisingly, of the two single workgroup runs, the 1024-processor run is faster than 256 processors. This could have been due to natural variation in SCF convergence due to the random element to the geometries created. To test this the benchmark was run twice, but the same trend was found in both runs. This implies that there is some overhead to the large processor count in the first cycle that is not present in subsequent cycles.

Speed-up factors for the task-farmed calculations are therefore given compared to the 1024-processor run. Gains in performance are again substantial, with the best performance as expected given by maximising the number of workgroups, although the performance of the 16 workgroup calculation is very nearly as good, with speed-ups of over 5 achieved in both cases.

Tom Keal 2010-06-29