Two intrinsically parallel optimisation methods are available in DLFIND [6], namely a genetic algorithm and stochastic search. These methods are typically used for finding the global minimum on a potential energy surface. Both methods involve calculations on a population of structures during each cycle, which can be run in parallel.
The genetic algorithm and stochastic search methods use the same parallel DLFIND interface that was used as the basis for parallelising the NEB method. To support the new methods the interface was modified on the ChemShell side to pass the relevant input options to DLFIND.
A change to the handling of shell model systems in ChemShell was also required to ensure that the parallel optimisations would work with these systems if desired. The DLFIND optimiser only uses atom positions and ignores shells, which are a feature specific to ChemShell. Previously the shell positions were stored in ChemShell as absolute coordinates and relaxed starting from the old positions when a new geometry was created. This relaxation can be slow or difficult to converge if the geometry changes radically, which is often the case for the global optimisation routines, where a wide variety of geometries is considered at every step. The shell handling in ChemShell was improved by storing shell positions as relative coordinates, so they would stay near to their parent atoms even under a large change of geometry. This change benefits all the other optimisation methods in DLFIND as well, but is most important for the global optimisers.
Following Ref. [19], benchmark calculations were performed on ZnO nanoclusters. In Ref. [19] rigid ion MM calculations were used but for the purpose of benchmarking a much more demanding QM calculation was set up using GAMESSUK. Timing calculations were performed on a (ZnO) cluster. The B971 functional was again used with a PVDZ basis (560 basis functions) and the Stuttgart ECP for the Zn atoms. A population of 32 structures was used, which is a typical size for these methods and is an efficient number for taskfarm parallelisation.

Normally a genetic algorithm or stochastic search optimisation would be run for hundreds or thousands of cycles in order to have a good chance of finding the global minimum. However, due to computational expense it was not feasible to run a baseline calculation of this length using a single workgroup. To obtain a benchmark the stochastic search algorithm was run for 20 cycles instead. This is expected to give a good representation of the scaling behaviour as each cycle contains the same number of evaluations and should therefore take approximately the same amount of time.
Procs  Workgroups  Procs/  Time / s  Speedup 
workgroup  vs 1024  
1024  1  1024  23535  
256  1  256  26560  
1024  2  512  14881  1.6 
1024  4  256  8930  2.6 
1024  8  128  5819  4.0 
1024  16  64  4270  5.5 
1024  32  32  4197  5.6 
Speedup factors for the taskfarmed calculations are therefore given compared to the 1024processor run. Gains in performance are again substantial, with the best performance as expected given by maximising the number of workgroups, although the performance of the 16 workgroup calculation is very nearly as good, with speedups of over 5 achieved in both cases.
Tom Keal 20100629