The HECToR Service is now closed and has been superceded by ARCHER.

Improvements for multi-core performance and domain choice within DL_POLY_4

DL_POLY is a general purpose package for classical molecular dynamics simulations developed by I.T. Todorov and W. Smith at STFC Daresbury Laboratory. The package is used to model the atomistic evolution of the full spectrum of models commonly employed in the materials science, solid state chemistry, biological simulation and soft condensed-matter communities. The main purpose of the software is to enable the exploitation of large scale MD simulations on multi-processor platforms. At the time that this work was proposed DL_POLY_3 was the current version, during the course of this dCSE project it was superseded by DL_POLY_4.

The overall aims of this project were to:

  • Optimise the multi-core performance of DL_POLY_4 on HECToR. This would be achieved by updating routines for the link cell, ewald and constraint force calculations.
  • The parallelisation of DL_POLY_4 is based upon domain decomposition. In earlier versions of the code the implementation of the parallel FFT within the code which mapped directly onto this decomposition restricted processor counts to powers of 2. For the FFT the aim was to lessen this restriction and to allow factors of 3 and 5.

The outcomes of the project are:

  • For the following measures of performance DL_POLY TEST8 was used. This consists of a complex biomolecule (gramicidin A) in water and is a representative test as it both exercises many routines in the code and is sufficiently large to exhibit good scalability.
  • For the link cell optimisation, re-location of an If condition was performed which now enables the code to run 11.9% faster on the XE6.
  • An optional manual unroll for the ewald_spme_forces main loop was implemented and enables runs to be 13% faster with 256 XE6 cores.
  • The number of If conditions in the constraints_shake_vv routine were reduced, this gives an improvement of 11% for the no frozen atom case and on average 5-7% for generic cases on the XE6.
  • A multi-radix domain decomposed parallel FFT was implemented in DaFT along with other routines in DL_POLY_4 which were also updated, including the major algorithm for determining the domain decomposition.
  • There is a small performance penalty in using decompositions other than powers of 2, although it is not significant when compared to the flexibility and overall efficiency gain in being able to utilise more cores per node. Also, it is most efficient if the prime factors of the number of processes are small integers.
  • The multi-core optimisation work has resulted in roughly a 10-25% increase in performance on the XE6, depending upon the number of cores.
  • The new implementation allows any processor count to be used (though if there is a large prime factor the performance may be poor).
  • This work is now available in DL_POLY_4 which is the direct successor of DL_POLY_3.
  • DL_POLY_4 is no longer restricted to using powers of 2 numbers of cores, which also allows the scientist to study the system of interest without having to artificially inflate it to fit the restrictions of the code.

Please see PDF or HTML for a report which summarises the multi-core optimisation work and PDF or HTML for a report which summarises the work on domain choice.