The HECToR Service is now closed and has been superceded by ARCHER.

Tight binding molecular dynamics on CPU and GPU clusters

The tight binding approach is a simplified electronic structure method which is significantly faster than density functional calculations. The TBE code is unique in that it accounts for self-consistent charge transfer by the development of multi-pole moments of charge on the atoms and it solves the associated electrostatic problem using multi-pole Ewald methods. The original choices for parallelisation in TBE were over the k-points and also the matrix diagonalisation part with ScaLAPACK.

This project will build upon the original parallel decompositions in TBE by:

  • Developing a parallel hamiltonian, charge and force calculation.
  • Upgrading the parallel diagonalisers.
  • Developing a CUDA interface for MAGMA to replace ScaLAPACK.

On completion of this project the main achievements may be summarised as follows:

  • Parallel implementations were developed for the electrostatic potential routines and the band structure derived quantities. The size of the matrix of structure dependent constants, B, was previously a significant barrier to the amount of atoms simulated. This matrix is now fully distributed and is therefore significantly smaller, particularly if there are only few heavy atoms in a system of predominantly light ones.
  • The k-point decomposition was also updated with a new 3D cartesian topology, which now links to the 2D and 1D decompositions for the linear algebra and integrated quantities.
  • The Hamiltonian setup routine was rewritten to exploit sparsity and memory access, such that an O(N3) problem has been turned into a linear one. Therefore, significant memory savings, superlinear scaling in the generating routine and a 10x speedup based on the original performance can now be achieved.
  • To update the parallel diagonalisers, a generic wrapper was written for the calls to ScaLAPACK.
  • The routines for conversions between the global actual and virtual arrays were also updated by implementing global collectives, which now gives a 3x speedup. Furthermore, a local method of matrix assembly was developed for the bloch transformation routines, and the neighbour table walk for the off-diagonal density matrix of derived quantities was also parallelised over the atoms.
  • Performance of the diagonalisers was benchmarked on the QUB cluster (900 cores, configured as dual socket nodes with quadcore Xeon E5530 2.40GHz CPUs and up to 24GB RAM available per node).
  • For a benchmark case of 1024 water molecules, a 10x speedup can now be achieved with 16 cores.
  • TBE was ported to a GPU (K20c cards, Tesla K20 GPU accelerators). Benchmarking demonstrated that performance was slower than on the CPU, due to the necessary block size required for TBE. However, matrix multiplication was 2-4x faster for cases with more than 1000 elements.
  • These developments are already being used with TBE on the QUB cluster.

Please see PDF or HTML for a report which summarises this project.