Next: Test systems Up: Tight Binding Molecular Dynamics Previous: Periodicity Contents

Project overview

The TBE code originally had -point MPI parallelisation which worked excellently for systems requiring plenty of -points. There was also an atom based MPI parallelisation for the second(last) step of the electrostatic potential calculation. Linking with a threaded BLAS and LAPACK libraries did benefit the parts using their routines on multiprocessor/core shared memory machines. Therefore the aim of this project is to modernise and speedup the code for larger systems with few -points by allowing access to clusters with fast interconnects and modern accelerators such as graphics cards.

The following milestones were set out at the beginning of the project:

Parallelise serial segments of the TBE code with MPI until diagonalisation is the bottleneck (as it should be in TB).
Parallelise diagonalisation with available standard routines for all combinations of real symmetric and complex Hermitian type, and orthogonal and generalised problems.
Accelerate the code by initially exporting the heavy linear operations to MAGMA/CUBLAS running on a GPU chip, then rewrite specific routines to minimise memory transfers and integrate with the MPI paralelisation to enable multinode multiGPU operation.

At the start of the project, the existing independent parallelisation over -points and atoms in the electrostatic potential were overhauled. Firstly, mpi_comm_world was replaced with a communicator with 3D cartesian topology in row-major order. The first dimension now defines how many 2D process arrays will be dealing with the separate -points and spin. Each array is then allocated a contiguous range of -points. The dimensioning is performed by an algorithm designed to use the largest possible number of process arrays while keeping the size of these arrays to the minimum necessary and their dimensions as close to square as possible, for efficiency and simplicity. When square 2D arrays cannot be achieved the first dimension is set to be smaller than the second for greater efficiency. The parallel linear algebra routines dealing with ,, and $\rho^k$ execute within the 2D process arrays, then a small number of vectors and scalars is reduced across 1D arrays of processes, perpendicular to the 2D arrays, to obtain the integrated quantities. The 2D arrays use a block distribution for the matrix quantities and independent atom distribution density matrix related vector or scalar observables. Outside the -point/spin loop the electrostatic potential is handled by the 3D communicator containing all active processes.

Since the effort for every -point is exactly the same in TB the blocks are allocated evenly by number. The atom parallellised regions use more sophisticated but still static allocation because the cost remains the same for diferent iterations and -points. All communicators: 3D, 2D and 1D and the respective offset, count and blocking related arrays are set once and reused throughout. To enable the reuse of an offsets and counts array in mpi_gatherv, block vectors of differing first dimension and type specific MPI contiguous derived types are defined. Although Gatherv is generally slower than gather, the data consists of comparatively thin vectors, so the effort of packing/repacking outweighs any time difference between the two routines.

Subsections

Next: Test systems Up: Tight Binding Molecular Dynamics Previous: Periodicity Contents

DP 2013-08-01