**Dimitar Pashov**

*Department of Physics,*
*King's College London, WC2R 2LS, UK*

**Date:** August 1, 2013

The aim of this dCSE project was to improve the TBE code which is based on the tight binding model with self consistent multipole
charge transfer. Given an appropriate parameterisation, the code is general and can be used to simulate a wide variety of systems and
phenomena such as bond breaking, charge and magnetic polarisation.

The first goal was to achieve better performance through parallelising all suitable routines with MPI. The next step was to integrate ScaLAPACK's parallel diagonalisation routines transparently and with minimal communication, thus allowing the code to run on multi-node machines as opposed to a single node which was already possible thanks to threaded LAPACK/BLAS libraries. The third and last task was to utilise GPUs as accelerators for the heavy linear algebra calculations and subsequently integrate with the MPI parallelisation.

The goals in the first two work packages were achieved mostly as planned with significant benefit gained from exploiting the sparsity of the tight binding Hamiltonian and reformulating the algorithms for calculations of the density matrix elements and related quantities. The electrostatics routines have also seen a significant reduction of memory usage and parallel speedup. A generic diagonalisation interface for all required diagonalisation routines was developed together with the related transparent communication routines. This is now available for all programs in the LMTO suite of which TBE is part. Nearly all of the code has been updated to Fortran 90 and later standards making it easier and much safer to work with.

The third task, GPU porting of TBE, was the more exciting and riskier part of the project and it did not dissappoint in term of the challenges it provided. The original intention was to minimise risk by avoiding native development as much as possible by using established libraries instead. Unexpectedly the diagonalisation routines were nowhere near as fast as expected and this steered us in slightly uncharted territory, writing CUDA code for a number of matrix operations and researching completely different algorithms for obtaining the density matrix. Eventually, the goals were accomplished even though the acceleration is still far from what it was hoped to be.

- Contents
- Introduction

- Project overview

- Parallelise serial segments of the TBE code
- Parallel structure constants matrix with MPI
- Density matrix from eigenvectors and occupancies
- Direct space Hamiltonian
- Parallelisation

- Diagonalisation with ScaLAPACK and ELPA

- Porting to GPU(s)

- Conclusion and future work
- Acknowledgements
- Bibliography
- About this document ...