## Adapting QSGW to large multi-core systems

The Quasi-particle self-consistent GW (QSGW) approximation is an improved method for ab initio electronic structure calculations which overcomes several short comings in comparable approaches, e.g. local density approximation (LDA) and Dynamical Mean Field Theory (DMFT). QSGW has been implemented in the full potential code LMTO, however it is 100-1000 times more expensive than LDA+DMFT. The current QSGW code has been developed for serial computation with up to 16 atoms per unit cell. There is scope for parallelism and this project will implement a parallel version to enable calculations with 100 atoms.

The overall aims of this project were:

- Implement MPI parallelisation for the self-energy and the polarisablity calculations in QSGW.
- Distribute the matrices used to store representations of the basis functions by further second level MPI parallelisation.
- Implement OpenMP to parallelise the intra-node sub-matrix operations, namely the polarization function, coulomb interactions and the self-energy.

The individual achievements of the project are summarised below:

- MPI parallelism was implemented for the nested loops which calculate the central quantities, this was achieved by utilizing different processor groups for each nested level.
- OpenMP parallelism was implemented for the simple loops at a lower level of the code, e.g. the matrix multiplications which are used in the calculation of matrix elements between basis functions and the Bloch wave functions.
- The two most costly steps in a QSGW self-consistent calculation were benchmarked on HECToR, i.e. the calculation of the susceptibility and screened Coulomb interaction (hx0fp0), and the calculation of the self-energy contributions (hsfp0).
- Calculations for hsfp0 and hx0fp0 were performed for a supercell of 16 Fe atoms with a q-mesh of 5x5x5 q-points, resulting in 10 irreducible points.
- For the self-energy calculation (hsfp0), a 74 times total speedup relative to a sequential run of the code was demonstrated using around 1600 cores on HECToR.
- The most severe bottleneck in QSGW was the assembly of the screened Coulomb interaction (hx0fp0), which in serial would take the order of days for one cycle; this calculation can now be performed in 21 minutes using 2640 HECToR cores.
- For the screened Coulomb interaction an absolute speedup of more than 2000 was also demonstrated by using about 4500 cores on HECToR.
- These developments have enabled a whole QSGW self-consistent calculation to be performed within several hours (or days), rather than several weeks.
- The new parallel LMF/QSGW code will be introduced to the community through a dedicated hands-on workshop, to be held under CCP9 (Collaborative Computational Project for the Study of the Electronic Structure of Condensed Matter).

Please see PDF or HTML for a report which summarises this project.