next up previous contents
Next: Density matrix from eigenvectors Up: Parallelise serial segments of Previous: Parallelise serial segments of   Contents

Parallel structure constants matrix with MPI

The structure constants were stored in a 4 index fortran array $ (L, L', R, R')$. This way the atom pair with the largest $ (L, L')$ block defines the memory requirements which may be of substantial size since the maximal $ L = (\ell+1)^2$, in this case may reach $ 2\ell$, as used in the basis functions for the operators and $ 2\ell+1$, for when forces or pressure also needs to be calculated in the end of the last self-consistency iteration. For example, a $ d$ element like Ti is given up to a $ d$ orbital $ \ell = 2$ resulting in structure block size for the pair Ti- Ti of up to $ ((2\times2+1)+1)^2 \times (2\times2+1)^2 = 36\times25$ compared with $ (2+1)^2\times (2+1)^2 = 9\times9$ for a Hamiltonian block. To reduce the memory requirements the original implementation exploited two symmetry properties: $ B_{RLR'L'} = B_{R'L'RL}$ and $ B_{RLR'L'} = (-1)^{\ell+\ell'}B_{R'LRL'}$.

Figure 3: Example B block with the first symmetry rule applied.
Image strxll

The second rule means that the extra $ L+1$ part may be calculated only once then obtained from this for the reverse pair. The calculation of a single block consists of first evaluating the Hankel functions for $ R-R', L''= L+L'$ then calculating a dot product over $ L''$ with the Gaunt coefficients stored in packed format. The Hankel function is directly computed for a non periodic case while Ewald summation is performed in the periodic case. This is the slowest part of the calculation.

The electrostatic potential calculation computation consists of a matrix by block vector product between the structure constants and the charge multipoles along the $ R'L'$ indices. A convenient optimisation is to swap the indices, resulting in $ (L',L,R',R)$ and then spread the $ R$ atoms over all available processes as evenly as possible without splitting a block to avoid doubling of the effort for the Hankels. In this way the storage is distributed over the processes and only a small gather operation for the final potential block vector is needed. This strategy was implemented and showed superlinear performance which is most likely due to the decreased memory requirements per process, see Fig 4.

Figure 4: Parallel structure constants generation times.
Image mkstrx-times

Since often one needs to study systems containing atoms of different type e.g. H2O, Ti2O, etc., the pair with lower $ L$ requirements waste a significant portion of the allocated block size. For example in a H2O molecule we would allocate $ (3\times9\times 3*16)*8/2^{10} = 10.125$KB and only use $ ((9+1+1)\times(16+4+4))\times8/2^{10} = 2.0625$KB. The difference is striking and for this reason the fixed block format was abandoned. Having the block structure however is more convenient and cache friendly than dealing with the normal array so the solution was to create an index of pointers and preserve the storage format while achieving a fivefold decrease the total memory requirements. This was actually a second revision due to difficulty interfacing the first implementation, which was based on a 2D Fortran pointer array with C/CUDA.


next up previous contents
Next: Density matrix from eigenvectors Up: Parallelise serial segments of Previous: Parallelise serial segments of   Contents
DP 2013-08-01