During later testing, it was discovered that a large proportion of both the memory and computational time of Castep calculations were spent in the subroutine ion_beta_beta_recip. The reason for both the memory and time cost of this operation is that this subroutine has to construct a modified overlap matrix between the so-called non-local projectors, often referred to as the -projectors. These projectors are arrays of plane-wave coefficients, just like a wavefunction, but because they are independent of the bands they are only distributed by plane-wave and k-point. The precise operation this subroutine performs is the construction of the matrix :
(4.1) |
(4.2) |
Our solution to this problem was to distribute the -projectors over the band-group. In the first phase the local -projectors are constructed, and a call to ZHERK computes the purely local contribution to the matrix. The second phase requires a computation of the local projectors with the projectors on the other nodes via a call to ZGEMM. Rather than get the relevant data via comms, we instead redefined the -projectors at this point so that they now contained the full effect of the diagonal matrix , i.e. . This allows us to compute the second contribution to the by simply computing the overlap between the local node's -projectors and the non-distributed -projectors.
Work completed, tested and working.