next up previous contents
Next: Optimisation Up: Programming Previous: Comms   Contents


Ion

In the density_augment subroutines, the augmentation charge and spin densities now to be reduced over the band-group.

During later testing, it was discovered that a large proportion of both the memory and computational time of Castep calculations were spent in the subroutine ion_beta_beta_recip. The reason for both the memory and time cost of this operation is that this subroutine has to construct a modified overlap matrix between the so-called non-local projectors, often referred to as the $\beta $-projectors. These projectors are arrays of plane-wave coefficients, just like a wavefunction, but because they are independent of the bands they are only distributed by plane-wave and k-point. The precise operation this subroutine performs is the construction of the matrix $B$:


\begin{displaymath}
B_{ij} = \sum_{p=1}^{N_p}\beta_{pi}^{*}K_{pp}\beta_{pj}
\end{displaymath} (4.1)

where $K_{pp}$ is a diagonal positive-definite matrix used for preconditioning. The problem with memory is that in order to exploit the optimised BLAS most efficiently, this is converted to
\begin{displaymath}
B_{ij} = \sum_{p=1}^{N_p}\gamma_{pi}^{*}\gamma_{pj}
\end{displaymath} (4.2)

where $\gamma_{pi}=\sqrt{K_{pp}}\beta_{pi}$. This allows the use of the BLAS subroutine ZHERK, but at the cost of an extra copy of the non-local projectors.

Our solution to this problem was to distribute the $\gamma$-projectors over the band-group. In the first phase the local $\gamma$-projectors are constructed, and a call to ZHERK computes the purely local contribution to the $B$ matrix. The second phase requires a computation of the local projectors with the projectors on the other nodes via a call to ZGEMM. Rather than get the relevant data via comms, we instead redefined the $\gamma$-projectors at this point so that they now contained the full effect of the diagonal matrix $K$, i.e. $\gamma_{pi}=K_{pp}\beta_{pi}$. This allows us to compute the second contribution to the $B$ by simply computing the overlap between the local node's $\gamma$-projectors and the non-distributed $\beta $-projectors.

Work completed, tested and working.



Subsections
next up previous contents
Next: Optimisation Up: Programming Previous: Comms   Contents
Sarfraz A Nadeem 2008-09-01