Ion

Next: Optimisation Up: Programming Previous: Comms Contents

Ion

In the density_augment subroutines, the augmentation charge and spin densities now to be reduced over the band-group.

During later testing, it was discovered that a large proportion of both the memory and computational time of Castep calculations were spent in the subroutine ion_beta_beta_recip. The reason for both the memory and time cost of this operation is that this subroutine has to construct a modified overlap matrix between the so-called non-local projectors, often referred to as the $\beta$ -projectors. These projectors are arrays of plane-wave coefficients, just like a wavefunction, but because they are independent of the bands they are only distributed by plane-wave and k-point. The precise operation this subroutine performs is the construction of the matrix :

$\begin{displaymath} B_{ij} = \sum_{p=1}^{N_p}\beta_{pi}^{*}K_{pp}\beta_{pj} \end{displaymath}$

(4.1)

where $K_{pp}$ is a diagonal positive-definite matrix used for preconditioning. The problem with memory is that in order to exploit the optimised BLAS most efficiently, this is converted to

$\begin{displaymath} B_{ij} = \sum_{p=1}^{N_p}\gamma_{pi}^{*}\gamma_{pj} \end{displaymath}$

(4.2)

where $\gamma_{pi}=\sqrt{K_{pp}}\beta_{pi}$ . This allows the use of the BLAS subroutine ZHERK, but at the cost of an extra copy of the non-local projectors.

Our solution to this problem was to distribute the $\gamma$ -projectors over the band-group. In the first phase the local $\gamma$ -projectors are constructed, and a call to ZHERK computes the purely local contribution to the matrix. The second phase requires a computation of the local projectors with the projectors on the other nodes via a call to ZGEMM. Rather than get the relevant data via comms, we instead redefined the $\gamma$ -projectors at this point so that they now contained the full effect of the diagonal matrix , i.e. $\gamma_{pi}=K_{pp}\beta_{pi}$ . This allows us to compute the second contribution to the by simply computing the overlap between the local node's $\gamma$ -projectors and the non-distributed $\beta$ -projectors.

Work completed, tested and working.

Subsections

Optimisation

Next: Optimisation Up: Programming Previous: Comms Contents

Sarfraz A Nadeem 2008-09-01