Triangular Matrix Optimisations

Whenever the set of bands has to be orthonormalised, the transformation matrix is in fact triangular; this is exploited in G-vector parallel calculations, which use the dtrmm and ztrmm BLAS subroutines in this case, rather than the general dgemm and zgemm ones, to gain a significant boost in performance.

Because the bands are distributed in a round-robin fashion amongst the cores, transforming a wavefunction by a (global) triangular matrix leads to each distributed transformation in step (2) of the algorithm also being triangular. Optimising this operation is relatively straightforward, provided care is taken for the case where the local number of bands is not the same as that of the client core; this case may be handled easily with a simple zero-padding. The results are shown in Table 3

Table: Comparison between the original and optimised triangular matrix operation for the al3x3 benchmark and CASTEP 6.1.
Total cores Band parallelism Original code Optimised for
      triangular matrices
512 8 816s 807s