Whenever the set of bands has to be orthonormalised, the transformation matrix is in fact triangular; this is exploited in
G-vector parallel calculations, which use the dtrmm and ztrmm BLAS subroutines in this case, rather
than the general dgemm and zgemm ones, to gain a significant boost in performance.
Because the bands are distributed in a round-robin fashion amongst the cores, transforming a wavefunction by a (global) triangular matrix leads to each distributed transformation in step (2) of the algorithm also being triangular. Optimising this operation is relatively straightforward, provided care is taken for the case where the local number of bands is not the same as that of the client core; this case may be handled easily with a simple zero-padding. The results are shown in Table 3