Next: Band-Parallelism (Work Package 1)
Up: Castep Performance on HECToR
Both the Cray PAT and built-in Castep trace showed that a considerable
amount of the time in ZGEMM, as well as the non-library time, was
spent in nlpot_apply_precon. The non-library time was attributable
to a packing routine which takes the unpacked array beta_phi, which
contains the projections of the wavefunction bands onto the nonlocal
pseudopotential projectors, and packs them into a temporary
array. Unfortunately this operation was poorly written, and the
innermost loop was over the slowest index.
The ZGEMM time in nlpot_apply_precon could also be reduced because
the first matrix in the multiplication was in fact Hermitian, so the
call could be replaced by ZHEMM to do approximately half the work.
Sarfraz A Nadeem