next up previous contents
Next: Band-Parallelism (Work Package 1) Up: Castep Performance on HECToR Previous: Baseline   Contents


Analysis

Both the Cray PAT and built-in Castep trace showed that a considerable amount of the time in ZGEMM, as well as the non-library time, was spent in nlpot_apply_precon. The non-library time was attributable to a packing routine which takes the unpacked array beta_phi, which contains the projections of the wavefunction bands onto the nonlocal pseudopotential projectors, and packs them into a temporary array. Unfortunately this operation was poorly written, and the innermost loop was over the slowest index.

The ZGEMM time in nlpot_apply_precon could also be reduced because the first matrix in the multiplication was in fact Hermitian, so the call could be replaced by ZHEMM to do approximately half the work.



Sarfraz A Nadeem 2008-09-01