In summary, three alternative methoods for performing the global communication in the 3D FFT transpose were implemented - SHM Alltoallv, Padded Alltoall and Scatter/Gather Alltoallv. The SHM Alltoallv is found to perform well, and scales the best of the three implementations. Speedups of up to 400% (on 128 cores of HECToR Phase 2a) were demonstrated for the 3D FFT in isolation, which delivered benefits of in the range of 4-36% in full application benchmarks. Even with these improvements, some large jobs would not fit into the 12 hour queue limit on HECToR, so a checkpoint and restart mechanism was added for non-SCF calculations using the PW.X code.
The performance of the FFT on HECToR Phase 2b (XT6) was found to be disappointing beyond 24 cores (1 node) due to the high number of messages requiring to cross the shared network interface. However, the forthcoming installation of Cray's new Gemini interconnect in Q4 2010 is expected to address this limitation and bring performance more closely in line with the Phase 2a system (XT4).
When combined with the ongoing work at Sheffield to implement a full 2D domain decomposition, even further scalability will be achieved. Nevertheless these modifications alone will deliver real improvements to the performance and scalability of Quantum Espresso for HECToR users.