next up previous contents
Next: Castep Performance on HECToR Up: Computational Costs Previous: Orthogonalisation and Diagonalisation   Contents

Fourier Transforms

The Fourier Transforms are carried out using conventional Fast Fourier Transform (FFT) libraries, or a built-in FFT written by Clive Temperton according to his Generalised Prime Factor Algorithm (GPFA). Since the data are distributed over G-vectors, Castep transforms the data not as a single 3D FFT but as three 1D FFTs, with a transpose step in between each to reorder the data. Although the cost of the FFTs is only $N_GlnN_GN_bN_k$, this transpose requires all-to-all communications between all processor elements with the same k-points. The data and workload is naturally distributed over k-points, but as the G-vectors are distributed over more and more nodes these all-to-all communications start to dominate.

The length of the FFTs in Castep is typically small, and even for large simulation cells it will usually be O(100). However every band at every k-point requires three 1D FFTs every time the potential is applied or the density constructed so the total number of FFTs required in a simulation can be extremely large. The vast number of messages coupled with their short length means that in Castep the communication stages are usually latency-bound, rather than bandwidth-bound.

In recent years Castep's communication in the transpose stage has been optimised by Martin Plummer (STFC) and Keith Refson (RAL) to allow a two-phase communication pattern. In the first phase PEs within a node exchange data, and then in the second phase the all-to-all communication is performed over the nodes, not the individual PEs. For small numbers of nodes the overhead of performing two communication phases is prohibitive, but as the all-to-all communications start to dominate the improved scaling of this method more than makes up for the additional cost. This optimisation was of great benefit on HPCx, which has 16 cores per node, but is still expected to be useful on HECToR for large numbers of nodes.


next up previous contents
Next: Castep Performance on HECToR Up: Computational Costs Previous: Orthogonalisation and Diagonalisation   Contents
Sarfraz A Nadeem 2008-09-01