The length of the FFTs in Castep is typically small, and even for large simulation cells it will usually be O(100). However every band at every k-point requires three 1D FFTs every time the potential is applied or the density constructed so the total number of FFTs required in a simulation can be extremely large. The vast number of messages coupled with their short length means that in Castep the communication stages are usually latency-bound, rather than bandwidth-bound.
In recent years Castep's communication in the transpose stage has been optimised by Martin Plummer (STFC) and Keith Refson (RAL) to allow a two-phase communication pattern. In the first phase PEs within a node exchange data, and then in the second phase the all-to-all communication is performed over the nodes, not the individual PEs. For small numbers of nodes the overhead of performing two communication phases is prohibitive, but as the all-to-all communications start to dominate the improved scaling of this method more than makes up for the additional cost. This optimisation was of great benefit on HPCx, which has 16 cores per node, but is still expected to be useful on HECToR for large numbers of nodes.