As each column of the fourier grid intersects every plane of the real-space grid, in principle each process has to transfer some data to every other process between the two local FFT steps. In practise this is not always the case e.g. if the number of processes is greater than the number of slices, some processes will receive (or send for the inverse FFT) no data. In general, each process may send and receive differing amounts of data to every other process, since the number of processes may not divide exactly into the number of planes or columns. To illustrate this, consider the CNT40 test case on 16 cores. Here the coarse grid is 30x30x30 elements, and so there are 30 planes of data, and 900 columns, of which 657 happen to contain non-zero Fourier coefficients.
Dividing these between the 16 cores we find that 14 processes have 2 planes, with the remainder having 1. 15 processes have 41 columns each, and the final one has 42. Taking into account all combinations of domain sizes, there will be four different message sizes used for the Alltoallv - 84 elements, 82 elements, 42 elements, and 41 elements. In this case each grid element is a double-precision complex number i.e. 16 bytes per element.
The transpose operation is implemented in the function fft_scatter, which includes packing the data from the grid itself into contiguous blocks for each destination process in a send buffer, the MPI_Alltoallv call itself, followed by a corresponding unpacking operation. In the existing code there were already two alternative implementations of this function - the MPI_Alltoallv version described above, and a similar version where the collective communication is replaced by a number of non-blocking point-to-point communication calls. The choice of which version to use is controlled by preprocessor macros which can be defined in the make.sys file which controls the compilation.
Following the same scheme (using macros to conditionally compile each specific function) several alternative methods for performing the transpose were implemented and these are detailed below. Performance results are reported together in section 5 for clarity.