As part of the 3D FFT algorithm, the data on the plane wave grids has to be transposed in order that the correct data is available on each process for the next 1D FFT step. This communication is accomplised using MPI_Alltoallv. Alltoallv allows each process to send and receive differing amounts of data. However, in practice, the variation between processes is very small, typically only 1 extra row of the grid. For example, on a 125 grid divided over 8 processes 3 of the processes would have 15 planes, and 5 would have 16. Micro-benchmarking using the Intel MPI Benchmarks  showed that MPI_Alltoall performs better than MPI_Alltoallv on HECToR for the same volume of data transferred (see figure 3). Typical message sizes for the transposes range from 256KB up to several MB, so we would expect around a 20-30% speedup if we were able to use Alltoall instead of Alltoallv.
In order to allow this, the buffer packing and unpacking steps before and after the tranpose were modified to add padding to the data, so that each process would send the same amount of data. The padding would then be discarded when the buffer was unpacked at the receiver. In the above case, the 3 processes with only 15 planes would have 1 extra plane's worth of padding added. Since the fraction of additional padding that would be sent was smaller than the performance gain of using MPI_Alltoallv, a speedup was expected.
Initial results obtained using the FFT libtest were encouraging, with a speedup of 43% shown for a 125 grid on 256 processors. However, this result was not replicated when a full benchmark was run (e.g. bench_64), with only a 2% improvement in the FFT routines. After further investigation this appeared to be the result of poor synchronisation. The performance gain of using Alltoall only occurs when all processes in the communicator are well synchronised, such as in the libtest or Intel benchmark.
It was decided that this change should not be included into CP2K as the extra complexity of `book-keeping' code to correctly pack and unpack the buffers was too much to justify such a small performance gain.