The second major work item in the project was to profile and optimise the CP2K's Fourier Transform routines (steps III,IV in figure 1). CP2K has an FFT library module that provides a consistent interface to a number of popular FFT libraries including FFTW  2 and 3, ESSL from IBM, ACML from AMD, and a CUDA implementation for GPUs. There is also in in-built FFT library based on work by Goedecker et. al. . These libraries provide 1D and 3D (serial) FFTs. As the planewave grids are distributed, CP2K requires a parallel 3D FFT, which involves three 1D FFTs, with global transpositions of the data using MPI_Alltoallv between the FFTs. For small numbers of processors (less than the number of planes in the grids), CP2K uses a `plane', `slice' or `slab' decomposition for the grids, allowing the FFTs to be performed with only a single transpose step. For larger numbers of processors, a `pencil' or `ray' decomposition is used, which requires two transpose steps to be performed. This is more expensive, but can scale to larger numbers of processors than the plane decomposition.