Caching to amortize costs

Initally, CrayPAT was used to profile the FFT routines, using the PW_TRANSFER libtest as a micro-benchmark.

 Time % |      Time |Imb. Time |   Imb. |    Calls |Group
        |           |          | Time % |          | Function
        |           |          |        |          |  PE.Thread='HIDE'

 100.0% | 19.588726 |       -- |     -- | 126389.0 |Total
|---------------------------------------------------------------------
|  62.8% | 12.298019 |       -- |     -- | 120362.0 |MPI
||--------------------------------------------------------------------
||  37.1% |  7.270134 | 0.741629 |   9.3% |   4000.0 |mpi_cart_sub_
||  24.4% |  4.782975 | 1.257500 |  20.9% |   4000.0 |mpi_alltoallv_
||   0.7% |  0.144511 | 0.006960 |   4.6% |   2002.0 |mpi_barrier_
||   0.2% |  0.034614 | 0.003197 |   8.5% |  24065.0 |mpi_wtime_
||   0.1% |  0.025250 | 0.002017 |   7.4% |  70001.0 |mpi_cart_rank_
||   0.1% |  0.014001 | 0.001163 |   7.7% |   4002.0 |mpi_comm_free_
||   0.0% |  0.008200 | 0.001827 |  18.3% |   6002.0 |mpi_cart_get_
||   0.0% |  0.007483 | 0.001781 |  19.3% |   6005.0 |mpi_comm_size_
...

The most obvious item to address is the large number of calls to MPI_Cart_sub. This routine is used to partition the 3D cartesian communicator containing all the MPI tasks into multiple sub-communicators which are used to transpose the data in the FFT grids and is called once at every tranpose step. This operation is collective and blocking, so causes (unnecessary) synchronisation between all processes. However, in CP2K the grid layout and mapping of the grid to MPI tasks remains constant throughout the simulation, so the sub-communicators are also the same every time an FFT is performed. CP2K already provides a data structure fft_scratch which is used to cache data relating to the FFT (data buffers, coordinates etc.), which was ideal for storing and reusing the MPI communicator handles. As a result it was possible to reduce the number of calls to MPI_Cart_sub from 11722 (for a 50 MD step run) to 5. As well as MPI_Cart_sub, a number of other related MPI operations could be moved into the FFT scratch cache, including rank and coordinate calculations. As a result of these changes, further speedups of 12% were achieved at 512 cores for the bench_64 test (see table 4). However, as the calls above were moved from the FFT loop into a one-off initialisation step, the performance gains would increase for longer runs.

Table 4: Comparison of bench_64 runtime before and after FFT caching optimisation

Cores	64	128	256	512
Before(s)	366	264	191	238
After(s)	363	250	177	213
Speedup(%)	1	6	8	12