Halo Swapping

The halo swap step can be quite expensive, particularly as the number of MPI tasks are increased. For a fixed grid size, the amount of local grid space decreases as the grid is divided between more processes. The halo width, however, is fixed by the maximally-sized gaussian function, and so the halos grow proportionally larger compared to the local domain as more processors are added. For example, a 125$^3$ grid on 512 processors has a local domains size of 16x16x16, and a halo width of 18. As a result the halos make up 97% of the total grid data.

Figure 2: Realspace grid for a single task showing local and halo(shaded) parts of the grid - 2D only

In the original algorithm, the halos are first swapped with neighbours in the X direction, then the Y direction, then Z. See figure 2 for a 2-dimensional example. In each case the entire halo (e.g. the region $x_2 < x < x_3, y_0 < y < y_3 $ which is passed to the right ) was sent. The observation was made that since only the data in the local grid is required after the halo swap is complete, it is not necessary to send the entire width of the halo in each direction - any data which will end up in a halo can be discarded before it is sent. For example, in 2D, if the swap in the X direction is done first, then when the swap in the Y direction is performed only the region $ x_1 < x < x_2 $ needs to be sent. When done in 3 dimensions this gives substantial savings, shown in table 2 below, in terms of both the amount of data to be sent, and the amount of time taken packing buffers.

Table 2: 60 iterations of the rs2pw libtest on 512 cores, before and after optimisation
  Before After
Avg. Message Size (bytes) 194688 91008
Time in SendRecv (s) 0.468 0.22
Time packing X bufs (s) 0.107 0.002
Time unpacking X bufs (s) 0.189 0.003
Time packing Y bufs (s) 0.060 0.005
Time unpacking Y bufs (s) 0.096 0.017
Time packing Z bufs (s) 0.054 0.054
Time unpacking Z bufs (s) 0.091 0.091

This data also shows that the packing and unpacking of the X buffers is more expensive than the Y buffers and the Z buffers. This is because when swapping in the Z direction, the halo data is contiguous in memory, and is more fragmented in Y and X directions. By applying the optimisation above and doing the halo swaps in the order Z, Y, X, a speedup of about 100% is obtained for this routine.

Due to the observation that swapping halos in the Z direction is cheapest, the heuristic for domain decomposition was modified. Originally, given a certain number of processors, the grid would be decomposed in order to minimise the total size of the halos relative to the local grids. In practise this means keeping the domains as close to cubic as possible. By adding small weightings to the heuristic calculation, CP2K will now favour decompositions which minimise the number of cuts in the Z direction, so maximising the halo size where the cost of swapping is cheapest. E.g. for a 128 MPI tasks CP2K would decompose the grid into 8x4x4 blocks, whereas before, 4x8x4, or 4x4x8 would be equally likely.

Since the buffer packing is relatively expensive, an approach using MPI derived types was also attempted. However, it was found not to provide any benefit, mainly due to the fact that when the halo data is recombined into the realspace grids, it is summed, rather than simply being copied into place, so a temporary receive buffer is still required, thus negating the main benefit of using derived types.