Final Benchmarks

Here the overall performance gains from the dCSE are shown (figures 7 and 8). To show performance more clearly, the reciprocal of the runtime is plotted. For bench_64, a speedup of 30% on 256 cores is achieved. An even greater speedup up 300% on 1024 cores is shown for W216, showing the effect of these optimisations on larger problems (and inhomogeneous systems in particular). This exceeds the original aims set out in the project proposal, which were:

we expect the performance gain on 64-256 processors to be around 10-15%. Far more significantly, at the capability end 512-1024+ processor jobs are expected to increase in performance by around 40-50%.

It should be noted that in addition to the work performed within the dCSE project, other work was undertaken by the CP2K development group, which would affect the benchmarks of W216. In particular, the load balancing was modified to allow heavily loaded processes to shift some work to neighbouring processes, provided that their realspace grid halos still contain the entirety of a Gaussian to be mapped. This has the effect of reducing the highest peaks of load (see figure 6. However, the FFT and halo swap optimisation also has a significant effect, especially on larger numbers of cores. Due to the concurrent nature of development, and the fact that W216 is relatively expensive to run, these changes were not benchmarked at each step in development.

**Figure 7:** Overall performance gains on bench_64
$\includegraphics[width=13cm]{images/scaling_bench_64.ps}$

**Figure 8:** Overall performance gains on W216
$\includegraphics[width=13cm]{images/scaling_W216.ps}$