In order to effectively transform CP2K from pure MPI to mixed-mode MPI/OpenMP we took the approach of applying OpenMP selectively to those routines which are known to dominate the runtime for many types of jobs - in particular the FFT, Realspace to Planewave transfer and Collocate and Integrate routines optimised in the earlier dCSE project. In addition, the dense matrix algebra would be targetted by the use of threaded BLAS libraries available on HECToR (Cray LibSci), and the sparse matrix algebra with the new DBCSR library (Section 1.4). This approach would allow effort to be concentrated in the areas that would yield most benefits.
A mixed-mode code would be expected to scale better than pure MPI for several reasons. Firstly, when running on the same total number of cores, the number of MPI processes can be reduced while still harnessing all the cores using OpenMP threads. This reduces the impact of algorithms which scale poorly with the number of processes, for example, the MPI_Alltoallv collective operation used in the FFT. Secondly, as HPC systems become more increasingly multi-core (for example HECToR has been upgraded from a 2-way node, to 4-way, then 24-way since its installation in 2007), a fully-populated node using only MPI greatly increases contention for access to the network. This effect was particularly seen on HECToR Phase 2b, and in this case we assert that a hybrid programming model fits more closely to the architecture and has resulting performance benefits.
Of course, while there is some gain due to reduced time in communication, this is offset by the fact that an efficient OpenMP implementation of the computational parts of the code is required so that several cores working on shared data using OpenMP threads would take around the same time as the same cores processing the same amount of distributed data as independent MPI processes. As we shall see, this is not always straightforward, although good performance has ultimately been obtained.
When combining OpenMP and MPI, a certain level of thread-safety is required from the MPI implementation. In our case we have adopted a very simple mechanism where all MPI calls are made outside of OpenMP parallel regions, which corresponds to the MPI_THREAD_FUNNELED model in MPI, where MPI calls are guaranteed only to be made by the master thread.