The HECToR Service is now closed and has been superceded by ARCHER.

Improving the scalability of CP2K on multi-core systems

This Distributed Computational Science and Engineering (dCSE) project was to implement mixed-mode OpenMP parallelism in the Density Functional Theory code CP2K, building on the results of the earlier successful dCSE project.

  • The overall performance gains from this project show improved scalability of up to 8 times as many cores which is demonstrated for a small benchmark, and a larger, inhomogeneous benchmark was shown to scale up to 9000+ cores. An increase in peak performance of up to 60% was also realised on HECToR Phase 2b. In addition, the performance of the code was studied on three generations of Cray systems - XT4, XT5 and XT6 - and under four different compilers.

The latest versions of the main compilers were evaluated for performance with CP2K.

  • The H20-64 benchmark, running on 72 cores (6 nodes) of the Cray XT5 'Monte Rosa' XT5 at CSCS in Switzerland was used. Less than 30% of the runtime is spent in communication, so the performance of the compiled code is strongly dependent on the compiler’s ability to generate a well optimised binary.
  • The PGI compiler gives a reasonably well-performing executable and the Pathscale compiler is fairly robust for compiling the MPI-only code. The Cray fortran compiler is able to successfully compiler CP2K since version 7.2.4. However, the performance is much poorer (35% slower than gfortran).
  • Gfortran is now the compiler of choice for CP2K. It is well tested by the developer and user community, and now gives performance on a par with, or exceeding the commercial compilers tested. Furthermore, it was the only compiler capable of producing a working mixed-mode executable.

The new OpenMP implementation was benchmarked extensively on three Cray supercomputers : Monte Rosa (each compute node with two 2.4 GHz hexa-core 'Istanbul' AMD Opteron processors and 16GB of main memory), HECToR Phase 2a (each compute node with one 2.3 GHz quad-core 'Barcelona' AMD Opteron processor and 8GB of main memory) and HECToR Phase 2b (each compute node with two 12 core 2.1 GHz 'Magny-cours' AMD Opteron processors and 32GB of main memory).

  • As the number of cores in a node increases (4 in HECToR 2a, 12 in Rosa, 24 in HECToR 2b), the scalability of the code decreases, with the maximum performance of the pure MPI code being achieved on 256, 144, and 144 cores respectively.
  • The performance of the MPI-only code has improved by 40-70% at around 1000 cores. While this has not had the effect of allowing the MPI code to scale any further, it will help improve performance at higher core counts for larger problems.
  • Using threads does help to improve the scalability of the code. Suitable numbers of threads to use are between 2 and 6 (the number of cores in a single processor), depending on the balance between performance for low core counts, and the desired scalability. The overall peak performance of the code has been increased by about 30% on HECToR Phase 2a and Rosa when using mixed-mode OpenMP, and by 60% on HECToR Phase 2b, due to the fact that it reduces the number of messages being required to pass through each SeaStar dramatically.

Please see PDF or HTML for a report which summarises this project.