During the dCSE project, further data on CP2K's performance was gathered from CrayPAT profiles, as well as CP2K's own timing routines. This highlighted a number of other regions of the code that were not addressed in this project, but are still barriers to improved scalability of the code.
- Dense matrix algebra (ScaLAPACK). CP2K makes use of a number of ScaLAPACK routines, principally PDGEMM (matrix multiplication) and PDSYEVD (solving an eigenvalue problem). These have been measured to have a parallel efficiency of around 20-30% on 2048 cores compared with 512 cores, which is the worst of all the major regions of the code. Currently these scale up to the point where there is about one atom per core. At this point communication costs dominate over computation, as shown by CrayPAT profiles.
- Sparse matrix algebra. Several of the important matrices in CP2K (Kohn-Sham, overlap, density matrices) are sparse. CP2K has it's own routines for multiplication of sparse and full matrices, but these also scale poorly in the same way as the dense matrix routines above. Work is already underway by the CP2K development group at the University of Zurich to rewrite these routines for better scalability and performance.
It is believed that the scalability of the code can be further improved by introducing OpenMP `hybrid' parallelism within a multi-core node while retaining the existing MPI communication between nodes. For a fixed number of processes, this will reduce the number of off-node messages to be sent, and is a better `fit' to the increasingly wide-SMP nodes currently available on the Cray XT series. A proposal for further dCSE funding to implement this has been funded.