## CP2K - Sparse Linear Algebra on 1000s of cores

This was the third Distributed Computational Science and Engineering (dCSE) project for the Density Functional Theory code CP2K. Following on from the results of the earlier successful dCSE projects here and here. By optimisising the Distributed Block Compressed Sparse Row (DBCSR) matrix multiplication library, which is embedded within CP2K, this project would enable more efficient and scalable sparse matrix operations and reduce the time to solution for typical simulations.

The achievements of this project are summarised below:

- For Mixed-mode OpenMP/MPI, the aim was to bring the performance up to par with pure MPI for
compute dominated runs, and for communication-dominated runs, an approximate 30%
improvement was expected. At the end of this project, to demonstrate performance of a compute dominated run,
the H2O-64 benchmark with 32 cores showed that the mixed-mode code still lags a little way behind pure
MPI (9% for 2 threads, 42% for 4 threads). Here, only around 40% of the runtime is spent in DBCSR, and
there may also be other significant computations within the code that do not have good threaded performance.
For a communication-dominated regime (e.g. on 512 cores, using 2 threads per MPI process gives a speedup of
**22%**over pure MPI. For the larger H2O-1024 benchmark which spends about 70% of its runtime in DBCSR, using mixed-mode OpenMP/MPI gives better performance than with pure MPI. Performance with 2 threads per process has been increased by**18%**and for 4 threads per process by**35%**. Similarly for H2O-64, the mixed-mode code begins to perform better (e.g. an improvement of**29%**over pure MPI on 4096 cores). Extensive testing of similar benchmark inputs ranging from 32 to 2048 water molecules has showed that the absolute fastest time to solution for all problem sizes could be achieved using 2 OpenMP threads per process. - For runs in the compute-bound regime, a 10% improvement was expected in the peformance
for pure MPI. Following this work, for the H2O-64 case on 32 cores, a
**7%**improvement is now observed. For H2O-1024 there is little change in the peformance for pure MPI, but as noted above large improvements have been made for the mixed-mode code. - For the communication-bound regime, a 10% improvement was also expected. However, it is
difficult to draw general conclusions about performance in the communication-bound regime
from the H2O-64 data. The best case performance (2 threads per process, 512 total cores)
has been increased by
**4%**, but in other cases the mixed-mode code performs slightly worse than before.

There were three main parts to the optimisisations which were performed within this project:

- For the MPI data exchange, profiling showed that there was a load imbalance in MPI_Waitall with some
processes taking up to 6 times longer. This was improved by implementing MPI_Testany, such that
the master thread will now check for completion, while the other threads continue their work. To control
this feature, USE_COMM_THREAD and THREAD_LOAD were introduced as new parameters in the GLOBAL / DBCSR section.
The performance of CP2K with and without the comm thread improvement was demonstrated on a
(strongly DBCSR dominated) 6144 atom calculation of liquid water. Up to a
**13%**improvement was acheived with two threads and up to**20%**with six threads, even on relatively modest numbers of processors. At 2304 cores and above, underloading the comm thread to 80% was also shown to give an improvement over the default settings. - The Node-local data multiply for the atomic blocks of the matrix occurs after the required data has been collected on node. This was improved by replacing the original linear multiplication of the sub-matrices by a new recursive approach which improves spatial locality. A specialised small matrix multiplication library (libsmm) was also developed, this employs an autotuning (loop reordering / unrolling) approach for producing specialised DGEMM routines for a specified set of small matrix sizes. Typically, libsmm outperforms the standard BLAS by up to 10 times and when block sizes become larger, performance tends towards that of the standard BLAS.
- The Data preparation for multiplication step essentially sub-divides the matrix into
a 2D array of images within the DBCSR routine make_images. Here, the CSR index re-ordering routine was
re-written, along with a number of smaller OpenMP optimisisations. For a typical matrix, this
is now
**8%**faster for 128 MPI x 2 OMP, and**43%**faster for 32 MPI x 8 OMP. Currently, the index re-ordering routine uses an efficient quicksort algorithm, a threaded sort (parallel mergesort) was tested, however this approach gave poorer than expected performance. For list sizes of 10,000 elements, the parallel mergesort was faster on Magny-cours but slower on Interlagos (due to the shared FPUs). In general for smaller list sizes, it is more efficient to use the quicksort approach.

To get the best performance with CP2K on HECToR, users are recommended to:

- Choose the right number of MPI processes. If possible choose a number that is a square, a multiple of 32 (the number of cores on a HECToR node) and if possible a power of 2. E.g. 64, 256, 1024.
- Use libsmm, which is included as part of the central version of CP2K on HECToR.
- Use OpenMP for large problem sizes in particular, or on large numbers of cores. Using perhaps 2 or 4 OpenMP threads per MPI process may give better performance than pure MPI. OpenMP may also be of use for problems requiring large amounts of memory, where otherwise nodes would have to be underpopulated.

Please see PDF or HTML for a report which summarises this project.