Data preparation for multiplication

The third aspect of DBCSR that was investigated was the preparation step that occurs before every matrix multiplication, found in the DBCSR routine make_images. This essentially subdivides each local sub-matrix into a 2D-array of `images', such that the global set of images is square, and thus suitable for Cannon's algorithm. For example, when using 6 MPI processes (arranged in a 2x3 grid), each process will have an array of 3x2 images, giving a total of 6x6 images globally. In addition, the first column and row shifts (pre-shifts) of Cannon's algorithm are performed. If a matrix is symmetric, blocks which are stored only once in the initial matrix are desymmetrized and stored twice as they may be sent to different processes. Once the destination (either local or remote) is determined, data is copied into buffers and sent to the recieving process. The recieved data in general will have come from a variety of different processes, and so the blocks are sorted into the correct CSR order and the index is rebuilt using the dbcsr_finalize routine, normally used for merging work matrices from multiple threads together.

Using the CrayPAT API to profile sub-regions within this routine the largest contribution to the runtime comes from dbcsr_finalize. A special case of this routine was written to account for the fact that rather than merging data from several threads' work matrices, we instead have only a single work matrix which contains unsorted blocks. We also avoid having to account for the case were the matrix being merged into already has existing blocks, since all the blocks making up the new image come from the MPI recieve buffer. As a result of this, and a number of other smaller OpenMP optimisations, results in a speedup in cases where there is more than one image (i.e. the number of MPI processes is not a square) - 8% faster for 128 MPI x 2 OMP, and 43% faster for 32 MPI x 8 OMP.

The majority of the time taken by the new finalize routine, and the routine used when there is only one image is taken up by sorting the recieved blocks into CSR order (first by row, then by column). This is currently done using an efficient quicksort, but only uses a single thread. A threaded sort (parallel mergesort) was implemented following [11] using the existing quicksort as the base case an the parallel merge from [12].

This gives poorer than expected results for typical list sizes of 10,000 elements (see table 2). One reason for this is related to the shared 'module' structure of the AMD Interlagos processor. If we run on the Magny-Cours processor, where every core has it's own FPU, instruction unit etc. we see much better performance (see third row of the table). In addition since we are sorting an array that has been written by MPI (i.e. a single thread), there is a penalty in accessing the data. If the data were written in parallel, so each thread has it's own portion of the array in cache at the start of the sort, better performance is achieved. However, this is not possible in practice and only serves to put an upper limit on the performance of the parallel sort. Larger array sizes give greater speedups (e.g. 2.78x on 4 threads with 100,000 elements), however, with smaller sizes such as 1,000 there is no speedup at all. With more work it might be possible to construct a heuristic on when to use the parallel sort (and with how many threads), but for now the CP2K sort has been left unchanged.

Table 2: Speedup vs CP2K sort on 1 thread (10,000 elements)

Threads	1	2	4	8
sort+merge	0.99	1.35	1.89	1.84
sort only	0.99	1.65	2.67	2.78
Magny-Cours	0.99	1.54	2.43	2.48
Parallel data	0.99	1.44	2.37	2.94