The linear multiplication described above exhibits poor use of cache, since blocks of the B matrix will be accessed several times - once for every block in the corresponding column of A - but not necessarily soon afterwards (poor temporal locality). In addition, since the A matrix is accessed by row and B by column one of these will stride irregularly through memory (poor spatial locality). To overcome these problems a recursive multiplication scheme was implemented which makes divides up the matrices by halves until the block size is below a preset limit. At this base case of the recursion, the multiplication parameters for the blocks in this section of the matrices are pushed onto the stack (which may be executed if it has grown large enough). Choosing when to stop recursing is a trade-off between overhead of further recursion steps, and the cost of poor memory locality by operating on too large a section of the matrix. The controlling parameter norec defines the maximum number of matrix blocks the recursion will terminate at. Experiments were performed to tune this parameter using dense matrices of size 2000x2000 and varying block sizes (see figure 4). The graph shows that relatively small numbers of blocks should be processed at once, indicating the recursion is not too expensive in most cases. A value of 512 was chosen as a reasonable compromise. The new algorithm was first tested as a standalone code before being integrated into the main CP2K code. Results in the graph are taken from the stand-alone code.