next up previous contents
Next: Recursive multiplication Up: DBCSR Optimisation Previous: DBCSR Optimisation   Contents

Node-local data multiply

Initially, when multiplying two matrices (or sub-matrices), DBCSR would perform a loop over rows of the left (A) matrix, and for each block, loop over all blocks in the corresponding column of the right matrix (B), accumulating the products into the appropriate block of C. The individual block multiplications were performed by DGEMM from the system-provided BLAS (or Fortran MATMUL if not available). In fact, as the blocks of the matrix are traversed, instead of multiplying individual blocks, the parameters for the multiplication (block pointers, sizes) are added to a stack, which is then processed once it reaches a certain size. This allows better use of cache by alternating between accessing the index (building the stack) and accessing the data (processing the stack) in a more granular fashion. A stack limit of 1000 is used as a default, but may be overridden by setting the MM_STACK_SIZE variable in the GLOBAL / DBCSR section of the input file

Iain Bethune