next up previous contents
Next: Functional Evaluation Up: Collocate and Integrate Previous: Collocate and Integrate   Contents

distribute_matrix

As mentioned earlier, the matrix used in the collocate and integrate routines is distributed, so it is required that before tasks are are mapped, the required matrix blocks are gathered on each MPI process, and conversely, after integration, each matrix block must be scattered back to the process it originated on. This is done by the subroutine distribute_matrix which simply traverses the task list, packs matrix blocks into an MPI send buffer, and distributes them using MPI_alltoallv. The corresponding recieved tasks are then unpacked into the correct position in the local matrix.

While a significant proportion of this routine is taken up by the communication, the buffer packing step can be accelerated using OpenMP. This is slightly complicated by the fact that the packing loops maintain a running count of the number of data elements sent to each process, so the loop is not easily parallelisable as is.

DO i = 1, SIZE(atom_pair_send)

  l = <process to send this block to>
  <pack buffer using send_disps(l) + send_sizes(l) as offset>
  send_sizes(l) = send_sizes(l) + <size of block packed>

END DO

We need to ensure that the data for a single process is packed only by one thread, thus avoiding a data race on elements of the send_sizes array. To do this, we first traverse the array of atom pairs (in fact this is already done earlier to calculate the total size required for the send buffer), and find the number of pairs (and displacement in the list) for each processor. Thus we can now transform the loop into a parallel form:

!$omp parallel do
DO l = 1, group_size
  DO i = 1, send_pair_count(l)

    <pack buffer using send_disps(l) + send_sizes(l) as offset>
    send_sizes(l) = send_sizes(l) + <size of block packed>
  END DO
END DO
!$omp end parallel do

In order to measure the speedup (excluding the communication time, which remains constant), the code was instrumented with CrayPAT regions. The results are shown in table 2. Clearly the speedup is not as large as might be hoped, however the main limitation here is memory bandwidth, which does not increase strongly with the number of threads used. It is also suspected that this loop is significantly imbalanced (particularly as the region of the buffer that is sent to self is typically much larger than for other processes). Investigation into an appropriate OpenMP schedule might improve this somewhat, but was not investigated in the time available.


Table 2: Times (in seconds) and speedup for distribute_matrix (using 6 threads on Rosa)
Region Before After Speedup
Packing Send Buffer 2.25 1.60 29%
Unpacking Recv Buffer 3.00 2.02 33%
Total user time 7.07 5.04 29%


next up previous contents
Next: Functional Evaluation Up: Collocate and Integrate Previous: Collocate and Integrate   Contents
Iain Bethune
2010-09-14