The second major change to the rs2pw_transfer routine was to replace the existing use of MPI_SendRecv with non-blocking MPI, thus allowing the packing of the buffers to be overlapped with communication. The following pseudocode outlines the approach taken:
calculate bounds for 'down' direction allocate recv buffer post MPI_IRecv pack send buffer send using MPI_ISend calculate bounds for 'up' direction allocate recv buffer post MPI_IRecv pack send buffer send using MPI_ISend MPI_Waitany on both outstanding recvs unpack each buffer when it arrives MPI_Waitall on both sends
This maximises performance by pre-posting recvs, which makes best use of Cray's `Portals' underlying communication architecture, and by starting the send as early as possible, so that it can be overlapped with the packing of the other buffer.
In practice, this change was found to be performance-neutral. This is believed to be because the amount of time spent actually sending the data is much greater than the time spent packing the buffers, so there is actually very little time that could potentially be saved. As only 2 send/recv pairs are ever active at one time, this also limits the scope for overlapping. As it stands, rs2pw_transfer is in fact called in a loop over the grid levels, so there is the possibility that this would allow up a further factor of 4 (depending on the number of grid levels) more communications to be overlapped. However, the amount of extra book-keeping code that this would add, compared to a potentially modest performance gain meant that this was not attempted.
Nevertheless, because non-blocking communication is recommended by Cray as the optimal way to do point-to-point messaging, this change was committed to CVS as it may be of benefit where the halos are very large, or if there is poor network performance.