The parallel 1D tri-diagonal solver for the Helmholtz equation contained 5 calls to
MPI_ALLTOALLW in the initial implementation, these are performed for each of the 3 layers for every time iteration. However, the number of calls is not relevant due to their placement within the code, but, the amount of time taken within each call is. The first feature noticed about the collectives was that each entry in the array of data types used in MPI_ALLTOALLW was identical. Therefore, since MPI_ALLTOALLW and MPI_ALLTOALLV only differ in that the former allows varying data type, calls to MPI_ALLTOALLW were replaced with calls to MPI_ALLTOALLV as shown below:

      call mpi_alltoallw(psi(1, 2), ge_c%scounts1, ge_c%sindices, ge_c%types, &
        & ge_c%recv(1, 1, 1), ge_c%rcounts1, ge_c%rindices, ge_c%types, &
        & decomp_j%cart_comm, ierr); 

      call mpi_alltoallv(psi(1, 2), ge_c%scounts1, ge_c%sindices/8, &
        MPI_DOUBLE_PRECISION,  ge_c%recv(1, 1, 1), ge_c%rcounts1, &
        ge_c%rindices/8, MPI_DOUBLE_PRECISION, decomp_j%cart_comm, ierr);
It was expected that performance of MPI_ALLTOALLV would be better than
MPI_ALLTOALLW, however it turned out that it was between 5-10% worse.

The next step was to develop a method using MPI_ALLTOALL and buffers, which used all the existing information available in the code. The benefit of doing this would be to take advantage of the hardware optimised collective (i.e. for the Gemini interconnect), however, the disadvantage was that MPI_ALLTOALL requires fixed length buffer sizes. Extra code would need to be developed to take account of this, to ensure that the buffers were packed and unpacked efficiently.Firstly, the size of the buffer is set to double the maximum size of the packed data for each process, this allows for those processes whose buffers are less than the maximum size, to contain zeros. The purpose of this being that all buffers on each process will be equal and of the same data type (MPI_DOUBLE_PRECISION), such that the following call may then be made:

      call mpi_alltoall(c_full,2*max_ii2s,MPI_DOUBLE_PRECISION, &
        &  coeff_full,2*max_ii2s,MPI_DOUBLE_PRECISION,decomp_j%cart_comm, ierr)

Prior to this call the sending buffer (c_full), will have to be packed and after the call, the receiving buffer (coeff_full) unpacked. This is implemented within loops which contain copies from the array psi to c_full and then from coeff_full to ge_c%recv. The loops do incorporate some memory access which includes striding from ge_c%scounts1 and ge_c%rcounts1, however this is unavoidable and fortunately does not cause any problems to performance.

Phil Ridley 2012-10-01