The first approach to be attempted was to implement a buffer shared between all processes on the same SMP node (4 processes per node on the XT4, and up to 24 processes per node on the XT6), using the Unix SHM API to allow each process to acceess the buffer. This follows the approach of the 2Decomp library  used in Incompact3D and makes use of code from David Tanqueray of Cray for identifying which processes belong to which SMP node, and setting up the shared buffers. It should be noted that this approach is not portable, as it relies on details of the /proc filesystem on the Cray platform.
The aim is that by combining all the data from each process into a single send buffer on a single process (the `root') on each SMP node, firstly, the number of processes involved in the MPI_Alltoallv operation can be reduced by a factor of the SMP width. As the number of messages exchanged is asymptotically O(p2), this dramatically reduces the number of messages e.g. using 64 cores of HECToR Phase 2a, this would reduce the number of messages from 4032 to 240. Secondly, by aggregating messages together, we reduce the impact of network latency, which is proportional to the number of messages. This relies on the fact that the copy of memory into and out of the shared buffer is relatively cheap as this is an extra step not present in the original algorithm.
The implementation of this method makes use of the fft_dlay_descriptor type, which contains parameters relating to the FFT grids (dimensions, indexing arrays etc.) to store the shared send and receive buffers, as this type is preserved from one FFT iteration to the next, saving the need for repeatedly allocating and deallocating the buffer.
There is a substantial amount of additonal `book-keeping' required as each process needs to know enough about how much data is being sent from each other process in the SMP node that it can copy its data to/from the correct regions of the shared buffer. There is also the need for the creation of two extra communicators, one containing only the root process of each SMP node (used for the MPI_Alltoallv), and one containing just the processes in the SMP node (used for exchanging send and receive counts).
Compiling the code using the SHM Alltoall requires the __FFT_SHM macro to be defined and requires a compiler support Cray Pointers (e.g. gfortran using -fcray-pointer).