MPI Proxy

Next: XPMT Performance Up: MPI Forwarding Previous: Existing DDR Architecture Contents

MPI Proxy

In order that the express user interface process can be run on the HECToR login node a number of changes to AVS/Express are required. Most significantly all MPI functionality must be removed from the executable so that it can be run outside of the MPI job. Two strategies were considered, the first being to add another communication API to express and the parallel module framework, removing any dependency on MPI. This strategy was rejected because it would have resulted in a significant rewrite of large sections of AVS code, in particular the framework used to manage the parallel modules and rendering. Also, users developing their own parallel modules would potentially have to be aware of both the MPI and non-MPI communication methods.

The second strategy, implemented in this project, is to provide an alternative MPI library that does not use the Cray MPI layer but still allows the express executable to be linked without major source code changes. The express executable can then continue to make MPI function calls that don't require the Cray MPI layer found on the backend nodes.

Our replacement MPI library is referred to as XPMT (Express MPI Tunnel). express source includes xpmt_nompi.h (rather than <mpi.h>) and is linked against libxpmt.so. It is compiled as a serial login node executable using the dual-core programming environment. libxpmt.so contains our MPI functions which communicate with a proxy MPI process via a standard tcp/ip socket. This proxy process is a genuine Cray MPI process (always rank 0 in the MPI job) running on the backend nodes. As shown in Figure , the non-MPI express sends requests for MPI functions to be called on the compute node on which the proxy xpnode¹ is running. This process receives the request and any required arguments to the requested MPI function. For example, a request for MPI_Send() requires the buffer, count, datatype, destination rank, tag and communicator arguments expected by the MPI function. Upon receiving the request the xpnode process calls the Cray MPI function with these arguments. Any results of the function (return type, buffer content etc.) are sent back to the express process via the socket. Hence the non-MPI express process is unaware that it is calling MPI functions via a proxy.

**Figure:** Forwarded MPI from login nodes to compute nodes.
$\includegraphics[width=12cm, draft=false, clip=true]{Plates/ddr-diagram-02}$

The MPI proxy process (xpnode) includes xpmt_mpi.h and <mpi.h> and links libxpmt_mpi.a and the Cray MPI libraries. This allows it to map XPMT's representation of MPI types to Cray MPI types. In our implementation the XPMT representation of MPI types are all integers that act as indices in to a table of real Cray MPI types within the xpnode proxy process. When express creates new MPI objects (communicators, datatypes, statuses etc.) the proxy creates the equivalent objects using the Cray MPI layer and a mapping between the two representations is maintained.

The pstnode and mpunode MPI processes are unchanged (they are standard Cray MPI executables) and will communicate with the xpnode process as though it were the express process. This is because they think the rank 0 process is express (and xpnode is always rank 0) and only communicate with it in response to MPI functions being called by express. For example if express posts an MPI_Recv() the proxy xpnode will make the same function call from rank 0. When the pstnode or mpunode processes make a corresponding MPI_Send() call the proxy xpnode will receive the data and pass it back to the non-MPI express process. Hence the sending processes are completely unaware that the xpnode process is a proxy for express.

It should be noted that the pstnode and mpunode processes communicate with each other via the Cray MPI layer and so benefit from this optimised library and its use of the Cray interconnect. The largest data transfer occurs between a pstnode and its associated mpunode when geometry is passed for rendering. This communication never touches the proxy xpnode process, occurring entirely within the Cray MPI domain and so suffers no change in performance as a result of removing Cray MPI from the express user interface. The amount of data sent by the non-MPI express process via the socket is in general small because it is mainly command-and-control messages from the express user interface. The global scene graph information sent from express to the mpunode render processes is also small because most of the geometry is generated by the pstnode processes.

Next: XPMT Performance Up: MPI Forwarding Previous: Existing DDR Architecture Contents

George Leaver 2010-07-29