next up previous contents
Next: Summary Up: MPI Forwarding Previous: XPMT Performance   Contents

Rank Placement

To simplify the aprun command needed to start the various MPI processes (the xpnode proxy, several pstnode and mpunode processes) we merge the functionality of these processes as follows. A new pstmpunode process is created by merging pstnode and mpunode. This new process will behave as a pstnode if it has an even rank or as mpunode if it has an odd rank. If the rank is 0 it behaves as the MPI proxy process. We retain the pstnode executable and allow it to also behave as the MPI proxy if running as rank 0. It can also operate as a dummy process to help rank placement (discussed below). The combined pstmpunode executable allows a simple aprun command to be used. For example, the command for a dataset decomposed in to 15 domains and being rendered by 15 rendering processes is:

aprun -n 31 -N <mppnppn> pstmpunode <args>
where <args> are the command line options required by the MPI proxy process to contact the express process running on a login node (switches specifying hostname and port number). Note that the MPI proxy rank 0 process prevents us using all 32 processes in the 32-core queue (in this example) because the number of pstmpunode processes must be an even number (so that half can act as pstnode processes and half as mpunode processes). Hence in this example we use $15+15+1$ processes (the $1$ being the MPI proxy rank 0 process).

AVS/Express DDR automatically assigns a pstnode style process to an mpunode style process (they will all be pstmpunode executables) so that the pstmpunode(m) can act as the rendering process for that single pstmpunode(p) process. By using the odd/even scheme we ensure AVS/Express always pairs pstmpunode(p)$_{2i}$ with pstmpunode(m)$_{2i-1}$. Ideally we would like these paired processes to be on the same physical backend node so that they communicate using intra-node communication. However, the MPI proxy (rank 0) creates an off-by-one layout where a pstmpunode may be paired with another such process on a different physical node. This may result in a small performance reduction. It is possible to overcome this off-by-one problem at the expense of a few dummy processes. By using the pstnode executable we can specify a number of dummy processes that do not contribute to AVS module processing. They are there simply to pad out the physical node on which the MPI proxy process is running. This then allows all pstmpunode process pairs to be placed on the same physical node. The number of dummy pstnode processes is always mppnppn$-1$. When using dummy pstnode processes we also use pstnode as the MPI proxy process. Hence the total number of pstnode processes is equal to the mppnppn value. However, the small gain in performance may not be worth the cost of running a few dummy processes. The aprun command when using dummy pstnode processes (for the 32-core queue) when using mppnppn=2 becomes:

aprun -n 2 -N 2 pstnode <args> : -n 30 -N 2 pstmpunode
or, using four processes per node, the number of domains is reduced to 14:
aprun -n 4 -N 4 pstnode <args> : -n 28 -N 4 pstmpunode

A ddr shell script is available that generates the PBS jobscripts given the number of domains required, the mppnppn setting and whether dummy pstnode processes are to be used to help rank placement. The script will submit the express job to the serial queue on the login node. When this process executes, it will submit the pstmpunode job to the relevant parallel queue. The express process then listens for the parallel job starting. The MPI proxy rank 0 process will connect to the express process via a socket. At this point the AVS/Express user interface will appear and the application can be used. This is similar to the ParaView start-up procedure but is completely automated by the ddr script. The jobs scripts can be generated but not submitted if customisation is required.

next up previous contents
Next: Summary Up: MPI Forwarding Previous: XPMT Performance   Contents
George Leaver 2010-07-29