Partitioning the Unstructured grid

To partition the unstructured Gambit mesh we need to assign certain cells to appropriate MPI processes in the communicator. This depends upon the number of MPI processes that one wishes to run the end simulation with. One suitable approach is to use an external partitioning package such as METIS [7], [9] or Scotch [8], [10]. Alternatively, we may be able to use a more simple method of structured partitioning across one, two or three dimensions of the cartesian plane.

The choice of whether to use an external partitioning package or a structured decomposition depends upon the geometry in question. For our test case within this project, (backward facing step geometry) subsequent results will show that it is most efficient to use a structured decomposition, however, for the arbitrary geometries that may be considered in future CABARET simulations it is necessary for the code to have the ability to handle unstructured decompositions.

The communication routines within the code have been designed to accomodate unstructured decompositions, if required. Also, a separate (pre-process) partitioning routine has been developed. This routine can be used to process a decomposition from an external package or generate a structured one. Due to the unstructured aspect of the Gambit grid, there will be non-contiguous data. The partitioner has to either have enough global memory allocated to read in the entire grid or else in separate parts. However, it is more efficient to read in the entire grid on a single XT/XE node and then use a shared memory approach.

Therefore the partitioner has been implemented with hybrid OpenMP/MPI parallelism for multi-core efficiency and the end application is capable of generating the partition data for a $5.12\times10^7$ cell Gambit mesh for 250 MPI partitions within 29s wct (using 240 cores over 10 fully populated Phase 2b nodes (XT6 - 24 core Opteron 6172 2.1GHz processors, arranged as two dual hex core sockets per node in a non uniform memory architecture). It can be run with up to the same number of cores as requested partitions. It exhibits linearly scalability as the algorithm involves little communication.

Phil Ridley 2011-02-01