The HECToR Service is now closed and has been superceded by ARCHER.

Developing Hybrid OpenMP/MPI Parallelism for Fluidity-ICOM - Next generation Geophysical Fluid Modelling Technology

This was the third Distributed Computational Science and Engineering (dCSE) project for improving the three-dimensional non-hydrostatic parallel ocean model Fluidity-ICOM. The Fluidity-ICOM code uses control volume finite element discretisation methods on meshes which may be unstructured in all three dimensions and which may also adapt to optimally resolve solution dynamics. This work will enhance the multi-core performance of the code, to enable increased scalability and efficiency on HECToR. It also builds upon the success of the previous Fluidity-ICOM projects which are described here and here.

The long term aim is for Fluidity-ICOM to be capable of simulating the global circulation and to resolve selected coupled dynamics down to a resolution of O(1km) - in particular, ocean boundary currents, convective plumes and tidal fronts on the NW European shelf. This project will move Fluidity-ICOM closer to that goal by improving the multi-core performance of the code for HECToR and future many-core architectures. In particular, the following will be performed:

  • Implementation of MPI/OpenMP mixed-mode parallelisation of the finite element assembly stage in Fluidity-ICOM.
  • Optimisation of the HYPRE Library usage for linear preconditioners/solvers for large core counts.
  • Benchmarking and code re-engineering for hybrid mesh adaptivity.

The following list highlights the major developments:

  • Matrix assembly node optimisation can be done mostly using OpenMP with an efficient colouring method, which avoids the use of mutual synchronization directives: eg. Critical.
  • Regarding PETSc Matrix stashing, it does not have any redundant calculations. However, it does incur the cost of maintaining and communicating stashed rows, and this overhead will increase for higher MPI process counts. A further complication of non-local assembly is that the stashing code within PETSc is not thread safe.
  • Local assembly has the advantage of not requiring any MPI communications as everything is performed locally, and the benchmark results also highlight the fact that the redundant calculations are not significantly impacting performance when local assembly is used. Furthermore, the scaling of local assembly is significantly better than non-local assembly at higher core counts. This makes assembly an inherently local process. Thus focus is on optimizing local (to the compute node) performance.
  • The current OpenMP standard(3.0), which has been implemented by most popular compilers, doesn't cover page placement at all. For memory-bound applications, like Fluidity-ICOM, it is therefore important to make sure that memory get's mapped into the locality domains of processors that actually access them, to minimize NUMA traffic. In addition to our implementation of first touch policy, which improves data locality, thread pinning can be used to guarantee that threads are executed on the cores which initially mapped their memory regions in order to maintain locality of data access.
  • For Fluidity-ICOM matrix assembly kernels, the performance bottle neck becomes memory allocation for automatic arrays. Using NUMA aware heap managers, such as TCMalloc, it makes pure OpenMP version outperform the pure MPI version.
  • Benchmarking results with HYPRE and threaded PETSc show that mixed mode MPI/OpenMP version can outperform pure MPI version at high core counts where I/O becomes major bottleneck for pure MPI version.
  • With mixed mode MPI/OpenMP, Fluidity-ICOM can now run with up to 32K cores, which offers Fluidity capability to solve the "grand-challenge" problems.

Please see PDF or HTML for a report which summarises this project.