Fluidity-ICOM: High Performance Computing Driven Software Development for Next-Generation Modelling of the Worlds Oceans

This Distributed Computational Science and Engineering (dCSE) project was to improve the performance of the three-dimensional non-hydrostatic parallel ocean model Fluidity-ICOM. The Fluidity-ICOM code uses control volume finite element discretisation methods on meshes which may be unstructured in all three dimensions and which may also adapt to optimally resolve solution dynamics. This project has enabled Fluidity-ICOM to be transformed from a code that was primarily used on institution level clusters with typically 64 tasks used per simulation into a highly performing scalable code which can be run efficiently on 4096 cores of the current HECToR hardware (Cray XT4 Phase2a). Fluidity-ICOM has been parallelised with MPI and optimised for HECToR alongside continual in-depth performance analysis.

The following list highlights the major developments:

  • The matrix assembly code has been optimised, including blocking. Fluidity-ICOM now supports block-CSR for the assembly and solves of vector fields and DG fields.
  • Interleaved I/O has been implemented to the vtu output. The performance analysis has been done with gyre test case, so far no performance improvement has been observed. The parallel I/O strategy has not yet been applied to the mesh file output as the final file format has still not been decided yet.
  • An optimal renumbering method for parallel linear solver performance has been implemented (provided via the PETSc interface). In general, it is recommended to use Reverse Cuthill-McKee to get best performance.
  • Fluidity-ICOM has relatively complex dependencies on third party software, several modules were made for HECToR users to easily set software environment and install Fluidity-ICOM on HECToR.
  • The differentially heated rotating annulus benchmark was used to evaluate the scalability of mesh adaptivity. A scalability analysis of both the parallel mesh optimisation algorithm and of the complete GFD model was performed. This allows the performance of the parallel mesh optimisation method to be evaluated in the context of a ”real” application.

Extensive profiling has been performed with several benchmark test cases using CrayPAT and VampirTrace:

  • Auto profiling proved not to be very useful for large test cases but MPI statistics of auto profiling are still very useful, which also helped to identify the problems with surface labelling which cause large overheads for CrayPAT. There are still on going issues of PETSc instrumentation.
  • VampirTrace (GNU version) proved to be useful for mesh adaptivity part tracing, several interesting results have been made.
  • Profiling the real world applications has proved to be a big challenge. This required a considerable understanding of profiling tools and extensive knowledge of the software itself. The introduction of manual instrumentation was required in order to focus on specific sections of the code. Determining a suitable way to reduce the profiling data size without losing the fine grain details was critical for successfully profiling. Inevitably this procedure involved much experimentation requiring large numbers of profiling runs.

Please see PDF or HTML for a report which summarises this project.