Fluidity-ICOM: High Performance Computing Driven Software Development for Next-Generation Modelling of the Worlds Oceans
This Distributed Computational Science and Engineering (dCSE) project was to improve the performance of the three-dimensional non-hydrostatic parallel ocean model Fluidity-ICOM. The Fluidity-ICOM code uses control volume finite element discretisation methods on meshes which may be unstructured in all three dimensions and which may also adapt to optimally resolve solution dynamics. This project has enabled Fluidity-ICOM to be transformed from a code that was primarily used on institution level clusters with typically 64 tasks used per simulation into a highly performing scalable code which can be run eﬃciently on 4096 cores of the current HECToR hardware (Cray XT4 Phase2a). Fluidity-ICOM has been parallelised with MPI and optimised for HECToR alongside continual in-depth performance analysis.
The following list highlights the major developments:
- The matrix assembly code has been optimised, including blocking. Fluidity-ICOM now supports block-CSR for the assembly and solves of vector ﬁelds and DG ﬁelds.
- Interleaved I/O has been implemented to the vtu output. The performance analysis has been done with gyre test case, so far no performance improvement has been observed. The parallel I/O strategy has not yet been applied to the mesh ﬁle output as the ﬁnal ﬁle format has still not been decided yet.
- An optimal renumbering method for parallel linear solver performance has been implemented (provided via the PETSc interface). In general, it is recommended to use Reverse Cuthill-McKee to get best performance.
- Fluidity-ICOM has relatively complex dependencies on third party software, several modules were made for HECToR users to easily set software environment and install Fluidity-ICOM on HECToR.
- The differentially heated rotating annulus benchmark was used to evaluate the scalability of mesh adaptivity. A scalability analysis of both the parallel mesh optimisation algorithm and of the complete GFD model was performed. This allows the performance of the parallel mesh optimisation method to be evaluated in the context of a ”real” application.
Extensive proﬁling has been performed with several benchmark test cases using CrayPAT and VampirTrace:
- Auto proﬁling proved not to be very useful for large test cases but MPI statistics of auto proﬁling are still very useful, which also helped to identify the problems with surface labelling which cause large overheads for CrayPAT. There are still on going issues of PETSc instrumentation.
- VampirTrace (GNU version) proved to be useful for mesh adaptivity part tracing, several interesting results have been made.
- Proﬁling the real world applications has proved to be a big challenge. This required a considerable understanding of proﬁling tools and extensive knowledge of the software itself. The introduction of manual instrumentation was required in order to focus on speciﬁc sections of the code. Determining a suitable way to reduce the proﬁling data size without losing the ﬁne grain details was critical for successfully proﬁling. Inevitably this procedure involved much experimentation requiring large numbers of proﬁling runs.