Mesh reordering in Fluidity using Hilbert space-filling curves

 

Mark Filipiak

EPCC, University of Edinburgh

March 2013

Abstract

Fluidity is open-source, multi-scale, general purpose CFD model.  It is a finite element model using an unstructured mesh which is adapted during the course of a simulation.  Reordering of the mesh elements and vertices has been implemented, reordering occurring at input and at each adapt step.  For an OpenMP simulation using all 4 UMA regions in a single HECToR node, a 3‒5% performance improvement is seen overall, with a 5‒20% improvement in the threaded element assembly and solve steps.

 

1.      Introduction. 2

2.      Objectives and outcome. 2

3.      Renumbering methods. 2

Nested dissection. 4

Hilbert space-filling curve. 4

4.      Renumbering libraries. 4

5.      Implementation. 5

6.      Results. 6

Benchmark test. 10

7.      Impact. 13

8.      Conclusions. 13

Acknowledgements. 13

References. 13

 

1.     Introduction

Fluidity is an open-source, multi-scale, general purpose CFD model developed by the Applied Modelling and Computation Group (AMCG) at Imperial College.  The focus for development of Fluidity is ocean modelling.  Fluidity is capable of resolving flows simultaneously on global, basin, regional, and process scales, enabling oceanographic research that is not possible with other models.  It is an adaptive model that can optimise the number and placement of computational degrees of freedom dynamically through the course of a simulation.  The long term aim is the capability to simulate the global circulation and to resolve selected coupled dynamics down to a resolution of 1km – in particular, ocean boundary currents, convection plumes, and tidal fronts on the NW European shelf.

Fluidity currently runs on a range of computing platforms, from desktop workstations to HPC platforms, such as HECToR.  Fluidity is currently limited for the size of problems that are being investigated, especially on systems with NUMA (Non Uniform Memory Access) multicore processors, of which HECToR is an example.  The purpose of this project is to improve the scaling behaviour of Fluidity by implementing memory access techniques to better exploit the hybrid OpenMP/MPI code within the finite element assembly and solve routines.

2.     Objectives and outcome

The key objective of this project was to improve the ordering of topological and computational meshes in Fluidity, by adding new features in Fluidity's decomposition software, fldecomp, and adding mesh renumbering code in the central Fluidity trunk code.  Correct functionality would be checked with new tests added to the existing suite of 22,000 unit and regression tests.  NUMA performance would be profiled and benchmarked.

A subsidiary objective was to improve the colouring strategy used in the hybrid assembly loops to allow non-dependent element assembly to be performed concurrently.  The existing strategy adversely affects cache coherency.

The work to add the mesh renumbering code in the central Fluidity trunk code was technically challenging and took much longer than planned in the proposal.  This meant that the scope of the key objective was reduced and the subsidiary objective was not achieved.  Only Hilbert space-filling curve renumbering was implemented; other methods can be readily added to the code.  The renumbering is so central in the code that the existing regression tests were sufficient to test correct functionality.  The renumbering is applied to the topographical mesh only.  Discontinuous Galerkin and higher order elements are created at run time and the quality of their numbering is strongly dependent on the quality of the numbering of the topographical mesh.  So that we can access the benefit of renumbering directly, we focus on the backward facing step test case, which uses a first-order, piecewise linear, continuous Galerkin discretisation

3.     Renumbering methods

The “first touch” memory policy is implemented in Fluidity to ensure memory pages are bound to local memory on NUMA architectures (e.g. the array of velocities at the nodes of the computational mesh).  It is assumed that thread/process affinity is specified in the user environment (aprun enforces this on HECToR for example).  However, this does not translate into locality of memory access during the assemble step if the data for the physical nodes comprising an element are not located close to each other in memory, nor, during the solve step, if the physical neighbours of a node are not located close to each other in memory.

In general, mesh construction methods are designed to produce correct meshes rather than meshes with good data locality.  Even if the mesh generator renumbers the mesh before outputting, subsequent domain decomposition for MPI can degrade data locality.  Reordering the mesh in the Fluidity data structures can improve the data locality so that during assembly, the data for each element is all close in memory and, during solution, the node-to-node memory distance is small, reducing matrix band widths.

Many methods have been used for choosing a good ordering of nodes in a computational mesh (e.g. nested dissection, quadtree/octree decomposition, space-filling  curves) or in the resulting matrix to be solved (e.g. reverse Cuthill-McKee ordering) [1].  The methods are either ‘top-down’ or ‘bottom-up’.  A typical top-down method is the nested dissection method, where the mesh is split recursively down to a chosen minimum size and the resulting binary tree is traversed in depth-first order to give an ordering of the minimum-size meshes.  With an arbitrary ordering of the nodes in each minimum-size mesh this will give an ordering of all the nodes.  A typical bottom-up method is the Hilbert space-filling curve method, where a (finite) Hilbert curve (a mapping from [0,1]  into the unit square or cube, see Figure 1) is computed for the physical domain covered by the mesh.  The Hilbert coordinates of the nodes then give an ordering of the nodes [2].

Figure 1:  a. 5th order Hilbert curve in 2D [from en.wikipedia.org, author António Miguel de Campos].  b. 3rd order Hilbert curve in 3D [from hr.wikipedia.org, author Stevo-88].

 

The traversal of the hierarchical structure of the dissection or of the curve gives the desired locality of nodes in the ordering.  Nodes that are close in the ordering will be close in the mesh or in the physical domain.  However, there will be ‘boundary’ nodes at all scales which are close in the mesh or physical domain but which are far apart in the ordering;  this cannot be avoided.  Both methods have a hierarchical structure in physical space, so that there is locality at all scales.  The locality property is just what is needed for ensuring physical locality corresponds to memory locality.  The hierarchical property means that the methods can be generally applied without tuning to the actual number of UMA regions.  The hierarchical structure also means that the resulting ordering will also have good cache locality, no matter what size or number of levels of cache.

The two methods have their advantages and disadvantages:

Nested dissection

Advantages

·         Mesh-based, so will work with any topology, e.g. thin spherical shells (oceans).

Disadvantages

·         Choice of sub-mesh size at which to stop the recursion will depend on the cache size of the processor but usually be much smaller than the size of a UMA region so will still be ‘NUMA-oblivious’.

·         May be slow to calculate the bisections.

Hilbert space-filling curve

Advantages

·         Assuming the machine precision is large enough to generate a high order approximation to the Hilbert curve, locality will be achieved at all scales, so the ordering will be cache-oblivious and NUMA oblivious.

·         Can be fast to calculate the ordering.

Disadvantages

·         Ordering is based on the physical coordinates and may give poor results for physical domains with extreme aspect ratios or that are embedded in a higher dimension space, e.g. spherical shells (oceans).  Recent work on mesh traversal by a self-avoiding walk [3] [4] may lead to a mesh-based analogue of the space-filling curve but it is not clear that it will have the desired hierarchical structure.

4.     Renumbering libraries

Because of the modularity of the task, and its wide application, renumbering methods have been implemented in many high performance software libraries.  In Table 1 we list the libraries considered.

Table 1:  Possible software libraries for renumbering

Library

Open source

Parallel

Comments

Boost graph library

yes

yes

Several sparse matrix ordering schemes (only in the serial library).

Leda

no

no

No ordering or decomposition  algorithms.

Scotch

yes

yes

Recursive dissection ordering.  Scotch is included in Zoltan.

Zoltan

yes

yes

Contains Scotch and ParMetis, and Hilbert space-filling curve ordering.

Chaco

yes

no

Partitioning only.

Jostle

no

evaluation version only

Partitioning only.

ParMetis

yes

yes

Recursive dissection ordering.  ParMetis is included in Zoltan.

 

Zoltan is already integrated into the Fluidity code for load balancing and Zoltan provides Hilbert space-filling curve ordering and recursive dissection ordering (via its 3rd party libraries ParMetis and Scotch).  Thus Zoltan was the obvious choice of library to use.  During the project there was time only to evaluate the Hilbert space-filling curve method.  However, the literature indicates that this is a particularly good choice cache-oblivious choice for unstructured computation [5].  The implementation of reordering in Fluidity has been structured so that it is easy to add new ordering methods and recursive dissection can be added as an option in the future.

5.      Implementation

The main computation in Fluidity consists of the assembly and solution of the momentum (velocity), pressure, and transport equations.  The velocity, pressure and tracer fields are defined on the same elements as the basic mesh but can have different order (e.g. quadratic elements) and different continuity.  The meshes for these fields are derived from the basic mesh and will have the same elements and element connectivity but can have different numbers of nodes and node connectivity and require ordering separately.  Colleagues at Imperial College, London are developing a numbering scheme for such derived meshes but this still requires that the basic mesh has a good ordering to start with.

In Fluidity, data structures are hierarchical, with a “state” variable holding the current simulation state.  This state variable holds multiple fields, each of which carry information about the discretisation they are using, which in turn carries information on the topological mesh.  In addition, any decomposition must also work with extruded and periodic meshes.  Extruded meshes are commonly used in oceanic simulations, where a 2-D surface mesh is created and then extruded downwards to a known bathymetry to create either a z-layered or σ-layered mesh.  As such, only the 2-D mesh is decomposed, but the full 3-D mesh must still be ordered correctly.  Periodic meshes contain connectivity between elements on one or more of the domain edges, which must be taken into account when ordering elements.

An initial step was to convert the code to use the Fluidity-based tool flredecomp instead of the stand-alone tool fldecomp for the initial decomposition of a mesh when running in parallel.  Then the reordering code could be added to Fluidity only.  The regression tests were converted to use flredecomp and some bugs uncovered by this change were corrected.

The positions of the mesh vertices and element centres are re-ordered together so that the element ordering matches the node ordering.  This gives a permutation to be applied to the meshes and positions (and auxiliary data structures).  These reordered meshes are used as normal in Fluidity to derive all the other meshes on which the fields are defined.

There are three cases where the topological mesh should be renumbered:

·         when the initial mesh is input, unless some input data (e.g., from a checkpoint) is defined on the initial mesh;

·         after the mesh is adapted: this is the easiest case and the new mesh can always be re-ordered since all the fields are interpolated from the old meshes to the new meshes.

·         in the middle of load balancing, after the basic mesh has been load balanced but before the fields have been re-distributed across the processors.  In this case, the mapping from ‘universal’ (over all MPI processes) numbers to ‘global’ (local to one MPI process) numbers also has to be permuted.

The mesh in each process is reordered separately, within each process.  This is sufficient to improve the NUMA and cache performance but does mean that the halo regions, shared between processes, cannot be reordered:  Fluidity exploits a fixed order in the haloes to overlap communications and computation.

The code is now available as a branch of the Fluidity code (https://code.launchpad.net/~fluidity-core/fluidity/hilbert) and is currently being reviewed for merging into the main development trunk.  It will be made generally available at the first available release of Fluidity.

6.     Results

The code was developed on a local machine that is the equivalent of 2 nodes of HECToR Phase 2.  The regression tests were also performed on this machine.  Using reordering is sometimes faster and sometimes slower for these tests.  Over the 50 largest tests, assembly using the reordered meshes took 95-105% of the time for the unordered meshes, depending on the test.  The velocity solve using the reordered meshes takes 90-120% of the time for the unordered meshes, depending on the test.  However, these tests are all much smaller than real-world simulations, and in many cases the input mesh already has a good ordering.  Example meshes, original and reordered, are shown in Figure 2, Figure 3, and Figure 4.  The re-ordering of the mesh in each process does not affect the halo regions shared between processes, as shown in Figure 5.

 

Figure 2: Mesh for 2D tracer diffusion (test case prescribed_adaptivity):  a. Original ordering of the coordinate mesh nodes.  b. Hilbert ordering of the coordinate mesh nodes.

 </