Porting and Optimisation of Code

Porting and Optimisation of Code_Saturne on HECToR

Zhi Shang, Charles Moulinec, David R. Emerson[1], Xiaojun Gu

Computational Science and Engineering Department

Science and Technology Facilities Council, Daresbury Laboratory

Warrington WA4 4AD, UK

Abstract

The move towards petaflop computing will require scientific software to run efficiently on many thousands of processors. For computational fluid dynamics, this imposes new challenges. We need to be able to generate very large computational grids, in excess of one billion computational cells, to ensure the processors have enough work. In addition, we need to partition these large computational meshes for efficient execution on these large scale facilities. As most grid generation codes are serial and proprietary, there is little the user can do. However, the majority of mesh partitioning software is available open-source and this study aims to understand how these codes perform when we need to create an extremely large number of computational domains. In particular, we seek to run our fluid dynamics software on a petascale system with more than 100,000 cores. This work focuses on the open-source software, Code_Saturne, and investigates the issues associated with pre-processing. The mesh partitioning software considered in this report has been restricted to open-source packages such as Metis, ParMetis, PT-Scotch and Zoltan. Today, Metis is the de facto standard but is a sequential code and is therefore limited by memory requirements. Parallel mesh partitioning software, such as ParMetis and PT-Scotch, can overcome this limitation provided the quality of the partition (edges cut, load balance) remains good. During our study, we found that the time required to perform the partition of 121M tetrahedral elements varied with the package and found that Metis consistently required the least amount of time. However, in all cases, the time to perform the partition was always modest and was not found to be a significant issue. In contrast, the memory constraints did vary with the package and PT-Scotch could generate mesh partitions in parallel (up to 131072 domains) using only 16 cores whereas ParMetis 3.1.1 required a minimum of 32 cores and 512 cores to create the 131072 domains. An analysis of the metrics suggests that the larger number of cores required by ParMetis results in a partition with a poor load balance. In practice, however, the simulation run time did not reflect this observation and, for up to 1024 cores, ParMetis produced the lower time to solution. Above 1024 cores, and up to 8192 cores, the sequential version of Metis showed the best speed-up. For 2048 and 4096 cores, PT-Scotch provided a better performance than ParMetis. In general, all packages did a reasonable job and it is difficult to identify any specific trends that would lead to one package being clearly superior to the others.

Keywords: Code_Saturne, mesh partitioning, Metis, ParMetis, PT-Scotch, Zoltan, HECToR

Contents

1 Introduction..............................................................................................................................3

2 Mesh partitioning software packages porting into Code_Saturne………...............................5

2.1 Metis 5.0pre2....................................................................................................................5

2.2 ParMetis 3.1.1...................................................................................................................6

2.3 PT-Scotch 5.1....................................................................................................................6

2.4 Zoltan 3.0..........................................................................................................................6

2.5 Mesh partitioning quality..................................................................................................7

2.6 Parallel performance on HECToR....................................................................................9

3 Conclusions............................................................................................................................10

Acknowledgements...................................................................................................................11

References.................................................................................................................................11

1. Introduction

The move towards hardware involving very large numbers of processing units, with state-of-the-art systems exceeding 100,000 cores, is highlighting many issues related to algorithmic scalability. However, for Computational Fluid Dynamics (CFD) software, a new challenge has emerged relating to the pre-processing stage. In common with many engineering topics, the system of equations (for CFD this is the Navier-Stokes equations) must be discretised onto a computational mesh. To run in parallel, the mesh needs to be partitioned into domains of equal size to ensure a good load balance. For structured grids, this is fairly straightforward but partitioning unstructured grids has always been more challenging. The move to petascale computing has made this challenge very immediate. As the computational meshes would have to be very large to run efficiently on 100,000 cores, the partitioning software would have to run in parallel. This project was to investigate how the partitioning software available would perform when creating domains involving very large core counts.

The Computational Fluid Dynamics (CFD) software, Code_Saturne, has been under development since 1997 by EDF R&D (Electricité de France) [1]. The software is based on a collocated Finite Volume Method (FVM) that accepts three-dimensional meshes built with any type of cell (tetrahedral, hexahedral, prismatic, pyramidal, polyhedral) and with any type of grid structure (unstructured, block structured, hybrid). This allows Code_Saturne to model highly complex geometries. It can simulate either incompressible or compressible flows with or without heat transfer and turbulence.

From the outset, Code_Saturne was designed as a parallel code and works as follows: the pre-processor reads the mesh file and currently partitions the mesh with Metis or Scotch to produce the input files for the solver. Once the simulation is complete, the output is post-processed and converted into readable files by different visualization packages (such as ParaView). Parallel code coupling capabilities are provided by EDF’s FVM library. Since 2007, Code_Saturne has been open-source and is available to any user [2]. To retain the open-source nature of the proposed work, we have only considered partitioning software that is freely available.

One significant advantage of Code_Saturne is its industrial pedigree. The code was originally designed for industrial applications and research activities in several fields related to energy production. These include nuclear power thermal-hydraulics, gas and coal combustion, turbo-machinery, heating, ventilation and air conditioning. To highlight the ability of the code to handle complex geometries, Figure 1 illustrates a CFD simulation of air flow around the Daresbury tower whereas Figure 2 involves water flow around the DARPA-2 submarine.

(a) velocity field

(b) turbulence kinetic energy

Figure 1: air flow around the Daresbury tower

Figure 2: DARPA-2 submarine model

The submarine test case [3, 4], shown in Figure 2, was chosen to test the partitioning software and the scalability of Code_Saturne on HECToR. Basic details concerning the submarine’s geometry are listed in Table 1.

Table 1

Geometry Length (L)	4.355 m
Body diameter (D)	0.507 m
Exit diameter	0.0075 m
Sail height (h)	0.206 m

The flow parameters, briefly listed in Table 2, correspond to the DARPA-2 experiment [3, 4].

Table 2

Free stream velocity	9 ms^-1
Reynolds number (based on geometry length)	3.89 10⁷
Outlet pressure	2.01 10⁵ Nm^-2

Table 3 lists the simulation parameters used for Code_Saturne. A standard wall function method was used for the near wall treatment.

Table 3

Turbulence model
y⁺	≈ 30 (within 25 to 70)
Discretisation	SIMPLE (steady state)
Solver	Algebraic multigrid

The test mesh is illustrated in Table 4.

Table 4

Number of cells	195,877
Number of interior faces	404,737
Number of boundary faces	9,324
Number of vertices	47,131

Figure 3 compares the results of the pressure coefficient (Cp) at different cross sections along the submarine’s body with both experiment data and several CFD codes.

Figure 3: comparisons of pressure coefficients

From the comparisons in Figure 3, it can be seen that the results obtained by Code_Saturne are in good agreement with experimental data and a range of commercial software packages Fluent [5], STAR-CD [6], and CFX [7] and also the open-source software OpenFoam [8].

For this dCSE project, Code_Saturne 2.0.0-beta2 version will be used. Several open-source mesh partitioning packages have been investigated, including Metis, ParMetis, PT-Scotch and Zoltan. All packages were integrated into Code_Saturne 2.0.0-beta2 to analyse their parallel partitioning performance. A very large submarine test case was created to test the software and involved 121,989,150 (121M) tetrahedral cells.

2. Mesh partitioning software packages porting into Code_Saturne

In common with many parallel CFD codes, data communication is carried out by exchanging data via halo cells [9], as indicated in Figure 4. The halo cells represent the inner boundaries between different sub-domains. In this report, each sub-domain is allocated to a core or processor. All of the statistics relating to the quality of the mesh partitioning in this report include the halo cells.

Figure 4: working mechanism of halo cells in Code_Saturne

2.1 Metis 5.0pre2

Metis 5.0pre2 is a sequential mesh partitioning software package [10] and is widely regarded as the de facto standard. It produces high quality partitioned meshes for efficient parallel execution. Since it is a serial version, the corresponding newest library of Metis 5.0pre2 was introduced through the pre-processor stage of Code_Saturne. The METIS_PartGraphKway function of Metis is employed by Code_Saturne to perform the mesh partitioning.

The connection of the libraries can be carried out through the following command lines in the installation file for Code_Saturne:

#########################

## Preprocessor Installation ##

#########################

METISPATH=$HOME/metis-5.0pre2

2.2 ParMetis 3.1.1

ParMetis 3.1.1 is a parallel version of the partitioning package [11]. One of the aims is to investigate its ability to produce high quality mesh partitioning in parallel. For tackling the very large meshes envisaged, it will not be possible to use the sequential version of Metis due to memory constraints. The latest version of the library, ParMetis 3.1.1, was introduced into Code_Saturne to enable parallel mesh partitioning. The ParMETIS_V3_PartKway function of ParMetis is employed by Code_Saturne to perform the mesh partitioning.

The connection of the libraries, which is slightly different from the serial version, can be carried out through the following command lines during the parallel running in the ‘runcase’ file of the Kernel of Code_Saturne.

#####################################################

CS_LIB_ADD_metis="$HOME/ParMetis-3.1.1/libmetis.a"

CS_LIB_ADD_parmetis="$HOME/ParMetis-3.1.1/libparmetis.a"

#####################################################

2.3 PT-Scotch 5.1

An alternative and promising parallel partitioning software is PT-Scotch 5.1 [12]. Our goal was to assess the quality of the meshes created in parallel. The latest library of PT-Scotch 5.1 was introduced through the Kernel of Code_Saturne. The SCOTCH_dgraphPart function of PT-Scotch is employed by Code_Saturne to perform the mesh partitioning.

The connection of the libraries can be carried out through the following command lines during the parallel running in the ‘runcase’ file of the Kernel of Code_Saturne.

####################################################

CS_LIB_ADD_ptscotch="$HOME/scotch_5.1/lib/libptscotch.a"

####################################################

2.4 Zoltan 3.0

Zoltan 3.0 is another parallel software package that can be used for mesh partitioning [13]. For the results presented, geometric mesh partitioning was used. However, limited time restricted an in-depth analysis and our results reflect only a preliminary analysis.

The latest version of Zoltan 3.0 is introduced through the Kernel of Code_Saturne. The library can be introduced through the following command lines during the execution of the ‘runcase’ file of the Kernel of Code_Saturne.

############################################################

CS_LIB_ADD_zoltan="$HOME/Zoltan_v3.0/src/Obj_generic/libzoltan.a"

############################################################

2.5 Mesh partitioning quality

The submarine geometry previously discussed will be the standard test case. For our tests, it has 121,989,150 computational cells (121M). The key factors associated with good parallel performance for unstructured grids are load balance and statistics associated with neighbours and halo cells, and the selected partitioning packages Metis 5.0pre2, ParMetis 3.1.1, PT-Scotch 5.1 and Zoltan (RIB) 3.0 are used.

Due to memory constraints, all the partitioning of Metis 5.0pre2 was carried out on a 96 GB SGI machine in Daresbury Laboratory [14]. The peak value of memory used by Metis 5.0pre2 for partitioning the 121M case into 8192 domains is around 32 GB, which is beyond the amount of memory available on HECToR Phase 2a processing node. However, partitioning the grid in parallel can eliminate the memory limits for large scale mesh partitioning. In addition to the partitioning tools discussed, Code_Saturne has its own tools for creating meshes in parallel. The two approaches available are Space-Filling Curves (SFC) [15] and a simple partitioning strategy that just divides the computational cells by the processor number. At the time of this study, these tools were in the beta stage of development and testing by EDF and it was not possible to generate domains in parallel.

A special focus on the parallel graph partitioning tools, e.g. ParMetis 3.1.1 and PT-Scotch 5.1 is considered, for the generation of 32 to 131072 sub-domains from the 121M original grid. Tables 5 and 6 give the minimum number of processors ParMetis 3.1.1 and PT-Scotch 5.1 should be run on, depending on the number of sub-domains, the time spent for the graph transfer, the partitioning time, the number of edge-cuts, the load balance, the maximum number of neighbours a sub-domain might have.

The load balance is defined as the ratio between the number of cells of the smallest sub-domain divided by the number of cells of the largest sub-domain.

ParMetis 3.1.1 requires at least 32 processors to perform the partitioning up to 8192 sub-domains, and at least 512 for the 131072 case. Conversely, PT-Scotch 5.1 only requires 16 processors for all the partitions. The available partitioning strategy clearly depends on the number of processors the parallel partitioner is run on and this could have an impact on the quality of the partition obtained.

ParMetis 3.1.1's graph transfer time scales very well (speed-up of 12.5 instead of ideal 16 going from 32 to 512 processors) and PT-Scotch's graph transfer time on 16 processors is about twice as costly as ParMetis 3.1.1 on 32 processors, which indicates that both partitioning tools exhibit similar performance. On 32 processors, partitioning takes only about 10% more time for 8192 domains than for 32 sub-domains for ParMetis 3.1.1. For 16384 domains and above, 64 to 512 processors are used, but less than 40 seconds are required for ParMetis 3.1.1 to complete. The PT-Scotch 5.1 partitioning time is much longer (from just over 220 seconds for 32 sub-domains to almost 520 seconds for 131072 sub-domains). In practice, the times are all very modest and demonstrate that the computational time associated with large-scale partitioning is not a major issue. However, as expected, memory constraints do impact on how the partitioning is performed for large scale problems.

An important measure of the resulting partition is the number of edge cuts, which indicates how the load is balanced across domains and the amount of communication to be performed between sub-domains. For this particular case, the partitions generated by ParMetis 3.1.1 lead to a slightly larger number of edge-cuts compared to those obtained by PT-Scotch 5.1. In general, for partitions up to 65536 sub-domains, the percentage error between the two software packages is less than 14% and typically below 10%. However, for the largest partition, which involves 131072 sub-domains and is typical of the scale needed for a realistic petascale platform, it is about 57% larger ParMetis 3.1.1.

Considering the two software packages under investigation, PT-Scotch 5.1 produces a good load balance (above 83% for all the cases), whereas ParMetis 3.1.1 indicates a high degree of variability with generally poor load balancing for a large number of sub-domains. Finally, the maximum number of neighbours associated with a partition is seen to increase with both ParMetis 3.1.1 and PT-Scotch 5.1 but, in general, are less with PT-Scotch. These metrics suggest that better performance should be observed with Code_Saturne for sub-domains obtained by PT-Scotch 5.1 rather than by ParMetis 3.1.1.

Table 5: Metrics for the 121M case - ParMetis 3.1.1

Domains	Minimum processors	Graph transfer time (s)	Partitioning time (s)	Edge cuts	Load balance	Max neighbours
32	32	22.22	49.28	1034756	0.94	23
128	32	22.40	49.69	2014746	0.88	25
512	32	22.35	50.17	3406940	0.66	35
2048	32	22.74	51.59	5647363	0.34	76
8192	32	22.31	55.73	9205459	0.36	72
16384	64	11.90	32.05	11683666	0.43	Not available
32768	64	12.12	36.72	14607593	0.40	Not available
65536	128	6.33	35.70	18474955	0.68	Not available
131072	512	1.79	35.49	36006281	0.31	Not available

Table 6: Metrics for the 121M case - PT-Scotch 5.1

Domains	Minimum processors	Graph transfer time (s)	Partitioning time (s)	Edge cuts	Load balance	Max neighbours
32	16	47.15	221.04	924584	0.93	15
128	16	43.33	257.66	1779971	0.92	26
512	16	43.56	296.48	3143164	0.90	43
2048	16	44.87	338.25	5299619	0.87	46
8192	16	53.51	385.38	8755670	0.86	50
16384	16	51.68	412.05	11173729	0.86	54
32768	16	56.27	442.80	14188598	0.84	64
65536	16	62.99	478.25	18075714	0.83	61
131072	16	66.61	519.08	22911992	0.83	50

2.6 Parallel performance on HECToR

For the work presented here, all of the tests have been performed on HECToR Phase2a (Cray XT4) [16]. Figure 9 shows the CPU time per time step as a function of the number of cores for Code_Saturne's simulations running on up to 8192 cores and pre-processed by Metis 5.0pre2 (on the 96 GB SGI machine located at Daresbury Laboratory, as described in Section 2.5), ParMetis 3.1.1, PT-Scotch 5.1 and Zoltan (RIB) 3.0. The CPU time per time step decreases for all of the simulations as the number of cores is increased, with ParMetis 3.1.1 partitioning leading to the fastest simulation but no really significant difference up to 512 cores. From 1024 cores on, Zoltan (RIB) geometric partitioning has a clear effect on the poor performance of Code_Saturne. This is to be expected, as geometry-based partitioning tools usually do not perform as well as graph-based tools, with the number of edge-cuts and neighbours being very large. From 2048 on, Metis 5.0pre2 produces better results than ParMetis 3.1.1 and PT-Scotch 5.1. In Tables 5 and 6, PT-Scotch 5.1 generally showed better metrics than ParMetis 3.1.1 and for 2048 and 4096 cores, this is confirmed by Code_Saturne performing better. However, there is no conclusive evidence to suggest the improved metrics offered by PT-Scotch 5.1 results in the best code performance.

Figure 9: CPU time per time step as a function of the number of cores

Figure 10: Speed-up as a function of the number of cores

Figure 10 shows the speed-up based on the CPU time per iteration as a function of the number of cores. Metis 5.0pre2 demonstrates the best performance with a speed-up almost ideal up to 2048 cores, whereas Zoltan (RIB) 3.0 exhibits the poorest performance. Overall, PT-Scotch is the best parallel partitioner for 2048 and 4096 cores, as shown by Code_Saturne's speed-up, which is very close (about 10% lower) to the speed-up obtained when Metis 5.0pre2 is used as the partitioning tool. Despite the poor metric indicators of ParMetis 3.1.1, especially at high core counts, the code generally performs well and at lower core counts produces the minimum run-time. Even at 8192 cores, it is only just below Metis 5.0pre2 which gives the best performance at the highest number of cores tested on HECToR Phase 2a.

3. Conclusions

Mesh partition is a key component of solving grid-based problems using unstructured meshes and the advent of petascale and, before 2020, exascale systems has highlighted the need to revisit this “solved” problem. To test the available partitioning software for its suitability of creating very large-scale mesh partitions, we have used Code_Saturne, an open-source CFD package that is used extensively in industry and Europe. The geometric problem is based on the DARPA submarine which was meshed with 121M tetrahedral elements to reflect the scale of the problem sizes expected on a petascale architecture.

The partitioning software considered was Metis 5.0pre2, ParMetis 3.1.1, PT-Scotch 5.1 and Zoltan (RIB) 3.0 which are all available as open-source packages. As partitioning a graph is considered to be an NP-hard problem, all packages use heuristics in their solution strategy. This naturally leads to differences in the algorithms employed and their implementation with the corresponding result that each software package produces a different partition. As Metis 5.0pre2 is sequential, there are natural memory limitations that impact the total number of sub-domains the package can create.

For PT-Scotch 5.1, we found that we could generate 131072 sub-domains using just 16 cores. In contrast, ParMetis 3.1.1 required a minimum of 32 cores to partition the DARPA submarine and the number of cores grew with the 131072 partition requiring at least 512 cores. However, ParMetis 3.1.1 was consistently faster than PT-Scotch but, in practice, the amount of time required to partition the mesh was very modest. Although the time is not a major issue, it is clear that memory constraints could make an impact on deciding which package to use.

If we consider the metrics presented in Tables 5 and 6, the indications are that PT-Scotch 5.1 might provide the better solution. In practice, this was not the case. In contrast to the statistics produced, ParMetis 3.1.1 provided the minimum run times up to 1024 cores whereas Metis 5.0pre2 delivered the best performance above 1024 cores and up to the limit of 8192 cores tested on HECToR phase 2a. We did find PT-Scotch 5.1 performing better than ParMetis 3.1.1 on 2048 and 4096 cores. In general, however, it is not possible to identify specific trends that would lead to one package being clearly superior to the others.

The results presented for Zoltan are very preliminary. Although this package indicates it provides the worst performance, the limited time available to investigate this package means that it would be unfair to take this as a definitive result. However, one aspect that became apparent was that it was necessary to allocate one core to each sub-domain being created. This could be attributed to a limited understanding of how the software works but it would be undesirable feature.

As a final comment and observation, all current packages considered performed well and are very similar at low core counts. We should also note that these results are only valid for the DARPA test case running with Code_Saturne on HECToR phase 2a and other codes and problems could perform in a different way, although we anticipate that the results and observations are fairly general.

Acknowledgements

The authors would like to thank the Engineering and Physical Sciences Research Council (EPSRC) for their support of Collaborative Computational Project 12 (CCP12) and Dr. Ming Jiang of STFC who assisted with the partitioning libraries of Code_Saturne.

This project was funded under the HECToR Distributed Computational Science and Engineering (CSE) Service operated by NAG Ltd. HECToR - A Research Councils UK High End Computing Service - is the UK’s national supercomputing service, managed by EPSRC on behalf of the participating Research Councils. Its mission is to support capability science and engineering in UK academia. The HECToR supercomputers are managed by UoE HPCx Ltd and the CSE Support Service is provided by NAG Ltd.

References

[1] F. Archambeau, N. Mechitoua, M. Sakiz. Code_Saturne: A Finite Volume Code for the Computation of Turbulent Incompressible Flows – Industrial Applications. International Journal on Finite Volumes, 1(1), 2004.

[2] Code_Saturne open source: http://www.code-saturne.org.

[3] M. Sohaib, M. Ayub, S. Bilal, S. Zahir, M.A. Khan. Calculation of flows over underwater bodies with hull, sail and appendages. Technical Report of National Engineering and Scientific Commission, Islamabad, 2001.

[4] Cindy C. Whitfield. Steady and Unsteady Force and Moment Data on a DARPA2 Submarine. Master Thesis of the Faculty of the Virginia Polytechnic Institute and State University, August 1999, USA.

[5] www.fluent.co.uk.

[6] www.cd-adapco.com/products/STAR-CD.

[7] www.ansys.com/products/fluid-dynamics/cfx.

[8] www.openfoam.com.

[9] EDF R&D. Code_Saturne version 1.3.2 practical user’s guide. April 2008.

[10] G. Karypis, V. Kumar. Metis: A Software Package for Partitioning Unstructured Graph, Partitioning Meshes, and Computing Fill-Reducing Orderings of Sparse Matrices. Version 5.0, 2007.

[11] http://glaros.dtc.umn.edu/gkhome/views/metis/.

[12] http://www.labri.fr/perso/pelegrin/scotch/.

[13] http://www.cs.sandia.gov/zoltan/Zoltan.html.

[14] http://www.cse.scitech.ac.uk/sog/.

[15] http://www.win.tue.nl/~hermanh/stack/dagstuhl08-talk.pdf.

[16] http://www.hector.ac.uk.

[1] Corresponding author at: Department of Computational Science and Engineering, Science and Technology Facilities Council, Daresbury Laboratory, Warrington WA4 4AD, United Kingdom. Tel: +44 1925 603221; Fax: +44 1925 603634.

E-mail address: david.emerson@stfc.ac.uk (D.R.Emerson)