Direct Numerical Simulation (DNS) of turbulent fluid flows - a dCSE Project

Ning Li
Numerical Algorithms Group (NAG)
Wilkinson House, Jordan Hill Road,


Date: April 29, 2012



Contents

1 Introduction

This project concerns the further development of Incompact3D, an incompressible Navier-Stokes solver, and its sister codes, Compact3D & QuasiCompact3d, for compressible flow simulations. These applications are used by the Turbulence, Mixing and Flow Control group at Imperial College and its academic collaborators to conduct state-of-the-art turbulence studies.

This is the second dCSE project for code Incompact3D. In a very successful first dCSE, the scalability of Incompact3D was significantly improved via the creation of 2DECOMP&FFT, a 2D pencil decomposition communication framework with a distributed FFT interface [3]. As a result, Incompact3D now regularly uses 4,000-16,000 cores on HECToR for production runs.

The current project contains two major work modules. First, the two sister codes of Incompact3D, both in serial form originally, are to be parallelised using the same 2DECOMP&FFT framework. This is to enable the research group to simulate large-scale compressible turbulent flows. Second, a new communication library is to be introduced to enable embedding explicit numerical schemes within the framework of an spatially implicit node. This makes it possible to use Incompact3D to study a wider range of problems, such as particle-turbulence interaction.

The project was granted 6 months full-time effort for one person to work on a full-time basis, which translates to 1 year effort on a 50% part-time basis. The project officially started in March 2011.

2 General serial code modernisation and improvement

The Incompact3D family of applications have a long history. Unlike the Incompact3D code, which had already been modernised during the previous dCSE project, the two sister codes were still in old Fortran 77 style and required major updates before the parallelisation. Following is a list of key works done during this project:

The original serial codes were imported into an SVN server for version control. This is the same SVN server used by the previous Incompact3D project. Code validations based on a standard test case were carried out after each step change. After making all the changes, the new serial codes, although look beyond recognition, can still produce identical results as the original.

These code modernisation works were not originally planned in the proposal but was found to be necessary to ensure the high-quality parallelisation works later.

3 Strategy for parallelisation

As proposed, the code parallelisation makes use of the 2DECOMP&FFT library created in the previous dCSE project.

Because of the flexible design of the 2DECOMP&FFT library, there are a number of ways to parallelise the serial Compact3D and QuasiCompact3D codes. After consulting the code author and key users, the following strategies were used:

This set of procedure was designed to help future code development. Should the developers need to add new functionality, they can plan their new algorithms as if they are writing a serial code. By following the strategies documented above, new algorithms can be easily parallelised to the same standard.

Validating an MPI code is normally only possible when all development works are completed. During this project, a procedure was established to test the intermediate codes regularly by running them on a single MPI rank (with all communication code being covered and tested on a local node), which should always produce identical results as the serial code. This helped pick up a number of coding errors.

When the parallelisation was completed, the code was moved onto a cluster for debugging. The main procedure was to compare intermediate results stored in distributed array to those stored in the corresponding arrays in the serial code. Code was checked section by section until all data matched to machine accuracy. Further validation was done by evaluating integral quantities (such as system entropy) over the computational domain - they should remain constants regardless of the domain decomposition approaches. In the end, the parallel code was shown to produce identical numerical results as the original serial code.

4 Parallel performance

The new code was ported onto HECToR for further validation and benchmark. The performance study on HECToR was based on a test case supplied by a key Compact3D user. This is a small simulation using a 3D mesh of 75×101×75 points. Note that this problem size is about 2 magnitude smaller than the typical large production simulations. Due to the time constraint of this project, and the complexity setting up large-scale simulation, it was not possible to benchmark a more realistic case.


Figure 1: Strong scaling of Compact3D.

Figure 1 shows the strong scaling of Compact3D. First of all, it can be seen that the absolute speed-up over the serial run is not very impressive - 12 times faster when using 64 cores. However, such performance is broadly in line with the behaviour of the base 2DECOMP library - the ALLTOALL type of communication introduces significant overhead initially at smaller scale, but good scaling at large parallel runs can be observed. Such claim can be supported by a second set of benchmark in which the number of mesh in the small test case was artificially doubled in each dimension. The code demonstrates much improved scaling with a parallel efficiency of nearly 80% between 32 and 128 cores.

5 Implementation of a new communication library

Incompact3D is parallelised using transpose-base method because its key algorithms are spatially implicit and requires boundary-to-boundary data access. There are, however, many situations in which stencil-based explicit schemes are desirable. For example:

To support explicit numerical schemes, a second set of communication routines have been introduced. This involves building halo cells around the pencils (as defined in the 2D decomposition) and writing point-to-point communication code to enable neighbouring pencils to exchange data.

The original target, as outlined in the proposal, was to perform particle tracking only in X-pencils and using a 4th-order finite difference scheme for spatial interpolations (with five-point stencil, therefore requiring two layers of halo cells while building the communication code). However, because of the usefulness of these explicit schemes, a more general communication routine has been designed and implemented in 2DECOMP&FFT to support any number of halo cells.

In a related work, the P3DFFT package[2] (also use a 2D pencil decomposition) implemented its halo-cell support using the following storage arrangement:

      |000000000000000000000000000|11111|22222|33333|44444|55555|66666|
                   ^                 ^     ^     ^     ^     ^     ^ 
              Pencil data            W     E     S     N     B     T
As shown, the first section of the memory contains the data in a pencil. The data in halo cells is then appended to the pencil data, in the order of west/east (the 1st dimension), south/north (the 2nd dimension), and bottom/top (the 3rd dimension). The advantage of such storage format is its efficiency - the same memory reference can be used in the global transposition routines, where the halo cells are simply ignored. The disadvantage is that extra code has to be written in applications to map the halo-cell data to a more convenient (i,j,k)-indexed format before being used.

In the current project, a more user-friendly design was adopted. A single library routine has been introduced in 2DECOMP&FFT, in the form of:

      call update_halo(var, var_halo, level)
Here the first parameter var, an input array, contains the normal pencil-distributed data as defined by the domain decomposition. After the subroutine call, the second parameter var_halo, an output, returns all original data plus halo data collected from the neighbouring processes through MPI. The third parameter level simply defines how many layers of overlapping is required. var_halo should be defined from the calling routine as either a 3D allocatable array or pointer. Its memory space will be calculated and allocated by the library and when the subroutine returns it can be indexed by the calling program using the normal i,j,k indices. A more general form of this subroutine is also available:
      call update_halo(var, var_halo, level, opt_decomp, opt_global)
The two optional parameters, opt_decomp and opt_global, enable the use of arbitrary global data sizes and global coordinates in data storage.

For transpose-based algorithms, boundary conditions can normally be applied to the dimension completely residing in local memory as they are done in serial code. The library code does provide direct support to periodic conditions so that pencils bordering one side of the domain may exchange information with those in the other side.

The library internally handles the communication between neighbouring pencils using standard MPI non-blocking point-to-point communication routines. Full details of the halo-cell support is available in the 2DECOMP&FFT documentation [1].

The validation of the new halo-cell communication was straight-forward. This is because any communication that can be achieved using halo-cell exchanges can also be performed using global transpositions, albeit much slower. The two sets of communication code validate each other.

The new communication code is now being used by Incompact3D to implement several key algorithms. For example, when using the Immersed Boundary method to model solid bodies in the computational domain, Incompact3D needs to perform 3D interpolations to prescribe the correct boundary conditions on the fluid-solid interface. Such interpolations, using explicit numerical schemes, are made possible by the new communication code. Its particle tracking algorithm, which involves interpolating velocity field from mesh points to arbitrary spatial locations, also requires the halo-cell support.

6 Concluding Remarks

Through this dedicated software engineering project, Incompact3D's sister codes have been parallelised. A second communication framework has been introduced to expand Incompact3D's capabilities. All major objectives in the original proposal were achieved.

As discussed in Section 3, there is still significant scope for improving the performance of the new parallel code, by adjusting its algorithm to make better use of the communication library. Such work is more of an algorithm design nature than software engineering. The PI and his associates are provided a good platform to implement further improvements.

Like in the previous dCSE project, reusable software components have been extracted and put in the 2DECOMP&FFT library. A new version 1.4 of 2DECOMP&FFT was released in October 2011.

7 Acknowledgments

This project was funded under the HECToR Distributed Computational Science and Engineering (CSE) Service operated by NAG Ltd. HECToR - A Research Councils UK High End Computing Service - is the UK's national supercomputing service, managed by EPSRC on behalf of the participating Research Councils. Its mission is to support capability science and engineering in UK academia. The HECToR supercomputers are managed by UoE HPCx Ltd and the CSE Support Service is provided by NAG Ltd. http://www.hector.ac.uk

Bibliography

1
2DECOMP&FFT website.
http://www.2decomp.org.

2
P3DFFT website.
http://code.google.com/p/p3dfft/.

3
N. Li and S. Laizet.
2DECOMP&FFT - a highly scalable 2D decomposition library and FFT interface.
In Cray User Group 2010 conference, May 2010.

About this document ...

Direct Numerical Simulation (DNS) of turbulent fluid flows - a dCSE Project

This document was generated using the LaTeX2HTML translator Version 2008 (1.71)

Copyright © 1993, 1994, 1995, 1996, Nikos Drakos, Computer Based Learning Unit, University of Leeds.
Copyright © 1997, 1998, 1999, Ross Moore, Mathematics Department, Macquarie University, Sydney.

The command line arguments were:
latex2html -split 0 -no-navigation -show_section_numbers -html_version 3.2,math -no-math -local_icons -antialias index

The translation was initiated by Ning Li on 2012-04-29


Footnotes

... tracking1
Note that partitioning computational domain for particle tracking, as discussed in this project, is not suitable for large-scale problems. When particle tracking is computationally more expensive than obtaining the base flow field, if the particle distribution is not even, then domain decomposition can introduce significant load-imbalance. Consider using a task-based decomposition instead if that is the case.


Ning Li 2012-04-29