The HECToR Service is now closed and has been superceded by ARCHER.

Optimising the Parallelisation of a Harmonic Balance Navier-Stokes Solver for the Ultra-rapid Analysis of Wind Turbine, Turbomachinery and Aircraft Wing Periodic Flows

This dCSE project was concerned with improving the parallel performance of the COSA CFD system. COSA is a general-purpose compressible Navier-Stokes code which includes solvers for three flow models: steady, time-domain (TD) and harmonic balance (HB). All three solvers use a finite volume scheme with structured multi-block grids. For low-speed flows, such as those associated with horizontal axis wind turbines, low-Speed Preconditioning may also be used. All three COSA solvers (steady, TD and HB) may be run with one of three parallelisation strategies: pure MPI, OpenMP or hybrid OpenMP and MPI. COSA uses a standard block-structured domain decomposition where each block may be assigned to one MPI rank. The maximum number of MPI processes is limited by the number of geometric partitions (grid blocks) in the simulation. However, fine-grain parallelisation is made possible via OpenMP threads, further partitioning the harmonic-mode space, or alternatively parallelising the top level loops within each harmonic mode (if the number of modes are too small to be divided among the threads).

At the start of this project, for a representative grid of 262,144 cells and 31 real harmonics (Test 1), the parallel efficiency of the COSA HB solver was around 50% when using 512 cores on HECToR. The overall aim of the project is to improve the parallel efficiency of the HB solver, by optimising both the MPI communications and the hybrid OpenMP and MPI implementation. This will be achieved by the following:

  • Reducing the number of MPI messages for the point-to-point communications. The code will be updated to enable the assembly of halo data into buffers before communications.
  • Improving the use of OpenMP within the code for three main aspects: initialisation of the shared data to ensure cache coherency; the use of single OpenMP parallel regions to reduce the work distribution overhead; and the use of MPI from within OpenMP regions to reduce any synchronisation overhead.

The overall outcome of this work may be summarised as follows:

  • Combining the MPI messages for the point-to-point communications did not improve the overall performance of Test 1 on HECToR. However, this was due to the Gemini interconnect being able to handle large numbers of small messages extremely well. To demonstrate this, Test 1 was also performed on a Bull Supercomputer with 2 x 8 core Intel Xeon processors per node and an Infiniband QDR interconnect. On this system, a significant improvement was achieved (up to 100x). Work was also performed to reduce the number of MPI global collectives by nearly 10x.
  • Individual calls to MPI I/O for writing data were combined for the restart and Tecplot data output files. For Test 1, the original run-time without I/O was 421 seconds with 512 processes and 701 seconds with I/O. The new code now takes 547 seconds with I/O. For a larger test case (2048 block grid with 4,194,304 cells and 17 real harmonics) the new code is 30% faster for 2048 processes and 40% faster with 512 processes.
  • Calls to BLAS and LAPACK were implemented for performing matrix inversion and matrix-vector multiplication in the Linear Algebra routines. By demonstrating on 512 processes, the overall run-time for the larger test case was reduced by nearly half (~740 seconds vs ~1400 seconds).
  • The combined MPI and general optimisations have resulted in improved scaling, the overall runtime can now be reduced to half that of the original code, for a representative test case running on 256 or 512 cores on HECToR.
  • To improve OpenMP use within the code, it was found that certain functions within OpenMP regions did not consume as much run-time as other, more key routines. Therefore, the original OpenMP functionality was removed from areas that do not have enough work within them to justify the overheads imposed when using OpenMP parallelisations. OpenMP was then re-implemented for the key HB routines whilst keeping overheads to a minimum by adding the parallel regions at the highest loop levels as possible.
  • The initialisation and zeroing of the arrays was also re-developed to use a "first touch" approach such that it is now done in parallel, with each thread accessing its own array section.
  • To demonstrate the initial OpenMP improvements, Test 1 was run in hybrid mode using 512 MPI tasks each with 4 OpenMP threads (using a total of 2048 cores), the overall scaling was 3.26.
  • Further hybrid OpenMP and MPI optimisations were also performed. Firstly, as the OpenMP threads involve independent data, each MPI call can be performed via individual threads, therefore MPI thread support is specified as MPI_THREAD_MULTIPLE. Secondly, a hybrid file opening and closing strategy was also implemented. These enabled an increase in the overall scaling to 3.68.
  • The original hybrid code gave a representative parallel efficiency of around 55%, with the new code this is now around 90%. This represents a significant saving in computational resources; it also means that simulations can now be performed in around half the original time.
  • These developments have been incorporated back into the main COSA source code. COSA has also been installed as a central module on HECToR.

Please see PDF or HTML for a report which summarises this project.