The HECToR Service is now closed and has been superceded by ARCHER.

Enhancement of a high-order CFD solver for many-core architecture

This Distributed Computational Science and Engineering (dCSE) project is to develop the Block Overset Fast Flow Solver (BOFFS) high-order CFD code. BOFFS may be used on overset, structured grids to perform high resolution Large eddy simulation (LES) models of turbulent flows for turbomachinery applications.

The main aim of the project was to enable more realistic turnaround times for grids with more than 100 million points (with hundreds of blocks), by enabling more efficient use of thousands of processing cores on HECToR. In particular, the work will improve the scalability and performance of BOFFS on many-core architectures by:

  • Optimising the MPI used for the inter block data transfers.
  • Improving the memory utilisation.
  • Updating the OpenMP for the intra block computations.

The individual achievements of the project are summarised below:

  • For the inter block data transfers, an asynchronous method to transfer data between blocks was implemented in MPI. This allowed the packing and unpacking of data buffers to be performed while each MPI process is waiting and also enabled handling of grids with more complex block structures. Overall, it also reduced the amount of time taken for the inter block communication to a third of the original amount.
  • For better memory utilisation, the static arrays for all main variables were replaced by allocatable forms. Originally, BOFFS was restricted to the Gnu compiler due to the use of large static arrays. This work has now enabled BOFFS to use any of the default compilers on HECToR: Gnu, PGI and Cray. This was important because both PGI and Cray give better performance.
  • The OpenMP sections for the intra block computations in the Tri-Diagonal Matrix Solver (TDMA) were improved by implementing a red-black decomposition. The updated BOFFS now has good scalabilty up to 8 threads per MPI task and gives reasonable performance up to 32 threads for certain problem sizes. Up to a 1.4 times speedup can be achieved compared with the original code.
  • The performance of BOFFS was tested with the Gnu, PGI and Cray compilers. It was found that Cray and PGI give best performance and OpenMP scalability. A 1.5 times speedup can be achieved for the 12 million grid point, 4 and 32 block test cases.
  • In general, the most cost effective solution for using BOFFS on HECToR phase 3 is with 8 OpenMP threads, and either 2 or 4 MPI tasks per node depending on the size of the problem.

Please see PDF or HTML for a report which summarises this project.