## Improved Scaling for Direct Numerical Simulations of Turbulence

This Distributed Computational Science and Engineering (dCSE) project is to develop two CFD codes: SWT and SS3F. Both of these codes solve the Navier-Stokes (N-S) equations using a pseudospectral method. Here, the nonlinear advective terms of the N-S equations are evaluated in real space such that the necessary transformation to and from wave space is carried out in slices (data decompositions). SWT has been used for the direct numerical simulation of turbulent flow in an infinite plane channel and turbulent Couette-Poiseuille flow. SS3F solves the incompressible N-S equations in the Boussinesq approximation using a 3D-Fourier representation. SS3F is predominantly used to simulate the dynamics of vortices in stratified flow. SWT uses the same FFT routines as SS3F but one coordinate direction uses Chebyshev basis functions.

The aims of this project were to remove the limit on parallel scalability, which had been reached with both codes at HECToR phase 2b. This would be achieved by improving the parallel FFT implementation for SWT and SS3F, in particular:

- The FFT and cosine transform calls in the SS3F and SWT codes would be updated to use FFTW3.
- SS3F would be modified to use the 2DECOMP&FFT library of FFT routines by using a 2-D domain decomposition.
- The modified FFT and cosine transform calls would be used to create a serial Chebyshev transform routine for the library.
- A 2-D-FFT/1-D-Chebyshev transform capability would be developed for use with the 2DECOMP&FFT library.
- SWT would be modified to use this transform by implementing a 2-D domain decomposition.

The individual achievements of the project are summarised below:

- For both codes, a 1-D cosine transform was implemented from FFTW3 and the planning flags (FFTW_ESTIMATE and FFTW_PATIENT) were tuned. For a 128×720×1440 mode problem, SS3F showed a 51% reduction in wall clock time per time step and a 55% reduction for 256×1440×2880 modes. For a 3072×325×1024 mode problem on 6144 cores SWT showed a 53% reduction.
- A parallel 2-D-FFT / 1-D-Chebyshev transform was implemented to work with the 1-D cosine transform from FFTW3 for both SS3F and SWT.
- For a 768×1536×3072 mode test problem, SS3F now scales to over 12000 cores with good efficiency.
- For a representative 3072×325×1024 mode problem, SWT now scales to 8192 cores with good efficiency. The new code now requires around 34% fewer kAUs to perform the same amount of work as the original code.

Please see PDF or HTML for a report which summarises this project.