Improved Data Distribution Routines for Gyrokinetic Plasma Simulations
GS2 is an initial value simulation code developed to study low-frequency turbulence in magnetized plasma. GS2 solves the gyrokinetic equations for perturbed distribution functions together with Maxwell's equations for the turbulent electric and magnetic fields within a plasma. It is typically used to assess the microstability of plasmas produced in the laboratory and to calculate key properties of the turbulence which results from instabilities. It is also used to simulate turbulence in plasmas which occur in nature, such as in astrophysical and magnetospheric systems.
This dCSE project will follow on from the work identified in Upgrading the FFTs in GS2. This work will optimise parts of the GS2 code by improving the performance of the transformation of data between the linear and non-linear parts of the code. This generally involves some FFT calculations along with associated data copying and MPI communications. Improved performance will be achieved by replacing the indirect addressing used in the data copying functionality with more efficient functionality. Furthermore, new decomposition functionality will be developed in order to reduce the amount of communications and data copying required when using none optimal process counts for a given user simulation. The new decomposition functionality will enable GS2 to efficiently use a much wider range of process counts, thus providing flexibility to users to select the process count that matches the simulation, resources, and system they are using.
The individual achievements of the project are summarised below:
- The performance of the local data copying associated with the data transform between the linear and non-linear calculations was improved by around 40-50%, by replacing the costly indirect addressing functionality by direct access mechanisms. This optimisation has the most significant impact on performance at lower process counts, as at larger core counts there is less data on each processor to be kept local in the transformation, and therefore the performance is more influenced by remote copies.
- The data decompositions used for the non-linear calculations were also improved. A new unbalanced decomposition was implemented such that slightly different amounts of data are allocated to each process, this alleviates the large communication costs previously observed at large processes counts.
- A reduction in the overall runtime of the code by up to 17% has been achieved for a representative benchmark. In particular with 512 cores on HECToR phase 3, an overall reduction in run time of 7% may be achieved, with 1536 cores the overall reduction is 17% and with 2048 cores the overall reduction is nearly 20%.
- The work on the unbalanced decomposition approach is likely to be applicable to optimising other scientific simulation codes where both real space and k-space data domains are used.
Please see PDF or HTML for a report which summarises this project.
