Upgrading the FFTs in GS2
This Distributed Computational Science and Engineering (dCSE) project was concerned with improving the performance on HECToR Phase2a of the FFTs within the magnetic plasma turbulence modelling code GS2. The GS2 application uses a gyrokinetic approach to simulate micro-turbulence within magnetised fusion plasmas. This code spends significant amounts of time performing fast Fourier transformations and the overal aim of this project was to upgrade the code usage of the legacy library FFTW2 to the newer FFTW3 library. For this work an in-depth analysis of how GS2 uses FFTW was required. The outcomes of the project are summarised below.
GS2 has been re-engineered to gain the option of using FFTW3:
- From a detailed analysis of the call tree of GS2 the transformation routines have been re-implemented using FFTW3
FFTW3's exploitation of SSE instructions has not improved GS2's performance:
- At the outset of the project it was expected that moving GS2 onto FFTW3 would reduce the time spent on the FFTs. This is because FFTW3 can utilise the SSE instructions of the Opteron processors deployed on the HECToR system, while FFTW2 cannot.
- Detailed analysis shows that for the FFT calls relevant for GS2, the benefits from the SSE instructions are at best minimal. This was unexpected and should be of general interest to the HECToR user community. Our detailed analysis has shown, that there is little benefit from the SSE even when using smaller problems to make the problem fit into cache or when placing only a single core of the processors to give the compute task more memory bandwidth and level 3 cache.
Other significant GS2 Performance Issues have been Identified:
- An in-depth analysis of the profile after upgrading the FFTW library has shown the data redistribution routines inside the transformation routines to be very costly.
- Profiling showed that for small processor numbers, the time was consumed inside a single loop that rearranges the data. This loop makes extensive use of indirect addressing, which cannot be optimised by the compiler. This was uncovered late in the project, and insufficient resources were available to fully understand this complex code.
- For a special case the project demonstrated that substantial performance gains can be achieved by removing the indirect addressing. For the subroutines c_redist_22_inv and c_redist_22, their time cost was reduced by almost a factor of 2 after removing the indirect addressing.
- Since indirect addressing is at the core of the application, a clean well engineered solution to this problem would be very worthwhile.
- It was also found that the initialisation cost was expensive for the non-accelerated transform implementation. It would be interesting to understand the reason for the high initialisation cost of the non-accelerated transform implementations.
Please see PDF or HTML for a report which summarises this project.