The replacement of the FFT routines was reasonably straightforward; a naive approach using 1-D individual transforms was chosen. Multiple cosine transforms were tested for the Chebyshev transform routines described in section 3 below, but there was no significant performance benefit in that case (testing was carried out using the SWT code and a 3072x325x1024 mode problem on 6144 cores).
FFTW planning flags were also investigated; FFTW_ESTIMATE was found to perform just as well as FFTW_PATIENT. The relevant test was carried out with the SS3F code on the test case described in section 2.1 above, using 360 cores over 12 nodes.
In spite of the simplicity of this implementation, substantial performance gains were realised for both codes. Test results for SS3F are summarised in table 1.
A similar test was carried out for SWT, using a domain size of 3072 x 325 x 1024; a percentage reduction of 53 % was observed in that case (see section 4.2 below).