Figure 6 shows absolute performance plotted against node count, for a test case (3072x321x1024 modes) chosen to be representative of the simulations planned using SWT in the near future. Node count is again chosen in preference to core count, because the original code was not quite able to run this case using all 32 cores per node, owing to its 1-D domain decomposition. The original code is represented in the figure with an empty square symbol. Replacing the original FFT routines with FFTW3 equivalents resulted in a significant performance improvement - using the same number of cores, wall clock time per time step fell by about 53%.
In addition, memory requirements were slightly reduced. This made it possible to reduce the required node count by 1 to 11, with all nodes but one fully populated (the last could almost certainly also have been fully utilised but for the fact that one grid dimension was not a multiple of 32). This (11 node) run is shown in the figure as an empty diamond; using 12 nodes instead improved performance by about 5% (this is not shown in figure 6).
Scaling to a higher number of processors was not possible in either of these cases, owing to the 1-D domain decomposition (nor would it have been possible to use fewer nodes; reducing the core count would only have been possible by reducing the number of cores used per node). This case therefore represents the limit of what could realistically be achieved using the 1-D decomposition. Note that - perhaps owing to the dimensions of the large static arrays used - both 1-D code versions could in fact only be successfully compiled for this particular case by using the pathscale compiler. This compiler is no longer supported on HECToR, and it is likely that it no longer produces optimal code for phase 3 of this machine. Some of the observed performance improvement is probably due to its replacement by the Cray compiler. Recall, however, that large performance benefits (>100 %) were obtained by replacing the same FFT routines in the SS3F code, even when the Cray compiler was used in both cases.
The implementation of a 2-D decomposition using the 2DECOMP&FFT library has made it possible to reduce by at least an order of magnitude the wall clock time required to run the most challenging case possible with the original code version, without giving up efficiency gains resulting from the modification of the FFTs. Note that even using 256 nodes (8192 cores), although scaling less efficiently than at lower node counts, the new version still requires about 34% fewer AUs to perform the same amount of work as the original (see figure 7), and does so about 30x faster. More important, however, is probably the fact that still larger grids are now thinkable, something that was not necessarily true prior to this work (eventually the memory required per core would have exceeded the total memory per node, at which point progress becomes impossible rather than merely very expensive - unless out of core algorithms or simply paging to disk are considered acceptable). The scaling of the simulation program with problem size, as well as processor count, is therefore of interest.
The four symbol styles shown in figure 8 illustrates the scaling of the SWT code with the total number of gridpoints , for four different simulation sizes, each over a range of decompositions. Low values on the -axis correspond to a large number of cores relative to the , while high values imply a lesser degree of parallelism. The -axis variable is a measure of efficiency - the number of gridpoints that can be advanced by one time step by expending one allocation unit. Perfect parallel scaling, therefore, would mean that all like symbols would align horizontally - i.e. efficiency independent of decomposition. In practice, we would expect to see a positive slope on the left hand side of the graph, and this is indeed the case. Above approximately collocation points per core, however, scaling is quite good, independently of the total problem size .
The horizontal lines in figure 8 show the expected maximum efficiency for each , based on the best efficiency obtained for the smallest (the 1024x221x384 grid), and assuming scaling. scaling would correspond to collapse of all symbols onto one curve (at least towards the right hand side of the graph), while - the computational complexity scaling of the fast Fourier transform - would correspond to good alignment of each set of symbols with the horizontal line of the same colour. It appears that the real scaling is intermediate between the two - the best efficiency obtained at a given shows a tendency to decline a little more slowly than implied by complexity. This is very encouraging, as it suggests that - apart from the superlinear scaling in that results from the increasing range of time scales (equation 7) - approximate Gustafson scaling holds. The wall clock time required per time step will therefore increase only very slowly if both and the processor count are increased in the same proportion; although the number of time steps required will increase.