Figure 4 shows parallel scaling for a planned simulation using SS3F. The original code is included for reference (open square), although it was necessary to reduce the cores used per node from 32 to 6 in order to run this case. Node count is therefore chosen in preference to core count for the -axis; this reflects the actual resources occupied in running this simulation in a way that core count does not.
The improvement in efficiency - ie. performance at the minimum (192) node count relative to the original code - appears at first glance to be entirely due to the use of all 32 cores per node (it is a factor of approximately 5, not far off 32/6). If true, the contribution expected from the replacement of the original FFT routines - and confirmed for smaller test cases - is absent. However, the 32 AMD Interlagos processors on each node share many resources, notably L3 cache and interconnect bandwidth, so this may not be a fair comparison.
Scaling to over 12000 cores is efficient, but in this case the same good efficiency does not extend to 18000 cores.