Initial performance analysis

Next: Plan of Work Up: The dCSE Project Previous: The dCSE Project

Initial performance analysis

Both of the programs described above have properties that are bound, ultimately, to limit their parallel scalability, and these limits had, at the time of writing, been reached. In pursuit of the lines of inquiry opened in [4] and [9], simulations were planned which would require grids of approximately 3 and 1 billion collocation points, using SS3F and SWT respectively. In both of these cases, more cores would need to be allocated than can in fact be used, owing to the requirement that each core store and process an integer number of planes at a time. This follows from the use of a 1-D domain decomposition.

Initial investigations showed that, in addition, neither code performed particularly well - either in terms of per-processor performance, or of parallel scaling - on the HECToR architecture (phase 2b at the time), and that this state of affairs had worsened relative to phase 2a. Test cases smaller than the problem sizes (3 & 1 billion grid points) actually planned were used, as scaling tests would not be meaningful otherwise (certainly the core count would be completely meaningless, as many if not most cores would be left idle).

**Figure 1:**
$\includegraphics[width=4.0in]{SS3F_scaling1.eps}$

Figure 1 shows parallel scaling for SS3F on a grid (128x720x1440 modes) previously used to obtain the results published in [4]. This appears acceptable; however, it is not possible to increase the core count for this problem beyond 360 without suffering load imbalance; perfect load balance requires that the second and third dimension should both be divisible by the core count, the former without dealiasing (i.e. 1080), the latter with (1440). At the beginning of the project, this code was in fact restricted to permit only very slight load imbalance (fixed computational load was enforced for all but one process). It is reasonable to assume that merely loosening this restriction would not be a particularly good use of development effort; for the test case considered here, for instance, each process is responsible for 3 planes, so an imbalance of 1 would be fairly significant.

**Figure 2:**
$\includegraphics[width=2.5in]{SWT_scaling_small_grid.eps}$

**Figure 3:**
$\includegraphics[width=2.5in]{SWT_scaling_medium_grid.eps}$

The results for SWT (figures 2 and 3) appear less satisfactory, and scaling is also notably poorer on the XE6 than the XT4 for both simulation sizes (in spite of the fact that the XT4 data were collected before the introduction of the Gemini interconnect). It would appear that the SWT parallelisation may not be well suited to architectures with large numbers of cores per CPU. Performance per core, however, was superior on phase2b.

Both codes originally implemented a parallel transpose using MPI non-blocking sends and receives. On Cray machines, this approach performs less well than MPI_ALLTOALLV (used by 2DECOMP&FFT), so some parallel efficiency gains are likely to arise from this substitution.

In-depth profiling of the SWT code, using the 864x325x324 grid on 324 cores, was carried out by Dr. Ning Li of the NAG HECToR CSE team, and the results (obtained using CrayPAT) are shown below.

Time % | Time | Imb. Time | Imb. | Calls |Experiment=1
| | | Time % | |Group
| | | | | Function
| | | | | PE='HIDE'
100.0% | 110.186539 | -- | -- | 85892.9 |Total
|----------------------------------------------------------------
| 63.9% | 70.354830 | -- | -- | 7983.9 |USER
||---------------------------------------------------------------
|| 19.7% | 21.745259 | 5.791653 | 21.1% | 1.0 |MAIN_
|| 13.8% | 15.219145 | 8.783854 | 36.7% | 2181.8 |radb3_
|| 9.1% | 10.063469 | 8.069853 | 44.6% | 722.2 |radf3_
|| 3.6% | 3.998016 | 2.350367 | 37.1% | 180.6 |radb4_
|| 3.3% | 3.680287 | 2.335971 | 38.9% | 180.6 |radf4_
|| 3.1% | 3.427989 | 1.016561 | 22.9% | 722.2 |passb3_
|| 2.8% | 3.075736 | 1.263592 | 29.2% | 722.2 |passf3_
|| 2.6% | 2.817066 | 1.345977 | 32.4% | 545.4 |radb4l_
|| 1.7% | 1.910091 | 1.269967 | 40.1% | 180.6 |radf4f_
|| 1.0% | 1.110035 | 0.026740 | 2.4% | 545.4 |rfftb1_
||===============================================================
| 25.9% | 28.577012 | -- | -- | 77845.0 |MPI
||---------------------------------------------------------------
|| 25.4% | 27.943499 | 18.023463 | 39.3% | 38880.0 |MPI_WAIT
||===============================================================
| 10.2% | 11.254697 | -- | -- | 64.0 |MPI_SYNC
||---------------------------------------------------------------
|| 9.9% | 10.945641 | 10.932999 | 93.6% | 11.0 |mpi_reduce_(sync)
|================================================================

These results reveal that 25% of run time was spent in MPI_WAIT, and a further 10% in MPI_SYNC. This suggests load imbalance, which could be reduced by introducing a 2-D domain decomposition, by increasing the number of ways in which a grid could be divided for a given processor count.

Of the routines in the USER section, all but the main program (much of the runtime of which is associated with one-time initialisations, and not significant in a production run) are part of the existing FFT package (vecfft). It is important, therefore, that these should be efficient.

However, profiling of the most heavily used FFT routine (rad3b) using hardware performance counters showed that:

$\Rightarrow$: D1+D2 cache utilisation is 90.1% (which is quite poor).
$\Rightarrow$: There is no SSE vectorisation at all.

The first result may explain the good per-core performance on phase2b relative to phase2a; poor caching means the performance of the FFT routines is likely bounded by memory access speed, and the memory bandwidth per core of phase2b is superior. Note that the vecfft routines were designed to perform optimally on traditional vector processors (SWT and SS3F have previously been used on the X2 component of the HECToR service). The second result shows that the vecfft optimisations are unsuitable for the short SSE vector instructions.

A good case can therefore be made that the serial performance of both codes is poor on HECToR. However, it also appears that this is largely attributable to the poor performance of the FFT routines used by them, which dominate the run time.

Next: Plan of Work Up: The dCSE Project Previous: The dCSE Project

R.Johnstone 2012-07-31