Next: Scope of Parallelism and Up: Band-Parallelism (Work Package 1) Previous: Testing Contents

Benchmarking and Performance

The first reasonable simulation we performed with the new band-parallelism was the al1x1 benchmark.

The test simulations were restricted to small test cases (8-atom silicon, and the al1x1 benchmark) and numbers of cores ( $\leq$ 8), where the results could be compared in detail to known results and serial calculations, but once testing was complete we were able to move to larger systems. Table 4.2 shows the performance improvement for the al1x1 benchmark using the new band-parallel mode.

Table 4.2: Parallel scaling of the al1x1 benchmark in band-parallel mode.

cores	DM efficiency
2	65%
4	50%
8	35%

Table 4.3: Execution time and parallel efficiency for the 33-atom TiN benchmark (8 k-points). Times are for 40 SCF cycles using the DM algorithm. The 8-core calculation is running purely k-point parallel, the others are running with mixed band and k-point parallelism.

cores	Time (s)	band-parallel efficiency
8	5085.04	(k-point parallel)
16	3506.66	72%
32	2469.84	51%

The performance was analysed using Cray's Performance Analysis Tool (PAT) version 4.2 (earlier versions had bugs which prevented them being used with the Pathscale compiler and/or Castep properly). It was also necessary to create symbolic links to the Castep source files in Source/Utility in Castep's obj/linux_x86_64_pathscale/Utility and similarly for Fundamental and Functional.

pat_build -D trace-max=2048 -u -g mpi,blas,lapack,math castep

We profiled a Castep calculation on the al1x1 benchmark parallelised over 16 nodes (4-way band-parallel, 4-way gv-parallel). The subroutine with the most overhead from the band-parallelism was wave_rotate, the Trace output of which was:

|   o-> wave_orthonormalise_over_slice         1290                           |
|   o-> electronic_find_eigenslice             2580                           |
|   o-> wave_rotate_slice                      3870        3870     40.06s    |
|   o-> wave_nullify_slice                     3870        3870      0.01s    |
|   o-> wave_allocate_slice                    3870        3870      0.01s    |
|   o-> wave_initialise_slice                  3870        3870      1.75s    |
|   o-> comms_reduce_bnd_logical               3870        3870      0.28s    |
|   o-> comms_reduce_bnd_integer               3870        3870      0.10s    |
|   o-> comms_send_complex                    23220       23220      6.48s    |
|   o-> comms_recv_complex                    23220       23220      7.79s    |
|   o-> wave_copy_slice_slice                  3870        3870      1.09s    |
|   o-> wave_deallocate_slice                  3870        3870      0.00s    |

This is to be expected, since these wavefunction rotations scale cubically with system size, and also incur a communication cost when run band-parallel. Some time was spent optimising this subroutine, and in the end we settled on a refactoring of the communications whereby each node does $log_2{nodes}$ communication phases, the first phase involving an exchange of half the transformed data, and each subsequent phase exchanging half the data of the previous one. This scheme is illustrated in figure 4.1.

**Figure 4.1:** The new communication pattern, illustrated for seven nodes in the band group. Nodes with work still to do are coloured blue, and nodes that have finished are coloured yellow. At each of the three communication phases each group of nodes is split to form two child groups. Each node in a child group transforms its local data to produce its contribution to the nodes in the other child group, the `sibling group'; it then sends this data to one of the nodes in that group, and receives the sibling node's contribution to all of the nodes in the child group.
$\includegraphics[width=0.9\textwidth]{rotate_comms.eps}$

This communication pattern improved the speed of the wave_rotate subroutine considerably, but at the cost of increased storage. Indeed the first phase involves the exchange of half of the newly transformed data, so the send and receive buffers constitute an entire non-distributed wavefunction. As the band-parallelisation is only efficient over relatively small numbers of nodes (typically $\leq 16$ ) this has not proved too much of a hindrance thus far, but it would be wise to restrict this in future, perhaps to a low multiple of a single node's storage, at the cost of slightly more communication phases. Such a change could, of course, be made contingent on the value of the opt_strategy_bias parameter.

Once wave_rotate had been optimised, Castep's performance was measured on the al3x3 benchmark. As can be seen from figure 4.2, the basic band-parallelism implemented in this stage of the project improved Castep's scaling considerably. Using linear interpolation of the data points we estimated that the maximum number of PEs that can be used with 50% or greater efficiency has been increased from about 221 to about 436 (without using the SMP optimisations).

**Figure 4.2:** Graphs showing the computational time (4.2(a)) and scaling (4.2(b)) of the band-parallel version of Castep, compared to ordinary Castep 4.2, for 10 SCF cycles of the standard `al3x3` benchmark.
[Castep performance] $\includegraphics[width=0.9\textwidth]{init_band_par.eps}$ [Castep parallel efficiency relative to 16 PE Castep 4.2] $\includegraphics[width=0.9\textwidth]{init_band_scaling.eps}$

Next: Scope of Parallelism and Up: Band-Parallelism (Work Package 1) Previous: Testing Contents

Sarfraz A Nadeem 2008-09-01