The test simulations were restricted to small test cases (8-atom
silicon, and the al1x1 benchmark) and numbers of cores (8),
where the results could be compared in detail to known results and
serial calculations, but once testing was complete we were able to
move to larger systems. Table
4.2 shows the performance improvement for the al1x1
benchmark using the new band-parallel mode.
|
The performance was analysed using Cray's Performance Analysis Tool (PAT) version 4.2 (earlier versions had bugs which prevented them being used with the Pathscale compiler and/or Castep properly). It was also necessary to create symbolic links to the Castep source files in Source/Utility in Castep's obj/linux_x86_64_pathscale/Utility and similarly for Fundamental and Functional.
pat_build -D trace-max=2048 -u -g mpi,blas,lapack,math castep
We profiled a Castep calculation on the al1x1 benchmark parallelised over 16 nodes (4-way band-parallel, 4-way gv-parallel). The subroutine with the most overhead from the band-parallelism was wave_rotate, the Trace output of which was:
| o-> wave_orthonormalise_over_slice 1290 | | o-> electronic_find_eigenslice 2580 | | o-> wave_rotate_slice 3870 3870 40.06s | | o-> wave_nullify_slice 3870 3870 0.01s | | o-> wave_allocate_slice 3870 3870 0.01s | | o-> wave_initialise_slice 3870 3870 1.75s | | o-> comms_reduce_bnd_logical 3870 3870 0.28s | | o-> comms_reduce_bnd_integer 3870 3870 0.10s | | o-> comms_send_complex 23220 23220 6.48s | | o-> comms_recv_complex 23220 23220 7.79s | | o-> wave_copy_slice_slice 3870 3870 1.09s | | o-> wave_deallocate_slice 3870 3870 0.00s |
This is to be expected, since these wavefunction rotations scale
cubically with system size, and also incur a communication cost when
run band-parallel. Some time was spent optimising this subroutine, and
in the end we settled on a refactoring of the communications whereby
each node does communication phases, the first phase
involving an exchange of half the transformed data, and each
subsequent phase exchanging half the data of the previous one. This
scheme is illustrated in figure 4.1.
![]() |
This communication pattern improved the speed of the
wave_rotate subroutine considerably, but at the cost of
increased storage. Indeed the first phase involves the exchange of
half of the newly transformed data, so the send and receive buffers
constitute an entire non-distributed wavefunction. As the
band-parallelisation is only efficient over relatively small numbers
of nodes (typically ) this has not proved too much of a
hindrance thus far, but it would be wise to restrict this in future,
perhaps to a low multiple of a single node's storage, at the cost of
slightly more communication phases. Such a change could, of course, be
made contingent on the value of the opt_strategy_bias
parameter.
Once wave_rotate had been optimised, Castep's performance was measured on the al3x3 benchmark. As can be seen from figure 4.2, the basic band-parallelism implemented in this stage of the project improved Castep's scaling considerably. Using linear interpolation of the data points we estimated that the maximum number of PEs that can be used with 50% or greater efficiency has been increased from about 221 to about 436 (without using the SMP optimisations).
[Castep performance]
![]() ![]() |