The scheme outlined above works well for calculations over 2 or 3 nodes, but as the number of nodes increases the time taken by the communication calls also increases. If we look at the profiles from Castep's internal trace for wave_rotate_slice on a 4-node calculation (table 4.1) we can see that the higher-numbered nodes are spending more time in the subroutine than the lower-numbered ones.
![]() |
The reason for this disparity is that by constructing all of the data for a given node before moving onto the next node, we `serialise' the communication calls-each communication phase is one-to-many or many-to-one, and because Castep uses standard MPI_send and _recv calls every node has to wait for all of the preceding nodes to finish their communications before it can begin. This process proved to be a severe bottleneck for calculations over large numbers of nodes.
Removing this bottleneck is straightforward-we simply copy the
communication pattern from the dot-all subroutines. The communication
pattern is now a cyclic one, where every node constructs the data for
a node hops prior to it. For each
there are two sets of
communications, each of two phases. The first communication set
consists of each node sending its rotation matrix
places to the
left, and receiving a rotation matrix from
places to the right
(there is also some exchange of meta-data). Each node then applies the
rotation matrix it received to its local data. In the second
communication phase each node passes the result of its rotation
places to the right, and receives the contribution to its own
transformed data from
places to the left.
All of the communication phases are now point-to-point, and many such communications can take place simultaneously.