Table: Benchmark times (in seconds) for wave_write.
No. processing elements Version 5.5 write Collectives write
24 7.09 7.63
48 7.23 7.72
96 9.85 7.88
192 19.49 8.13
384 70.48 8.42

Often, the greatest amount of distribution in a CASTEP job is over G-vectors. Because of this, our initial prototype was to rewrite wave_write to use an MPI gather over G-vectors. Our approach is to collect the data for each band at a particular k-point and spin and then pass each band or block of bands up to the nodes that are both G-vector and band masters. In turn the data is to be passed to the root node for output. The figures in Table 1 show timings, in seconds, for wave_write from the CASTEP 5.5 release and our prototype code that uses collectives.

These benchmarks were run on HECToR Phase 2b with the Gemini interconnect. It should be noted that the time spent in disk I/O for all runs in Table 1 is approximately 6.4s, as determined by the internal trace function of CASTEP. The test case was based on an aluminium oxide `2x2' slab (subsequently referred to as al2x2) that contains 5 k-points,  40000 G-vectors and 288 bands. Only G-vector parallel distribution was used for the runs. The code has been checked for correctness with combinations of all three parallel distribution strategies, producing a correctly formatted CASTEP .check file that is compatible with existing analysis tools. For comparison, CASTEP's ground state SCF calculation on the al2x2 system takes approximately 1300s on 120 processing elements (5-way k-point distribution and 24-way G-vector) and approximately 16s is spent in 2 calls to wave_write.

The same was done for wave_read, rewriting it such that the data is distributed with an MPI scatter over G-vectors. The wave_read routine also contains extra machinery to determine how to distribute the data if the parallel data distribution has changed. Our initial prototype wave_read, provided substantial improvement over that in CASTEP 5.5. However, the collectives version of wave read was taking more than 4 times longer than the corresponding wave_write in some cases (see Table 2 ).

Further investigation, helped greatly by CASTEP's internal trace profiling routines showed some inefficiencies that could easily be rectified. First was that one of the internal arrays for reading in the wave function data was specified as (a block of) bands by G-vectors. Most fortran implementations store arrays in contiguous memory in array element order. The extra stride through memory during the fortran read statement for each band’s set of G-vectors was causing slow running of the code. This was made more efficient by simply switching the order of the array indices so that G-vectors came first. The second optimisation came from rewriting the data handling for reordering the wave function data if the G-vector distribution has changed between runs. In the existing code a conditional statement is evaluated for every plane wave in the read-in data, nested within loops over the nodes in the G-vector group for the current run and the number of bands. The mapping for redistributing the data over G-vectors is independent of bands, so can be done outside of the band loop. In our optimised version of the routine, a many-one vector subscript is used to store the mapping between the read-in data and the distribution for the current run vector subscript. The array of data to be scattered over G-vectors is then assigned in a single statement with an indirect index for the read-in data array.

Table: Benchmark times (in seconds) for wave_read and, for comparison, wave_write.
No. processing Version 5.5 Prototype Optimised Collectives
elements read collectives read collectives read write
24 16.78 14.74 9.15 7.63
48 22.03 15.97 9.20 7.72
96 32.90 18.57 9.23 7.88
192 57.61 23.94 9.36 8.13
384 113.47 34.80 9.60 8.42

Table 2 lists timings for wave_read in its original, prototype and optimised forms. The tests were performed using the al2x2 system, as outlined in 1.2. Time spent in disk I/O in each wave_read was approximately constant at a total of 7.0s. Note that it is expected that wave_read takes longer than the corresponding wave_write because all the handling for reordering data for changed parallel distributions is done in the read routine.

In analogy to the work on wave_read and wave_write, the corresponding read/write routines for the electron density and potential were modified to use MPI collectives. This task was straightforward as these objects are independent of k-points and bands.

The routine wave_apply_symmetry is an essential part of phonon calculations. It transforms a wavefunction from one k-point set to another one related by symmetry operations. For certain classes of large phonon calculations, the version of wave_apply_symmetry in CASTEP version 5.5 fails on HECToR phase 2a with an MPI unexpected buffer error. This error occurs because of the many point-to-point MPI communications that are present. To avoid use of the unexpected buffer, it was decided to use MPI gather over G-vectors for each band's data before applying the symmetry operation. The data is then scattered back to the appropriate nodes. It is possible that a symmetry transformation maps data between k-points that are stored on different nodes. In this case the gathered band data is sent to the node that holds that k-point, where the symmetry operations are applied before scattering.

The modified version of wave_apply_symmetry was then applied in situations where the unexpected buffer error had been reproduced. As expected, the jobs ran to completion without having to alter the MPICH_UNEXPECTED_BUFFER variable. This opens up the ability to perform large phonon calculations without having to dedicate unfeasible amounts of RAM to the unexpected buffer.

At the request of Phil Hasnip of the CASTEP Developers Group, the lessons learned in modifying the above routines were also applied to wave_reassign. This routine is called under the assumption that the k-point and band distribution are unchanged. It takes the data from an existing wavefunction and maps it onto a different plane-wave basis set. It is mostly used during variable cell geometry optimisations and molecular dynamics. The routine approximates the process of wave_write followed by wave_read but without any disk I/O. Performance gains in line with those detailed above were obtained.