The changes to the wave and electronic modules were rather more extensive, and difficult to test in isolation. In order to test and debug these modules, a short Castep calculation was run in serial to create a wavefunction and density in a Castep .check file, and then the calculation was restarted from that checkpoint file in serial and band-parallel modes. Because both jobs started from the same known point the calculations could be compared in detail, right down to individual wavefunction coefficients where necessary. This, coupled with Castep's in-built trace functionality, enabled bugs to be found and fixed quickly.
The ability to compare data in detail, even mid-calculation, proved necessary, since several bugs were found and fixed that did not affect the final, converged result but either hindered convergence, or could give rise to incorrect answers in unusual circumstances.
For a small 8-atom silicon test calculation performed using the density mixing (DM) method, the serial calculation produces
------------------------------------------------------------------------ <-- SCF SCF loop Energy Fermi Energy gain Timer <-- SCF energy per atom (sec) <-- SCF ------------------------------------------------------------------------ <-- SCF Initial 5.10476316E+002 4.62264099E+001 15.74 <-- SCF 1 -7.76802126E+002 2.64224391E+000 1.60909805E+002 25.28 <-- SCF 2 -8.50574887E+002 2.02770490E-001 9.22159500E+000 33.17 <-- SCF 3 -8.54801574E+002 3.79693598E-001 5.28335886E-001 44.40 <-- SCF 4 -8.52981743E+002 7.44988320E-001 -2.27478843E-001 52.26 <-- SCF 5 -8.52884167E+002 9.08590414E-001 -1.21969434E-002 60.13 <-- SCF 6 -8.52886334E+002 8.98636611E-001 2.70796284E-004 68.36 <-- SCF 7 -8.52887081E+002 9.06719344E-001 9.34638588E-005 76.22 <-- SCF 8 -8.52887250E+002 9.10591664E-001 2.11356795E-005 84.07 <-- SCF 9 -8.52887250E+002 9.11100143E-001 -3.08962712E-008 88.90 <-- SCF 10 -8.52887250E+002 9.11105407E-001 -4.65249702E-008 93.76 <-- SCF 11 -8.52887250E+002 9.11110563E-001 -1.77844141E-008 98.98 <-- SCF ------------------------------------------------------------------------ <-- SCF
and a two-core band-parallel calculation produces
------------------------------------------------------------------------ <-- SCF SCF loop Energy Fermi Energy gain Timer <-- SCF energy per atom (sec) <-- SCF ------------------------------------------------------------------------ <-- SCF Initial 5.10476316E+002 4.62264099E+001 16.67 <-- SCF 1 -7.76802126E+002 2.64224391E+000 1.60909805E+002 28.13 <-- SCF 2 -8.50574887E+002 2.02770490E-001 9.22159500E+000 38.62 <-- SCF 3 -8.54801574E+002 3.79693598E-001 5.28335886E-001 52.16 <-- SCF 4 -8.52981743E+002 7.44988320E-001 -2.27478843E-001 60.98 <-- SCF 5 -8.52884167E+002 9.08590414E-001 -1.21969434E-002 70.63 <-- SCF 6 -8.52886334E+002 8.98636611E-001 2.70796284E-004 79.96 <-- SCF 7 -8.52887081E+002 9.06719344E-001 9.34638588E-005 88.78 <-- SCF 8 -8.52887250E+002 9.10591664E-001 2.11356795E-005 97.65 <-- SCF 9 -8.52887250E+002 9.11100143E-001 -3.08962712E-008 102.84 <-- SCF 10 -8.52887250E+002 9.11105407E-001 -4.65248977E-008 108.05 <-- SCF 11 -8.52887250E+002 9.11110563E-001 -1.77844503E-008 113.57 <-- SCF ------------------------------------------------------------------------ <-- SCF
Note that the results as reported are identical for the first 9 SCF
cycles, and only differ by eV/atom in the last two cycles,
which is the same order as
for double-precision arithmetic
and so may be attributed to different rounding errors for the serial
and band-parallel calculations.
This calculation takes longer when run band-parallel compared to the serial calculation, but this is not a cause for alarm - the test system is very small, containing only 16 valence bands, so it is not surprising that the communication overhead outweighs the gains.
The same calculation run using the `all-bands' self-consistent code path yields
------------------------------------------------------------------------ <-- SCF SCF loop Energy Energy gain Timer <-- SCF per atom (sec) <-- SCF ------------------------------------------------------------------------ <-- SCF Initial 6.83465549E+002 10.35 <-- SCF 1 -7.97053977E+002 1.85064941E+002 20.90 <-- SCF 2 -8.48247959E+002 6.39924773E+000 31.57 <-- SCF 3 -8.50914193E+002 3.33279207E-001 42.16 <-- SCF 4 -8.51618587E+002 8.80493249E-002 54.31 <-- SCF 5 -8.52080365E+002 5.77221874E-002 64.82 <-- SCF 6 -8.52436527E+002 4.45203123E-002 75.48 <-- SCF 7 -8.52663071E+002 2.83179709E-002 85.99 <-- SCF 8 -8.52769350E+002 1.32848145E-002 96.64 <-- SCF 9 -8.52812636E+002 5.41075552E-003 107.15 <-- SCF 10 -8.52829576E+002 2.11747553E-003 117.69 <-- SCF 11 -8.52836183E+002 8.25924796E-004 128.48 <-- SCF ------------------------------------------------------------------------ <-- SCF
in serial, and
------------------------------------------------------------------------ <-- SCF SCF loop Energy Energy gain Timer <-- SCF per atom (sec) <-- SCF ------------------------------------------------------------------------ <-- SCF Initial 6.83465549E+002 7.04 <-- SCF 1 -7.97053977E+002 1.85064941E+002 15.95 <-- SCF 2 -8.48247959E+002 6.39924773E+000 24.90 <-- SCF 3 -8.50914193E+002 3.33279207E-001 33.79 <-- SCF 4 -8.51618587E+002 8.80493249E-002 43.01 <-- SCF 5 -8.52080365E+002 5.77221874E-002 51.94 <-- SCF 6 -8.52436527E+002 4.45203123E-002 61.04 <-- SCF 7 -8.52663071E+002 2.83179709E-002 69.93 <-- SCF 8 -8.52769350E+002 1.32848145E-002 79.12 <-- SCF 9 -8.52812636E+002 5.41075552E-003 87.98 <-- SCF 10 -8.52829576E+002 2.11747553E-003 96.76 <-- SCF 11 -8.52836183E+002 8.25924796E-004 107.40 <-- SCF ------------------------------------------------------------------------ <-- SCF
in two-core band-parallel. Note that this time there is a small speed improvement for the band-parallel run - this is because the `all-bands' path does more FFTs per SCF cycle than the DM path, and the FFTs distribute trivially among the band-group.
With the basic band-parallelism tested and complete, Castep has been demonstrated to work in band-parallel mode for the EDFT and DM algorithms.
The only known problem outstanding is with the EDFT mode. In the EDFT algorithm the empty bands are optimised non-self-consistently after the full bands have been updated, but at the moment this does not use the same algorithm as the DM code path and so is not band-parallel.