HECToR

Boosting the scaling performance of CASTEP: enabling next generation HPC for next generation science

This was the fifth Distributed Computational Science and Engineering (dCSE) project for improving the density functional theory code CASTEP. The work concerns improvements for the MPI buffers regarding I/O and further developments to the band parallelism. This project also builds upon the success of the previous CASTEP projects 1, 2, 3 and 4.

The first objective to both improve the MPI Buffer memory and optimise the I/O will be achieved by:

Developing working prototype versions of wave_read, wave_write and wave_apply_symmetry which will use MPI collectives instead of point-to-points for wavefunction coefficients.
Benchmark tests of expected buffer memory requirements and performance. The updated I/O will avoid the creation of stderr files on every core, and also allow user/system configuration of standard error handling/reporting and scratch file handling.

The second objective to further develop the band-parallel capability of CASTEP for multi-core architectures will be achieved by:

Analysing bottlenecks in the initial band-parallel implementation then rewriting the general inter-process band transformation subroutines to generate transformed data for multiple cores at once. A tree algorithm will also be used in order to generate and send fewer, longer messages.
Demonstrating a working version of band-parallel CASTEP on HECToR with rewritten band transformation routines for orthogonalisation and subspace diagonalisation, and "triangular matrix" optimisation.

The individual achievements of the project may be summarised as the following:

MPI collectives were implemented to replace MPI point-to-points for the wave function I/O routines. For a test case aluminium oxide '2x2' slab (al2x2) containing 5 k-points, ~40000 G-vectors and 288 bands, reading is 2.2 times faster and writing is 1.8 times faster on 24 cores (HECToR Phase 2b). With more cores, these improvements are greater, e.g. with 384 cores, reading is 11.8 times faster and writing is 13.47 times faster. Performing checkpoints and restarts are now more efficient on HECToR. This also means that classes of phonon calculations using thousands of cores are now feasible without having to tweak MPI environment variables.
A parallel efficiency report is now written at the end of every CASTEP run. In addition to providing the basic parallel efficiencies, the report provides information regarding the parallel decomposition used. This information may also be used to see whether any further optimisations might be possible. The report is also capable of providing details on any aspects of the calculation that are particularly important to the parallelisation (e.g. the k-point distribution and G-vector communications). This will help CASTEP users to take full advantage of the parallel capability of the application and will enable more efficient use on HECToR and other high-end HPC architectures.
The band-parallel capability of CASTEP was improved by implementing an upgraded "triangular matrix" algorithm. This will provide a useful speedup to all band-parallel calculations. In particular, the improved method is 1.08 times faster with 256 HECToR Phase 3 cores and 1.16 times faster with 1024 cores.

Currently, the I/O improvements and parallel efficiency report have been incorporated into the main CASTEP 6.1 source repository; the band-parallel improvements will be available from CASTEP 7.0.

Please see PDF or HTML for a report which summarises this project.

Main web site navigation

Boosting the scaling performance of CASTEP: enabling next generation HPC for next generation science

In this section

Apply to ARCHER

Current Service Status