Next: 2. Castep Performance on
Up: castep_performance_xt
Previous: Contents
Contents
Subsections
This Distributed Computational Science and Engineering (dCSE) project
is to implement a new parallelisation strategy in the density
functional theory program Castep[1], on top of the existing
parallelisation if possible, in order to extend the number of nodes
that Castep can be run on efficiently. Although benchmarking Castep
performance is a part of this dCSE project, it is anticipated that
this will allow Castep to run efficiently on O(1000) processing
elements (PEs) of the HECToR national supercomputer.
Castep was included as one of the benchmark programs used in the HECToR
procurement exercise. Increasing the
efficiency of Castep's parallelisation strategies will not only enable
HECToR to be used more productively, it will enable considerably
larger simulations to be performed and open up many new avenues of
research.
This dCSE project commenced 1st December 2007, and is scheduled to end
on the 31st July 2008. The Principal Investigator on the grant was Dr
K. Refson (RAL). Dr M.I.J. Probert (York) and Dr M. Plummer (STFC)
also provided help and support.
At the heart of a Castep calculation is the iterative solution of a
large eigenvalue problem to find the lowest 1% or so of the
eigenstates, called `bands', and this calculation is currently
parallelised over the components of the bands. The aim of this project
was to implement an additional level of parallelism by distributing
the bands themselves. The project was to be comprised of four phases:
- Basic Band Parallelism
Split the storage and workload of the
dominant parts of a Castep calculation over the `bands', in addition
to the current parallelisation scheme.
- Distributed Matrix Inversion and Diagonalisation
At various
points of a Castep calculation large matrices need to be inverted or
diagonalised, and this is currently done in serial. In this phase we
will distribute this workload over as many processors as
possible.
- Band-Independent Optimiser
The current optimisation of the
bands requires frequent, expensive orthonormalisation steps that will
even more expensive with the new bands-parallelism. Implementing a
different, known optimisation algorithm that does not require such
frequent orthonormalisation should improve speed and scaling.
- Parallelisation and Optimisation of New Band Optimiser
To work
on making the new optimiser as fast and robust as possible, and
parallelise the new band optimisation algorithm.
The overall aim was to enable Castep to scale efficiently to at least
eight times more nodes on HECToR. An additional phase was
introduced at the request of NAG, to investigate Castep performance on
HECToR in general to determine the best compiler, compiler flags and
libraries to use.
All three phases of the project have been completed successfully,
based on Castep 4.2 source code, though there remains some scope for
optimisation and several possible extensions. Basic Castep
calculations can be parallelised over bands in addition to the usual
parallelisation schemes, and the large matrix diagonalisation and
inversion operations have also been parallelised. Two band-independent
optimisation schemes have been implemented and shown to work under
certain conditions.
The performance of Castep on HECToR has been improved dramatically by
this dCSE project. One example is the standard benchmark al3x3,
which now scales effectively to almost four times the number of cores
compared to the ordinary Castep 4.2 (see figure 1.1).
Figure 1.1:
Graph showing the performance and scaling improvement
achieved by this dCSE project (using 8-way band-parallelism) compared
to the ordinary Castep 4.2 code for the standard al3x3
benchmark.
|
Next: 2. Castep Performance on
Up: castep_performance_xt
Previous: Contents
Contents
Sarfraz A Nadeem
2008-09-03