Castep Performance

Next: Limitations Up: Programming Previous: Castep Implementation Contents

Castep Performance

With the new distributed inversion and diagonalisation subroutines the performance and scaling of Castep was improved noticeably. As expected, this improvement was more significant when using larger number of cores. Figure 5.2 shows the improved performance of Castep due to the distribution of the matrix inversion and diagonalisation in this work package.

**Figure 5.2:** Graph showing the performance and scaling improvement achieved by the distributed inversion and diagonalisation work in Work Package 2, compared to the straight band-parallel work from Work Package 1. Each calculation is using 8-way band-parallelism, and running the standard `al3x3` benchmark.
$\includegraphics[width=0.9\textwidth]{phase2.eps}$

**Figure 5.3:** Comparison of Castep scaling for Work Packages 1 and 2 and the original Castep 4.2, for 10 SCF cycles of the al3x3 benchmark. Parallel efficiencies were measured relative to the 16 core calculation with Castep 4.2.
[`num_proc_in_smp : 1`] $\includegraphics[width=0.9\textwidth]{overall_smp1.eps}$ [`num_proc_in_smp : 2`] $\includegraphics[width=0.9\textwidth]{overall_smp2.eps}$

The distributed diagonalisation, on top of the basic band-parallelism, enables Castep calculations to scale effectively to between two and four times more cores compared to Castep 4.2 (see figure 5.3). The standard al3x3 benchmark can now be run on 1024 cores with almost 50% efficiency, which equates to over three cores per atom, and it is expected that larger calculations will scale better. A large demonstration calculation is being performed that should illustrate the new Castep performance even better.

Next: Limitations Up: Programming Previous: Castep Implementation Contents

Sarfraz A Nadeem 2008-09-01