Next: 3. Band-Parallelism (Work Package Up: castep_performance_xt Previous: 1. Introduction Contents

Subsections

2. Castep Performance on HECToR
(Work Package 0)

2.1 General Castep Performance

Castep's performance is usually limited by two things: orthogonalisation-like operations, and FFTs. The orthogonalisation (and subspace diagonalisation) are performed using standard BLAS and LAPACK subroutine calls, such as those provided on HECToR by the ACML or Cray's LibSci. Castep has a built-in FFT algorithm for portability, but it is not competitive with tuned FFT libraries such as FFTW and provides interfaces to both FFT versions 2 and 3. ACML also provides FFT subroutines.

Castep is written entirely in Fortran 90, and HECToR has three Fortran 90 compilers available: Portland Group (pgf90), Pathscale (pathf90) and GNU's gfortran. Following the benchmarking carried out during the procurement exercise, it was anticipated that Pathscale's pathf90 compiler would be the compiler of choice and Alan Simpson (EPCC) was kind enough to provide his flags for the Pathscale compiler, based on the ones Cray used in the procurement:

-O3 -OPT:Ofast -OPT:recip=ON -OPT:malloc_algorithm=1 
-inline -INLINE:preempt=ON

Note that this switches on fast-math. Unless otherwise noted, all program development and benchmarking was performed with the Castep 4.2 codebase, as shipped to the United Kingdom Car-Parinello (UKCP) consortium, which was the most recent release of Castep at the commencement of this dCSE project and was the version available on HECToR to end-users.

2.2 Benchmarks

The standard Castep benchmarks have not changed for several years, and many are now too small to be useful for parallel scaling tests. The smallest benchmark, al1x1, is a small slab of aluminium oxide and runs in less that 6 minutes on 8 PEs of HECToR. The larger titanium nitride benchmark, TiN (which also contains a single hydrogen atom), only takes an hour on 16 PEs because the DM algorithm converges slowly - its time per SCF cycle is little more than twice the al1x1 benchmark. For this reason we settled on the al3x3 test system as the main benchmark for the parallel scaling tests, since this was large enough to take a reasonable amount of time per SCF cycle, yet small enough to run over a wide range of nodes.

The al3x3 benchmark is essentially a 3x3 surface cell of the al1x1 system, and has:

270 atoms (108 Al, 162 O)
88,184 G-vectors
778 bands (1296 electrons, non-spin-polarised, plus 130 conduction bands)
2 k-points (symmetrised 2 $\times$ 2 $\times$ 1 MP grid)

However the parameter files for this calculation do not specify Castep's optimisation level. In general it is advisable to tell Castep how to bias it's optimisation, e.g. opt_strategy_bias : 3 to optimise for speed (at the expense of using more RAM). Since the default optimisation level is not appropriate for HPC machines such as HECToR, most of our calculations were performed with the addition of opt_strategy_bias : 3 to the Castep parameter file al3x3.param.

2.2.1 FFT

The FFTW version 2 libraries on HECToR are only available for the Portland Group compiler, so the full FFT comparison tests were performed exclusively with pgf90. Cray's LibSci (10.2.1) was used for BLAS and LAPACK. The following compiler flags were used throughout the FFT tests:

-fastsse -O3 -Mipa

In order to measure the performance of the FFT routines specifically we used Castep's internal Trace module to profile the two subroutines wave_recip_to_real_slice and wave_real_to_recip_slice. These subroutines take a group of eigenstates, called a wavefunction slice, and Fourier transform them from reciprocal space to real space, or vice versa.

**Figure 2.1:** Graph showing the relative performance of the four FFT subroutines available to Castep on HECToR for the TiN benchmark. This benchmark transforms wavefunction slices to real space 1634 times and back again.
$\includegraphics[width=1.0\textwidth]{epsimages/FFT_TiN.eps}$

As can be seen from figure 2.1 FFTW 3.1.1 was the fastest FFT library available on HECToR.

2.2.2 Maths Libraries (BLAS)

Much of the time spent in Castep is in the double-precision complex matrix-matrix multiplication subroutine ZGEMM. The orthogonalisation and subspace rotation operations both use ZGEMM to apply unitary transformations to the wavefunctions, and it is also used extensively when computing and applying the so-called non-local projectors. Although the unitary transformations dominate the asymptotic cost of large calculations, the requirement that benchmarks run in a reasonable amount of time means that they are rarely in this rotation-dominated regime. The orthogonalisation and diagonalisation subroutines also include a reasonable amount of extra work, including a memory copy and updating of meta-data, which can distort the timings for small systems. For these reasons we chose to concentrate on the timings for the non-local projector overlaps as a measure of ZGEMM performance, in particular the subroutine ion_beta_add_multi_recip_all which is almost exclusively a ZGEMM operation.

For the BLAS tests, the Pathscale compiler (version 3.0) was used throughout with the compiler options:

-O3 -OPT:Ofast -OPT:recip=ON -OPT:malloc_algorithm=1 -inline 
-INLINE:preempt=ON

**Figure 2.2:** Graph showing the relative performance of the ZGEMM provided by the four maths libraries available to Castep on HECToR for the TiN benchmark. This benchmark performs 4980 projector-projector overlaps using ZGEMM. Castep's internal Trace module was used to report the timings.
$\includegraphics[width=1.0\textwidth]{epsimages/BLAS_TiN.eps}$

As can be seen from figure 2.2 Cray's LibSci 10.2.1 was by far the fastest BLAS library available on HECToR, at least for ZGEMM.

2.2.3 Compiler

Much of the computational effort in a Castep calculation takes place in the FFT or maths libraries, but there are still significant parts of the code for which no standard library exists. It is the performance of these parts of code that changes depending on the compiler used and the various flags associated with it.

For the Pathscale compiler (3.0) we used the flags provided by Alan Simpson (EPCC) as a base for our investigations,

-O3 -OPT:Ofast -OPT:recip=ON -OPT:malloc_algorithm=1 
-inline -INLINE:preempt=ON

and created six compilation flagsets. The first set, which was used as a base for all the other sets, just used -O3 -OPT:Ofast, and we named this the bare set. The other five used this, plus:

malloc_inline        -OPT:malloc_algorithm=1 -inline -INLINE:preempt=ON
recip                -OPT:recip=ON 
recip_malloc         -OPT:recip=ON -OPT:malloc_algorithm=1
recip_malloc_inline  -OPT:recip=ON -OPT:malloc_algorithm=1 -inline 
                     -INLINE:preempt=ON 
full                 -OPT:recip=ON -OPT:malloc_algorithm=1 -inline 
                     -INLINE:preempt=ON -march=auto -m64 -msse3 -LNO:simd=2

The performance of the various Castep binaries can be seen in figure 2.3. It is clear that the flags we were given by Alan Simpson are indeed the best of this set.

**Figure 2.3:** Comparison of Castep performance for the Pathscale compiler with various flags, using 39 SCF cycles of the TiN benchmark (2.2.3) and 11 SCF cycles of the al3x3 benchmark (2.2.3).
[TiN benchmark (16 PEs)] $\includegraphics[width=1.0\textwidth]{epsimages/pathscale_flags_TiN.eps}$ [al3x3 benchmark (32 PEs)] $\includegraphics[width=1.0\textwidth]{epsimages/pathscale_flags_al3x3.eps}$

For the Portland Group compiler we used the base flags from the standard Castep pgf90 build as a starting point, -fastsse -O3. Unfortunately there seemed to be a problem with the timing routine used in Castep when compiled with pgf90, as the timings often gave numbers that were far too small and did not tally with the actual walltime. Indeed the Castep output showed that the SCF times were `wrapping round' during a run, as in this sample output from an al3x3 benchmark:

------------------------------------------------------------------------ <-- SCF
SCF loop      Energy           Fermi           Energy gain       Timer   <-- SCF
                               energy          per atom          (sec)   <-- SCF
------------------------------------------------------------------------ <-- SCF
Initial  -5.94087234E+004  5.75816046E+001                        71.40  <-- SCF
      1  -7.38921628E+004  4.31787037E+000   5.36423678E+001     399.29  <-- SCF
      2  -7.78877742E+004  1.96972918E+000   1.47985607E+001     689.06  <-- SCF
      3  -7.79878794E+004  1.79936064E+000   3.70760070E-001     954.04  <-- SCF
      4  -7.78423468E+004  1.96558259E+000  -5.39009549E-001    1250.05  <-- SCF
      5  -7.77212605E+004  1.34967844E+000  -4.48467894E-001    1544.50  <-- SCF
      6  -7.77152926E+004  1.12424610E+000  -2.21032775E-002    1863.09  <-- SCF
      7  -7.77129468E+004  1.05359411E+000  -8.68814103E-003      14.53  <-- SCF
      8  -7.77104895E+004  1.02771272E+000  -9.10094481E-003     288.19  <-- SCF
      9  -7.77084348E+004  9.96278161E-001  -7.60993336E-003     582.43  <-- SCF
     10  -7.77059813E+004  1.11167947E+000  -9.08729795E-003     872.09  <-- SCF
     11  -7.77052050E+004  1.16249354E+000  -2.87513162E-003    1162.86  <-- SCF
------------------------------------------------------------------------ <-- SCF

Unfortunately this behaviour meant that we were forced to rely on the PBS output file for the total walltime for each run, which includes set-up and finalisation time that we would have liked to omit. We experimented with various flags to invoke interprocedural optimisation -Mipa, -Mipa=fast but the Castep timings remained constant to within one second. Figure 2.4 shows the run times of both the Portland Group and Pathscale compiler as reported by the PBS output for the TiN benchmark.

**Figure 2.4:** Graph showing the relative performance of the Pathscale 3.0 compiler (recip_malloc_inline flags) with the Portland Group 7.1.4 compiler (-fastsse -O3 -Mipa) for the TiN benchmark on 16 PEs. The PBS reported walltime was used to report the timings.
$\includegraphics[width=1.0\textwidth]{epsimages/compiler_comparison.eps}$

2.2.4 Node Usage

Each node on HECToR has two cores, or PEs, so we ran a series of Castep calculations to see how the performance and scaling of a Castep calculation depends on the number of PEs used per node. We also used these calculations to double-check the results of our investigation into different libraries. The results are shown in figure 2.5.

**Figure 2.5:** Comparison of Castep performance for the ACML and LibSci (Goto) BLAS libraries, and the generic GPFA and FFTW3 FFT libraries, run using two cores per node (2.2.4) and one core per node (2.2.4)
[Execution time using 2 cores per node] $\includegraphics[width=1.0\textwidth]{epsimages/Al2O3_3x3_library.eps}$ [Execution time using 1 core per node] $\includegraphics[width=1.0\textwidth]{epsimages/Al2O3_3x3_library_ppn1.eps}$

The best performance was achieved with the Goto BLAS in Cray's libsci version 10.2.0 coupled with the FFTW3 library, as can be seen in figure 2.5. The best performance per core was achieved using only one core per node, though the performance improvement over using both cores was not sufficient to justify the expense (since jobs are charged per node not per core). Using Castep's facility for optimising communications within an SMP node^2.1 the scaling was improved dramatically and rivals that of the one core per node runs.

2.3 Baseline

We decided to choose the Pathscale 3.0 binary, compiled with the recip_malloc_inline flags (see section 2.2.3) and linked against Cray's Libsci 10.2.1 and FFTW3 for our baseline, as this seemed to offer the best performance with the 4.2 Castep codebase.

**Figure 2.6:** Execution time for the 33 atom TiN benchmark. This calculation is performed at 8 k-points.
$\includegraphics[width=1.0\textwidth]{epsimages/TiN.eps}$ $\includegraphics[width=1.0\textwidth]{epsimages/TiN_log_time.eps}$ $\includegraphics[width=1.0\textwidth]{epsimages/TiN_efficiency.eps}$

**Figure 2.7:** Scaling of execution time with cores for the 270 atom Al2O3 3x3 benchmark. This calculation is performed at 2 k-points.
[Execution time] $\includegraphics[width=1.0\textwidth]{epsimages/Al2O3_time.eps}$ [Efficiency with respect to 16 cores] $\includegraphics[width=1.0\textwidth]{epsimages/Al2O3_efficiency.eps}$

**Figure 2.8:** Breakdown of CPU time for 256 (2.3) and 512 (2.3) cores using 2 ppn, for Castep Al2O3 3x3 benchmark
[CPU time for Castep on 256 cores] $\includegraphics[width=1.0\textwidth]{epsimages/256_orig_craypat.eps}$ [CPU time for Castep on 512 cores] $\includegraphics[width=1.0\textwidth]{epsimages/512_orig_craypat.eps}$

**Figure 2.9:** The CPU time spent in the two dominant user-level subroutines and their children, for a 512-core (2 ppn) Castep calculation of the Al2O3 3x3 benchmark
[CPU time spent applying the Hamiltonian in Castep] $\includegraphics[width=1.0\textwidth]{epsimages/512_orig_craypat_applyH.eps}$ [CPU time spent preconditioning the search direction in Castep] $\includegraphics[width=1.0\textwidth]{epsimages/512_orig_craypat_nlpot.eps}$

2.4 Analysis

Both the Cray PAT and built-in Castep trace showed that a considerable amount of the time in ZGEMM, as well as the non-library time, was spent in nlpot_apply_precon. The non-library time was attributable to a packing routine which takes the unpacked array beta_phi, which contains the projections of the wavefunction bands onto the nonlocal pseudopotential projectors, and packs them into a temporary array. Unfortunately this operation was poorly written, and the innermost loop was over the slowest index.

The ZGEMM time in nlpot_apply_precon could also be reduced because the first matrix in the multiplication was in fact Hermitian, so the call could be replaced by ZHEMM to do approximately half the work.

Next: 3. Band-Parallelism (Work Package Up: castep_performance_xt Previous: 1. Introduction Contents

Sarfraz A Nadeem 2008-09-03

2. Castep Performance on HECToR (Work Package 0)