Castep is written entirely in Fortran 90, and HECToR has three Fortran 90 compilers available: Portland Group (pgf90), Pathscale (pathf90) and GNU's gfortran. Following the benchmarking carried out during the procurement exercise, it was anticipated that Pathscale's pathf90 compiler would be the compiler of choice and Alan Simpson (EPCC) was kind enough to provide his flags for the Pathscale compiler, based on the ones Cray used in the procurement:
-O3 -OPT:Ofast -OPT:recip=ON -OPT:malloc_algorithm=1 -inline -INLINE:preempt=ON
Note that this switches on fast-math. Unless otherwise noted, all program development and benchmarking was performed with the Castep 4.2 codebase, as shipped to the United Kingdom Car-Parinello (UKCP) consortium, which was the most recent release of Castep at the commencement of this dCSE project and was the version available on HECToR to end-users.
The al3x3 benchmark is essentially a 3x3 surface cell of the al1x1 system, and has:
However the parameter files for this calculation do not specify Castep's optimisation level. In general it is advisable to tell Castep how to bias it's optimisation, e.g. opt_strategy_bias : 3 to optimise for speed (at the expense of using more RAM). Since the default optimisation level is not appropriate for HPC machines such as HECToR, most of our calculations were performed with the addition of opt_strategy_bias : 3 to the Castep parameter file al3x3.param.
-fastsse -O3 -Mipa
In order to measure the performance of the FFT routines specifically we used Castep's internal Trace module to profile the two subroutines wave_recip_to_real_slice and wave_real_to_recip_slice. These subroutines take a group of eigenstates, called a wavefunction slice, and Fourier transform them from reciprocal space to real space, or vice versa.
|
As can be seen from figure 2.1 FFTW 3.1.1 was the fastest FFT library available on HECToR.
For the BLAS tests, the Pathscale compiler (version 3.0) was used throughout with the compiler options:
-O3 -OPT:Ofast -OPT:recip=ON -OPT:malloc_algorithm=1 -inline -INLINE:preempt=ON
|
As can be seen from figure 2.2 Cray's LibSci 10.2.1 was by far the fastest BLAS library available on HECToR, at least for ZGEMM.
For the Pathscale compiler (3.0) we used the flags provided by Alan Simpson (EPCC) as a base for our investigations,
-O3 -OPT:Ofast -OPT:recip=ON -OPT:malloc_algorithm=1 -inline -INLINE:preempt=ONand created six compilation flagsets. The first set, which was used as a base for all the other sets, just used -O3 -OPT:Ofast, and we named this the bare set. The other five used this, plus:
malloc_inline -OPT:malloc_algorithm=1 -inline -INLINE:preempt=ON recip -OPT:recip=ON recip_malloc -OPT:recip=ON -OPT:malloc_algorithm=1 recip_malloc_inline -OPT:recip=ON -OPT:malloc_algorithm=1 -inline -INLINE:preempt=ON full -OPT:recip=ON -OPT:malloc_algorithm=1 -inline -INLINE:preempt=ON -march=auto -m64 -msse3 -LNO:simd=2
The performance of the various Castep binaries can be seen in figure 2.3. It is clear that the flags we were given by Alan Simpson are indeed the best of this set.
[TiN benchmark (16 PEs)]
[al3x3 benchmark (32 PEs)]
|
For the Portland Group compiler we used the base flags from the standard Castep pgf90 build as a starting point, -fastsse -O3. Unfortunately there seemed to be a problem with the timing routine used in Castep when compiled with pgf90, as the timings often gave numbers that were far too small and did not tally with the actual walltime. Indeed the Castep output showed that the SCF times were `wrapping round' during a run, as in this sample output from an al3x3 benchmark:
------------------------------------------------------------------------ <-- SCF SCF loop Energy Fermi Energy gain Timer <-- SCF energy per atom (sec) <-- SCF ------------------------------------------------------------------------ <-- SCF Initial -5.94087234E+004 5.75816046E+001 71.40 <-- SCF 1 -7.38921628E+004 4.31787037E+000 5.36423678E+001 399.29 <-- SCF 2 -7.78877742E+004 1.96972918E+000 1.47985607E+001 689.06 <-- SCF 3 -7.79878794E+004 1.79936064E+000 3.70760070E-001 954.04 <-- SCF 4 -7.78423468E+004 1.96558259E+000 -5.39009549E-001 1250.05 <-- SCF 5 -7.77212605E+004 1.34967844E+000 -4.48467894E-001 1544.50 <-- SCF 6 -7.77152926E+004 1.12424610E+000 -2.21032775E-002 1863.09 <-- SCF 7 -7.77129468E+004 1.05359411E+000 -8.68814103E-003 14.53 <-- SCF 8 -7.77104895E+004 1.02771272E+000 -9.10094481E-003 288.19 <-- SCF 9 -7.77084348E+004 9.96278161E-001 -7.60993336E-003 582.43 <-- SCF 10 -7.77059813E+004 1.11167947E+000 -9.08729795E-003 872.09 <-- SCF 11 -7.77052050E+004 1.16249354E+000 -2.87513162E-003 1162.86 <-- SCF ------------------------------------------------------------------------ <-- SCF
Unfortunately this behaviour meant that we were forced to rely on the PBS output file for the total walltime for each run, which includes set-up and finalisation time that we would have liked to omit. We experimented with various flags to invoke interprocedural optimisation -Mipa, -Mipa=fast but the Castep timings remained constant to within one second. Figure 2.4 shows the run times of both the Portland Group and Pathscale compiler as reported by the PBS output for the TiN benchmark.
|
[Execution time using 2 cores per node]
[Execution time using 1 core per node]
|
The best performance was achieved with the Goto BLAS in Cray's libsci version 10.2.0 coupled with the FFTW3 library, as can be seen in figure 2.5. The best performance per core was achieved using only one core per node, though the performance improvement over using both cores was not sufficient to justify the expense (since jobs are charged per node not per core). Using Castep's facility for optimising communications within an SMP node2.1 the scaling was improved dramatically and rivals that of the one core per node runs.
We decided to choose the Pathscale 3.0 binary, compiled with the recip_malloc_inline flags (see section 2.2.3) and linked against Cray's Libsci 10.2.1 and FFTW3 for our baseline, as this seemed to offer the best performance with the 4.2 Castep codebase.
|
[Execution time]
[Efficiency with respect to 16 cores]
|
[CPU time for Castep on 256 cores]
[CPU time for Castep on 512 cores]
|
[CPU time spent applying the Hamiltonian in Castep]
[CPU time spent preconditioning the search direction in Castep]
|
The ZGEMM time in nlpot_apply_precon could also be reduced because the first matrix in the multiplication was in fact Hermitian, so the call could be replaced by ZHEMM to do approximately half the work.