next up previous contents
Next: Node Usage Up: Benchmarks Previous: Maths Libraries (BLAS)   Contents


Much of the computational effort in a Castep calculation takes place in the FFT or maths libraries, but there are still significant parts of the code for which no standard library exists. It is the performance of these parts of code that changes depending on the compiler used and the various flags associated with it.

Unfortunately GNU's gfortran compiler (4.2.4) is not capable of correctly compiling the Castep 4.2 codebase, so our investigations on HECToR were restricted to the Portland Group and Pathscale compilers. It should be noted that versions 4.3.0 and later can compile Castep, but are not yet available on HECToR.

For the Pathscale compiler (3.0) we used the flags provided by Alan Simpson (EPCC) as a base for our investigations,

-O3 -OPT:Ofast -OPT:recip=ON -OPT:malloc_algorithm=1 
-inline -INLINE:preempt=ON
and created six compilation flagsets. The first set, which was used as a base for all the other sets, just used -O3 -OPT:Ofast, and we named this the bare set. The other five used this, plus:

malloc_inline        -OPT:malloc_algorithm=1 -inline -INLINE:preempt=ON
recip                -OPT:recip=ON 
recip_malloc         -OPT:recip=ON -OPT:malloc_algorithm=1
recip_malloc_inline  -OPT:recip=ON -OPT:malloc_algorithm=1 -inline 
full                 -OPT:recip=ON -OPT:malloc_algorithm=1 -inline 
                     -INLINE:preempt=ON -march=auto -m64 -msse3 -LNO:simd=2

The performance of the various Castep binaries can be seen in figure 3.3. It is clear that the flags we were given by Alan Simpson are indeed the best of this set.

Figure 3.3: Comparison of Castep performance for the Pathscale compiler with various flags, using 39 SCF cycles of the TiN benchmark (3.3(a)) and 11 SCF cycles of the al3x3 benchmark (3.3(b)).
[TiN benchmark (16 PEs)] \includegraphics[width=0.9\textwidth]{pathscale_flags_TiN.eps} [al3x3 benchmark (32 PEs)] \includegraphics[width=0.9\textwidth]{pathscale_flags_al3x3.eps}

For the Portland Group compiler we used the base flags from the standard Castep pgf90 build as a starting point, -fastsse -O3. Unfortunately there seemed to be a problem with the timing routine used in Castep when compiled with pgf90, as the timings often gave numbers that were far too small and did not tally with the actual walltime. Indeed the Castep output showed that the SCF times were `wrapping round' during a run, as in this sample output from an al3x3 benchmark:

------------------------------------------------------------------------ <-- SCF
SCF loop      Energy           Fermi           Energy gain       Timer   <-- SCF
                               energy          per atom          (sec)   <-- SCF
------------------------------------------------------------------------ <-- SCF
Initial  -5.94087234E+004  5.75816046E+001                        71.40  <-- SCF
      1  -7.38921628E+004  4.31787037E+000   5.36423678E+001     399.29  <-- SCF
      2  -7.78877742E+004  1.96972918E+000   1.47985607E+001     689.06  <-- SCF
      3  -7.79878794E+004  1.79936064E+000   3.70760070E-001     954.04  <-- SCF
      4  -7.78423468E+004  1.96558259E+000  -5.39009549E-001    1250.05  <-- SCF
      5  -7.77212605E+004  1.34967844E+000  -4.48467894E-001    1544.50  <-- SCF
      6  -7.77152926E+004  1.12424610E+000  -2.21032775E-002    1863.09  <-- SCF
      7  -7.77129468E+004  1.05359411E+000  -8.68814103E-003      14.53  <-- SCF
      8  -7.77104895E+004  1.02771272E+000  -9.10094481E-003     288.19  <-- SCF
      9  -7.77084348E+004  9.96278161E-001  -7.60993336E-003     582.43  <-- SCF
     10  -7.77059813E+004  1.11167947E+000  -9.08729795E-003     872.09  <-- SCF
     11  -7.77052050E+004  1.16249354E+000  -2.87513162E-003    1162.86  <-- SCF
------------------------------------------------------------------------ <-- SCF

This behaviour has not been seen on other machines to our knowledge, was not reproduced on HECToR with the Pathscale compiler, where the Castep timings agreed with the walltime reported in the PBS output file to within a second. Unfortunately this behaviour meant that we were forced to rely on the PBS output file for the total walltime for each run, which includes set-up and finalisation time that we would have liked to omit.

We experimented with various flags to invoke interprocedural optimisation -Mipa, -Mipa=fast but the Castep timings remained constant to within one second. Figure 3.4 shows the run times of both the Portland Group and Pathscale compiler as reported by the PBS output for the TiN benchmark.

Figure 3.4: Graph showing the relative performance of the Pathscale 3.0 compiler (recip_malloc_inline flags) with the Portland Group 7.1.4 compiler (-fastsse -O3 -Mipa) for the TiN benchmark on 16 PEs. The PBS reported walltime was used to report the timings.

next up previous contents
Next: Node Usage Up: Benchmarks Previous: Maths Libraries (BLAS)   Contents
Sarfraz A Nadeem 2008-09-01