Unfortunately GNU's gfortran compiler (4.2.4) is not capable of correctly compiling the Castep 4.2 codebase, so our investigations on HECToR were restricted to the Portland Group and Pathscale compilers. It should be noted that versions 4.3.0 and later can compile Castep, but are not yet available on HECToR.
For the Pathscale compiler (3.0) we used the flags provided by Alan Simpson (EPCC) as a base for our investigations,
-O3 -OPT:Ofast -OPT:recip=ON -OPT:malloc_algorithm=1 -inline -INLINE:preempt=ONand created six compilation flagsets. The first set, which was used as a base for all the other sets, just used -O3 -OPT:Ofast, and we named this the bare set. The other five used this, plus:
malloc_inline -OPT:malloc_algorithm=1 -inline -INLINE:preempt=ON recip -OPT:recip=ON recip_malloc -OPT:recip=ON -OPT:malloc_algorithm=1 recip_malloc_inline -OPT:recip=ON -OPT:malloc_algorithm=1 -inline -INLINE:preempt=ON full -OPT:recip=ON -OPT:malloc_algorithm=1 -inline -INLINE:preempt=ON -march=auto -m64 -msse3 -LNO:simd=2
The performance of the various Castep binaries can be seen in figure 3.3. It is clear that the flags we were given by Alan Simpson are indeed the best of this set.
[TiN benchmark (16 PEs)]
![]() ![]() |
For the Portland Group compiler we used the base flags from the standard Castep pgf90 build as a starting point, -fastsse -O3. Unfortunately there seemed to be a problem with the timing routine used in Castep when compiled with pgf90, as the timings often gave numbers that were far too small and did not tally with the actual walltime. Indeed the Castep output showed that the SCF times were `wrapping round' during a run, as in this sample output from an al3x3 benchmark:
------------------------------------------------------------------------ <-- SCF SCF loop Energy Fermi Energy gain Timer <-- SCF energy per atom (sec) <-- SCF ------------------------------------------------------------------------ <-- SCF Initial -5.94087234E+004 5.75816046E+001 71.40 <-- SCF 1 -7.38921628E+004 4.31787037E+000 5.36423678E+001 399.29 <-- SCF 2 -7.78877742E+004 1.96972918E+000 1.47985607E+001 689.06 <-- SCF 3 -7.79878794E+004 1.79936064E+000 3.70760070E-001 954.04 <-- SCF 4 -7.78423468E+004 1.96558259E+000 -5.39009549E-001 1250.05 <-- SCF 5 -7.77212605E+004 1.34967844E+000 -4.48467894E-001 1544.50 <-- SCF 6 -7.77152926E+004 1.12424610E+000 -2.21032775E-002 1863.09 <-- SCF 7 -7.77129468E+004 1.05359411E+000 -8.68814103E-003 14.53 <-- SCF 8 -7.77104895E+004 1.02771272E+000 -9.10094481E-003 288.19 <-- SCF 9 -7.77084348E+004 9.96278161E-001 -7.60993336E-003 582.43 <-- SCF 10 -7.77059813E+004 1.11167947E+000 -9.08729795E-003 872.09 <-- SCF 11 -7.77052050E+004 1.16249354E+000 -2.87513162E-003 1162.86 <-- SCF ------------------------------------------------------------------------ <-- SCF
This behaviour has not been seen on other machines to our knowledge, was not reproduced on HECToR with the Pathscale compiler, where the Castep timings agreed with the walltime reported in the PBS output file to within a second. Unfortunately this behaviour meant that we were forced to rely on the PBS output file for the total walltime for each run, which includes set-up and finalisation time that we would have liked to omit.
We experimented with various flags to invoke interprocedural optimisation -Mipa, -Mipa=fast but the Castep timings remained constant to within one second. Figure 3.4 shows the run times of both the Portland Group and Pathscale compiler as reported by the PBS output for the TiN benchmark.
![]() |