HECToR Fortran Compiler Performance Comparison
This page presents the performance of different compilers for a collection of Fortran codes being used on HECToR. The performance data given here should be regarded as a straw poll of an arbitrary set of codes and optimization flags. Flags used are either derived from existing 'fast' options in bundled Makefiles or based on limited experimentation of commonly used optimization flags. The codes have not been optimized in any way for these experiments. Any changes made were in order to allow compilation to proceed (e.g. avoiding non-standard features). The same source code was used for every compiler and results were verified for each run. The Intel and Cray 7.1° compilers are not available on HECToR and are included here because some users have expressed an interest in their use on the machine. In every case there may exist flags as yet untried that futher improve the performance of the codes below. Please get in touch if you have any information which you think might improve the quality of the data this table, or if you would like us to run your code. This work has been inspired by the Fortran compiler comparisons for benchmark codes performed by Polyhedron Software. In the table below green indicates the best performing executable, red the poorest. All times given in seconds. Last updated: Thu Sep 3 11:21:09 BST 2009| ↓ code | compiler → | Cray | GNU | Intel | Pathscale | PGI |
| CASTEP | 362.85 | 356.71 | 357.71 | 343.81 | 366.26 |
| TETRIN_PMG | 178.91 | 226.24 | 281.26 | 168.89 | 169.71 |
| DG-DES | 177.58 | 229.88 | 174.27 | 160.40 | 291.31 |
| GWW | error | details | n/a | details | details |
| DL_POLY | 211.36 | 242.66 | error | 246.83 | 226.47 |
| HELIUM | 2160.05 | 1752.12 | 1845.59 | 1854.16 | 1604.86 |
| CASINO | error | 432 | 459 | 407 | 428 |
| Incompact3D | 193.49 | 212.28 | 206.33 | 252.01 | 196.19 |
CASTEP
CASTEP is a plane wave basis density functional theory materials code. CASTEP has Makefiles for a number of compilers and architectures. The compiler flags used in these runs were derived from the 'fast' build option.Code Version: 4.4
Run: al1x1 benchmark, 8 MPI processes
Cray
Compiler Version: 7.1Libraries: ACML 4.2.0, MPT 3.2.0, FFTW 3.1.1
Flags:
-Dcray -e m -O 3 -O aggressgfortran
Compiler Version: 4.3.3Libraries: ACML 4.2.0, MPT 3.2.0, FFTW 3.1.1
Flags:
-fconvert=big-endian -frecord-marker=4 -O3Intel
Compiler Version: 11.0Libraries: ACML 4.2.0, MPT 3.2.0, FFTW 3.1.1
Flags:
-convert big_endian -O3 -fast -ipo -no-prec-div -msse3Comment: The Intel build was performed off-site on an AMD quadcore machine.
Pathscale
Compiler Version: 3.2Libraries: ACML 4.2.0, MPT 3.2.0, FFTW 3.1.1
Flags:
-byteswapio -O3 -OPT:Ofast -ffast-math -OPT:recip=ON -OPT:malloc_algorithm=1PGI
Compiler Version: 3.2Libraries: ACML 4.2.0, MPT 3.2.0, FFTW 3.1.1
Flags:
-Mnostdinc -Mbyteswapio -fastsse -O3TETRIN_PMG
(Parallel Multi-Grid solver for INcompressible fluid on TETRahedral grids) is a 3D Euler/Navier-Stokes incompressible fluid dynamics code based on Galerkin Finite Volume Method.Run: 15 timesteps over a single multigrid level using 240 MPI processes.
Cray
Compiler Version: 7.0.4Libraries: MPT 3.2.0
Flags:
-O3 -Oaggress -OmsgsComment: Cray 7.1 (built on Jaguar°) produced a time of 185.75, which may be the result of unsafe optimizations being throttled in the later release.
gfortran
Compiler Version: 4.3.3Libraries: MPT 3.2.0
Flags:
-march=barcelona -ffast-math -funroll-loops -O3 -ffixed-line-length-none -ftree-vectorizer-verbose=2Intel
Compiler Version: 11.0Libraries: MPT 3.2.0
Flags:
-O3 -fast -ipo -no-prec-div -msse3Comment: The Intel build was performed off-site on an AMD quadcore machine.
Pathscale
Compiler Version: 3.2Libraries: MPT 3.2.0
Flags:
-Ofast -LNO:full_unroll=4 -march=barcelona -OPT:malloc_algorithm=1 -LNO:simd_verbose=ONComment: Without the
-OPT:malloc_algorithm=1 flag the runtime was 186.26, placing Cray as the
fastest compiler for this code.
PGI
Compiler Version: 8.0.6Libraries: MPT 3.2.0
Flags:
-Minfo -Mneginfo -Mextend -fast -Munroll=n:4 -Mipa=fast,inline -O3 -tp barcelona-64DG-DES
Dynamic Grid Detached Eddy Simulation. This code is used to model turbulent flows in aerospace engineering using a dual time stepping Runge-Kutta method and a finite volume mesh.Run: 50 timesteps of ncylinder using 64 MPI processes. Times given are for timestepping phase of calculations.
The preprocessor macros turn on code optimizations and timing info from prior work with the CSE team.
Cray
Compiler Version: 7.1 (Jaguar°)Libraries: MPT 3.2.0, metis 4.0
Flags:
-O3 -Oaggress -Omsgs -F -DHECTOR_NAGOPT -DHECTOR_NAGTIMEComment: Version 7.0.4 executable crashed with a segmentation fault.
gfortran
Compiler Version: 4.3.3Libraries: MPT 3.2.0, metis 4.0
Flags:
-O3 -march=barcelona -ffast-math -funroll-loops -x f95-cpp-input -DHECTOR_NAGOPT -DHECTOR_NAGTIMEIntel
Compiler Version: 11.0Libraries: MPT 3.2.0, metis 4.0
Flags:
-O3 -fast -ipo -no-prec-div -msse3 -fpp -DHECTOR_NAGOPT -DHECTOR_NAGTIMEPathscale
Compiler Version: 3.2Libraries: MPT 3.2.0, metis 4.0
Flags:
-Ofast -LNO:full_unroll=4 -march=barcelona -OPT:malloc_algorithm=1 -LNO:simd_verbose=ON -cpp -DHECTOR_NAGOPT -DHECTOR_NAGTIMEPGI
Compiler Version: 8.0.6Libraries: MPT 3.2.0, metis 4.0
Flags:
-Minfo -Mneginfo -fast -Munroll=n:4 -Mipa=fast,inline -O3 -tp barcelona-64 -Mpreprocess -DHECTOR_NAGOPT -DHECTOR_NAGTIMEGWW
The GWW code is an extension to the popular electronic structure package Quantum Espresso (QE).Code Version: GWW_in_QE4.0.4
Run: The performance of QE as used in GWW is to be improved as part of a distributed CSE (dCSE) project. The test case provided for benchmarking as part of the dCSE project (64 silicon atoms) involves running 7 separate parallel jobs in sequence using 3 executables, pw.x, ph.x and gww.x. pw.x and ph.x are from the standard QE distribution extended to produce output consumable by the new executable gww.x. The Cray 7.0.4 compiler produces an internal compiler error for GWW. Only compilers available on HECToR have been used for this code due to the cost of the benchmark. Using 32 processes the performance data are as follows (with the executable involved given in brackets). Times given are headline CPU time given in process 0 output file:
| ↓ run | compiler → | GNU | Pathscale | PGI |
| 1. exc_scf (pw.x) | 59.66 | 66.70 | 67.94 |
| 2. exc_nscf (pw.x) | 542.97 | 597.10 | 616.02 |
| 3. head_scf (pw.x) | 87.75 | 100.74 | 104.73 |
| 4. head_nscf (ph.x) | 21609 | 26700 | 28380 |
| 5. matrix_scf (pw.x) | 106.67 | 128.99 | 114.70 |
| 6. matrix_nscf (pw.x) | 1239.79* | 1477.37 | 1269.04 |
| 7. gww (gww.x) | 252.02** | 160.12 | 143.47 |
gfortran
Compiler Version: 4.3.3Libraries: MPT 3.2.0, libsci 10.3.6, FFTW 3.1.1
Flags:
-O3 -march=barcelona -ffast-math -funroll-loops -ffree-line-length-noneComment: FFTW 3.2.1 failed to link.
*matrix_nscf run originally failed because default MPICH_PTL_UNEX_EVENTS queue size is too small (20480). The successful run used
export MPICH_PTL_UNEX_EVENTS=1000000 in the job script.**The original GWW run failed with a segmentation fault. Two bugs were found in the code:
- In times_gw.f90 on lines 124 and 162 the range of the array section
weights_freq(-tf%n:1)should beweights_freq(-tf%n:-1)(i.e. up to -1 instead of 1) to match the extent of the array being copied. - Variables declared with the pointer attribute are undefined before being modified and pointers passed to the associated function cannot be undefined. The GWW code has lots of lines such as:
if(associated(fftd%fd)) deallocate(fftd%fd)which are executed before pointers are allocated. The GNU compiler produced a segmentation fault since the test passed and deallocate tried to free unallocated space. This has been fixed by setting all pointer variables to NULL so that the associated test fails on first pass.
Pathscale
Compiler Version: 3.2Libraries: MPT 3.2.0, libsci 10.3.6, FFTW 3.2.1
Flags:
-Ofast -LNO:full_unroll=4 -march=barcelona -OPT:malloc_algorithm=1 -LNO:simd_verbose=ONComments: Had to enable macro __EKO in Modules/stick_base.f90 for pathscale bug workaround.
Intel
Not done.PGI
Compiler Version: 8.0.6Libraries: MPT 3.2.0, libsci 10.3.6, FFTW 3.2.1
Flags:
-Minfo -Mneginfo -fast -Munroll=n:4 -Mipa=fast,inline -O3 -tp barcelona-64Cray
Internal compiler error for Cray 7.0.4 on file fft_gw.F90 in GWW, even with no optimization:
nid15883> ftn -O0 -N255 -D__GWW -D__FFTW3 -D__MPI -D__PARA -D__CRAYX86 -I../include -I./ -I../Modules -I../iotk/src -I../PW -I../PH -c fft_gw.F90 -o fft_gw.o
cft90: llvm/lib/VMCore/Instructions.cpp:2267: llvm::BitCastInst::BitCastInst(llvm::Value*, const llvm::Type*, const std::string&, llvm::BasicBlock*): Assertion `castIsValid(getOpcode(), S, Ty) && "Illegal BitCast"' failed.
ftn-2116 crayftn: INTERNAL
"/opt/cray/cce/7.0.4/cftn/x86-64/lib/cft90" was terminated due to receipt of signal 06: Aborted.
DL_POLY
DL_POLY is a molecular dynamics code that can be used to simulate a wide variety of molecular systems including simple liquids, ionic liquids and solids, small polar and non-polar molecular systems, bio- and synthetic polymers, ionic polymers and glasses, solutions, simple metals and alloysCode Version: 3.10
Run: sodium chloride with ewald sum (1728000 ions) using 32 processes
Cray
Compiler Version: 7.1 (Jaguar°)Libraries: MPT 3.2.0
Flags:
-O3 -Oaggress -Omsgs -enComments: Cray 7.0.4 compiler recorded a time of 219.99s.
gfortran
Compiler Version: 4.3.3Libraries: MPT 3.2.0
Flags:
-O3 -march=barcelona -ffast-math -funroll-loops -ffree-line-length-none -Wall -pedanticIntel
Compiler Version: 11.0Libraries: MPT 3.2.0
Flags:
no flagsComment: job encountered error at runtime: "too many atoms in CONFIG file" (although CONFIG file is the same as used for other executables). Possibly due to linking of PGI MPI and runtime libraries.
Pathscale
Compiler Version: 3.2Libraries: MPT 3.2.0
Flags:
-byteswapio -Ofast -LNO:full_unroll=4 -march=barcelona -OPT:malloc_algorithm=1 -LNO:simd_verbose=ONPGI
Compiler Version: 8.0.6Libraries: MPT 3.2.0
Flags:
-Minfo -Mneginfo -fast -Munroll=n:4 -Mipa=fast,inline -O3 -tp barcelona-64HELIUM
HELIUM models interaction of the helium atom with laser fields.Run: 200 timesteps with 240x240 radial grid using 55 processes. Times given are for wavefunction calculation.
Cray
Compiler Version: 7.0.4Libraries: MPT 3.2.0
Flags:
-O3 -Oaggress -Omsgsgfortran
Compiler Version: 4.3.3Libraries: MPT 3.2.0
Flags:
-O3 -march=barcelona -ffast-math -funroll-loops -ffree-line-length-noneIntel
Compiler Version: 11.0Libraries: MPT 3.2.0
Flags:
-O3 -fast -ipo -no-prec-div -msse3Pathscale
Compiler Version: 3.2Libraries: MPT 3.2.0
Flags:
-Ofast -LNO:full_unroll=4 -march=barcelona -OPT:malloc_algorithm=1PGI
Compiler Version: 8.0.6Libraries: MPT 3.2.0
Flags:
-fast -Munroll=n:4 -Mipa=fast -O3 -tp barcelona-64Comments:
-Mipa=inline produces a compile time error message even with the -mcmodel=medium -Mlarge_arrays flags:
/opt/pgi/8.0.6/linux86-64/8.0-6/libso/libpgf90.a(initpar.o): In function `__hpf_initarg':
initpar.c:(.text+0x137): relocation truncated to fit: R_X86_64_PC32 against `.bss'
CASINO
CASINIO is a quantum monte carlo program for calculating the electronic properties of matter.Code Version: 3.0.42
Run: 1024 electrons, 64 atoms, fe pseudopotential, blip grid 92 by 92 by 72; 5 configuration per core, no branching. Times given are average time per block for 9 dmc blocks of 10 steps.
Cray
Compilation with Cray 7.0 and 7.1 compilers crashed due to an internal compiler error:
[F90] casl
ftn-2116 crayftn: INTERNAL
"/opt/cray/cce/7.1.1/cftn/x86-64/lib/cft90" was terminated due to receipt of signal 013: Segmentation fault.
gfortran
Compiler Version: 4.4.1Libraries: MPT 3.1, libsci 10.3.6
Flags:
-O3 -march=barcelona -ffast-math -funroll-loops -fcray-pointerIntel
Compiler Version: 11.0Libraries: MPT 3.1
Flags:
-O3 -no-prec-div -no-prec-sqrt -funroll-loops -no-fp-port -ip -complex-limited-range -par-report0 -vec-report0
Pathscale
Compiler Version: 3.2Libraries: MPT 3.1, libsci 10.3.2
Flags:
-Ofast -march=barcelona -LNO:full_unroll=4 -OPT:malloc_algorithm=1PGI
Compiler Version: 8.0.6Libraries: MPT 3.1, libsci 10.3.2
Flags:
-fast -Munroll=n:4 -Mipa=fast,inline -O4 -tp barcelona-64 -Mlarge_arraysIncompact3D
Incompact3D is a CFD code to perform very large-scale turbulence simulations. Its numerical framework rests on simple 3D Cartesian mesh and uses compact finite difference approach to solve the fluid PDEs. The numerical scheme is implicit and the compact finite difference scheme involves solving a tri-diagonal matrix. Run: 10 timesteps over a 2048^3 mesh using 1024 processes. Times given are the process average times for the 10 timesteps less the MPI_Alltoall communications.Cray
Compiler Version: 7.1Libraries: MPT 3.5
Flags:
-O3 -Oaggress -Omsgs -Fgfortran
Compiler Version: 4.4.1Libraries: MPT 3.5
Flags:
-O3 -march=barcelona -ffast-math -funroll-loops -ftree-vectorize -fcray-pointer -x f95-cpp-input -ftree-vectorizer-verbose=2Intel
Compiler Version: 11.0Libraries: MPT 3.5
Flags:
-O3 -fast -ipo -no-prec-div -msse3 -cpp
Pathscale
Compiler Version: 3.2Libraries: MPT 3.5
Flags:
-cpp -Ofast -LNO:full_unroll=4 -march=barcelona -OPT:malloc_algorithm=1 -LNO:simd_verbose=ONPGI
Compiler Version: 9.04Libraries: MPT 3.5
Flags:
-Mpreprocess -fast -Munroll=n:4 -Mipa=fast -O3 -tp barcelona-64 -Minfo -MneginfoThis research used resources of the National Center for Computational Sciences at Oak Ridge National Laboratory, which is supported by the Office of Science of the Department of Energy under Contract DE-AC05-00OR22725.
