HECToR Fortran Compiler Performance Comparison

This page presents the performance of different compilers for a collection of Fortran codes being used on HECToR. The performance data given here should be regarded as a straw poll of an arbitrary set of codes and optimization flags. Flags used are either derived from existing 'fast' options in bundled Makefiles or based on limited experimentation of commonly used optimization flags. The codes have not been optimized in any way for these experiments. Any changes made were in order to allow compilation to proceed (e.g. avoiding non-standard features). The same source code was used for every compiler and results were verified for each run.

The Intel and Cray 7.1° compilers are not available on HECToR and are included here because some users have expressed an interest in their use on the machine.

In every case there may exist flags as yet untried that futher improve the performance of the codes below. Please get in touch if you have any information which you think might improve the quality of the data this table, or if you would like us to run your code.

This work has been inspired by the Fortran compiler comparisons for benchmark codes performed by Polyhedron Software.

In the table below green indicates the best performing executable, red the poorest. All times given in seconds.

Last updated: Thu Sep 3 11:21:09 BST 2009

↓ code | compiler → Cray GNU Intel Pathscale PGI
CASTEP 362.85 356.71 357.71 343.81 366.26
TETRIN_PMG 178.91 226.24 281.26 168.89 169.71
DG-DES 177.58 229.88 174.27 160.40 291.31
GWW error details n/a details details
DL_POLY 211.36 242.66 error 246.83 226.47
HELIUM 2160.05 1752.12 1845.59 1854.16 1604.86
CASINO error 432 459 407 428
Incompact3D 193.49 212.28 206.33 252.01 196.19

CASTEP

CASTEP is a plane wave basis density functional theory materials code. CASTEP has Makefiles for a number of compilers and architectures. The compiler flags used in these runs were derived from the 'fast' build option.
Code Version: 4.4
Run: al1x1 benchmark, 8 MPI processes

Cray

Compiler Version: 7.1
Libraries: ACML 4.2.0, MPT 3.2.0, FFTW 3.1.1
Flags: -Dcray -e m -O 3 -O aggress

gfortran

Compiler Version: 4.3.3
Libraries: ACML 4.2.0, MPT 3.2.0, FFTW 3.1.1
Flags: -fconvert=big-endian -frecord-marker=4 -O3

Intel

Compiler Version: 11.0
Libraries: ACML 4.2.0, MPT 3.2.0, FFTW 3.1.1
Flags: -convert big_endian -O3 -fast -ipo -no-prec-div -msse3
Comment: The Intel build was performed off-site on an AMD quadcore machine.

Pathscale

Compiler Version: 3.2
Libraries: ACML 4.2.0, MPT 3.2.0, FFTW 3.1.1
Flags: -byteswapio -O3 -OPT:Ofast -ffast-math -OPT:recip=ON -OPT:malloc_algorithm=1

PGI

Compiler Version: 3.2
Libraries: ACML 4.2.0, MPT 3.2.0, FFTW 3.1.1
Flags: -Mnostdinc -Mbyteswapio -fastsse -O3

TETRIN_PMG

(Parallel Multi-Grid solver for INcompressible fluid on TETRahedral grids) is a 3D Euler/Navier-Stokes incompressible fluid dynamics code based on Galerkin Finite Volume Method.
Run: 15 timesteps over a single multigrid level using 240 MPI processes.

Cray

Compiler Version: 7.0.4
Libraries: MPT 3.2.0
Flags: -O3 -Oaggress -Omsgs
Comment: Cray 7.1 (built on Jaguar°) produced a time of 185.75, which may be the result of unsafe optimizations being throttled in the later release.

gfortran

Compiler Version: 4.3.3
Libraries: MPT 3.2.0
Flags: -march=barcelona -ffast-math -funroll-loops -O3 -ffixed-line-length-none -ftree-vectorizer-verbose=2

Intel

Compiler Version: 11.0
Libraries: MPT 3.2.0
Flags: -O3 -fast -ipo -no-prec-div -msse3
Comment: The Intel build was performed off-site on an AMD quadcore machine.

Pathscale

Compiler Version: 3.2
Libraries: MPT 3.2.0
Flags: -Ofast -LNO:full_unroll=4 -march=barcelona -OPT:malloc_algorithm=1 -LNO:simd_verbose=ON
Comment: Without the -OPT:malloc_algorithm=1 flag the runtime was 186.26, placing Cray as the fastest compiler for this code.

PGI

Compiler Version: 8.0.6
Libraries: MPT 3.2.0
Flags: -Minfo -Mneginfo -Mextend -fast -Munroll=n:4 -Mipa=fast,inline -O3 -tp barcelona-64

DG-DES

Dynamic Grid Detached Eddy Simulation. This code is used to model turbulent flows in aerospace engineering using a dual time stepping Runge-Kutta method and a finite volume mesh.
Run: 50 timesteps of ncylinder using 64 MPI processes. Times given are for timestepping phase of calculations.
The preprocessor macros turn on code optimizations and timing info from prior work with the CSE team.

Cray

Compiler Version: 7.1 (Jaguar°)
Libraries: MPT 3.2.0, metis 4.0
Flags: -O3 -Oaggress -Omsgs -F -DHECTOR_NAGOPT -DHECTOR_NAGTIME
Comment: Version 7.0.4 executable crashed with a segmentation fault.

gfortran

Compiler Version: 4.3.3
Libraries: MPT 3.2.0, metis 4.0
Flags: -O3 -march=barcelona -ffast-math -funroll-loops -x f95-cpp-input -DHECTOR_NAGOPT -DHECTOR_NAGTIME

Intel

Compiler Version: 11.0
Libraries: MPT 3.2.0, metis 4.0
Flags: -O3 -fast -ipo -no-prec-div -msse3 -fpp -DHECTOR_NAGOPT -DHECTOR_NAGTIME

Pathscale

Compiler Version: 3.2
Libraries: MPT 3.2.0, metis 4.0
Flags: -Ofast -LNO:full_unroll=4 -march=barcelona -OPT:malloc_algorithm=1 -LNO:simd_verbose=ON -cpp -DHECTOR_NAGOPT -DHECTOR_NAGTIME

PGI

Compiler Version: 8.0.6
Libraries: MPT 3.2.0, metis 4.0
Flags: -Minfo -Mneginfo -fast -Munroll=n:4 -Mipa=fast,inline -O3 -tp barcelona-64 -Mpreprocess -DHECTOR_NAGOPT -DHECTOR_NAGTIME

GWW

The GWW code is an extension to the popular electronic structure package Quantum Espresso (QE).
Code Version: GWW_in_QE4.0.4
Run: The performance of QE as used in GWW is to be improved as part of a distributed CSE (dCSE) project. The test case provided for benchmarking as part of the dCSE project (64 silicon atoms) involves running 7 separate parallel jobs in sequence using 3 executables, pw.x, ph.x and gww.x. pw.x and ph.x are from the standard QE distribution extended to produce output consumable by the new executable gww.x. The Cray 7.0.4 compiler produces an internal compiler error for GWW. Only compilers available on HECToR have been used for this code due to the cost of the benchmark. Using 32 processes the performance data are as follows (with the executable involved given in brackets). Times given are headline CPU time given in process 0 output file:

↓ run | compiler → GNU Pathscale PGI
1. exc_scf (pw.x) 59.66 66.70 67.94
2. exc_nscf (pw.x) 542.97 597.10 616.02
3. head_scf (pw.x) 87.75 100.74 104.73
4. head_nscf (ph.x) 21609 26700 28380
5. matrix_scf (pw.x) 106.67 128.99 114.70
6. matrix_nscf (pw.x) 1239.79* 1477.37 1269.04
7. gww (gww.x) 252.02** 160.12 143.47

gfortran

Compiler Version: 4.3.3
Libraries: MPT 3.2.0, libsci 10.3.6, FFTW 3.1.1
Flags: -O3 -march=barcelona -ffast-math -funroll-loops -ffree-line-length-none
Comment: FFTW 3.2.1 failed to link.
*matrix_nscf run originally failed because default MPICH_PTL_UNEX_EVENTS queue size is too small (20480). The successful run used export MPICH_PTL_UNEX_EVENTS=1000000 in the job script.
**The original GWW run failed with a segmentation fault. Two bugs were found in the code:
  1. In times_gw.f90 on lines 124 and 162 the range of the array section weights_freq(-tf%n:1) should be weights_freq(-tf%n:-1) (i.e. up to -1 instead of 1) to match the extent of the array being copied.
  2. Variables declared with the pointer attribute are undefined before being modified and pointers passed to the associated function cannot be undefined. The GWW code has lots of lines such as: if(associated(fftd%fd)) deallocate(fftd%fd) which are executed before pointers are allocated. The GNU compiler produced a segmentation fault since the test passed and deallocate tried to free unallocated space. This has been fixed by setting all pointer variables to NULL so that the associated test fails on first pass.
After fixing these bugs the code ran successfully, albeit slowly compared to the PGI and Pathscale compilers. The performance bottleneck appears to be the call to lmder1 in the fit_multipole_minpack subroutine in fit_multipole.f90. However, this is not a significant issue since over 90% of the total runtime is due to the head_nscf job.

Pathscale

Compiler Version: 3.2
Libraries: MPT 3.2.0, libsci 10.3.6, FFTW 3.2.1
Flags: -Ofast -LNO:full_unroll=4 -march=barcelona -OPT:malloc_algorithm=1 -LNO:simd_verbose=ON
Comments: Had to enable macro __EKO in Modules/stick_base.f90 for pathscale bug workaround.

Intel

Not done.

PGI

Compiler Version: 8.0.6
Libraries: MPT 3.2.0, libsci 10.3.6, FFTW 3.2.1
Flags: -Minfo -Mneginfo -fast -Munroll=n:4 -Mipa=fast,inline -O3 -tp barcelona-64

Cray

Internal compiler error for Cray 7.0.4 on file fft_gw.F90 in GWW, even with no optimization:
nid15883> ftn -O0 -N255 -D__GWW -D__FFTW3 -D__MPI -D__PARA -D__CRAYX86 -I../include -I./ -I../Modules -I../iotk/src -I../PW -I../PH -c fft_gw.F90 -o fft_gw.o
cft90: llvm/lib/VMCore/Instructions.cpp:2267: llvm::BitCastInst::BitCastInst(llvm::Value*, const llvm::Type*, const std::string&, llvm::BasicBlock*): Assertion `castIsValid(getOpcode(), S, Ty) && "Illegal BitCast"' failed.
ftn-2116 crayftn: INTERNAL
"/opt/cray/cce/7.0.4/cftn/x86-64/lib/cft90" was terminated due to receipt of signal 06: Aborted.

DL_POLY

DL_POLY is a molecular dynamics code that can be used to simulate a wide variety of molecular systems including simple liquids, ionic liquids and solids, small polar and non-polar molecular systems, bio- and synthetic polymers, ionic polymers and glasses, solutions, simple metals and alloys
Code Version: 3.10
Run: sodium chloride with ewald sum (1728000 ions) using 32 processes

Cray

Compiler Version: 7.1 (Jaguar°)
Libraries: MPT 3.2.0
Flags: -O3 -Oaggress -Omsgs -en
Comments: Cray 7.0.4 compiler recorded a time of 219.99s.

gfortran

Compiler Version: 4.3.3
Libraries: MPT 3.2.0
Flags: -O3 -march=barcelona -ffast-math -funroll-loops -ffree-line-length-none -Wall -pedantic

Intel

Compiler Version: 11.0
Libraries: MPT 3.2.0
Flags: no flags
Comment: job encountered error at runtime: "too many atoms in CONFIG file" (although CONFIG file is the same as used for other executables). Possibly due to linking of PGI MPI and runtime libraries.

Pathscale

Compiler Version: 3.2
Libraries: MPT 3.2.0
Flags: -byteswapio -Ofast -LNO:full_unroll=4 -march=barcelona -OPT:malloc_algorithm=1 -LNO:simd_verbose=ON

PGI

Compiler Version: 8.0.6
Libraries: MPT 3.2.0
Flags: -Minfo -Mneginfo -fast -Munroll=n:4 -Mipa=fast,inline -O3 -tp barcelona-64

HELIUM

HELIUM models interaction of the helium atom with laser fields.
Run: 200 timesteps with 240x240 radial grid using 55 processes. Times given are for wavefunction calculation.

Cray

Compiler Version: 7.0.4
Libraries: MPT 3.2.0
Flags: -O3 -Oaggress -Omsgs

gfortran

Compiler Version: 4.3.3
Libraries: MPT 3.2.0
Flags: -O3 -march=barcelona -ffast-math -funroll-loops -ffree-line-length-none

Intel

Compiler Version: 11.0
Libraries: MPT 3.2.0
Flags: -O3 -fast -ipo -no-prec-div -msse3

Pathscale

Compiler Version: 3.2
Libraries: MPT 3.2.0
Flags: -Ofast -LNO:full_unroll=4 -march=barcelona -OPT:malloc_algorithm=1

PGI

Compiler Version: 8.0.6
Libraries: MPT 3.2.0
Flags: -fast -Munroll=n:4 -Mipa=fast -O3 -tp barcelona-64
Comments: -Mipa=inline produces a compile time error message even with the -mcmodel=medium -Mlarge_arrays flags:
/opt/pgi/8.0.6/linux86-64/8.0-6/libso/libpgf90.a(initpar.o): In function `__hpf_initarg':
initpar.c:(.text+0x137): relocation truncated to fit: R_X86_64_PC32 against `.bss'

CASINO

CASINIO is a quantum monte carlo program for calculating the electronic properties of matter.
Code Version: 3.0.42
Run: 1024 electrons, 64 atoms, fe pseudopotential, blip grid 92 by 92 by 72; 5 configuration per core, no branching. Times given are average time per block for 9 dmc blocks of 10 steps.

Cray

Compilation with Cray 7.0 and 7.1 compilers crashed due to an internal compiler error:
[F90] casl ftn-2116 crayftn: INTERNAL "/opt/cray/cce/7.1.1/cftn/x86-64/lib/cft90" was terminated due to receipt of signal 013: Segmentation fault.

gfortran

Compiler Version: 4.4.1
Libraries: MPT 3.1, libsci 10.3.6
Flags: -O3 -march=barcelona -ffast-math -funroll-loops -fcray-pointer

Intel

Compiler Version: 11.0
Libraries: MPT 3.1
Flags: -O3 -no-prec-div -no-prec-sqrt -funroll-loops -no-fp-port -ip -complex-limited-range -par-report0 -vec-report0

Pathscale

Compiler Version: 3.2
Libraries: MPT 3.1, libsci 10.3.2
Flags: -Ofast -march=barcelona -LNO:full_unroll=4 -OPT:malloc_algorithm=1

PGI

Compiler Version: 8.0.6
Libraries: MPT 3.1, libsci 10.3.2
Flags: -fast -Munroll=n:4 -Mipa=fast,inline -O4 -tp barcelona-64 -Mlarge_arrays

Incompact3D

Incompact3D is a CFD code to perform very large-scale turbulence simulations. Its numerical framework rests on simple 3D Cartesian mesh and uses compact finite difference approach to solve the fluid PDEs. The numerical scheme is implicit and the compact finite difference scheme involves solving a tri-diagonal matrix. Run: 10 timesteps over a 2048^3 mesh using 1024 processes. Times given are the process average times for the 10 timesteps less the MPI_Alltoall communications.

Cray

Compiler Version: 7.1
Libraries: MPT 3.5
Flags: -O3 -Oaggress -Omsgs -F

gfortran

Compiler Version: 4.4.1
Libraries: MPT 3.5
Flags: -O3 -march=barcelona -ffast-math -funroll-loops -ftree-vectorize -fcray-pointer -x f95-cpp-input -ftree-vectorizer-verbose=2

Intel

Compiler Version: 11.0
Libraries: MPT 3.5
Flags: -O3 -fast -ipo -no-prec-div -msse3 -cpp

Pathscale

Compiler Version: 3.2
Libraries: MPT 3.5
Flags: -cpp -Ofast -LNO:full_unroll=4 -march=barcelona -OPT:malloc_algorithm=1 -LNO:simd_verbose=ON

PGI

Compiler Version: 9.04
Libraries: MPT 3.5
Flags: -Mpreprocess -fast -Munroll=n:4 -Mipa=fast -O3 -tp barcelona-64 -Minfo -Mneginfo


This research used resources of the National Center for Computational Sciences at Oak Ridge National Laboratory, which is supported by the Office of Science of the Department of Energy under Contract DE-AC05-00OR22725.