HECToR Fortran Compiler Performance Comparison

This page presents the performance of different compilers for a collection of Fortran codes being used on phase 3 of HECToR, the data having been collected by the NAG HECToR CSE team. The performance data given here should be regarded as a straw poll of an arbitrary set of codes and optimization flags. Flags used are either derived from existing 'fast' options in bundled Makefiles or based on limited experimentation of commonly used optimization flags. As such it is NOT supposed to be a representation on the best performance achievable by any compiler, but rather more an indication of what a user can expect when compiling an application “out of the box”. The codes have not been optimized in any way for these experiments. The same source code was used for every compiler and results were verified for each run.

In every case there may exist flags as yet untried that futher improve the performance of the codes below. For some of the codes there has been limited experimentation, and this is detailed below, where we also give more details of the compiler versions and flags used, and also the benchmarks used. Please get in touch if you have any information which you think might improve the quality of the data this table, or if you would like us to run your code.

This work has been inspired by the Fortran compiler comparisons for benchmark codes performed by Polyhedron Software.

In the table below green indicates the best performing executable, red the poorest. All times given in seconds. Please click on the results to see the compiler flags used and any more details about each run.

Last updated: Thu Apr 12 11:44:04 BST 2012

↓ code | compiler → Cray GNU PGI
DL_POLY_4 390.89 390.11 363.63
CRYSTAL 3952.86 3954.67 3633.41
ONETEP 2250 2303 2365
CASINO 637.98 541.17 558.15
CASTEP 3023.63 2746.06 2800.78
VASP Did not compile 13591.19 13642.90
HELIUM 1116 Did not finish 1205
Incompact3D 1.8021 2.1092 1.9815
CABARET 172.03 213.20 273.14

The clear conclusion from this data is that there is no clear winner! Thus at present we can only recommend that if possible users try each compiler on their code and use whichever suits their needs best. Of course should a user want help with this please get in touch with the HECToR CSE team who will be very happy to help.

Details

DL_POLY_4

DL_POLY is a molecular dynamics code that can be used to simulate a wide variety of molecular systems including simple liquids, ionic liquids and solids, small polar and non-polar molecular systems, bio- and synthetic polymers, ionic polymers and glasses, solutions, simple metals and alloys. Version 4.03 of the code was used. The benchmark was 216,000 ions of Sodium chloride and run on 32 cores.
Cray
Version: 8.0.1
Flags: -O3 -en
GNU
Version: 4.6.2
Flags: -O3 -funroll-loops -std=f95 -Wall -pedantic
PGI
Version: 11.9.0
Flags: -O3 -fast

CRYSTAL

CRYSTAL (http://www.crystal.unito.it/) is a periodic ab initio electronic structure code code that adopts a local basis set of Gaussian type orbitals. CRYSTAL09 was used. The benchmark was based on the mesoporous silica MCM-41, which contains 579 atoms, and uses 7756 basis functions. The runs were on 256 cores.
Cray
Compiler version: 8.0.1
Flags: -O3 -en
GNU
Compiler version: 4.6.2
Flags: -O2 -funroll-loops -std=f95 -Wall -pedantic
Use of -O3 caused the routine POLIPA to generate a segmentation fault
PGI
Compiler version: 11.9.0
Flags: -O3 -fast

ONETEP

ONETEP (Order-N Electronic Total Energy Package) is a linear-scaling code for quantum-mechanical calculations based on density-functional theory, see http://www2.tcm.phy.cam.ac.uk/onetep/ for more details. The version current as of Oct 2011 was used. The benchmark was 2 NGWF iterations on a Lysozyme protein fragment. Runs were on 64 cores using half packed nodes (fully packed ran out of memory).
Cray
Compiler version: 8.0.1
Flags: -O2 -Oipa1
Use of higher ipa levels produces incorrect results
GNU
Compiler version: 4.6.2
Flags: -Ofast
PGI
Compiler version: 11.10.0
Flags: -Ofast

CASINO

CASINO (http://www.tcm.phy.cam.ac.uk/~mdt26/casino2.html) is a code for performing quantum Monte Carlo (QMC) electronic structure calculations for finite and periodic systems. Version 2.10 was used. The calculation was on 1024 electrons, 64 atoms, using a Fe pseudopotential, 92 by 92 by 72 blip grid and 5 configuration per core for 10 steps, with no branching. Time values are an average per block ( averaging performed over 4 blocks).
Cray
Compiler version: 8.0.1
Flags: -O1 -h noomp
Use of higher optimisation levels produce incorrect results
GNU
Compiler version: 4.6.2
Flags: -O3 -ffast-math -funroll-loops -fcray-pointer
PGI
Compiler version: 11.9.0
Flags: -fast -Munroll=n:4 -Mnoipa -O4 -Mlarge_arrays -g -traceback -Minform=severe -Mbackslash

CASTEP

CASTEP (www.castep.org) is a full-featured materials modelling code based on a first-principles quantum mechanical description of electrons and nuclei. Version 6.0 was used. The calculation was a full DFPT phonon dispersion calculation for a single unit cell of rutile TiO2 (6 atoms), and runs were on 96 cores.
Cray
Compiler version: 7.4.4
Flags: -O3 -Ocache3 -Oipa5
GNU
Compiler version: 4.6.1
Flags: O3 -funroll-loops -fprefetch-loop-arrays
PGI
Compiler version: 11.9.0
Flags: Most files were compiled with -fast -Msmartalloc=huge. However some performance critical routines required -O1 to work properly due to an optimiser bug.

VASP

The Vienna Ab initio Simulation Package (VASP) is a computer program for atomic scale materials modelling, e.g. electronic structure calculations and quantum-mechanical molecular dynamics, from first principles. More details can be found at http://www.vasp.at/. Version 5.2.12 was used. Runs were on 128 cores. The benchmark was a supercell of CeO2.
For this code a number of runs were performed, and it was found that the GNU compiler consistently produced faster executables than PGI, but in all cases the difference was small, typically less than 10%.
Cray
Compiler version: 8.0.2
We were unable to compile VASP due to the use of non-standard features in the code rather than a problem with the Cray compiler itself.
GNU
Compiler version: 4.6.2
Flags: -O3 -fexternal-blas -ffast-math -funroll-loops
PGI
Compiler version: 12.1.0
Flags: -O3 -Mvect -fastsse -Mipa=fast,inline

HELIUM

HELIUM models the interaction of the helium atom with laser fields. Runs were on 16290 cores.
Cray
Compiler version: 8.0.1
Flags: -h noomp -e m
GNU
Compiler version: 4.6.2
Flags: -O3
PGI
Compiler version: 11.10.0
Flags: -fastsse -Mipa=fast -Mpreprocess -Minfo -Ktrap=ovf,divz

Incompact3D

Incompact3D (http://www3.imperial.ac.uk/tmfc/people/sylvainlaizet/incompact3d) is a CFD code to perform large-scale turbulence simulations. Its numerical framework rests on simple 3D Cartesian mesh and uses compact finite difference approach to solve the fluid PDEs. The numerical schemes are implicit: the compact finite difference scheme involves solving tri-diagonal matrices; the pressure Poisson solver uses Fourier-based spectral method. The MPI communication is of all-to-all type. Run: 1000 timesteps over a 128*64*64 mesh using 64 processes. Each case was repeated 6 times and for the fastest run, the average time per timestep was recorded.
Cray
Compiler version: 8.0.0
Flags: -e -Fm
GNU
Compiler version: 4.6.2
Flags: -O3 -funroll-loops -ftree-vectorize -cpp
PGI
Compiler version: 11.10.0
Flags: -fast -O3 -Mpreprocess

CABARET

CABARET is an unstructured hexagonal code based on Monotonically Integrated LES for turbulent flow modelling. It is developed at The Whittle Laboratory, Cambridge University Engineering Department and the Mathematical Modelling Division, Nuclear Safety Institute, Russian Academy of Science, Moscow, Russia.. The code is based on the Compact Accurately Boundary Adjusting high-REsolution Technique that is a significant upgrade of the second-order Upwind Leapfrog scheme (Iserlis, 1986) to nonlinear conservation laws (Karabasov and Goloviznin, 2009). The simulations were performed on a 700,000 grid, which corresponds to a reduced-size model of an overexpanded heated jet with supersonic inflow conditions at Reynolds number of 1000000 included within the flight stream. The timings are for 1000 time steps with 64 cores (2 nodes).
Cray
Compiler version: 8.0.1
Flags: -O3
GNU
Compiler version: 4.6.2
Flags: -O3
For the gnu compiler it was found that use of -Ofast -funroll-loops -ftree-vectorize -ftree-vectorizer-verbose=2 reduced the run time from 213.2 seconds to 201.5
PGI
Compiler version: 11.10.0
Flags: -O3 -Mpfi -Mpfo -Minline -Munroll -Mvect
For the PGI compiler it was found that use of -O3 -Minfo -Mneginfo -fast -Munroll=n:4 -Mipa=fast,inline reduced the time taken markedly, from 273.14 to 180.33