6. Tuning

This section discusses best practice for improving the performance of your code on Cray XE systems. We begin with a discussion of how to optimise the serial (single-core) compute performance and then discuss how to improve parallel performance.

Please note that these are general guidelines and some/all of the recommendations may not have an impact on your code. We always advise that you analyse the performance of your code using the profiling tools detailed in the Performance analysis section to identify bottlenecks and parallel performance issues (such as load imbalance).

6.1 Optimisation summary

A summary of getting the best performance from your code would be:

  1. Select the right (parallel) algorithm for your problem. If you do not do this then no amount of optimisation will give you the best performance.
  2. Use the compiler optimisation flags (and use pointers sparingly in your code).
  3. Use the optimised numerical libraries supplied by Cray rather than coding yourself.
  4. Eliminate any load-imbalance in your code (CrayPAT can help identify load-balance issues). If you have load-imbalance then your code will never scale up to large core counts.

6.2 Serial (single-core) optimisation

5.2.1 Compiler optimisation flags

One of the easiest optmisations to perform is to use the correct compiler flags. This optimisation technique is extremely simple as it does not require you to modify your source code - although alterations to your source code may allow compiler flags to have more beneficial effects. It is often worth taking the time to try a number of optimisation flag combinations to see what effect they have on performance of your code. In addition, many of the compilers will provide information on what optimisations they are performing and, more usefully, what optimisations they are not performing and why. The flags needed to enable this information are indicated below.

Typical optimisations that can be performed by the compiler include:

Loop optimisation
such as vectorisation and unrolling.
Inlining
replacing a call to a function with the actual function source code.
Local logical block optimisations
such as scheduling, algebreic identity removal.
Global optimisations
such as constant propagations, dead store eliminations (still within a single source code file).
Inter-procedural analyses
try to optimise across subroutine/function boundary calls (can span multiple source code files).

The compiler-specific documentation and man pages contain more information about which optimisations particular flags will enable/disable.

When using the more aggressive optimisation options it is important to be aware that the resulting output might be affected, for example a loss of precision. Some of the optimisation options allow changing the order of execution and changing how arithmetic computations are performed. When using aggressive optimisations it is important to test your code to ensure that it still produces the correct result.

Many compiler suites allow pragmas or flags to be placed in the source code to give more information on whether or not (or even how) partcular sections of code should be optimised. These can be useful, particularly on restriciting optimisation for sections of code where the order of execution is critical. For example, the Portland group compiler can vectorize individual loops, perform memory prefetching, and select an optimization level for a code section.

Cray Compiler Suite

The -O1, -O2 and -O3 flags instruct the compiler to attempt various levels of optimisation (with -O1 being the least aggressive and -O3 being the most aggressive). The default is -O2 but most codes should benefit from increasing this to -O3.

To enable information on successful optimisations use the -Omsgs flag and to enable information on failed optimisations add the -Onegmsgs flag.

GNU Compiler Suite

The -O1, -O2 and -O3 flags instruct the compiler to attempt various levels of optimisation (with -O1 being the least aggressive and -O3 being the most aggressive).

The option -ftree-vectorizer-verbose=N will generate information about attempted loop vectorisations.

PGI Compiler Suite

The most useful set of optimisation flags for most codes will be: -fast -Mipa=fast. Other useful optimisation flags are -O3, -Mpfi, -Mpfo, -Minline, -Munroll and -Mvect.

To enable information on successful optimisations use the -Minfo flag and to enable information on failed optimisations add the -Mneginfo flag.

6.2.2 Using Libraries

Another easy way to boost the serial performance for your code is to use the optimised numerical libraries provided on the Cray XE system. More information on the libraries available on the system can be found in the section: Available Numerical Libraies.

6.2.3 Writing Optimal Serial Code

The speed of computation is determined by the efficiency of your algorithm (essentially the number of operations required to complete the calculation) and how well the compiled executable can exploit the Opteron architecture.

When actually writing your code the largest single effect you can have of performance is by selecting the appropriate algorithm for the problems you are studying. The algorithm you choose is dependent on many things but may include such considerations as:

Precision
Do you need to use double precision floating point numbers? If not, single or mixed-precision algorithms can run up to twice as fast as the double precision versions.
Problem size
what are the scaling properties of your algorithm? Would a different approach allow you to treat larger problems more efficiently?
Complexity
Although a particular algorithm may theoretically have the best scaling properties, is it so complex that this benefit is lost during coding?

Often algorithm selection is non-trivial and a good proportion of code benchmarking and profiling is needed to elucidate the best choice.

Once you have selected the best algorithms for your code you should endevour to write your code in such a way that allows the compiler to exploit the Opteron processor architecture in the most efficient way.

The first rule is that if your code segement can be replaced by an optimised library call then you should do this (see Available Numerical Libraries). If your code segment does not have a equivalent in one of the standard optimised numerical libraries then you should try to use code constructs that will expose instruction-level parallelism or vectorisation (also known as SSE/AVX/FMA4 instructions) to the compiler while avoiding simple optimisations that the compiler can perform easily. For floating-point intensive kernels the following general advice applies:

  • Avoid the use of pointers - these limit the optimisation that the compiler can perform.
  • Avoid using function calls, branching statements and goto statements wherever possible.
  • Only loops of stride 1 are ammeanable to vectorisation.
  • For nested loops, the innermost loop should be the longest and have a stride of 1.
  • Loops with a low number of iterations and/or little computation should be unrolled.

6.2.4 Cache Optimisation

Main memory access on systems such as CrayXE machines is usually around two orders of magnitude slower than performing a single floating-point operations. One solution used in the Opteron architecture to mitigate this is to use a hierarchy of smaller, faster memory spaces on the processor known as caches. This solution works as there is often a high chance of a particular address from memory being needed again within a short interval or a address from the same vicinity of memory being needed at the same time. This suggests that we could improve the performance of our code if we write it in such a way so that we access the data in memory that allows the cache hierarchy to be used as efficiently as possible.

Cache optimisation can be a very complex subject but we will try to provide a few general principles that can be applied to your codes that should help improve cache efficiency. The CrayPAT tool introduced in the Performance Analysis section can be used to monitor the cache efficiency of your code through the use of hardware counters.

Effectively, in programming for cache efficiency we are seeking to provide additional locality in our code. Here, locality, refers to both spatial locality - using data located in blocks of consecutive memory addresses; and temporal locality - using the same address multiple times in a short period of time.

  • Spatial locality can be improved by looping over data (in the innermost loop of nested loops) using a stride of 1 (or, in Fortran, by using array syntax).
  • Temporal locality can be improved by using short loops that do not contain function calls or branching statements.

There are two other ways in which the cache technology can have a detrimental effect on code performance.

Part of the way in which caches are able to achieve high performance is by mapping each memory address on to a set number of cache lines, this is known as n-way set associativity. This property of caches can seriously affect the performance of codes where two array variables involved in an operation exist on the same cache line and the cache line must be refilled twice for each instance of the operation. One way to minimise this effect is to avoid using powers of 2 for your array sizes (as cache lines are always powers of 2) or, if you see this happening in your code, to pad the array with enough zeroes to stop this happening.

The other major effect on users codes comes in the form of so-called TLB misses. The TLB in question is the translation lookaside buffer and is the mechanism that the cache/memory hierachy uses to convert application addresses to physical memory addresses. If a mapping is not contained in the TLB then main memory must be accessed for further information resulting in a large performance penalty. TLB misses most often occur in codes as they loop through an array using a large stride.

The cache layout is detailed in the Architecture section above.

6.3 Parallel optimisation

Some of the most important advice from the serial optimisation section also applies for parallel optimisation, namely:

  • Choose the correct algorithm for your problem.
  • Use vendor-provided libraries wherever possible.

When programming in parallel you will also need to select the parallel programming model to use. As the Cray XE system is an MPP machine with distributed memory you have the following options:

  • Pure MPI - using just the MPI communications library.
  • Pure SHMEM - using just the SHMEM, single-sided communications library.
  • Pure PGAS - using one of the Partitioned Global Address Space (PGAS) implementations, such as Coarray Fortran (CAF) or Unified Parallel C (UPC).
  • Hybrid approach - using a combination of parallel programming models (most often MPI+OpenMP but MPI+CAF and MPI+SHMEM are also used).

The Cray XE interconnect architecture includes hardware support for single-sided communications. This mean that SHMEM and PGAS approaches can run very efficiently and, if your algorithm is ammeanable to such an approach, are worth considering as an alternative to the more traditional pure MPI approach. A caveat here is that if your code makes heavy use of collective communications (for example, all-to-all or allreduce type operations) then you will find that the optimised MPI versions of these routines almost always outperform the equivalents coded using SHMEM or PGAS.

In addition, due to the fact that Cray XE machines are constructed from quite powerful SMP building blocks (i.e. individual nodes with up to 32 cores), then a hybrid programming approach using OpenMP for parallelism within a node and MPI for communitions outwith a node will generally produce code with better scaling properties than a pure MPI approach.

6.3.1 Load-imbalance

None of the parallel optimisation advice here will allow your code to scale to larger numbers of cores if your code has a large amount of load-imbalance.

Load-imbalance in parallel algorithms is where different parallel tasks (or threads) have a large amount of difference in computational work to perform. This, in turn, leads to some tasks (or threads) sitting idle at synchronisation points while waiting for other tasks to complete there block of work. Obviously, this can lead to a large amount of inefficiency in the program and can seriously inhibit good scaling behaviour.

Before optimising the parallel performance of your code it is always worth profiling (see the Profiling section) to try and identify the level of load-imbalance in your code, CrayPAT provides excellent tools for this. If you find a large amount of load-imbalance then you should eliminate this as much as possible before proceeding. Note that load-imbalance may only become apparent once you start using the code on higher and higher numbers of cores.

Eliminating load-imbalance can involve changing the algorithm you are using and/or changing the parallel decomposition of your problem. Generally, this issue is very code specific.

6.3.2 MPI Optimisation

The majority of parallel, scientific software still uses the MPI library as the main way to implement parallelism in the code, so much effort has been put in by Cray software engineers to optimise the MPI performance on Cray XE systems. You should make use of this by using high-level MPI routines for parallel operations wherever possible. For example, you should almost always use MPI collective calls rather than writing you own versions using lower-level MPI sends and receives.

When writing MPI (or hybrid MPI+X) code you should:

  • overlap commumication and computation by using non-blocking operations wherever possible;
  • pre-post receives before the matching send operation is called to save memory copies and MPI buffer management overheads;
  • send few large messages rather than many small messages to minimise latency costs;
  • use collective communication routines as little as possible.
  • avoid the use of mpi_sendrecv as this is an extremely slow operation unless the two MPI tasks involved are perfectly synchronised.

Some useful MPI environment variables that can be used to tune the performance of your application are:

MPICH_ENV_DISPLAY
set to display the current environment settings when a MPI program is executed.
MPICH_FAST_MEMCPY
use an optimised memory copy function in all MPI routines.
MPICH_MAX_SHORT_MSG_SIZE
tune the use of the eager messaging protocol which tries to minimise the use of the MPI system buffer. The default on Cray XE systems is usually 128000 bytes. Increasing/decreasing this value may improve performance.
MPICH_COLL_OPT_ON
can give better performance for MPI_Allreduce and MPI_Barrier for large numbers of cores.
MPICH_UNEX_BUFFER_SIZE
increases the buffer size for messages that are received before the receive has been posted (default is 60MB). Increasing this may improve performance if you have a large number of such messages. Better to alter the code to pre-post receives if possible though.

Use "man intro_mpi" on the machine to show a full list of available options.

6.3.3 Mapping tasks/threads onto cores

The way in which your parallel tasks/threads are mapped onto the cores of the Cray XE compute nodes can have a large effect on performance. Some options you may want to consider are:

  • When underpopulating a compute node with parallel tasks it can often be beneficial to ensure that the parallel tasks are evenly spread across NUMA regions using the -S option to aprun (see below). This has the potential to optimise the memory badwidth available to each core and to free up the additional cores for use by the multithreaded version of Cray's LibSci library by setting the OMP_NUM_THREADS environment variable to however many spare cores are availble to each parallel task and using the "-d $OMP_NUM_THREADS" option to aprun (see below).
  • On the AMD Bulldozer architecture (Interlagos processors) if you use half the cores per node you may be able to get additional performance by ensuring that each core has exclusive access to the shared floating pont unit in each processing module. You can do this by specifying the "-d 2" option to aprun (see Example 3 below).

The aprun command which launches parallel jobs onto Cray XE compute nodes has a range of options for specifying how parallel tasks and threads are mapped onto the actual cores on a node. Some of the most important options are:

-n parallel_tasks
Total number of parallel tasks (not including threads). Default is 1.
-N parallel_tasks_per_node
Number of parallel tasks (not including threads) per node. Default is the number of cores in a node.
-d threads_per_parallel_task
Number of threads per parallel task. For OpenMP codes this will usually be equal to $OMP_NUM_THREADS. Default is 1. This option can also be used to specify a stride between parallel tasks when not using threads (useful on the Interlagos processors for using one core per module).
-S parallel_tasks_per_numa
Number of parallel tasks to assign to each NUMA region on the node. There are 4 NUMA regions per XE compute node. Default is 8.

Some examples should help to illustrate the various options. In all the examples we assume we are running on Cray XE compute nodes that have 32 cores per node arranged into 4 NUMA regions of 8 cores each (interlagos processors).

Example 1:

Pure MPI job using 1024 MPI tasks (-n option) with 32 tasks per node (-N option):

aprun -n 1024 -N 32 my_app.x

This is analogous to the behaviour of mpiexec on Linux clusters.

Example 2:

Hybrid MPI/OpenMP job using 512 MPI tasks (-n option) with 8 OpenMP threads per MPI task (-d option), 4096 cores in total. There will be 4 MPI tasks per node (-n option) and the 8 OpenMP threads are placed such that the threads associated with each MPI task are assigned to the same NUMA region (1 MPI task per NUMA region, -S option):

aprun -n 512 -N 4 -d 8 -S 1 my_app.x

Example 3:

Pure MPI job using 1024 MPI tasks (-n option) with 16 tasks per node (half-populated, -N option) with one task per Bulldozer module (one task every second core, -d option):

aprun -n 1024 -N 16 -d 2 my_app.x

Further information on job placement can be found in the Cray document:

or by typing:

man aprun

when logged on to HECToR.

6.4 Advanced OpenMP usage

On Cray XE systems, when using the GNU compiler suite, the location of the thread that initialises the data can determine the location of the data. This means that if you allocate your data in the serial portion of the code then the location of the data will be on the NUMA region associated with thread 0. This behaviour can have implications for performance in the parallel regions of the code if a thread from a different NUMA region then tries to access that data. If you are using the Cray or PGI compiler suites then there is no guarantee of where shared data will be located if your OpenMP code spans multiple NUMA regions. We always recommend that OpenMP code does not span multiple NUMA regions on Cray XE systems. See below for recommended task/thread configurations.

You can overcome this limitation, when using the GNU compier suite, by initialising your data in parallel (within a parallel region) or, for any compiler suite, by not using OpenMP parallel regions that span multiple NUMA regions on a node.

In general, it has been found that it can be difficult to gain any parallel performance when using OpenMP parallel regions that span multiple NUMA regions on a Cray XE compute node. For this reason, you will generally find that it is best to use one of the following task/thread layouts if you code contains OpenMP.

MPI Tasks per NUMA RegionThreads per MPI taskaprun syntax
18aprun -n ... -N 4 -S 1 -d 8 ...
24aprun -n ... -N 8 -S 2 -d 4 ...
42aprun -n ... -N 16 -S 4 -d 2 ...

There is a known issue with OpenMP thread migration when using the GNU programming environment to compile OpenMP code with multiple parallel regions. When a parallel region is finished and a new parallel region begins all of the threads become assigned to core 0 leading to extremely poor performance. To prevent this happening you should add the "-cc none" option to aprun.

6.4.1 Environment variables

The following are the most important OpenMP environment variables:

OMP_NUM_THREADS=number_of_threads
Sets the maximum number of OpenMP threads available to each parallel task.
OMP_NESTED=true
Enable nested OpenMP parallel regions. Note that this functionality is currently only supported by the Cray and GNU compilers.
OMP_SCHEDULE=policy
Determines how iterations of loops are scheduled.
OMP_STACKSIZE=size
Specifies the size of the stack for threads created.
OMP_WAIT_POLICY=policy
Controls the desired behavior of waiting threads.

A more complete list of OpenMP environment variables can be found at:

6.5 Memory optimisation

Although the dynamic memory allocation procedures in modern programming languages offer a large amount of convenience the allocation and deallocation functions are time consuming operations. For this reason they should be avoided in subroutines/functions that are frequently called.

The aprun option -m size[h|hs] specifies the per-PE required Resident Set Size (RSS) memory size in megabytes. (K, M, and G suffixes, case insensitive, are supported). If you do not include the -m option, the default amount of memory available to each task equals the minimum value of (compute node memory size) / (number of cores) calculated for each compute node.

6.5.1 Memory affinity

Please see the discussion of memory affinity in the OpenMP section

6.5.2 Memory allocation (malloc) tuning

The default is to allow remote-NUMA-node memory allocation to all assigned NUMA nodes. Use the aprun option -ss to specify strict memory containment per NUMA node.

Linux also provides some environment variables to control how malloc behaves, e.g. MALLOC_TRIM_THRESHOLD_ that is the amount of free space at the top of the heap after a free() that needs to exist before malloc will return the memory to the OS. Returning memory to the OS is costly. The default setting of 128 KBytes is much too low for a node with 32GBytes of memory and one application. Setting it higher might improve performance for some applications.

6.5.3 Using huge pages

Huge pages are virtual memory pages that are larger than the default 4KB page size. They can improve the memory performance for codes that have common access patterns across large datastes. Huge pages can sometimes provide better performance by reducing the number of TLB misses and by enforcing larger sequential physical memory inside each page.

The Cray XE system is set up to have huge pages available by default. The modules craype-hugepages2m and craype-hugepages8m set the necessary link options and environment variables to enable the usage of 2MB or 8MB huge pages respectively. Also, the AMD Opteron supports multiple page sizes (128KB, 512KB, 2MB, 8MB, 16MB, 64MB). The default huge page size is 2 Mbytes. You will also need to load the appropriate craype-hugepages module at runtime (in you job submission script) for hugepages to work.

If you know the memory requirements of your application in advance you should set the -m option to aprun when you launch your job to preallocte the appropriate number of huge pages. This improves performance by reducing operating system overhead. The syntax is:

request size Mbytes per PE (advisory)
-m/size/h
request size Mbytes per PE (required)
-m/size/hs

6.6 I/O optimisation

The HECToR CSE team have published some information on how to optimise the I/O on the HECToR system.