Good Practice Guide

HECToR Phase 2B (24-core)

Introduction
Who will benefit?
What can I do to prepare?
Architecture details
Batch Submission
CSE Assistance
References
Appendix: some micro-benchmarking results

Introduction

As of February 2011 the main machine of the HECToR service in phase 2b is a Cray XE6. The XE6 consists of 1856 24-core nodes, giving 44,544 cores in total, increasing the theoretical peak performance and memory footprint from around 200TF and 45TB in phase 2a to over 370TF and 59TB in phase 2b. The phase2a XT4 is still currently operational, although this is no longer the contractual service machine.

This guide explains the key differences in software and architecture which will impact the operation and performance of codes on the XE6 part of the HECToR service.

Who will benefit?

The section on Architecture below highlights the key changes that will have an effect on performance. Phase 2b offers nearly double the number of cores compared to phase 2a and is therefore another upgrade that will favour codes that scale well. The microarchitecture of the system at the level of a single core is very similar to that of the Barcelona processor used in phase 2a and therefore the same emphasis should be put on using the 3-level cache hierarchy effectively and ensuring the use of packed SSE instructions (for more details on these aspects, see the phase 2a guide). Codes that do not use cache and packed SSE instructions effectively will require some work in order to get the most out of the new system and are likely to perform poorly if performance issues are not addressed.

The per-core memory statistics show that there will be a decrease in memory size (8GB shared between 6 cores for the XE6 vs 8GB shared between 4 cores for the XT4), but an increase in memory bandwidth (85.3GB/s across a node vs 12.3GB/s across a node), which means that memory-access-bound codes may show some benefit, but there may be a need to run larger jobs in order to fit the same problems in memory. The new architecture also means that much more shared memory is available, 32GB within a node, which will provide codes that were previously prevented from running on HECToR (due to the previously much smaller available memory-per-node) with the opportunity to use the XT part of the service for the first time, and also benefit codes that are able to make effective use of more shared memory.

The new phase 2b system will be less forgiving of codes that perform adequately on the phase 2a system. For example, codes that scale only to the point that enough processes are used to distribute 2GB of memory per core in phase 2a are likely to perform very poorly when scaled to higher numbers of processes in order to distribute 1.3GB of memory per core in phase 2b.

What can I do to prepare?

Set aside some time to benchmark your code on the new system and compare the results with those from the phase 2a quad core setup. See the Good Practice Guide for Performance Measurement for tips on how to benchmark your code.

The upgrade is also a good time to check that you are using libraries wherever possible. This is important in order to make your life as a developer easier and take advantage of the dedicated effort put into solving a specific problem efficiently. Make sure you are using the most up to date versions. For example, FFTW 3 is designed to make use of packed SSE instructions, whereas FFTW 2 is not, so it may be worth the effort to switch to the new interface. Remember too that vendor-tuned libraries (ACML and libsci) are likely to provide the most efficient routines in most cases.

One of the most important aspects of achieving high performance with the AMD Magny Cours processors, which make up the XE6, is utilising packed SSE instructions, and it is the compiler's job to generate these for you. In order to make sure that packed SSE instructions are being used for the key computational regions of your code use the compiler options -Minfo for the PGI compiler (-Mneginfo to see what is not vectorized, which may be more important) and -LNO:simd_verbose=ON for the Pathscale compiler. See the Good Practice Guide for Serial Code Optimisation for information about compiler optimisation flags.

If compiler optimisation fails to improve performance significantly, the next step is code optimisation. Use CrayPAT to profile your code and understand the performance bottlenecks. It will be more important to use techniques for effective cache management such as loop reordering, cache blocking and prefetching. Also, remember that single precision floating point operations can be carried out at twice the rate of those in double precision, so use double precision reals only where required. See the Good Practice Guide for Serial Code Optimisation for more information about code optimisation.

Hybrid MPI-OpenMP programming may be an option to consider if each process in your code currently contains significant computational sections (e.g. loops) that may be more finely parallelised. Hybrid codes make use of MPI for inter-processor communication and OpenMP for intra-process communication, so that some or all cores within a node use multithreaded shared memory parallelism. The advantage of this approach is that the threads within a process can share much of the same data structures, thus providing more scope for cache optimisation and also reducing off-node traffic.

The natural way to configure a hybrid job under the architecture of an XE6 node is to use 4 MPI processes per node, and within each node use 6 OpenMP threads running on the same die, thus giving uniform memory access to the threads in a process. It is possible to use all 24 cores within a node for shared memory OpenMP threads, but this may not be desirable, depending on how your code accesses memory. See the architecture, batch script and appendix sections for more information on these aspects of running jobs.

Finally, the need to scale to higher numbers of processes means that if your code performs well at the node level, the limiting factor will become sequential activities such as IO. Such activities may have had a tolerable impact on performance in the past, but in order to scale further, phase 2b may be the time to parallelise these regions too. For the specific case of IO, see the parallel IO section of the Good Practice Guide for IO.

Architecture details

A single node

An XE6 node consists of 24 cores sharing a total of 32 GB of memory accessible through a NUMA (Non-Uniform Memory Architecture) design. The 24 cores are packaged as 2 AMD Opteron 6172 2.1GHz processors, code-named "Magny-Cours". It is necessary to look at the architecture of a Magny-Cours processor in order to understand the NUMA nature of an XE6 node.

A Magny-Cours processor is an example of a Directly Connected Multi Chip Module processor. Multi Chip Module (MCM) means that the processor is essentially two hex-core dies (see definition 2) connected together within the same socket.

Each die has direct access to 8GB memory through 2 memory channels. Access to memory connected to other dies must be transferred through a Hyper Transport (HT) link.

Each die has four HT links, which is each 16-bits in width, and one of the links is split into two 8-bit connections.

One and a half links are connected to the neighbouring die within the Magny-Cours processor, giving a 24-bit HT version 3.1 connection within the processor (38.4GB/s bidirectional bandwidth).
Within an XE6 node a whole HT 3.1 link connects a die to its counterpart in the neighbouring processor (i.e. die 0 in processor 0 has a 16-bit connection to die 0 in processor 1, giving 25.6GB/s bidirectional bandwidth).
Half a HT 3.1 link connects a die to the other, "diagonally opposite", die in the neighbouring processor (i.e. die 0 in processor 0 has an 8-bit connection to die 1 in processor 1, giving 12.8GB/s bidirectional bandwidth).
The remaining link is unused in 3 of the 4 dies and in the fourth it is a 16-bit HT version 1 link connected to the interconnect for inter-node communication and IO, with a bidirectional bandwidth of 6.4GB/s. That only one link is directly connected to the interconnect gives interesting results for inter-node communication.

Under this architecture, even though each die is directly connected we have an hierarchy of HT connections and it is this hierarchy which characterises the NUMA nature of an XE6 node: for a process running on a given core, accessing memory local to the die belonging to the core is faster than accessing the memory connected to the neighbouring die, which is faster than accessing the memory connected to the counterpart die in the neighbouring processor, which is again faster than accessing the memory connected to the diagonally opposite die in the neighbouring processor. The results of a benchmark test demonstrate the performance differences.

Within a single hex-core die there is 6MB of shared L3 cache, 512KB of dedicated L2 cache, 64KB of dedicated L1 data cache and 64KB of dedicated L1 instruction cache. Not all of the 6MB of L3 cache will be available to processes running on the die because around 1MB will be used to support a mechanism called HT Assist for reducing the amount of Hyper Transport traffic between dies due to maintaining cache coherency. Instead of constantly probing the caches of other dies, which means communicating via the HT links, a "probe filter" is maintained in L3 cache, thus effectively improving memory transfer bandwidth and latency between dies.

5MB L3 cache shared between 6 cores represents an increase in the amount of cache available per core compared to the 2MB L3 cache shared between 4 cores in phase 2a. L1 and L2 cache sizes remain the same. 32GB shared between 24 cores is a decrease in the amount of main memory per core compared to the 8GB per 4 cores in phase 2a. However, should users wish to do so, there is the opportunity of running sparsely populated jobs so that more memory is available per core (up to 32GB if only one core per node is used), which offers more flexibility although at a higher AU cost.

Figure 1: A logical view of the architecture of an XE6 node. Note that there is a connection and a half between dies in the same socket/Magny-Cours processor.

The Gemini Interconnect

A new interconnect (codenamed 'Gemini') forms a key part of the XE6 system. Compared to the previous SeaStar interconnect this improves the performance of communication-bound codes and also directly supports in hardware Partitioned Global Address Space (PGAS) languages such as Coarray Fortran and UPC, as well as asynchronous one-sided messaging approaches such as Shmem and one-sided MPI. Prior to the introduction of Gemini the system was an XT6 in which the SeaStar interconnect represented a serious performance bottleneck for some codes, making optimisations such as using shared memory segments to coalesce messages into a single off-node communication (as discussed in the Parallel Optimisation good practice guide) and mixed-mode programming crucial for communications-bound codes; with the introduction of Gemini these optimisations are less crucial, but nonetheless are likely to remain relevant for future architectures.

Batch Submission

There are some important differences between submitting jobs for the XE6 and XT4 that you will need to understand. If the batch system is used incorrectly, then you may be using more AUs than you realise because a job will always be charged for an entire compute node of 24 cores even if it actually uses fewer than 24 cores to execute, since compute nodes are not shared between different jobs.

Pure MPI jobs

Here is an example batch script:

  #!/bin/bash --login
  #PBS -N My_job
  #PBS -l mppwidth=48
  #PBS -l mppnppn=24
  #PBS -l walltime=00:20:00
  #PBS -j oe
  #PBS -A budget
  
  cd $PBS_O_WORKDIR

  aprun -n 32 -N 16 ./my_mpi_executable.x arg1 arg2 > stdout

Note that we advise against the previous recommended practice of using awk within a batch script to set the values of variables NPROC and NTASKS to match the PBS requested values of mppwidth and mppnppn. The reason for this is because it is important to make a clear distinction between the resources requested (i.e. what is specified in the PBS directives at the top of your job script) and those actually used (i.e. the values passed to the aprun flags -n, -N, -d, -S). In the above the job will acutally use 32 cores (aprun -n 32) that span 2 compute nodes (making use of 16 cores in each compute node for load-balancing), hence mppwidth=2*24=48 requests 48 cores.

The following points need bearing in mind:

Your job will not share nodes with any other job, thus you will be charged for an entire node of 24 cores even if you actually use fewer than 24.
The #PBS directives that appear at the top of a job script tell the job submission system how many compute nodes your job will require.
The options given to the aprun command determines the resources your job actually uses.

In order to make things clear, we recommend that mppwidth is always the least upper bound number of processes you intend to use that is a multiple of 24, that mppnppn is always set to 24, and that mppdepth is never used.

Hybrid MPI/OpenMP jobs

Here is an example batch script:

  #!/bin/bash --login
  #PBS -N My_job
  #PBS -l mppwidth=96
  #PBS -l mppnppn=24
  #PBS -l walltime=00:20:00
  #PBS -j oe
  #PBS -A budget
  
  cd $PBS_O_WORKDIR
  export OMP_NUM_THREADS=6
  # This variable should be set to allow the ALPS and the OS scheduler to 
  # assign the task affinity rather than the compiler. If this is not set
  # they you may see a large negative effect on performance.
  export PSC_OMP_AFFINITY=FALSE

  aprun -n 15 -N 4 -d 6 -S 1 ./my_mpi_executable.x arg1 arg2 > stdout

This last example gives the "natural" configuration for a hybrid job discussed above: on each node we are placing 4 MPI processes, one on each die, and each process may fork 6 threads to run on the same die. This example uses ceiling(15/4) = 4 compute nodes. Note that the -S aprun flag is a new one, which defines how many processes to put on each die in order to allow an even distribution for sparsely populated jobs (i.e. jobs in which fewer than 24 processes are used per node). By default processes are placed sequentially within a node, that is die 0 is filled first (top left in Figure 1), then die 1 (bottom left) then die 2 (top right) and then die 3 (bottom right). Note, if you are using a multi-threaded version of libsci 10.4.1, set GOTO_NUM_THREADS rather than OMP_NUM_THREADS in your job script.

We recommend users base their job scripts on either of the above examples, unless there is a good reason not to. For example, running jobs sparsely populated within a node in order to utilise more memory per process, in which case you should consult the user guide and aprun man page or contact the helpdesk for assistance.

CSE Assistance

If you would like to explore how to take advantage of the potential performance increase available in phase 2b, or would like dedicated help and advice on benchmarking, profiling or improving your code, please get in touch with the NAG CSE team: hector-cse@nag.co.uk. The CSE team run regular training courses on the topics discussed above, and more (see the References section). Check the schedule for current courses. If you would like to attend a course not currently scheduled, please contact us.

References

The previous good practice guide for phase 2a puts the upgrade into context.
Cray's XT6 brochure (pdf)
Cray's XE6 brochure (pdf)
AMD's Magny Cours zone
An AMD page on Directly Connected architectures
A review of the Magny Cours architecture
If you would like to learn more about OpenMP, optimising codes for multicore architectures, profiling and optimisation, or parallel IO the CSE service runs training courses (follow links). Please check the training schedule and get in touch if you would like to arrange training at your institution free of charge.

Appendix: some micro-benchmarking results