Good Practice Guide

Parallel Optimisation

Introduction

Here we provide an introduction to optimisation techniques that may improve parallel performance and scaling on HECToR. It assumes that the reader has some experience of parallel programming including basic MPI and OpenMP. (The CSE team run regular training courses, including MPI and OpenMP courses.)

Scaling is a measurement of the ability for a parallel code to use increasing numbers of cores efficiently. A scalable application is one that, when the number of processors is increased, performs better by a factor which justifies the additional resource employed. Making a parallel application scale to many thousands of processes requires not only careful attention to the communication, data and work distribution but also to the choice of the algorithms to use. Since the choice of algorithm is too broad a subject and very particular to application domain to include in this brief guide we concentrate on general good practices towards parallel optimisation on HECToR.

MPI /OpenMP Tips

Some general tips when writing MPI applications:

Try to post receives before sends (this minimises the use of MPI system buffers and the associated overhead of managing the buffers).
Overlap communication and computation (use non-blocking operations where possible to minimise the time spent waiting for data from other processes).
Send large messages (this minimises communications latency, the cost associated with setting up communication).
Use collectives only where necessary (but if they are necessary use them; don't re-write them in terms of point-to-point comms, since collectives are often optimised in the library).

Shared memory optimisations

Multicore technology is now the default for scientific and numerical computing from workstations to the largest supercomputers - but your software won't achieve good performance unless it is optimised for multicore processors. Phase3 of the HECToR service is a Cray XE6, which is a multicore system with 32 cores per node. The 32 cores within a node have non-uniform access to 32GB of shared memory, which provides the opportunity to use so-called "mixed-mode" or "hybrid" parallelisation techniques, i.e. using OpenMP within nodes to exploit parallelism as well as MPI.

As shown in our phase 3 guide a Cray XE6 node consists of four 8-core dies with uniform (SMP) access to 8GB of local memory. Each of the four dies can also access the memory of each of the other three, but at a slower rate (this is why memory access across the whole node is non-uniform). Hence, the natural configuration for hybrid programming on the XE6 is to use 4 MPI processes per node, one for each die, and for each to spawn 8 threads. This ensures that threads are working on local data with faster access rates. In order to place your MPI processes on different dies you must use the -S aprun option. See the example job script given in the phase3 good practice guide.

Even if your code cannot/does not make use of OpenMP it may be efficient to spread your MPI processes among different dies within a node ("under populating a node") in order to reduce contention on resources such as memory and cache. This will leave a number of cores unused in each node, which could be utilised by linking your code against the threaded version of Cray's maths library, libsci. Thus, your application will exercise multithreaded BLAS or LAPACK routines when called.

In addition to under populating, it is also possible to optimise the communication of global data by sending a single message per node rather than however many processes per node are being used (in the worst case, 32). The idea is that each process in a node directly writes its data into a shared send buffer (i.e. an array that is accessible by all processes in a node), and only one process per node participates in the global operation on the shared data. Once data is received, each process can read its portion directly from a shared receive buffer.

This has been done successfully for some high profile codes on HECToR using UNIX System V shared memory segments. For example, see the report on how this has been done for CASTEP (PDF).

The dCSE project CASINO (http://www.hector.ac.uk/cse/distributedcse/reports/casino/ was redesigned by using hybrid MPI-OpenMP and System V shared memory techniques. The code with mixed mode parallelism speeds up the computation by a factor of 1.6-1.8 compared with pure MPI code.

If you would like help and advice applying any of the above techniques to your own code, please don't hesitate to get in touch with the HECToR CSE team: hector-cse@nag.co.uk.

References

Practical Parallel Programming. Gregory V. Wilson. MIT Press. 1995.
Designing and Building Parallel Programs. Ian Foster. Addison-Wesley. 1994.
M. Quinn, Parallel Programming in C with MPI and OpenMP. McGraw Hill, 2004.
MPI standard: Documents on MPI can be found at http://www.mpi-forum.org/docs/docs.html
Using MPI2, William Gropp, Ewing Lusk and Rajeev Thakur; MIT Press, 1999.
Parallel programming with MPI, Peter S. Pancheco. The complete set of C and Fortran example programs for this book are available at: http://www.cs.usfca.edu/mpi
OpenMP specification: Documents on OpenMP can be found at http://openmp.org/wp
Parallel Programming with OpenMP. R.Chandra, R.Menon, L.Dagum, D.Kohr

If you would like to learn more about MPI, OpenMP, optimising codes for multicore architectures, profiling and optimisation, or parallel IO the CSE service runs regular training courses. Please check the training schedule and get in touch if you would like to arrange training at your institution free of charge.