The HECToR Service is now closed and has been superceded by ARCHER.

Performance enhancements for the GLOMAP aerosol model Part 2

The overall aim of this Distributed Computational Science and Engineering (dCSE) project was to enable the aerosol simulation code GLOMAP Mode MPI, to make better use of multi-core architectures. This would be acheived by building on the results of the earlier successful dCSE project. This work started while HECToR was configured as a Cray XT4h (Phase 2a - four cores per node) and was still active when the system was upgraded to a Cray XT6 (Phase 2b - i.e. 24 cores per node but without the Gemini interconnect).

This project is to re-factor GLOMAP Mode MPI and parallelise via a hybrid OpenMP-MPI method by implementing mixed-mode OpenMP parallelism in main code regions. This is driven by the need to utilise the increasing number of cores available per chip and the decreasing amount of memory available per core. The work will enable GLOMAP Mode MPI to make more efficient use of multi-core architectures and achieve a reasonable scalability on HECToR for a representative 128x64x31 grid simulation (T42). The result will be demonstrated on HECToR for up to 128 MPI tasks and this will be achieved by the following steps to:

  • Test and summarise the potential performance gains achievable for a hybrid OpenMP-MPI GLOMAP Mode MPI over the pure MPI version on Phase 2a and Phase 2b of HECToR.
  • Develop the hybrid OpenMP-MPI GLOMAP Mode MPI code and validate for Phase 2a and Phase 2b of HECToR.
  • Perform further (quick) optimisations for the hybrid OpenMP-MPI GLOMAP Mode MPI and validate.
  • Benchmark performance of the new hybrid OpenMP-MPI GLOMAP Mode MPI on HECToR Phase 2b.

At the start of this project GLOMAP Mode MPI was rarely used with more than 64 MPI tasks, this work now means that production runs are now regularly performed with 64 MPI tasks and at least 2 OpenMP threads per task. Furthermore, up to a 2.5 times speedup can now be achieved for the T42 grid. The individual achievements of the project are summarised below:

  • Of those regions in the code which would benefit most from OpenMP, five were identified (ADVX2, ADVY2, ADVZ2, CONSOM and CHIMIE).These have now all been developed with OMP PARALLEL DO directives.
  • The performance of OpenMP is such that it will only help with the normal operation of GLOMAP Mode MPI if the number of threads is chosen to match the number of latitudes in the surface patches. The T42 case is limited by the decomposition method to a maximum number of 128 MPI tasks (typical usage is 32 MPI tasks) where there are only 2 latitudes per patch. Overall, the new hybrid version of GLOMAP Mode MPI has shown that it is now possible for the researchers to use more cores than previously was economically possible i.e use 128 cores for a simulation with 32 MPI tasks.
  • The new implementation of the hybrid OpenMP-MPI GLOMAP Mode MPI has been successful as the previously idle cores on a node can now be loaded with work from OpenMP threads and therefore reduce the time per iteration of the simulation. The practice of distributing the MPI tasks sparsely has arisen due to the increase in resolution and complexity of simulations. The extra OpenMP threads do require extra memory but not as much as replicating the whole code as would need to be done with a dense distribution of MPI tasks.
  • Using a sparse placement of MPI tasks will allow the code to run more efficiently and recommendations to determine an optimal mode of operation have been made to the HECToR users of GLOMAP Mode MPI.

Please see PDF or HTML for a report which summarises this work.