The HECToR Service is now closed and has been superceded by ARCHER.

GloMAP Pt1

This Distributed Computational Science and Engineering (dCSE) project was to improve the performance of the aerosol simulation code GLOMAP MODE MPI in order to make better use of HECToR. Improved utilisation of the Phase 1 and 2a hardware was achieved by : - i) re-factoring targeted areas of the code for more efficient memory access ii) implementing more efficient use of MPI for the communication iii) determining the most effective use of the compiler and applying these to the compilation scripts.

  • The overall performance gain from this dCSE work shows a speedup of up to 12.4% for the T42 grid when decomposed for 32 MPI tasks. In this configuration the simulations typically require 2 minutes per day, that is of the order of 12 hours per year so a 10 year simulation would cost 120x32 core hours (x7.5 AUs.) i.e. 28800 AUs. so the saving could be as much as 3500 AUs. per simulation. From a scientist point of view this means the simulation would be complete after 105 hours, allowing more time for analysis of the result.

The individual achievements of the project are summarised below:

  • The original compiler optimisation level was changed to include vectorisation and some inlining. The “-fast” flag resulted in a reduction of ~3% in simulation time compared to using “-O3” only.
  • Several areas in the code structure were modified as they did not make the best use of the processor architecture due to loop ordering, in some cases this could not be changed as there are three approaches to working with the data. These are planes of latitude, planes of altitude and for the chemistry scheme where the three dimensions of numerical points are unwrapped into one-dimensional arrays. The latter scheme was changed to unwrap planes of latitude into one-dimensional arrays resulting in ~16% reduction in run time.
  • There were several places where the communications were not efficient due to awkward strides through data arrays when they were being loaded into communication buffers. This was revealed by the MPI workload indicated by Cray PAT reports. The resulting reduction in run time is of the order of 9%.
  • The file access model is one of master I/O so all data has to pass through “task 0” this requires additional memory (unfortunately replicated on all nodes) and incurs a bottleneck when the other tasks send information and it does not take advantage of the Lustre disk servers. A file handling system that uses MPI I/O would help by reducing the amount of data re-distribution and remove the requirement for arrays that are sized to the global dimensions of the simulation.
  • The analysis has shown that the code performs faster with one MPI task per node which is likely due to several factors including the there is more memory available for calculation, in shared cache and RAM and there is one SeaSTAR processor per node handling all the MPI traffic. This observation has lead onto a subsequent project to implement Open MP and accelerate the simulation by enabling an MPI task to use the otherwise idle cores on the node if such a configuration were chosen.

Please see PDF or HTML for a report which summarises this project.