The HECToR Service is now closed and has been superceded by ARCHER.

WRF code Optimisation for Meso-scale Process Studies (WOMPS)

This Distributed Computational Science and Engineering (dCSE) project investigated various aspects of the performance of the WRF model on a Cray XT4 (HECToR) with four cores per node and an XT5 with 12 cores per node (at CSCS). The WRF model is a regional-to global-scale simulation intended for both research applications and operational weather-forecast systems. Building WRF in hybrid (MPI / OpenMP) mode was found to give the best absolute and parallel-scaling performance and in fact proved to be essential in achieving the same performance on the 12 core nodes of the XT5 as was obtained on the quad core XT4.

  • The main aims of the project were: i) Optimise the WRF model for HECToR, with particular attention to cache usage and compiler options ii) Recommend the best choice for domain decomposition iii) Recommend the optimum I/O configuration and report findings iv) Report on the effects of increasing the vertical resolution.

The individual findings and achievements of the project are summarised below:

  • Memory bandwidth remains a bottleneck even when nodes are populated with just one task per socket.
  • Tuning the OpenMP threads by further decomposing a PE’s patch of the model domain into tiles that are shared amongst them significantly improved the hit rate of the D2 cache (from 20 up to 70+%) and could improve WRF’s performance by ∼5% on 1024 PEs through to ∼20% on 64 PEs.
  • For the I/O although improving the performance of the writes themselves proved unsuccessful since they consist of a large number of small writes, one for each variable. It may be productive to experiment with the caching used in the netCDF library in order to tackle this. Of the WRF I/O options, the ‘I/O quilting’ functionality proves to be the most successful. This hides the actual time taken to write the data to disk by using dedicated ‘IO server’ PEs. The other time-limiting aspect is then the gathering (via MPI_Gatherv) of data from the compute PEs to the IO servers.
  • Increasing the vertical resolution of a model has indicated that the run-time is proportional to the number of vertical layers in the model grid.

Please see PDF or HTML for a report which summarises this project.