DSTAR uses IO for two main tasks: i) collection of physical quantities from a selected set of monitoring points inside the simulation domain and ii) checkpoint data needed to restart the computation.
In the pre-dCSE version of the code, the monitoring data was collected in ASCII text files with one per observation point. This could lead to hundreds or thousands of individual files being access over the parallel file system.
The restart data was written in binary format by a subgroup of MPI tasks that collect the data from associated ranks and then write them to the disk in a serial manner, that is, data is written to the file immediately after is received from one of the associated ranks. This approach saves buffer memory but blocks the progress of the other associated ranks. The data layout depends upon ranks used in the computation, which in turn made the computation reconfiguration rather inflexible.
In other applications it was found that when IO is performed in this manner and especially for ASCII text files, the method loses scalability as the number of MPI ranks reaches the 1000-10000 range. In order to ensure scalable IO when using more than 1000 MPI ranks, the original IO operations were modified as follow:
Also related to IO operations we mention that the logging mechanism was changed in order to avoid a multiple file access pattern. In the new version log messages are written to a single file by all MPI tasks using the shared file pointer provided by MPI-IO. For debugging purposes a subroutine that will dump the contents of the global arrays has also been provided along with a post-processing program for inspecting or comparing sections of dumped arrays.
The IO benchmark was carried out for three grid sizes, see Table 1, columns 4-8. One can see that the write time of the monitoring data by using a single file is significantly improved, but it came as a bit of surprise that IO for the restart file is fastest when using Fortran IO with one writer per node, even for runs that used 18,432 MPI ranks and approximately 217GB of checkpoint data. This result suggests that the best strategy for the restart operation is to use MPI-IO for the fine tuning of the run parameters (e.g. when searching for the best 2D MPI rank decomposition) and to switch to Fortran IO for the production runs, if this is faster.
Lucian Anton 2011-09-13