Measuring the I/O speed for comparison was found to be difficult on HECToR. CrayPAT, Cray's profiling tool did not handle well the case of netCDF/HDF5 API calls. Thus we resorted to using calls to the MPI_Wtime routine to measure the elapsed time for both serial and parallel I/O.
We show results for the case of 128 processes on 4 nodes, fully packed at 32 processes per node. Where times for each parallel process were not equal, we show the worst case result recorded. For the serial case, we show the elapsed time on the master process. We time only calls to read or write functions, ie fprintf/fscanf or nc_put_vara/nc_get_vara and not associated stores to memory, calls to transformation routines or counter increments, which are identical between versions.
|
|
As can be seen from both Tables 1 & 2, the relative I/O speed for the parallel case is much worse than that of the serial case, however, the absolute time is much lower and represents a significant speedup.
|
From Figure 2 we can see that the solver scales almost identically for the parallel and serial case. The total wall clock time for calculations using the new parallel I/O routines scales much better than those using the serial I/O routines although it is still not as close to linear as the solver alone, indicating that there may still be gains to be found by further optimising the parallel I/O. The speedup over a single core for serial I/O is 22 and for parallel I/O is 90. The speedup of the solver alone is 180.