Results

Measuring the I/O speed for comparison was found to be difficult on HECToR. CrayPAT, Cray's profiling tool did not handle well the case of netCDF/HDF5 API calls. Thus we resorted to using calls to the MPI_Wtime routine to measure the elapsed time for both serial and parallel I/O.

We show results for the case of 128 processes on 4 nodes, fully packed at 32 processes per node. Where times for each parallel process were not equal, we show the worst case result recorded. For the serial case, we show the elapsed time on the master process. We time only calls to read or write functions, ie fprintf/fscanf or nc_put_vara/nc_get_vara and not associated stores to memory, calls to transformation routines or counter increments, which are identical between versions.


Table 1: Input file sizes and times for 128 processes, fully packed at 32 process/node.
  Input
  File size Worst-case time Approximate I/O speed
Serial 567MB 30s 18.9 MB/s
Parallel 2.9MB 0.4s 0.05MB/s



Table 2: Output file sizes and times for 128 processes, fully packed at 32 process/node.
  Output
  File size Worst-case time Approximate I/O speed
Serial 2600MB 118s 22MB/s
Parallel 994MB 7s 1MB/s


As can be seen from both Tables 1 & 2, the relative I/O speed for the parallel case is much worse than that of the serial case, however, the absolute time is much lower and represents a significant speedup.

Figure 2: Strong scaling of PARA-BMU using serial and parallel I/O showing ideal case (Linear), solver only (Solver) and complete runtime (Total).
\includegraphics[width=0.88\textwidth, keepaspectratio]{allscaling2.eps}

From Figure 2 we can see that the solver scales almost identically for the parallel and serial case. The total wall clock time for calculations using the new parallel I/O routines scales much better than those using the serial I/O routines although it is still not as close to linear as the solver alone, indicating that there may still be gains to be found by further optimising the parallel I/O. The speedup over a single core for serial I/O is $\sim$22 and for parallel I/O is $\sim$90. The speedup of the solver alone is $\sim$180.