Results

Measuring the I/O speed for comparison was found to be difficult on HECToR. CrayPAT, Cray's profiling tool did not handle well the case of netCDF/HDF5 API calls. Thus we resorted to using calls to the MPI_Wtime routine to measure the elapsed time for both serial and parallel I/O.

We show results for the case of 128 processes on 4 nodes, fully packed at 32 processes per node. Where times for each parallel process were not equal, we show the worst case result recorded. For the serial case, we show the elapsed time on the master process. We time only calls to read or write functions, ie fprintf/fscanf or nc_put_vara/nc_get_vara and not associated stores to memory, calls to transformation routines or counter increments, which are identical between versions.

Table 1: Input file sizes and times for 128 processes, fully packed at 32 process/node.

	Input
	File size	Worst-case time	Approximate I/O speed
Serial	567MB	30s	18.9 MB/s
Parallel	2.9MB	0.4s	0.05MB/s

Table 2: Output file sizes and times for 128 processes, fully packed at 32 process/node.

	Output
	File size	Worst-case time	Approximate I/O speed
Serial	2600MB	118s	22MB/s
Parallel	994MB	7s	1MB/s

As can be seen from both Tables 1 & 2, the relative I/O speed for the parallel case is much worse than that of the serial case, however, the absolute time is much lower and represents a significant speedup.

**Figure 2:** Strong scaling of PARA-BMU using serial and parallel I/O showing ideal case (Linear), solver only (Solver) and complete runtime (Total).
$\includegraphics[width=0.88\textwidth, keepaspectratio]{allscaling2.eps}$

From Figure 2 we can see that the solver scales almost identically for the parallel and serial case. The total wall clock time for calculations using the new parallel I/O routines scales much better than those using the serial I/O routines although it is still not as close to linear as the solver alone, indicating that there may still be gains to be found by further optimising the parallel I/O. The speedup over a single core for serial I/O is $\sim$ 22 and for parallel I/O is $\sim$ 90. The speedup of the solver alone is $\sim$ 180.