The HECToR Service is now closed and has been superceded by ARCHER.

Adding Parallel I/O to PARA-BMU

VOX-FE is a voxel-based finite element bone modelling suite developed by Prof. Michael Fagan's Medical & Biological Engineering group at the University of Hull. It is one of the demonstrator applications for the EPSRC-funded "Novel Asynchronous Algorithms and Software for Large Sparse Systems" project, and the core algorithms of VOX-FE are being redeveloped for increased scalability and functionality. The VOX-FE suite comprises two parts; a GUI for manipulating bone structures and visualizing the results of applying strain forces, and an MPI-parallelised Finite Element solver PARA-BMU which performs the computation required to solve the Linear Elasticity problem and calculate stresses and strains in the bone. Example applications would include computing the maximum principal strain in a human mandible (jaw bone) undergoing incisor biting, or understanding the stresses in an axially loaded femur.

In this dCSE project, the primary goal was improve the scalability of the code by parallelising the I/O routines. To that end there were two objectives:

  • Reduce file sizes by converting to netCDF-HDF5 formats with a target reduction of between 2 and 20 times.
  • Increase disk speed by using parallel netCDF routines with a target speedup of between 3 and 4 times minimum.

By using the libraries mentioned above, we were able to meet both objectives. We reduced file sizes by up to a factor of 190 and reduced I/O time by up to a factor of 7. We conclude that the use of freely available netCDF libraries, with the associated parallel HDF5 backend is an easy way in which to add parallel I/O to an existing application. We note that conversion programs may be required (as in this case) to convert to/from the original formats. Use of netCDF also allows files to be made self describing and portable between systems, along with offering good opportunites for compression.

Figure 1: Strong scaling of PARA-BMU using serial and parallel I/O showing ideal case (Linear), solver only (Solver) and complete runtime (Total).

From Figure 1 we can see that the solver scales almost identically for the parallel and serial case. The total wall clock time for calculations using the new parallel I/O routines scales much better than those using the serial I/O routines although it is still not as close to linear as the solver alone, indicating that there may still be gains to be found by further optimising the parallel I/O. The speedup over a single core for serial I/O is $\sim$22 and for parallel I/O is $\sim$90. The speedup of the solver alone is $\sim$180.

Please see PDF or HTML for a report which summarises this project.