Starting from the new, refactored code, the serial implementation of CVODE had to be redone. Whilst this wasn't a huge task and the original work could be used as a guide, it still consumed some effort. However, this initial effort resulted in a marked improvement in performance. For a sphere, with 20K vertices and 100K finite elements running in serial on the XT4, took DVODE 3098.86 seconds for a single external field step, whereas the CVODE version took 1890.35 seconds. This represents an impressive speed up factor of 1.64.
For the semi-parallel version, computationally the problem can be represented in two parts, the implicit ODE solver, and the so-called Right hand side (RHS) function which is required to determine the time-step update of the implicit ODE solver. The RHS function determines the energies described in Section 2 and it is the most computational intensive part of the code. The semi-parallel strategy would be to compute the RHS in serial, then scatter the values of the energies computed in the RHS across the parallel elements. The call to the CVODE solver would be done in parallel, and then the results of the time-step gather back together, to compute the next RHS in serial again.
This gather/scatter serial - parallel - serial computation was decided upon as a staged development strategy, the first stage being to implement the parallel CVODE solver call, with the RHS in serial. The parallelisation of the the RHS for a fully parallel code would be implemented in a later package.
When attempting to implement this in the refactored, simplified code, it became clear that this approach simply wouldn't work, due to the way the user supplied RHS function was called by the CVODE library for each time step. When using the CVODE library, the user calls a set up routine in parallel, which sets up the sizes of the arrays to be used by the CVODE library, as shown in the code fragment below.
NEQ=3*NMAX ! calculate the local NEQ and the displacements NEQ_local=NEQ/nprocs call fnvinitp(comm, 1, NEQ_local, NEQ, IER) call fcvmalloc(T, Y, METH, ITMETH, IATOL, RTOL, ATOL, $ IOUT, ROUT, IPAR, RPAR, IER)
where NMAX is the number of finite elements of the mesh and hence this determines the size of the arrays representing the matrices. NEQ_local is then the size of the array local to each processor.
In the main program, the call to the CVODE ODE solver is made in parallel. The user supplied RHS function, which is called by the CVODE solver, is also called in parallel. So, all the data structures have local, not global scope.
CALL FCVODE(TOUT, T, Y, ITASK, IER)
The call to the CVODE solver is shown in the code fragment above, passing the array which holds the total derivatives of the global magnetic energy, Y . The user supplied RHS function named FCVFUN is shown in the code fragment below,
SUBROUTINE FCVFUN (T, Y, YDOT, IPAR, RPAR, IER) use mpi_global IMPLICIT NONE REAL (KIND=DP) :: T,Y(3*NMAX), YDOT(3*NMAX)
The array Y is passed to the subroutine FCVFUN by the CVODE library, and is local in scope, so however the array is setup by the fnvinit routine and populated by CVODE. Whilst the local scope array can be declared any size the programmer wishes, the data is spread out across the processors, so there is no array with a complete global copy of the data. In the serial version, it didn’t matter because the initial setup was serial, and so the whole array is populated.
The original idea was to compute the RHS in serial on a single (master) process, and then scatter the data to all processors. The array which holds the data in the RHS function, being local in scope only holds NEQ_local and not the whole data, so computation cannot be done in serial because the data is already distributed. The serial computation could not be implemented without first have to gather all the data first. Whilst it would be possible to override the CVODE parallel implementation and implement this by hand, and even get it to work correctly, the high unforeseen communication overhead would have seriously undermined any performance gain from working in parallel. The ultimate goal was a fully parallel code, so after consulting with the PI and NAG CSE it was decided to abandon the semi-parallel strategy and move to develop a fully parallel code, where the RHS is implemented in parallel from the beginning.
Chris Maynard 2011-06-08