Parallel CVODE for Time Evolution

Once the data structures have been constructed in parallel, the computational methods could be tackled. There were two main tasks to enable this with the PETSc MicroMag code : the call to the CVODE solver and the RHS function which itself would call the Krylov subspace solver methods in PETSc such as CG. The PETSc library comes with documentation and an extensive set of examples demonstrating how various computational methods can be called. Whilst examining and running the SUNDIALs example it was discovered that there is a bug in the PETSc - SUNDIALs interface for anything other than a trivially small number of processors. The CVODE solver would run, but report that it had run for the maximum allowed number of steps without reaching the final time step. The example would run correctly in serial, and if the number of MPI tasks didn't exceed 6, and if the system size was not increased more that $10\%$. The maxumim allowed number of steps, and the time taken can also be altered.

Varying all the allowed parameters did not change this behaviour. For example, running the PETSc example 4 (ts/examples/tutorials/ex4) for SUNDIALS time-stepper (ts) solver in version 3.1-p3, on 128 nodes of HECToR, and increasing the system size to $m=1024$, where the system size is given by $m^2$, resulted in the following error.

[CVODE ERROR]  CVODE
  At t = 0.182961, mxstep steps taken before reaching tout.

This was reported to the PETSc developers who confirmed that there was indeed a bug. They gave a commitment to fixing the bug, but without any timescale for doing so. This presented a significant problem to the successful completion of the project. It was decided that the best course of action was to replace the call to the SUNDIALS solver, with a simple EULER solver which is a native PETSc method. The code could then run correctly in parallel, if with limited scientific use-ability. The solver could be altered to call the SUNDIALS solver once the PETSc developers issued a patch or new release. A further problem was the amount of time consumed discovering and subsequently confirming the bug with the PETSc developers. With time limited to produce a working parallel code a successful conclusion to the project would be difficult.

Chris Maynard 2011-06-08