The original version of the code (compiled with -O3) stops on time steps 27 and 54 (as reported by time.step and 1_time.step files). The code exits normally when the zonal velocities become too large - i.e. the model has become unstable. The ocean.output file reports the following:
===>>> : E R R O R =========== stpctl: the zonal velocity is larger than 20 m/s ====== kt= 27 max abs(U): 23.37 , i j k: 123 3 18Because the error output appears in the un-nested output file this tells us that something is wrong with the un-nested part of the model/calculation.
Before investigating this further the optimisation level was reduced to -O0 and the code re-run. Setting the optimisation level to -O0 removes all optimisations and should allow us to detect any possible bugs which could be causing the code to fail. Reducing the optimisation level to -O0 results in the code running to completion, finishing on time steps 5475 and 10950 as desired. To investigate which subset of the source code is affected by the optimisation level the optimisation level is increased systematically. Table 12 summarises the results.
It appears that increasing the optimisation level of the main /WORK/Makefile from -O1 to -O2 changes the code in such a way as to cause a crash. From the PGI documentation the key differences between the optimisation levels are as follows:
Basically -02 includes global optimisations such as induction recognition and loop invariant motion. Neither of these should cause the code to blow up but clearly something is going wrong. It's possible the optimisation highlights an existing bug in the code or that the compiler is doing something it shouldn't. It's also possible that the optimisation brings out an existing numerical instability, e.g. calculations being performed in a slightly different order which results in the velocity increasing too fast. The later hypothesis can be tested by reducing the time step used to see if this allows the code to run successfully.
The time step is specified in the namelist file(s) via the rdt parameter. For the BASIC model the time step used in the nested model is half that used in the non-nested model. The initial values of rdt for the non-nested and nested models are respectively 5760.0 seconds and 2880.0 seconds. To ascertain whether the time step is too large the model can be re-run with a smaller time step. Initially the time steps were reduced by half to 2880.0 and 1440.0 respectively, however, the zonal velocity still increased too fast. Reducing the time steps but a factor of four relative the the original, i.e. to 1440.0 and 720.0 seconds allows the run to successfully.
We can gain a better understanding of what is going on by plotting out the zonal velocity as a function of model time. The zonal velocity can be obtained by modifying the stp_ctl subroutine in stpctl.F90 such that it outputs the zonal velocity every time step and not just when a problem occurs. A diff of the original and modified code gives:
diff stpctl.F90 stpctl.F90.orig < !FR dump out velocity to file for the failed step < WRITE(99,*) kt, zumax, ii, ij, ik < !FR dump out velocity to file for the failed step 150,153d146 < !FR dump out velocity to file for every time step < WRITE(99,*) kt, zumax, rdt < !FR dump out velocity to file for every time step
Figure 11 shows the zonal velocity plotted against model time for the non-nested (i.e. namelist model). Figure 12 shows the zonal velocity plotted against model time for the nested (i.e. 1_namelist) model.
From figures 11 and 12 it is clear that the problem with the velocity occurs for the non-nested model (see red line on figure 11). The original time step of 5760.0 and 2880.0 seconds for the non-nested and nested models is simply too large. Reducing the time step by a factor of 4, to 1440.0 and 720.0 seconds prevents the zonal velocity from blowing up. With the reduced time step the -O0 and -O2 versions of the code behave very similarly. The zonal velocities are almost identical (the blue and green lines overlap on figures 11 and 12).
It seems there are two choices for running the AGRIF models; run with a code compiled with -O1 to avoid the numerical instabilities or reduce the time step. The second solution is safer as the model could become unstable even with -O1 or less. The researchers have reduced the time step as suggested and confirmed that the results appear sensible when compared with those obtained on their SGI cluster.