The MERGED AGRIF model also initially failed with an error of the form (from the (ocean.output file):
===>>> : E R R O R =========== stpctl: the zonal velocity is larger than 20 m/s ====== kt= 4 max abs(U): 50.06 , i j k: 243 56 53and also from the (1_ocean.output) file:
===>>> : E R R O R =========== stpctl: the zonal velocity is larger than 20 m/s ====== kt= 16 max abs(U): 461.8 , i j k: 314 3 30
According to the code output files (*ocean.output* and *time.step) the model stops on time steps 4, 16 and 64 for the levels of nesting.
As with the BASIC model different compiler options were explored to see if the optimisation level was a factor in causing the velocities to grow uncontrollably. The MERGED model was compiled incrementally with optimisation levels from -O3 down to -0O. For each optimisation level the code stops in the same manner as described above.
The time step, represented by the value of rdt in the namelist, 1_namelist and 2_namelist files was also varied to see if reducing it gave any improvement. Table 15 summarises the results.
From table 15 it seems that regardless of the time step used the MERGED AGRIF model continues to crash early on in the model run. This suggests that unlike the BASIC model the choice of time step is not the issue. The zonal velocity was extracted as described above in section 10.2. Figure 13, shows the zonal velocity plotted as a function of the elapsed model time for three different values of time step, rdt. From figure 13 its clear that the problem lies with the outer-most (i.e. coarsest) model (namelist) with the model using 1_namelist (i.e. first level of nesting) also affected but to a lesser degree.
![]() ![]() ![]() |
Further investigation of the output files suggests that something actually goes wrong prior to time step 4. The mpp_output.0* files are found to contain NaN values as early as time step 2. The cause of these NaN values is currently unknown but clearly they should not be present if the code is running correctly.
Locating the cause of these NaN values will likely be fundamental in getting the merged model to run on HECToR. Unfortunately, the PGI compiler doesn't allow direct trapping of NaN values. The -Ktrap flag can be used to trap a number of other numerical problems, e.g. denormalised operand, divide-by-zero, overflow, underflow and inexact. Trapping of NaN values may be possible by instrumenting the code with the isnan function which is a logical function which returns true if a NaN value is detected and false otherwise. To use this function the code must be instrumented with isnan() calls everywhere where a NaN value is suspected.
The PathScale compiler, however, provides options which may help with NaN detection particularly if the NaN values are arising from uninitialised values. The two flags of interest are:
ftn -freeform -c -Dkey_agrif -Dkey_agrif_nolim -Dkey_trabbl_dif -Dkey_mpp_mpi -Dkey_orca_r1=64 -Dkey_lim2 -Dkey_dynspg_flt -Dkey_diaeiv -Dkey_ldfslp -Dkey_traldf_c2d -Dkey_traldf_eiv -Dkey_dynldf_c3d -Dkey_dtatem -Dkey_dtasal -Dkey_tradmp -Dkey_trabbc -Dkey_zdftke -Dkey_zdfddm -O0 -r8 -module ../../../lib -I../../../lib -I../../../lib/oce -I/home/n01/n01/fionanem/netcdf/3.6.2/include \ OPAFILES/lib_mpp.F90 || { if [ -f lib_mpp.L ] ; then mv lib_mpp.L ../../../tmp ; fi ; false ; exit ; } /opt/xt-asyncpe/1.0c/bin/ftn: INFO: linux target is being used pathf90-3.0 INTERNAL ERROR: /opt/pathscale/lib/3.0/mfef95 died due to signal 11 Please report this problem to <support@pathscale.com>. Problem report saved as /home/n01/n01/fionanem/.ekopath-bugs/pathf90-3.0_error_HgpoTn.i Please review the above file and, if possible, attach it to your problem report. make: *** [../../../lib/oce/libopa.a(lib_mpp.o)] Error 1
Version 3.1 was also tested, again it fails on file lib_mpp.F90 but with a slightly different error.
ftn -freeform -c -O0 -r8 -module ../../../lib -I../../../lib -I../../../lib/oce -I/home/n01/n01/fionanem/netcdf/3.6.2/include \ lib_mpp.f|| { if [ -f lib_mpp.L ] ; then mv lib_mpp.L ../../../tmp ; fi ; false ; exit ; } /opt/xt-asyncpe/1.0c/bin/ftn: INFO: linux target is being used Signal: Segmentation fault in IR->WHIRL Conversion phase. "lib_mpp.f": Error: Signal Segmentation fault in phase IR->WHIRL Conversion -- processing aborted *** Internal stack backtrace: pathf90-3.1 INTERNAL ERROR: /opt/pathscale/lib/3.1/mfef95 died due to signal 4 make: *** [../../../lib/oce/libopa.a(lib_mpp.o)] Error 1
These errors suggest an internal problem with the PathScale compiler and as such will need to be referred back to the compiler developers for a bug fix. This has been submitted as HECToR query Q29941 and is currently under investigation.
With the PGI compiler, it transpires the -Msave flag has the side-effect of initialising variables to zero. Compiling with this flag results in the code hanging or taking an inordinate amount of time to run (1 step complete in an hour on 256 processors) which makes its use impractical.
We have also tried compiling with the -Mbounds flag which performs array bounds checking at compile and runtime. Running the executable with -Mbounds caused the code to crash with an error message stating that one of the array indices was negative. The affected file was fldread.F90. Removing -Mbounds and then re-running whilst writing out the affected indices demonstrated that no negative values occur. It's possible therefore, that the -Mbounds compiler flag, has altered the code in some way as to create or highlight a new problem which doesn't appear when the flag is omitted. The results of this are somewhat inconclusive.
The Totalview debugger should be able to provide some helpful information. E.g. tracing through the code, watching variables etc. Unfortunately it has so far not been possible to obtain any symbolic information when the nested version of NEMO (or indeed any code) is compiled with the PGI compiler. The problem has been replicated with a simple helloworld type and filed as a bug with Cray/Totalview via HECToR Q22386. The PathScale compiler does not suffer from this problem but as we cannot compile the code with PathScale this doesn't help.
A number of attempts have been made to compile the code on the TDS as this has the latest versions of the compilers, system libraries, operating system etc. However, no PathScale licence is present on the TDS and the Totalview licence also appears to be invalid and therefore limited progress was possible on the TDS.