next up previous contents
Next: Conclusions and future work Up: NEMO AGRIF nested model Previous: BASIC nested model performance   Contents

MERGED nested model

Having found the cause and a solution to the BASIC nested model crashing we now focus on the MERGED model. This model is more complex because it allows for two levels of nesting and also includes the code required by NOCS to carry out their research.

The MERGED AGRIF model also initially failed with an error of the form (from the (ocean.output file):

 ===>>> : E R R O R

  stpctl: the zonal velocity is larger than 20 m/s
 kt=     4 max abs(U):   50.06    , i j k:  243  56  53
and also from the (1_ocean.output) file:
 ===>>> : E R R O R

  stpctl: the zonal velocity is larger than 20 m/s
 kt=    16 max abs(U):   461.8    , i j k:  314   3  30

According to the code output files (*ocean.output* and *time.step) the model stops on time steps 4, 16 and 64 for the levels of nesting.

As with the BASIC model different compiler options were explored to see if the optimisation level was a factor in causing the velocities to grow uncontrollably. The MERGED model was compiled incrementally with optimisation levels from -O3 down to -0O. For each optimisation level the code stops in the same manner as described above.

The time step, represented by the value of rdt in the namelist, 1_namelist and 2_namelist files was also varied to see if reducing it gave any improvement. Table 15 summarises the results.

Table 15: Time steps at which MERGED AGRIF model crashes for different time steps
Time step, rdt in seconds      
namelist 1_namelist 2_namelist time.step 1_time.step 2_time.step
3600.0 900.0 225.0 4 16 62
1600.0 400.0 100.0 4 16 64
400.0 100.0 25.0 6 24 96

From table 15 it seems that regardless of the time step used the MERGED AGRIF model continues to crash early on in the model run. This suggests that unlike the BASIC model the choice of time step is not the issue. The zonal velocity was extracted as described above in section  10.2. Figure 13, shows the zonal velocity plotted as a function of the elapsed model time for three different values of time step, rdt. From figure 13 its clear that the problem lies with the outer-most (i.e. coarsest) model (namelist) with the model using 1_namelist (i.e. first level of nesting) also affected but to a lesser degree.

Figure 13: Zonal velocity against model time for the upper most model represented by the namelist file.
Image merged_namelist Image merged_namelist1 Image merged_namelist2

Further investigation of the output files suggests that something actually goes wrong prior to time step 4. The mpp_output.0* files are found to contain NaN values as early as time step 2. The cause of these NaN values is currently unknown but clearly they should not be present if the code is running correctly.

Locating the cause of these NaN values will likely be fundamental in getting the merged model to run on HECToR. Unfortunately, the PGI compiler doesn't allow direct trapping of NaN values. The -Ktrap flag can be used to trap a number of other numerical problems, e.g. denormalised operand, divide-by-zero, overflow, underflow and inexact. Trapping of NaN values may be possible by instrumenting the code with the isnan function which is a logical function which returns true if a NaN value is detected and false otherwise. To use this function the code must be instrumented with isnan() calls everywhere where a NaN value is suspected.

The PathScale compiler, however, provides options which may help with NaN detection particularly if the NaN values are arising from uninitialised values. The two flags of interest are:

Various attempts have been made to compile the merged model with the PathScale compiler. The compiler fails with an internal error. The output from version 3.0 is below:

ftn -freeform -c -Dkey_agrif -Dkey_agrif_nolim -Dkey_trabbl_dif
-Dkey_mpp_mpi -Dkey_orca_r1=64 -Dkey_lim2 -Dkey_dynspg_flt 
-Dkey_diaeiv -Dkey_ldfslp -Dkey_traldf_c2d -Dkey_traldf_eiv 
-Dkey_dynldf_c3d -Dkey_dtatem -Dkey_dtasal -Dkey_tradmp 
-Dkey_trabbc -Dkey_zdftke -Dkey_zdfddm -O0 -r8
-module ../../../lib -I../../../lib -I../../../lib/oce
-I/home/n01/n01/fionanem/netcdf/3.6.2/include \
OPAFILES/lib_mpp.F90 || { if [ -f lib_mpp.L ] ; 
then mv lib_mpp.L
../../../tmp ; fi ; false ; exit ; }
/opt/xt-asyncpe/1.0c/bin/ftn: INFO: linux target is being used
pathf90-3.0 INTERNAL ERROR: /opt/pathscale/lib/3.0/mfef95 died 
due to signal 11

Please report this problem to <>.
Problem report saved as
Please review the above file and, if possible, attach it to 
your problem report.
make: *** [../../../lib/oce/libopa.a(lib_mpp.o)] Error 1

Version 3.1 was also tested, again it fails on file lib_mpp.F90 but with a slightly different error.

ftn -freeform -c -O0 -r8 -module ../../../lib -I../../../lib
-I../../../lib/oce -I/home/n01/n01/fionanem/netcdf/3.6.2/include \
lib_mpp.f|| { if [ -f lib_mpp.L ] ; then mv lib_mpp.L ../../../tmp ; 
fi ;
false ; exit ; }
/opt/xt-asyncpe/1.0c/bin/ftn: INFO: linux target is being used
Signal: Segmentation fault in IR->WHIRL Conversion phase.
"lib_mpp.f": Error: Signal Segmentation fault in phase IR->WHIRL 
Conversion -- processing aborted
*** Internal stack backtrace:
pathf90-3.1 INTERNAL ERROR: /opt/pathscale/lib/3.1/mfef95 died due 
to signal 4
make: *** [../../../lib/oce/libopa.a(lib_mpp.o)] Error 1

These errors suggest an internal problem with the PathScale compiler and as such will need to be referred back to the compiler developers for a bug fix. This has been submitted as HECToR query Q29941 and is currently under investigation.

With the PGI compiler, it transpires the -Msave flag has the side-effect of initialising variables to zero. Compiling with this flag results in the code hanging or taking an inordinate amount of time to run (1 step complete in an hour on 256 processors) which makes its use impractical.

We have also tried compiling with the -Mbounds flag which performs array bounds checking at compile and runtime. Running the executable with -Mbounds caused the code to crash with an error message stating that one of the array indices was negative. The affected file was fldread.F90. Removing -Mbounds and then re-running whilst writing out the affected indices demonstrated that no negative values occur. It's possible therefore, that the -Mbounds compiler flag, has altered the code in some way as to create or highlight a new problem which doesn't appear when the flag is omitted. The results of this are somewhat inconclusive.

The Totalview debugger should be able to provide some helpful information. E.g. tracing through the code, watching variables etc. Unfortunately it has so far not been possible to obtain any symbolic information when the nested version of NEMO (or indeed any code) is compiled with the PGI compiler. The problem has been replicated with a simple helloworld type and filed as a bug with Cray/Totalview via HECToR Q22386. The PathScale compiler does not suffer from this problem but as we cannot compile the code with PathScale this doesn't help.

A number of attempts have been made to compile the code on the TDS as this has the latest versions of the compilers, system libraries, operating system etc. However, no PathScale licence is present on the TDS and the Totalview licence also appears to be invalid and therefore limited progress was possible on the TDS.

next up previous contents
Next: Conclusions and future work Up: NEMO AGRIF nested model Previous: BASIC nested model performance   Contents