next up previous contents
Next: Performance of NEMO V3.0 Up: NEMO V3.0 Previous: Compilation   Contents


Performance for different compiler flags

A number of different compiler flags have been tested for NEMO V3.0. The results are summarised in Table 10. These tests were all carried out using a 16 by 16 processor grid with the land cells removed, that is jpni=16, jpnj=16 and jpnij=221. Both cores were used for all tests, i.e. mppnppn=2 was specified in the batch script.


Table 10: Runtime for 60 time steps for different compiler flags for the PGI compiler suite. Version 7.1.4 used unless stated otherwise. All tests were run with jpni=16, jpnj=16 and jpnij=221.
Compiler flags Time for 60 steps (seconds)
-O0 163.520
-O1 157.123
-O2 138.382
-O3 139.466
-O4 137.642
-fast -O3 fails with segmentation violation
-fast -O3 on PGI 7.2.3 fails with segmentation violation
-O2 -Munroll=c:1 runs 139.568
-O2 -Munroll=c:1 -Mnoframe runs 138.862
-O2 -Munroll=c:1 -Mnoframe -Mlre fails step 1
-O2 -Munroll=c:1 -Mnoframe -Mlre -Mautoinline fails step 1
-O2 -Munroll=c:1 -Mnoframe -Mlre -Mautoinline -Mvect=sse seg fault
-O2 -Munroll=c:1 -Mnoframe -Mlre -Mautoinline -Mvect=sse -Mscalarsse seg fault
-O2 -Munroll=c:1 -Mnoframe -Mlre -Mautoinline -Mvect=sse -Mscalarsse -Mcache_align seg fault
-O2 -Munroll=c:1 -Mnoframe -Mlre -Mautoinline -Mvect=sse -Mscalarsse -Mcache_align -Mflushz seg fault
-O2 -Munroll=c:1 -Mnoframe -Mautoinline -Mscalarsse -Mcache_align -Mflushz runs??


Increasing the level of optimisation from -O0 to -O2 gives an increase in performance. Optimisation of -O2 up to -O4 gives minimal improvement. The -fast flag results in a segmentation violation. As this flag invokes a number of different optimisations we tested each of these in turn to ascertain which particular flags cause the problem. The command pgf90 -help -fast lists the optimisations invoked by -fast, e.g.

fionanem@nid15879:~> pgf90 -help -fast
Reading rcfile /opt/pgi/7.1.4/linux86-64/7.1-4/bin/.pgf90rc
-fast     Common optimizations; 
          includes -O2 -Munroll=c:1 -Mnoframe -Mlre -Mautoinline
          == -Mvect=sse -Mscalarsse -Mcache_align -Mflushz

The -Munroll=c:1 flag enables loop unrolling which c:1 ensuring that all loops with a length of 1 or more are completely unrolled. The -Mnoframe flag prevents the compiler from generating code which fits in a stack frame. The -Mlre flag allows loop carried redundancy elimination to occur - i.e. variables redundant within a loop are removed. The -Mautoinline flag automatically enables function inlining in C/C++ and thus does not apply to NEMO. The -Mvect=sse flag allows vector pipelining to be used with SSE instructions. The -Mscalarsse flag generates scalar SSE code with xmm registers - this flag also implies -Mflushz. The -Mcache_align flag ensures that objects are aligned along cache boundaries. The -Mflushz flag sets the SSE instructions to ``flush-to-zero'' which ensures that numbers approaching zero get automatically zeroed.

From Table 10 we see that the addition of the flags -Mlre and -Mvect=sse cause the code to crash at runtime. All other flags invoked by -fast do appear to not cause significant issues. The -Mlre causes the zonal velocity to become very large suggesting that the loop redundancy elimination may have removed a loop temporary that was actually required. The reason for the failure when -Mvect=sse is added is unknown. Ultimately the addition of the additional flags doesn't give significant performance improvements over -O2 or -O3 and thus -O3 will be used in future.


next up previous contents
Next: Performance of NEMO V3.0 Up: NEMO V3.0 Previous: Compilation   Contents