Performance for different compiler flags

Compiler flags	Time for 60 steps (seconds)
`-O0`	163.520
`-O1`	157.123
`-O2`	138.382
`-O3`	139.466
`-O4`	137.642
`-fast -O3`	fails with segmentation violation
`-fast -O3` on PGI 7.2.3	fails with segmentation violation
`-O2 -Munroll=c:1`	runs 139.568
`-O2 -Munroll=c:1 -Mnoframe`	runs 138.862
`-O2 -Munroll=c:1 -Mnoframe -Mlre`	fails step 1
`-O2 -Munroll=c:1 -Mnoframe -Mlre -Mautoinline`	fails step 1
`-O2 -Munroll=c:1 -Mnoframe -Mlre -Mautoinline -Mvect=sse`	seg fault
`-O2 -Munroll=c:1 -Mnoframe -Mlre -Mautoinline -Mvect=sse -Mscalarsse`	seg fault
`-O2 -Munroll=c:1 -Mnoframe -Mlre -Mautoinline -Mvect=sse -Mscalarsse -Mcache_align`	seg fault
`-O2 -Munroll=c:1 -Mnoframe -Mlre -Mautoinline -Mvect=sse -Mscalarsse -Mcache_align -Mflushz`	seg fault
`-O2 -Munroll=c:1 -Mnoframe -Mautoinline -Mscalarsse -Mcache_align -Mflushz`	runs??

Increasing the level of optimisation from -O0 to -O2 gives an increase in performance. Optimisation of -O2 up to -O4 gives minimal improvement. The -fast flag results in a segmentation violation. As this flag invokes a number of different optimisations we tested each of these in turn to ascertain which particular flags cause the problem. The command pgf90 -help -fast lists the optimisations invoked by -fast, e.g.

The -Munroll=c:1 flag enables loop unrolling which c:1 ensuring that all loops with a length of 1 or more are completely unrolled. The -Mnoframe flag prevents the compiler from generating code which fits in a stack frame. The -Mlre flag allows loop carried redundancy elimination to occur - i.e. variables redundant within a loop are removed. The -Mautoinline flag automatically enables function inlining in C/C++ and thus does not apply to NEMO. The -Mvect=sse flag allows vector pipelining to be used with SSE instructions. The -Mscalarsse flag generates scalar SSE code with xmm registers - this flag also implies -Mflushz. The -Mcache_align flag ensures that objects are aligned along cache boundaries. The -Mflushz flag sets the SSE instructions to ``flush-to-zero'' which ensures that numbers approaching zero get automatically zeroed.

From Table 10 we see that the addition of the flags -Mlre and -Mvect=sse cause the code to crash at runtime. All other flags invoked by -fast do appear to not cause significant issues. The -Mlre causes the zonal velocity to become very large suggesting that the loop redundancy elimination may have removed a loop temporary that was actually required. The reason for the failure when -Mvect=sse is added is unknown. Ultimately the addition of the additional flags doesn't give significant performance improvements over -O2 or -O3 and thus -O3 will be used in future.