Next: Compiler optimisations
Up: NEMO performance
Previous: Scaling plot
Contents
Removing the land only grid cells
So far we have considered decompositions in which all the grid cells are
used, i.e. those where the code has jpnij = jpni x jpnj. However,
many decompositions give rise to grid cells which contain only land. These
land only cells are essentially redundant in an ocean model and can be
removed. In the code this means that the value of jpnij can be
reduced such that jpnij <= jpni x jpnj. It is anticipated that
removing land only cells may improve the performance of the code as branches
into land only regions will no longer take place and any I/O associated with
the land cells will also be removed. Furthermore, the number of AU's required
will be reduced as fewer processors will be required if the land cells are
removed.
The NEMO code does not automatically remove the land cells which means the
user needs to use the chosen decomposition and then separately determine how
many cells contain only land. A tool written by Andrew Coward can be used to
determine the number of active (ocean containing) and dead (land only)
cells. The procedure for doing this is as follows:
- Use the nocsprocmap code to generate the layout.dat
file for the required decomposition. E.g running the command
acc/NTOOLS/NOCSPROCMAP/nocspmap_r25 -f bathy_meter.nc
-i 16 -j 16 -s
gives the number of active (i.e. ocean only) regions for a jpni = 16
by jpnj = 16 processor grid.
- Alter the appropriate line of par_oce.F90 so that the value
of jpnij is reduced such that the the land only squares are removed.
For a 16 by 16 grid, there are 35 land only squares and thus
jpnij = 221 instead of 256.
Table 3 gives the number of land only cells for a
variety of grid dimension configurations. The reduction in the number of
processors required is generally around 10%. For very large (>256)
processor counts the reduction can be considerably larger and as much as 25%.
Table 3:
Number of land only squares for a variety of processor grids.
The percentage saved gives the percentage of cells saved by removing
the land only cells and will correspond to the reduction in the number
of AU's required for the computation.
jpni |
jpnj |
Total cells |
Land only cells |
Percentage saved |
6 |
6 |
36 |
0 |
0.00% |
7 |
7 |
49 |
1 |
2.04% |
8 |
8 |
64 |
2 |
3.13% |
9 |
9 |
81 |
6 |
7.41% |
10 |
10 |
100 |
10 |
10.00% |
11 |
11 |
121 |
13 |
10.74% |
12 |
12 |
144 |
14 |
9.72% |
13 |
13 |
169 |
21 |
12.43% |
14 |
14 |
196 |
22 |
11.22% |
15 |
15 |
225 |
29 |
12.89% |
16 |
16 |
256 |
35 |
13.67% |
20 |
20 |
400 |
65 |
16.25% |
30 |
30 |
900 |
193 |
21.44% |
32 |
32 |
1024 |
230 |
22.46% |
40 |
40 |
1600 |
398 |
24.88% |
16 |
8 |
128 |
117 |
8.59% |
32 |
16 |
512 |
92 |
17.97% |
|
We now investigate whether removing the land only cells has any impact on
the runtime of the NEMO code. We hope that by avoiding branches into land
only regions and the associated I/O involved with the land cells that the
runtime should reduce. For this test we have considered only 128, 256,
512 and 1024 processor grids. The results are given by table
4.
Table 4:
Runtime comparison for 60 time steps for models with/without land
squares included on 128, 256, 512 and 1024 processor grids.
jpni |
jpnj |
jpnij |
Time for 60 steps (seconds) |
32 |
32 |
1024 |
110.795 |
32 |
32 |
794 |
100.011 |
16 |
32 |
512 |
117.642 |
16 |
32 |
420 |
111.282 |
16 |
16 |
256 |
146.607 |
16 |
16 |
221 |
136.180 |
8 |
16 |
128 |
236.182 |
8 |
16 |
117 |
240.951 |
|
From table 4 we can see that for 256 processors and
above removing the number of land squares reduces the total runtime by
up to 10 seconds which corresponds to a reduction of around 7-10%.
For a 128 processors run, removal of the land-only cells actually gives a
small increase in the total runtime. This difference is within normal
repeatability errors and could be a result of heavy load on the system
when the test was run. As the runtime does not seem to improve greatly
with the removal of the land only cells the main motivation for removing
these cells is to reduce the number of AU's used for each calculation.
Assuming the runtime is not affected detrimentally then the reduction in
in AU usage will be as given by table 3.
The times given in table 4 are the time that the NEMO
code reports when it writes the information from time step 60 to disk. This,
however is not the whole story. At the end of the run, NEMO also dumps
out the restart files required to restart the computation from the final
time step. These restart files are significantly larger than the files
output at each individual time step and thus take a reasonable amount
of time to write out to disk. Unfortunately the code does not output any
timings which include the writing of these restart files. One way to get
an estimate of the time taken to write out these restart files is to
look as the actual time taken by the parallel run as reported by the batch
system. The PBS output files gives the walltime in hh:mm:ss. By subtracting
the time taken for 60 steps from walltime we can get an estimate of the
time taken over and above the step by step output, i.e. we can get an
estimate of the time taken to read in the input data and output the final
restart files. To get accurate time estimates timers should be inserted
into the code but as a first pass this method will let us find out whether
there is any variation with processor count. The amount of time that NEMO
spends in I/O and initialisation will be discussed in Section 6.9.
Next: Compiler optimisations
Up: NEMO performance
Previous: Scaling plot
Contents