Removing the land only grid cells

Next: Compiler optimisations Up: NEMO performance Previous: Scaling plot Contents

Removing the land only grid cells

So far we have considered decompositions in which all the grid cells are used, i.e. those where the code has jpnij = jpni x jpnj. However, many decompositions give rise to grid cells which contain only land. These land only cells are essentially redundant in an ocean model and can be removed. In the code this means that the value of jpnij can be reduced such that jpnij <= jpni x jpnj. It is anticipated that removing land only cells may improve the performance of the code as branches into land only regions will no longer take place and any I/O associated with the land cells will also be removed. Furthermore, the number of AU's required will be reduced as fewer processors will be required if the land cells are removed.

The NEMO code does not automatically remove the land cells which means the user needs to use the chosen decomposition and then separately determine how many cells contain only land. A tool written by Andrew Coward can be used to determine the number of active (ocean containing) and dead (land only) cells. The procedure for doing this is as follows:

Use the nocsprocmap code to generate the layout.dat file for the required decomposition. E.g running the command
acc/NTOOLS/NOCSPROCMAP/nocspmap_r25 -f bathy_meter.nc -i 16 -j 16 -s
gives the number of active (i.e. ocean only) regions for a jpni = 16 by jpnj = 16 processor grid.
Alter the appropriate line of par_oce.F90 so that the value of jpnij is reduced such that the the land only squares are removed. For a 16 by 16 grid, there are 35 land only squares and thus jpnij = 221 instead of 256.

Table 3 gives the number of land only cells for a variety of grid dimension configurations. The reduction in the number of processors required is generally around 10%. For very large (>256) processor counts the reduction can be considerably larger and as much as 25%.

`jpni`	`jpnj`	Total cells	Land only cells	Percentage saved
6	6	36	0	0.00%
7	7	49	1	2.04%
8	8	64	2	3.13%
9	9	81	6	7.41%
10	10	100	10	10.00%
11	11	121	13	10.74%
12	12	144	14	9.72%
13	13	169	21	12.43%
14	14	196	22	11.22%
15	15	225	29	12.89%
16	16	256	35	13.67%
20	20	400	65	16.25%
30	30	900	193	21.44%
32	32	1024	230	22.46%
40	40	1600	398	24.88%
16	8	128	117	8.59%
32	16	512	92	17.97%

We now investigate whether removing the land only cells has any impact on the runtime of the NEMO code. We hope that by avoiding branches into land only regions and the associated I/O involved with the land cells that the runtime should reduce. For this test we have considered only 128, 256, 512 and 1024 processor grids. The results are given by table 4.

`jpni`	`jpnj`	`jpnij`	Time for 60 steps (seconds)
32	32	1024	110.795
32	32	794	100.011
16	32	512	117.642
16	32	420	111.282
16	16	256	146.607
16	16	221	136.180
8	16	128	236.182
8	16	117	240.951

From table 4 we can see that for 256 processors and above removing the number of land squares reduces the total runtime by up to 10 seconds which corresponds to a reduction of around 7-10%. For a 128 processors run, removal of the land-only cells actually gives a small increase in the total runtime. This difference is within normal repeatability errors and could be a result of heavy load on the system when the test was run. As the runtime does not seem to improve greatly with the removal of the land only cells the main motivation for removing these cells is to reduce the number of AU's used for each calculation. Assuming the runtime is not affected detrimentally then the reduction in in AU usage will be as given by table 3.

The times given in table 4 are the time that the NEMO code reports when it writes the information from time step 60 to disk. This, however is not the whole story. At the end of the run, NEMO also dumps out the restart files required to restart the computation from the final time step. These restart files are significantly larger than the files output at each individual time step and thus take a reasonable amount of time to write out to disk. Unfortunately the code does not output any timings which include the writing of these restart files. One way to get an estimate of the time taken to write out these restart files is to look as the actual time taken by the parallel run as reported by the batch system. The PBS output files gives the walltime in hh:mm:ss. By subtracting the time taken for 60 steps from walltime we can get an estimate of the time taken over and above the step by step output, i.e. we can get an estimate of the time taken to read in the input data and output the final restart files. To get accurate time estimates timers should be inserted into the code but as a first pass this method will let us find out whether there is any variation with processor count. The amount of time that NEMO spends in I/O and initialisation will be discussed in Section 6.9.

Next: Compiler optimisations Up: NEMO performance Previous: Scaling plot Contents