This benchmark is a simple extension of the 2D liddriven cavity flow, which is the first OpenFOAM tutorial [9].
As with the 2D case, the top wall moves in the xdirection at a speed of 1ms^{1} whilst the remaining walls remain stationary, the flow is assumed laminar and is solved on a uniform mesh using the icoFoam solver for laminar, isothermal, incompressible flow.
Unlike the 2D case, all output was switched off. Further, two cases were create for cubic meshes with different discretisations, namely and .
The case is of particular interest as it has been employed elsewhere [10][11] for benchmarking purposes, and is thus open for comparison.
To create the 3D case, the blockMeshDict file was changed to
hex (0 1 2 3 4 5 6 7) (100 100 100) simpleGrading (1 1 1)and
hex (0 1 2 3 4 5 6 7) (200 200 200) simpleGrading (1 1 1)
For both the 100^{3} and 200^{3} cases, the controlDict file contained
application icoFoam; startFrom startTime; startTime 0; stopAt endTime; endTime 0.025; deltaT 0.005; writeControl timeStep; writeInterval 20; purgeWrite 0; writeFormat ascii; writePrecision 6; writeCompression uncompressed; timeFormat general; timePrecision 6; runTimeModifiable yes;
For 8 cores, the file decompoiseParDict contains the lines
numberOfSubdomains 8; method simple simpleCoeffs { n (2 2 2); delta 0.001; }
where
n (2 2 2);
denotes a cubic layout of 8 processors, np_{x}=2, np_{y}=2 and np_{z}=2, in a virual grid. Indeed, many processor configurations were investigated. For instance, for 8 processors, (8 x 1 x 1, (1 x 8 x 1), (1 x 1 x 8), (4 x 2 x 1), (1 x 2 x 4), etc. Simulations were run for numberOfSubdomains =1, 2, 4, 8, ....
It has previously been reported, [10], that varying the processor virtual topology when running a 3D liddriven cavity flow case on a Cray XT4/XT5, has little affect on performance. Our investigation showed that, to a first approximation, this is indeed the case. However, we found a slight performance gain if the value of np_{z} is as low as possible. This is due to the fact that C++ stores the last dimension of arrays contiguously in memory and, if the last dimension is not distributed over processors, then more data will be kept in cache lines.
We ran the 100^{3} and 200^{3} cases for 5 time step, and 200^{3} case again but for 40 time steps, to investigate the impact of startup.
A summary of the results are displayed in figure 1.

The timing and performance results are presented in table 1, where performance figures are calculated as described in 6.2.2.
From table 1, it can be seen that for the 100^{3} case running for 5 time steps, the optimum number of cores is 128, for the 200^{3} case running for 5 time steps, the optimum number of cores is 512. However, the optimum number of cores for 200^{3} case running for 40 time steps is 1024. This implies that running the simulation for only 5 time steps was not long enough to remove the startup cost of the simulation.
Thus, from this exercise, we can suggest that, if using icoFoam in 3D, with a regular mesh, then for simulations with 200^{3}, or 8 million grid points, we recommend running on 1024 cores.