Second parallelism level with OpenMP

The previous study shows that the parallel computation of a configuration across processors is rather expensive while the performance of distributed computation on the same processor is promising. In phase II HECToR is equipped with quadcore processors and it is expected that the next stages of the HECToR service will use processors with higher number of cores or/and shared memory for the processors belonging to a blade. In this hardware framework the mixed mode programming is the straightforward option for the implementation of SLP. OpenMP can be used to accelerate the computation of one configuration, while keeping separated MPI tasks for separated sets of configurations.

OpenMP parallelism at loop level in CASINO is relatively easy to implement as the code contains patterns of nested loops over the number of electrons for the computation of various physical quantities which are sums of function that depend on two or more electrons coordinates.

The OpenMP parallelism is implemented for the inner loops of the computation. The external loop parallelism, which from a theoretical point of view should be more efficient, is in practice very hard to implement with OpenMP for two main reasons:

  1. the code uses a buffering algorithm for various quantities to avoid the repetitive computation of the same quantity but which introduces data dependency,
  2. most importantly at the moment, the current implementation of the OpenMP specifications in the used compilers are struggling to handle procedure calls that use module variables. PGI and Pathscale compilers have difficulties with module variables which also can be thread private (this problem disappeared in PGI at version 8.0.6). GCC compiler (version 4.3.3) crashes during compilation if the parallel region is contained in an internal subroutine (problem solved in version 4.4.0). Sometimes it is possible to rewrite the code in order to use the implemented OpenMP specifications, but this kind of solutions are time consuming and interferes unnecessarily with the work of the other developers. Ideally OpenMP at the loop level should not change the serial code.

Table: Computing times in seconds for OpenMP SLP algorithm for quadcore processors for 1, 2 and 4 threads. For comparison the first column shows timing data of the executable compiled without OpenMP flags. The times are for the five section of the code that are computed in parallel (OPO, Jastrow, Ewald, update of the $ \bar D$ matrix, the electron-electron distances $ R_{ee}$) and for the full DMC sector. In brackets are the values of the efficiency parameter defined by Eq ([*]) where $ t_1$ is the taken from the first column.
threads No MP 1 2 4
System 1 1024 electrons
OPO 102 105 73(0.38) 61(0.56)
Jastrow 246 302 216(0.14) 170(0.15)
Ewald 79 78 39(1.03) 20(0.98)
$ \bar D$ 49 47 19(1.58) 9(1.48)
$ R_{ee}$ 123 114 65(0.89) 40(0.69)
DMC 672 723 481(0.40) 363(0.28)
System 2 1536 electrons
OPO 218 232 166(0.31) 147(0.16)
Jastrow 539 648 470(0.15) 420(0.09)
Ewald 176 176 90(0.96) 46(0.94)
$ \bar D$ 178 180 126(0.41) 124(0.15)
$ R_{ee}$ 265 249 143(0.85) 84(0.72)
DMC 1542 1654 1154(0.34) 966(0.20)

The benchmarks test for the OpenMP parallelism was done on the same system as in the section [*]. However a direct comparison between the numerical values is not straightforward as the CASINO code has undergone significant changes between the two measurements, nevertheless one can use the efficiency Eq ([*]) parameter for a performance comparison between the two SLP algorithms.

At a first sight two features need to be discussed regarding the values presented in table [*].

The efficiency of the OpenMP parallel loops varies. There are sections as Ewald and $ \bar D$ for ''System 1'' which show larger than one efficiency and a very small overhead. On the other hand the overhead for the parallel loops in Jastrow sector is very large in both sectors and the scaling is rather poor. The extreme case of poor scaling is shown by the computation with four threads of the $ \bar D$ for ''System 2'' which shows practically no speed gain, despite the fact that ''System 1'' has a perfect behaviour in this case. The main suspect of this variation is cache memory utilisation as the linear size of $ \bar D$ matrix of ''System 2'' is $ 50\%$ larger than that of ''System 1''.

In conclusion, SLP implemented with OpenMP offers better performance than the MPI pool algorithm on the benchmark cases. In some sectors of the parallel calculation (Jastrow, $ \bar D$) must be investigated further in order to understand the poor scaling. As the compilers improves in this area and the number of cores per processor increases the parallel computation should be implement at level of the first loop over electrons. For nodes with more that 10 cores and models with very large number of electrons ($ >10^4$) one should consider nested OpenMP parallelism in order to acquire the best performance.