Second parallelism level with OpenMP

The previous study shows that the parallel computation of a configuration across processors is rather expensive while the performance of distributed computation on the same processor is promising. In phase II HECToR is equipped with quadcore processors and it is expected that the next stages of the HECToR service will use processors with higher number of cores or/and shared memory for the processors belonging to a blade. In this hardware framework the mixed mode programming is the straightforward option for the implementation of SLP. OpenMP can be used to accelerate the computation of one configuration, while keeping separated MPI tasks for separated sets of configurations.

OpenMP parallelism at loop level in CASINO is relatively easy to implement as the code contains patterns of nested loops over the number of electrons for the computation of various physical quantities which are sums of function that depend on two or more electrons coordinates.

The OpenMP parallelism is implemented for the inner loops of the computation. The external loop parallelism, which from a theoretical point of view should be more efficient, is in practice very hard to implement with OpenMP for two main reasons:

Table: Computing times in seconds for OpenMP SLP algorithm for quadcore processors for 1, 2 and 4 threads. For comparison the first column shows timing data of the executable compiled without OpenMP flags. The times are for the five section of the code that are computed in parallel (OPO, Jastrow, Ewald, update of the $\bar D$ matrix, the electron-electron distances $R_{ee}$ ) and for the full DMC sector. In brackets are the values of the efficiency parameter defined by Eq (

) where

is the taken from the first column.

threads	No MP	1	2	4
System 1	1024 electrons
OPO	102	105	73(0.38)	61(0.56)
Jastrow	246	302	216(0.14)	170(0.15)
Ewald	79	78	39(1.03)	20(0.98)
$\bar D$	49	47	19(1.58)	9(1.48)
$R_{ee}$	123	114	65(0.89)	40(0.69)
DMC	672	723	481(0.40)	363(0.28)
System 2	1536 electrons
OPO	218	232	166(0.31)	147(0.16)
Jastrow	539	648	470(0.15)	420(0.09)
Ewald	176	176	90(0.96)	46(0.94)
$\bar D$	178	180	126(0.41)	124(0.15)
$R_{ee}$	265	249	143(0.85)	84(0.72)
DMC	1542	1654	1154(0.34)	966(0.20)

The benchmarks test for the OpenMP parallelism was done on the same system as in the section . However a direct comparison between the numerical values is not straightforward as the CASINO code has undergone significant changes between the two measurements, nevertheless one can use the efficiency Eq () parameter for a performance comparison between the two SLP algorithms.

At a first sight two features need to be discussed regarding the values presented in table .

The efficiency of the OpenMP parallel loops varies. There are sections as Ewald and $\bar D$ for ''System 1'' which show larger than one efficiency and a very small overhead. On the other hand the overhead for the parallel loops in Jastrow sector is very large in both sectors and the scaling is rather poor. The extreme case of poor scaling is shown by the computation with four threads of the $\bar D$ for ''System 2'' which shows practically no speed gain, despite the fact that ''System 1'' has a perfect behaviour in this case. The main suspect of this variation is cache memory utilisation as the linear size of $\bar D$ matrix of ''System 2'' is $50\%$ larger than that of ''System 1''.

In conclusion, SLP implemented with OpenMP offers better performance than the MPI pool algorithm on the benchmark cases. In some sectors of the parallel calculation (Jastrow, $\bar D$ ) must be investigated further in order to understand the poor scaling. As the compilers improves in this area and the number of cores per processor increases the parallel computation should be implement at level of the first loop over electrons. For nodes with more that 10 cores and models with very large number of electrons () one should consider nested OpenMP parallelism in order to acquire the best performance.