The previous study shows that the parallel computation of a configuration across processors is rather expensive while the performance of distributed computation on the same processor is promising. In phase II HECToR is equipped with quadcore processors and it is expected that the next stages of the HECToR service will use processors with higher number of cores or/and shared memory for the processors belonging to a blade. In this hardware framework the mixed mode programming is the straightforward option for the implementation of SLP. OpenMP can be used to accelerate the computation of one configuration, while keeping separated MPI tasks for separated sets of configurations.
OpenMP parallelism at loop level in CASINO is relatively easy to implement as the code contains patterns of nested loops over the number of electrons for the computation of various physical quantities which are sums of function that depend on two or more electrons coordinates.
The OpenMP parallelism is implemented for the inner loops of the computation. The external loop parallelism, which from a theoretical point of view should be more efficient, is in practice very hard to implement with OpenMP for two main reasons:
The benchmarks test for the OpenMP parallelism was done on the same system as in the section . However a direct comparison between the numerical values is not straightforward as the CASINO code has undergone significant changes between the two measurements, nevertheless one can use the efficiency Eq () parameter for a performance comparison between the two SLP algorithms.
At a first sight two features need to be discussed regarding the values presented in table .
The efficiency of the OpenMP parallel loops varies. There are sections as Ewald and for ''System 1'' which show larger than one efficiency and a very small overhead. On the other hand the overhead for the parallel loops in Jastrow sector is very large in both sectors and the scaling is rather poor. The extreme case of poor scaling is shown by the computation with four threads of the for ''System 2'' which shows practically no speed gain, despite the fact that ''System 1'' has a perfect behaviour in this case. The main suspect of this variation is cache memory utilisation as the linear size of matrix of ''System 2'' is larger than that of ''System 1''.
In conclusion, SLP implemented with OpenMP offers better performance than the MPI pool algorithm on the benchmark cases. In some sectors of the parallel calculation (Jastrow, ) must be investigated further in order to understand the poor scaling. As the compilers improves in this area and the number of cores per processor increases the parallel computation should be implement at level of the first loop over electrons. For nodes with more that 10 cores and models with very large number of electrons () one should consider nested OpenMP parallelism in order to acquire the best performance.