Second level of parallelism with MPI

The second level parallelism (SLP) algorithm seeks to increase the computation speed by employing more than one task to move one RW configuration to its next state. As the computation time of a configuration in CASINO scales with $N_{e}^{p}$ , with , the computation time per configuration for a system with $\mathcal{O}(10^4)$ electrons could be more 100 times longer than a for a system with $\mathcal{O}(10^3)$ electrons, which is the maximum size reached by the current calculations. As byproducts SLP solves the large size BC problem because it distributes the BC data among the group of task that perform the computation for one configuration and improves the load balance of the parallel computation because the relative difference between the number of configurations on different pools decreases with the pool size (for a calculation with a fixed number of configurations per task or pool of tasks).

As in the case of MPI-2S algorithm SLP divides the tasks in groups, named pools, of given size (typically 2 or 4). At start the program reads the BC and distribute them among the pool members similar to MPI-2S algorithm. The difference is that only one configuration is computed at a time by all the tasks belonging to a pool. One of the tasks, named ''pool head'', controls the computation and sends signals to the other tasks about the next step of the computation. In this manner the synchronisation problem of MPI-2S algorithm is removed and the pool's tasks can be used to compute in parallel more quantities beside the orbitals: sums that appear in the Jastrow factor, the potential energy and linear algebra operations needed for the Slater matrices.

We analyse the efficiency of SLP algorithm in the following way: in the ideal case for a pool of size n the computation time of one configuration would be . However the communication time between tasks is not negligible and the work is not equally distributed over the pool's tasks because there are computations done only on the pool's head. We can measure the efficiency of a pool usage with the following parameter:

Table: Computing times in seconds for SLP algorithm for dual core (DC) and quadcore (QC) processors with pools of sizes 1, 2 and 4. The times are for the three section of the code that are computed in parallel (OPO, Jastrow, Ewald) and for whole DMC segment that contains also sections executed in serial mode on the pool's head. In brackets are the values of the efficiency parameter defined by with Eq (

	DC			QC
pool size	1	2	4	1	2	4
System 1	1024 electrons
OPO	118	80(0.46)	78(0.17)	141	93(0.52)	64(0.40)
Jastrow	271	218(0.24)	189(0.14)	199	151(0.32)	123(0.21)
Ewald	79	59(0.34)	34(0.44)	122	90(0.36)	52(0.45)
DMC	773	656(0.18)	592(0.10)	743	610(0.22)	518(0.14)
System 2	1536 electrons
OPO	267	171(0.56)	143(0.29)	311	197(0.58)	136(0.43)
Jastrow	640	505(0.27)	473(0.12)	454	340(0.36)	317(0.14)
Ewald	183	137(0.34)	81(0.42)	276	207(0.33)	121(0.43)
DMC	1965	1709(0.15)	1531(0.09)	1847	1522(0.21)	1358(0.12)

In Table we present the computation times for three sections that are done in parallel over the pool: one particle orbitals (OPO), Jastrow function, Ewald sum and also for the whole (DMC) section; the pool sizes are . The input file is identical to that used for shared memory measurements, see Table . The efficiency parameter shows that the best efficiency is obtained for pools of size 2 for OPO computation. In the case of pool of size 4 the OPO computation is clearly more efficient on quadcore processor but the other quantities have similar performance, though slightly better on quadcore. We note also that the efficiency of the calculation OPO increases for the larger system.

Table: Computing times in seconds for Slater determinants using Scalapack over the pool for the recursive algorithm,

, or by LU factorisation, $O(N^{3})$ . The $O(N_{e}^2)$ method shows longer times because its number of calls is a small multiple of

larger that of the $O(N^{3})$ method. In brackets are the values of the efficiency parameter defined by with Eq (

	DC
pool size	1	2	4
System 1	1024 electrons
det $O(N^{3})$	12	5.5(1.18)	4.7(0.52)
det $O(N^{2})$	94	106(-0.11)	76(0.08)
System 2	1536 electrons
det $O(N^{3})$	39	18(1.17)	10(0.90)
det $O(N^{2})$	330	344(-0.01)	231(0.14)

The overall efficiency in DMC sector of the current implementation is rather small as the computations of the Slater determinant and of the associated matrices are done on the pool's head. The Slater determinants are computed in two ways in CASINO: i) using LU factorisation of the Slater matrix, which scales as $N_{e}^3$ , ii) using an iterative relationship for the cofactor matrix [9], which scale as $N_{e}^2$ but is numerically unstable. We have implemented a parallel computation over the pool cores of this two subroutines for VMC calculations using Scalapack subroutines. The timing results, Table , shows an excellent scaling for the $N_{e}^3$ algorithm but little improvement or slight degradation for $N_{e}^2$ algorithm which in fact has a much larger weight in the computation time ( in the last version of CASINO LU computation of the Slater determinant is call by default every 100,000 time steps).