Second level of parallelism with MPI

The second level parallelism (SLP) algorithm seeks to increase the computation speed by employing more than one task to move one RW configuration to its next state. As the computation time of a configuration in CASINO scales with $ N_{e}^{p}$, with $ p=2,3$, the computation time per configuration for a system with $ \mathcal{O}(10^4)$ electrons could be more 100 times longer than a for a system with $ \mathcal{O}(10^3)$ electrons, which is the maximum size reached by the current calculations. As byproducts SLP solves the large size BC problem because it distributes the BC data among the group of task that perform the computation for one configuration and improves the load balance of the parallel computation because the relative difference between the number of configurations on different pools decreases with the pool size (for a calculation with a fixed number of configurations per task or pool of tasks).

As in the case of MPI-2S algorithm SLP divides the tasks in groups, named pools, of given size (typically 2 or 4). At start the program reads the BC and distribute them among the pool members similar to MPI-2S algorithm. The difference is that only one configuration is computed at a time by all the tasks belonging to a pool. One of the tasks, named ''pool head'', controls the computation and sends signals to the other tasks about the next step of the computation. In this manner the synchronisation problem of MPI-2S algorithm is removed and the pool's tasks can be used to compute in parallel more quantities beside the orbitals: sums that appear in the Jastrow factor, the potential energy and linear algebra operations needed for the Slater matrices.

We analyse the efficiency of SLP algorithm in the following way: in the ideal case for a pool of size n the computation time of one configuration would be $ t_n=t_1/n$. However the communication time between tasks is not negligible and the work is not equally distributed over the pool's tasks because there are computations done only on the pool's head. We can measure the efficiency of a pool usage with the following parameter:

$\displaystyle \eta=\frac{t_1/t_n-1}{n-1}$ (6)

where $ \eta$ takes value 1 in the ideal case, 0 if the use of SLP does not increase the computation speed and becomes negative if the computation time increases with the pool size.


Table: Computing times in seconds for SLP algorithm for dual core (DC) and quadcore (QC) processors with pools of sizes 1, 2 and 4. The times are for the three section of the code that are computed in parallel (OPO, Jastrow, Ewald) and for whole DMC segment that contains also sections executed in serial mode on the pool's head. In brackets are the values of the efficiency parameter defined by with Eq ([*]).
  DC QC
pool size 1 2 4 1 2 4
System 1 1024 electrons
OPO 118 80(0.46) 78(0.17) 141 93(0.52) 64(0.40)
Jastrow 271 218(0.24) 189(0.14) 199 151(0.32) 123(0.21)
Ewald 79 59(0.34) 34(0.44) 122 90(0.36) 52(0.45)
DMC 773 656(0.18) 592(0.10) 743 610(0.22) 518(0.14)
System 2 1536 electrons
OPO 267 171(0.56) 143(0.29) 311 197(0.58) 136(0.43)
Jastrow 640 505(0.27) 473(0.12) 454 340(0.36) 317(0.14)
Ewald 183 137(0.34) 81(0.42) 276 207(0.33) 121(0.43)
DMC 1965 1709(0.15) 1531(0.09) 1847 1522(0.21) 1358(0.12)


In Table [*] we present the computation times for three sections that are done in parallel over the pool: one particle orbitals (OPO), Jastrow function, Ewald sum and also for the whole (DMC) section; the pool sizes are $ 1,2,4$. The input file is identical to that used for shared memory measurements, see Table [*]. The efficiency parameter shows that the best efficiency is obtained for pools of size 2 for OPO computation. In the case of pool of size 4 the OPO computation is clearly more efficient on quadcore processor but the other quantities have similar performance, though slightly better on quadcore. We note also that the efficiency of the calculation OPO increases for the larger system.


Table: Computing times in seconds for Slater determinants using Scalapack over the pool for the recursive algorithm, $ O(N^2)$, or by LU factorisation, $ O(N^{3})$. The $ O(N_{e}^2)$ method shows longer times because its number of calls is a small multiple of $ N_e$ larger that of the $ O(N^{3})$ method. In brackets are the values of the efficiency parameter defined by with Eq ([*]).
  DC
pool size 1 2 4
System 1 1024 electrons
det $ O(N^{3})$ 12 5.5(1.18) 4.7(0.52)
det $ O(N^{2})$ 94 106(-0.11) 76(0.08)
System 2 1536 electrons
det $ O(N^{3})$ 39 18(1.17) 10(0.90)
det $ O(N^{2})$ 330 344(-0.01) 231(0.14)


The overall efficiency in DMC sector of the current implementation is rather small as the computations of the Slater determinant and of the associated matrices are done on the pool's head. The Slater determinants are computed in two ways in CASINO: i) using LU factorisation of the Slater matrix, which scales as $ N_{e}^3$, ii) using an iterative relationship for the cofactor matrix [9], which scale as $ N_{e}^2$ but is numerically unstable. We have implemented a parallel computation over the pool cores of this two subroutines for VMC calculations using Scalapack subroutines. The timing results, Table [*], shows an excellent scaling for the $ N_{e}^3$ algorithm but little improvement or slight degradation for $ N_{e}^2$ algorithm which in fact has a much larger weight in the computation time ( in the last version of CASINO LU computation of the Slater determinant is call by default every 100,000 time steps).