Load Balancing

The third and final major section of work looked at improving load balancing of the code, particularly in the region where gaussian basis functions held in sparse matrix representation are mapped to the real space grids (steps I,VI in figure 1). The real space to plane wave transfer and the FFT are well load balanced as the amount of work per process is proportional to the number of grid points, which are evenly distributed either as planes, pencils (plane wave grids) or blocks (real space grids). However, the task of mapping gaussians from the matrix to the grid is not so well load balanced. The matrix is distributed across all processors in a ScaLAPACK block-cyclic style. However, in general, each gaussian contained in a process' matrix block, may not be located in it's own section of the real space grids.

The existing load balancing scheme works by each process building a task list of the gaussians contained within it's local matrix block. Each particular gaussian will be mapped to one level of the real space grids, which may be either distributed or replicated. Thus the task list will be made up of `distributed tasks' which must be completed by the process with the corresponding section of the real space grid, and `replicated tasks' which may be completed by any process. A cost model exists in CP2K which is very accurate at predicting the computational cost of a particular task (see figure 4). This is used to estimate the load on each process due to distributed tasks and the replicated tasks are then assigned to processes in order to even out the load.

For homogeneous systems approximately the same number of atoms will reside in each process' section of the grid, so the distribution of gaussian mapping tasks will be well load balanced. For example, the bench_64 test case on 64 cores is able to achieve a nearly perfect load balance because there is large enough number of replicated tasks to remove the unbalanced load or distributed tasks:

 At the end of the load_balance_distributed
 Maximum load:                75667
 Average load:                68312
 Minimum load:                13060

 At the end of the load_balance_replicated
 Maximum load:               123552
 Average load:               123457
 Minimum load:               123374

However, for non-homogenous systems including interfaces and clusters (e.g. large molecules), some processes will have very few distributed tasks that they can process, as they may have few atoms in their local real space grid. If the number of replicated tasks is not enough, a poor load balance will result. To test this scenario, the W216 test case, described in section 2.1 was used. As shown, for W216 on 256 cores, there is a severe load imbalance in this region of the code, with the most heavily loaded processor having over 6 times as much work as the least loaded.

Figure 4: Excellent correlation between predicted and measured cost for task mapping (J. VandeVondele)

 At the end of the load_balance_distributed
 Maximum load:              1738978
 Average load:               176232
 Minimum load:                    0

 At the end of the load_balance_replicated
 Maximum load:              1738978
 Average load:               475032
 Minimum load:               286053

To address this, a scheme was proposed where instead of each process owning the same segment of the real space grid on each grid level, the mapping of grid blocks to processes could be varied at each level. Blocks which result in a heavy load on one grid level could be paired with blocks with low loads on another, thus distributing the load more evenly. This concept was introduced into the code as `real' and `virtual' ranks. Each process has a fixed real rank and a virtual rank for each grid level which is the rank of the process that would have held the particular section of the grid if there was no reordering. As the load balancing step is done before the grids are allocated, an appropriate mapping of real to virtual ranks is chosen based on the load of the distributed tasks, then the grids are allocated on each process corresponding to their virtual ranks. There are a number of changes required in the realspace to planewave transfer routines to ensure that the reordered grid data is sent to the correct process for transferring to the plane wave grid, but this is facilitated by the use of a pair of mapping arrays real2virtual and virtual2real which are members of the real space grid data structure and are used to convert between the two orderings as needed.

For the same problem as above, using the new load balancing scheme, the load on the most overloaded process is reduced by 30%, and this is now only 3.5 times the load of the least loaded process. For this particular problem it is not possible to find a perfect load balance, as there is a single grid level block which has more load associated with it than then total average load. It is possible to overcome this by setting up the grid levels so that they are more closely spaced, and thus there is less load on each grid level. However, this comes at an increased memory cost for the extra grid levels and also affects the numerics of the calculation slightly (1 in 10$^6$). As shown in figures 5 and 6 if it is possible to balance the load perfectly, then this algorithm will succeed.

 After load_balance_distributed
 Maximum load:              1165637
 Average load:               176232
 Minimum load:                    0

 After load_balance_replicated
 Maximum  load:              1165637
 Average  load:               475032
 Minimum  load:               317590

Figure 5: W216 load balance on 16 cores - perfect load balance achieved

Figure 6: W216 load balance on 64 cores - several heavily loaded grid sections stop perfect local balance