A load balancing algorithm

for Molecular Dynamics simulations

 

Laurence J. Ellison, 10th May 2011

 

 

Abstract

 

This report describes a prototype load balancing algorithm designed to improve the efficiency of molecular dynamics simulations, specifically those in which the system under study is characterised by large spatial variations in atom density.  At root the parallelisation scheme used in the load balancer is similar to the widely used domain decomposition scheme, however each domain is further subdivided into a number of cells.  The coordinates of the atoms within cells can be reallocated from processors with a heavy work load to ones with more modest work loads to effect a more even work distribution. 

The prototype load balancing  algorithm is based around a rudimentary MD time stepping loop in which only 2-body interactions are considered and the atom trajectories are non-physical.  The intention here was to provide a simple framework on which to develop and test the basic algorithm.  The results of performance tests carried out on HECToR are encouraging, with good speed ups observed for a number of simple test systems.  The largest test run consisted of some 5 million atoms run on a maximum of 2744 cores. 

The prototype load balancer code as it stands provides a good basis for understanding how the performance of the algorithm is affected by various operational parameters.  However further work is necessary to integrate the algorithm into DL_POLY_4, a long standing, general purpose MD package with a wide user base.   This is the ultimate aim of the project and the ultimate test for the load balancing algorithm.  If this can be achieved it is hoped that the improved performance conferred by the load balancing will enable users to obtain results in a shorter time frame and to make better use of the resources on which they run their jobs.

 

 

Contents

1.        

Introduction.................................................................................... 2

2.        

Code overview ............................................................................... 3

3.        

Performance test results       ............................................................... 9

4.        

Further work ................................................................................ 15

5.        

Summary and conclusions ........................................................... 17

 

Appendix Performance test data

 

1. Introduction

 

In molecular dynamics (MD) simulations parallelised using the domain decomposition (DD) scheme, the simulation volume, which we refer to as the MD cell, is divided into a number of spatial domains of equal size, each one being allocated to a particular core* on the machine on which the simulation is run.  The work load for each core consists of two main components: (i) the calculation of the forces acting on the atoms within its domain and (ii) the message passing necessary for importing additional atomic coordinates from neighbouring cores as required to complete those force calculations.  If the distribution of atoms within the MD cell is more or less uniform, then each core should have roughly the same work load � the work loads are said to be balanced.  If however the atom distribution becomes significantly non-uniform, as in figure 1(a), then we may well find that certain domains contain large numbers of atoms whilst others are relatively sparsely populated. The effect of this will be that cores responsible for regions of high atom density will have large work loads whilst those responsible for the sparse regions will be largely idle for much of the time.  There are two major disadvantages here, firstly the MD simulation will only run as quickly as the slowest core participating, therefore the average step time and thus the time to solution will be dictated by the core with the heaviest work load. Secondly the available hardware is not being utilised as efficiently as it might be since many of the cores are doing very little work.

 

Figure 1.

(a) Schematic representation of the MD cell partitioned into domains, the shaded disk represents a highly non-uniform atom distribution. (b) The basic strategy on which the load balancing algorithm is based is to subdivide each domain into cells and reallocate cells from cores responsible for densely populated domains to to cores responsible for domains containing fewer atoms.

 

 

 

 

 

 

 

 

 

 

 


The prototype load balancing program that we have developed seeks to ameliorate this situation by reallocating atoms from busy cores to less busy ones. In order to do this each domain is subdivided into a number of cells each of the same size (figure 1(b)). Periodically, within the MD time stepping loop, each core estimates its overall work load as a function of the current atom distribution. The work loads are then pooled via global communication and if at this point there is sufficient imbalance in the work loads, cells are transferred from busy cores to less busy ones.  By the transferal of a cell, we mean that all the atoms within that cell are allocated to another core which then becomes responsible for calculating the forces on the atoms in that cell.  Cell transfer continues until the work loads are balanced within a desired tolerance.  This should result in a speedup of the simulation, i.e. we should observe a reduction in the average time step length as compared to the equivalent test run without load balancing.  Furthermore all the cores should be engaged in doing work for the majority of the simulation time. A successful load balancer therefore allows the user to obtain results in a shorter time frame and makes more effective use of the hardware available.

The primary objectives of this project were (i) to develop a prototype load balancer designed to  improve the performance of parallel MD simulations, (ii) to thoroughly test the load balancer and understand the effect of the various operational parameters on its performance and (iii) to incorporate the load balancing algorithm into the latest release of the general purpose MD package DL_POLY_4, which uses the DD method of parallelisation but does not currently feature any load balancing functionality.   We have succeeded in developing an algorithm that is robust and flexible, in particular the cell and domain dimensions can be selected completely independently of the cut-off distance for atom-atom interactions.  The code has been exhaustively tested for errors and the results of the performance tests are encouraging.  In a number of these tests the load balancer achieved a speedup close to the theoretical maximum and very rarely did the load balanced version ran slower than the control test.  In other words it appears that in most circumstances there is little to be lost by deploying the load balancer but potentially much to be gained in terms of efficiency. Unfortunately, in the time available, it was not possible to integrate the load balancing algorithm into DL_POLY_4.  Both codes are complex in their own ways and so the task of incorporating the one into the other is highly non trivial.  Moreover unforeseen design issues did arise in the latter stages of the load balancer development that demanded attention and it was deemed more important to make sure that the prototype was, as far as we could be sure, bug free and operating as efficiently as possible rather than rushing to integrate a defective prototype into DL_POLY_4 prematurely.  However some preliminary has been done to outline how the integration might be implemented. 

      The remainder of this report is divided into four further sections, in section 2 we will give an overview of the architecture of the load balancer code.  Section 3 will present the results of some of the performance testing that was carried out. In section 4 we discuss weaknesses in the existing code and steps that might be taken to improve its effectiveness.  Section 5 summarises and draws conclusions from the work presented in the earlier sections.

 

 

2. Code overview  

 

The load balancer source code is written in Fortran 90 and message passing is implemented using MPI calls.  A serialised version of the source code has been put through the Forcheck application on the NAG CSE machine, softeng.cse.rl.ac.uk to check for compliance with the F90 standard and weed out  dubious programming practices.  In terms of organisation the top level program consists of three stages  presented in summary form in figure 2.

      In stage 1 the preliminary steps common to many simulation codes are performed such as the allocation of permanent arrays, definitions of constants, reading of user defined directives from an input file and setting up of the initial atom configuration. We will not go into the details here, a comprehensive overview of the top level subroutines that implement these steps can be found in the documentation that accompanies the source code.  In addition the header and embedded comments in each subroutine provide further detail. 

      Stage 3 of the program is  primarily concerned with the output of diagnostic data. The diagnostics include checks of global cell and atom tallies, local and global time averages for each primary subroutine (useful for inferring bottlenecks and disparities in work load) and also the amount of memory allocated to the major dynamic arrays used by the program.

      The meat of the code is located within stage 2, the time stepping loop where the bulk of the communication and computational work is done.  The role of each major block of code in the time stepping loop will now be outlined

Text Box: Figure 2. 
Summary of the key procedures carried out in the load balancing program. Within the time stepping loop, nstep is the step index and num_steps the total number of steps to be executed.

      Program Main

 

             Stage 1: Preliminaries

 

             Initialise MPI

             Make preliminary definitions

             Open diagnostic files

             Read user input file

             Setting up the initial atom configuration

             Construction of domain and cell maps and neighbour lists

             Allocation of permanent bookkeeping arrays

             Partitioning of the MD cell into domains

             Subdivision the domains into cells

 

             Stage 2: Time stepping loop

 

             Do nstep=1, num_steps, 1

 

                   Get required cell populations in order to determine initial load imbalance

                   Load balancing � i.e. reallocation of cells if work loads are sufficiently imbalanced

                   Get locations of cells required for local forces calculations

                   Import atoms required for local force calculations

                   Calculate total forces acting on on-core atoms

                   Export opposite force components acting on off-core atoms

                   Atom translations

                   Reallocate atoms that have crossed cell boundaries to their new cells

 

                   Check local cell and atom tallies

                   Optional measurement of ρ (a parameter used in the work load estimate) from timing data

                   Option to restore all cells to their original cores

                   Wind up current time step

 

             End Do

 

             Stage 3: Wind up simulation

 

             Check global cell and atom tallies

             Output diagnostic data

             Deallocation of permanent arrays

 

      End Program Main

 

 

Get required cell populations in order to determine initial load imbalance

The estimate of the relative overall work load, W, on each core is given by the following formula:

 

W = W1 + ρW2            (1)

 

Here W1 is the number of 2-body interactions that need to be evaluated by that core and W2 is the number of atom coordinates that need to be imported from other cores to complete the 2-body force calculations. The parameter ρ is the ratio of the average cost of importing a single atom to the average cost of evaluating a single pair interaction.  The calculation of W is made separately on each processor then, at the start of the load balancing subroutine, the results are pooled and the global work distribution thus obtained.

      Now in order to perform the work load calculation, the local core must know the populations of each of its required cells.  The term required cell has a specific meaning that derives from the way 2-body interactions are calculated in the prototype load balancer.  Consider two atoms i and j located in cells A and B respectively which are within the cut-off distance of one another as shown in figure 3(a).

 

 

The force on j due to i is equal and opposite to the force on i due to j.  Therefore we can choose to calculate only the force component fij and simply add the opposite of this, fji = -fij, to the net force on j.  The upshot of this is that to complete the force calculations for the atoms in a given 'central' cell, we do not require the coordinates of the atoms in all the surrounding cells within the cut-off of that cell, only those within a 'semi halo' as it were, as shown in figure 3(b) � these are the so called required cells.

      Now some of the required cells  for a given central cell allocated to a particular core may well be allocated to a different core � they are said to be off-core.  The coordinates of atoms within required cells that are off-core will therefore need to be imported.  Given that load balancing causes cells to be arbitrarily moved from core to core, the task of finding out which core a particular cell is located on and the subsequent acquisition of the coordinates of the atoms within that cell presents special difficulties.  The strategy used to solve this problem is to periodically update the home core (the core responsible for the domain on which a cell is physically located) with information about its so called native cells  i.e. cells physically located on that core's domain.  The type of information that needs to be stored on home cores is typically  the current hosts of its native cells (the core they are currently allocated to) and their populations.  Any core requiring information about an off-core cell, wherever it may be located, contacts the home core of that cell to acquire it.  This 'forwarding address' paradigm is predicated on the fact that although a cell could be allocated to any core in the system, given the identity of the cell (specified within the code by a unique global index) its physical location and therefore its home core can always be determined.  A scheme based on this idea is used to furnish the required cell populations. The details of the scheme are too detailed to go into here although further details are contained in the documentation accompanying the source code.  A crucial aspect of this scheme that should be mentioned however is that it the transfer of information within it is accomplished by highly selective point-to-point message passing.  This sort of efficiency is essential if the benefits of load balancing are not to be outweighed by the extra communication overheads  entailed by moving data off local cores.  Similarly global communications must be kept to a minimum because their use is likely to undermine the scalability of the algorithm.

Load balancing

Having obtained the populations of its required cells, each core now makes an estimate of its overall work load.  The workloads of all cores are then pooled via a global summation and each core calculates the mean and standard deviation of this work load distribution.  If the ratio of the standard deviation to the mean is above a desired threshold then load balancing is implemented.  The first step in performing load balancing is to rank the cores in descending order of work load.  The busiest cores are then paired up with the least busy cores. Within each of these partnerships, the busy core sends a single cell to the less busy one.  Each core then reassesses its work load (taking into account of the redistribution of atoms that has occurred as a result of the cell transfer) and the two partners share this information. If the ratio of the difference between the two partners' work loads to their average work loads remains above a specified threshold then another cell transfer occurs.  This procedure is repeated until the required balance is reached between the two partners at which point the load balancing loop is terminated for these two cores.  Usually, within a single round of load balancing, in a given time step, this will only result in the balance of work loads between each pair of partner cores.  However after several rounds of load balancing, in which a different set of partnerships is set up, the work load distribution will tend to become balanced globally.  

 

Get locations of cells required for local forces calculations

Before the force calculation loop can proceed  each core needs to import the cell data required to evaluate all of its pair interactions.  In an ordinary (non load balanced) DD based MD simulation, this is achieved by each core receiving halo data from the cores responsible for domains adjacent to its own domain. But if load balancing has been implemented and hence cells transferred away from their home cores, this is no longer sufficient.  A core can no longer assume that the atoms it requires are in the possession of the halo cores, it is therefore necessary for it to determine which cores currently host the cells it requires and to request the coordinates of these cells' atoms from those hosts.  A specialised scheme based on the forwarding address paradigm described earlier and similar to the one used to acquire cell populations is implemented to achieve this.  Again, the details of the scheme are too complex to relate here, but the result of it is that each core acquires the current locations of its required cells from the home cores of those cells.  In the course of this procedure the home cores also inform the host cores of how many requests for cells they can expect to receive.  This is essential since without this information the host cores will not know how many receives to post  to accommodate all such requests.  The anticipation of the numbers of messages that cores can expect to receive was in fact the most challenging aspect of the program design.      

 

Import atoms required for local force calculations

At this stage then, each core in its capacity as a client (a term loosely used to describe a core that is requesting information) knows where to acquire its required cells  and in its capacity as a host knows how many requests to expect from other cores requesting their required cells.  The required atom coordinates are imported and the force loop proceeds.

 

Calculate total forces acting on on-core atoms

Within the forces calculation loop as it stands, only 2-body interactions in the form of a 12-6 Lennard Jones potential are evaluated.  Clearly this falls well short of the functionality of a multi-purpose  MD code, such as DL_POLY_4, with its plethora of both inter- and intra-molecular interactions.  The justification for this is that to develop a load balancing algorithm from scratch that incorporated the full gamut of molecular interactions would have been an extremely complex task and certainly very much more difficult to debug.  Therefore this prototype program was developed with only as much functionality as was considered sufficient to demonstrate that (i) the algorithm is technically robust and (ii) it has the potential to be able to improve the efficiency of more realistic MD simulations.  The forces loop even in its current rudimentary state should provide a reasonable representation of the level of computational work load that cores have to contend with in MD simulations.  In other words it provides a reasonable basis for testing the performance of the load balancing algorithm.

 

Export opposite force components acting on off-core atoms

Recalling the earlier discussion of the concept of required cells, following the force calculations it is necessary for the equal and opposite force components acting on atoms whose coordinates were imported to be sent back to the cores from whence they came.  This completes the evaluation of the net forces acting on each atom in the system.

 

Atom translations

In an MD simulation proper the next stage after the forces loop would be for each core to integrate the equations of motion for the atoms allocated to it.  We do not do this here, instead simple arbitrary translations are applied to the atoms, there is the option to apply random displacements in randomly chosen directions or to move each atom by a fixed distance along a specified direction.  Again the justification for this rudimentary approach is that the purpose of this program is not to perform fully fledged molecular simulations, rather it is to provide a test bed for the load balancing algorithm.  The facility for setting up model configurations and evolving them in simple, controllable ways is part of that process.

 

Reallocate atoms that have crossed cell boundaries to new cells

Ordinarily, following atom translations in DD based MD simulations, atoms only have to be reassigned to different cores when they cross domain boundaries. These cores will always be the ones responsible for the adjacent domains.  However if cells have been transferred away from their home cores, as they have when load balancing has been implemented, any atom that crosses a cell boundary must be considered a candidate for reassignment since the neighbouring cell into which it has moved could currently be allocated to any core on the machine.  To account for this we again have recourse to a scheme closely related to the 'forwarding address' paradigm.  Essentially what happens is that each home core acquires the coordinates of all the atoms that have crossed the boundaries of its native cells (cells that are physically located on the home core's domain).  Some of these atoms, as well as crossing cell boundaries, will have crossed into the domains adjacent to the home domain.  The home core sends the coordinates of such atoms to the appropriate adjacent domain.  Each home core then sends out any atoms that now happen to lie on any of their absent cells (native cells that were reallocated to other cores by the load balancer) to the current hosts of those cells.  In this way each core acquires the complement of atoms that, following translations, are physically located on the cells that are currently allocated to it.

 

Check local cell and atom tallies

In order to make sure nothing has gone awry in the bookkeeping procedures used to move cell data around the system and keep track of its whereabouts, certain tallies and data fields are checked to ensure consistency in the information they contain.

 

Optional measurement of ρ from timing data

Recall that the variable ρ is the ratio of the average cost of importing a single atom to the average cost of evaluating a single pair interaction.  In most of the testing work carried out, the value assigned to this variable was, if truth be told, simply an arbitrary guess.  In the earlier stages of the code development, assigning an accurate value to it was not regarded as being overly critical since typically the estimated number of pair interactions in equation (1) dwarfs the number of  atoms to be imported and thus the load balancing is dominated by the former term.  Thus a reallocation of cells  resulting in approximately the same numbers of atoms on each core generally resulted in a reasonable load balance.  Nevertheless it would clearly be preferable to obtain a more accurate value for ρ and thus a better work load estimate.  With this in mind, in the latter stages of the code development, attempts were made to obtain an empirical value for ρ based on timings of the subroutines in which atoms coordinates are imported and forces exported (before and after the forces loop) and the timing for the forces loop itself.  Thus an empirical value of ρ was calculated as follows

ρmeasured = <import cost>/<pair cost>

 

where  the average cost associated with the import of an atom coordinate and the subsequent export of the corresponding equal and opposite force to the host of that atom is given by

 

<import cost> = (timport atoms + texport forces) / Nimports

 

where timport atoms is the time spent in the subroutine responsible for importing atoms, texport forces the time spent in the subroutine which exports forces and Nimports is the number of atoms imported.

The average cost of evaluating a pair interaction is given by

 

<pair cost> = tforces / Npairs

 

where tforces is the time spent in the forces loop and Npairs is the number of pair interactions evaluated within the loop.

 

The intention was  to use ρmeasured in the load balancing subroutine in the proceeding time step, with the aim of obtaining a more accurate estimate of the core work load and thus a more effective data distribution across the hardware.  Unfortunately it was found that the timings for the import atoms and export forces subroutines were subject to large fluctuations both from time step to time step and from core to core.  When the measured value was fed into  the work load estimate calculation, it almost always resulted in a worsening of performance, presumably due to some form of feedback mechanism e.g. if for some reason at a particular time step the value of timport atoms happens to dramatically increase, then in the next time step the relative cost of importing atoms will be overestimated and the load balance skewed as a result.  As a result, in the performance tests carried out, we had no option but to keep using a reasonable guess for ρ in the work load estimation. 

 

Option to restore all cells to their original cores

This procedure has two purposes, firstly it obviously enables us to reverse the effects of load balancing.  This would be particularly useful if the system of interest is initially non-uniform in its atom distribution but in time evolves into one that is more or less isotropic.  Secondly it provides an opportunity for a very robust check that the load balancer has operated without error.  After all cells have been returned to their home cores, a check of their coordinates at this point should show that they all lie within the bounds of the domain associated with that core.  Any exceptions to this would indicate that at some point atom data had been allocated to the wrong cell and/or core.

 

Wind up this step

To conclude each time step, temporary arrays allocated for the current time step are deallocated.  Also the total time taken for the step is measured and the average step time accumulated, clearly the average step time is the primary observation by which we assess the effectiveness of the load balancer.

3. Performance test results

 

We now present the results of performance tests of the load balancer under various conditions.  There are a rather bewildering range of parameters that may effect how the load balancer performs.    On the one hand we have the physical properties of the MD system under consideration: its degree of inhomogeneity in terms of atom density, the total number of atoms in the system, the computational demands of the particular interaction(s) between the atoms, the cut-off distance(s) for the interaction(s).  Then we have the hardware constraints: the total number of processors available  as well as the particular characteristics of the processors e.g. the clock speed, bandwidth, latency and cores per processor; the choice of compiler and compiler flags may also be a significant factor.  In addition there are what we might term the operational parameters of the load balancer itself such as the choice of cell partitioning, the value of ρ and how finely we want the work loads to be balanced.  In an MD simulation proper some of these would be predetermined such as the potential cut-off, however  for the purposes of testing the code we are free to choose any values within reason.  It would be a huge undertaking to explore this parameter space exhaustively, so here we confine ourselves to a few simple types of initial atom configuration, systematically alter the  number of cores on which these systems are run and also repeat the tests for a number of different cut-off values.      

      In all tests, the principal measure of performance is the speed-up i.e. the ratio of the average step time for a run carried out without load balancing to the average step time for an equivalent with load balancing switched on.  All tests reported here were carried out on HECToR Phase 2b (XE6) processors on 14-18/03/2011. Note that each XE6 processor has 24 cores.  The source code was compiled and linked using the PGI compiler with the flags -O3 and -fastsse (the same as the flags used for compiling DL_POLY_4 on HECToR).  In all tests ρ was set at a value of 25.0. The tolerance for the load balance � the target value for the ratio of the difference between the two partners' work loads in the load balancing subroutine to their average work loads � was set at 0.05.  Most of the tests were run for 100 time steps (from time to time the results were checked against those obtained from equivalent longer runs to ensure that 100 steps indeed a long enough test to give representative results).  

 

System 1

In the first set of tests, one octant of the cubic MD cell was occupied with atoms, arranged on a lattice, whilst the remaining volume was completely empty.  This is a rather artificial set-up but its simplicity makes it easier to interpret the results of the tests.  The volume  of the MD cell, V,  was systematically increased in direct proportion to the number of cores, P, on which the system was run.  The fraction of the MD cell that was occupied thus remained constant at 1/8.  The spacing between the atoms in the occupied region was fixed so that the number of atoms in the system  also increased in proportion to the system volume.   The choice of domain subdivision is completely at the user's discretion and there is no hard or fast rule for the ideal number of cells.  However, as a general rule of thumb, we regard domain subdivision of 4x4x4=64 cells or 5x5x5=125 cells as being sensible choices.  If the cell mesh is too coarse, then the packets of atom data that are moved from core to core by the load balancer will be too large to  permit an even distribution of work loads.  If on the other hand the subdivision is too fine then large numbers of cells will need to be reallocated in order to achieve load balance, the undesirable side effect of this is that the atom data, since it is divided into many small packets, will tend to become highly dispersed across the machine and so the communication overhead associated with the acquisition of off-core data will be large and result in poor performance.  

The same set of tests was carried out for a number of different cut-off distances, rcut, for the Lennard Jones pair interaction.  The smallest cut-off used was 0.499 arbitrary units (a.u.), just under 2.5 times the atom spacing, which was set at 0.2 a.u.  The choice for this minimum cut-off value was based on the fact that typically in MD simulations, the Lennard Jones potential is truncated at around 2.5σ, where σ is the atom-atom separation at which the potential is zero.  The potential minimum, which physically corresponds to the equilibrium separation and here equates to the lattice spacing, occurs at a slightly larger separation of 2σ or ~1.122σ. Thus the smallest cut-off value of 0.499 a.u. corresponds roughly to the minimum cut-off that  is likely to be used in a real MD simulation.  However other types of interactions that might be active such as metal interactions and certainly electrostatic interactions would have larger cut-offs, for this reason we repeated the tests for larger cut-off values.  In this set of tests, no translations were applied to the atoms because, first and foremost, we wished to understand how the load balancer copes with a given unbalanced atom configuration given a particular domain decomposition and cut-off.  Having a static configuration makes the interpretation of the results of the tests less complicated.  The complete set of parameters used for the System 1 tests are listed in table 1. The timings for the tests with load balancing off and on are tabulated in the appendix at the end of the report, figure 4 shows the resultant speed-ups. 

 

Text Box: Table 1. Parameters for the System 1 performance tests.  These tests were carried out for four different cut-off values: 0.499, 0.999, 1.999 and 2.999 a.u. 

 

P

8

64

216

512

1000

1728

2744

N

15625

125000

421875

1000000

1953125

3375000

5359375

% volume occupied

12.5

12.5

12.5

12.5

12.5

12.5

12.5

MD cell dimensions/[a.u.]

10.0

20.0

30.0

40.0

50.0

60.0

70.0

Domain  decomposition

2x2x2

4x4x4

6x6x6

8x8x8

10x10x10

12x12x12

14x14x14

Domain  dimensions/[a.u.]

5.0

5.0

5.0

5.0

5.0

5.0

5.0

Population of occupied  domains

15625

15625

15625

15625

15625

15625

15625

Cell subdivision

5x5x5

5x5x5

5x5x5

5x5x5

5x5x5

5x5x5

5x5x5

Cell dimensions/[a.u.]

1.0

1.0

1.0

1.0

1.0

1.0

1.0

Population of occupied cells

125

125

125

125

125

125

125

 

Text Box: Speed up
 

 

 

 

 

 

 

 

 

 

 

 


In nearly all the System 1 tests we observe good speed up values in the range of 6.0 to 8.0, the latter value being the theoretical maximum given the 1/8 fraction of the MD cell that is occupied by atoms.  The only System 1 test in which load balancing delivers a relatively poor speed up is the one that is run on the largest core count with the largest cut-off.  We believe the reason for this anomaly is that in this test the total number of cells into which the occupied region is divided is at its largest � it will be P/8 multiplied by the number of cells into which each domain is subdivided, for the P=2744 system the number of occupied cells will therefore be 343*125=42875.  Many of these cells will be allocated to other cores by the load balancer resulting in much dispersion of the atom data.  This coupled with the large cut-off will incur a heavy communication penalty in order for cores to import required cells and export forces.  The result, we suspect, is that the comms network becomes overloaded before and after the forces loop (where atoms are imported and forces exported), hence the inferior speed-up value.  This suspicion was given credence by the fact that when the test was rerun using the same number of cores overall but with the number of cores per processor set at 12 as opposed to the full complement of 24 on each XE6 processor, the speed up rose from 1.68 to 7.12.  Presumably the act of running on fewer cores per processor increased the amount of bandwidth available per core thus enabling them to better cope with the heavy communication costs.       We note that for all the cut-offs the speed up tends to peak at P=216 but declines slowly with increasing core counts.  Again this is attributable to the fact that, in this mode of testing, as we increase the core count the number of occupied cells rises along with the number of cores available onto which they can be moved by the load balancer.  As a consequence the atom data tends to become increasingly dispersed across the hardware and the number of communications necessary to acquire off-core data rises.      

 

System 2

In the second round of performance tests only a single domain was occupied with atoms, the atom spacing was again set at 0.2 a.u. and the atoms were held static.  Since only a single domain is occupied, as P and with it the number of domains increases, the fraction of the MD cell containing atoms falls in direct proportion.  The domains were again each subdivided into 125 cells.  The complete set of parameters used for the System 2 tests are listed in table 2. The timings for the tests with load balancing off and on are tabulated in the appendix at the end of the report, figure 5 shows the resultant speed-ups. 

 

Text Box: Table 2. Parameters for the System 2 performance tests.  These tests were carried out for four cut-off values: 0.499, 0.999, 1.999 and 2.999 a.u.

 

 

P

8

27

64

128

216

512

N

15625

125000

421875

1000000

1953125

3375000

% volume occupied

12.50

3.70

1.56

0.80

0.46

0.20

MD cell dimensions/[a.u.]

10.0

15.0

20.0

25.0

30.0

40.0

Domain  decomposition

2x2x2

3x3x3

4x4x4

5x5x5

6x6x6

8x8x8

Domain  dimensions/[a.u.]

5.0

5.0

5.0

5.0

5.0

5.0

Population of occupied  domains

15625

15625

15625

15625

15625

15625

Cell subdivision

5x5x5

5x5x5

5x5x5

5x5x5

5x5x5

5x5x5

Cell dimensions/[a.u.]

1.0

1.0

1.0

1.0

1.0

1.0

Population of occupied cells

125

125

125

125

125

125

 

 

We find that the peak speed-up occurs at P=125 or P=216 and then slowly declines with higher core counts.  This is not so surprising for when the number of cores matches the number of occupied cells available for reallocation, we would not expect any further benefit from increasing the core count since the maximum number of cores that can be allocated a populated cell can never exceed 125. It is not obvious therefore why the peak speed up should occur at P=216 rather than P=125 for the rcut=0.499 and rcut=0.999 systems.  Nevertheless the overall message from this set of tests is fairly clear: as the dense to sparse volume fraction of the system diminishes there is relatively speaking less benefit to be had from load balancing, particularly at high processor counts, since in this regime computational work load per processor will be relatively small and in addition the data will tend to become more widely dispersed across the machine.

 

Text Box: Figure 5. Speedups measured for the System 2 performance tests.

 

System 3

In the  third round of performance tests, the system set up was as with System 1, a cubic MD cell with one octant occupied and the remaining volume empty.  This time though the physical size of the system was kept fixed whilst the number of cores was increased.  The atom spacing was 0.1 a.u. in this set of tests.  The number of cells per domain was again 125 and once again the atoms remained static throughout. The timings for the tests with load balancing off and on are tabulated in the appendix at the end of the report, figure 6 shows the resultant speed-ups. 

Text Box: Table 3. Parameters for the System 3 performance tests.  These tests were carried out for four cut-off values: 0.249, 0.499, 0.999 and 1.999 a.u.

 

 

P

8

27

64

128

216

512

N

125000

125000

125000

125000

125000

125000

% volume occupied

12.5

12.5

12.5

12.5

12.5

12.5

MD cell dimensions/[a.u.]

10.0

10.0

10.0

10.0

10.0

10.0

Domain  decomposition

2x2x2

3x3x3

4x4x4

5x5x5

6x6x6

8x8x8

Domain  dimensions/[a.u.]

5.00

3.33

2.50

2.00

1.67

1.25

Population of occupied  domains

125000

37037

15625

8000

4630

1953

Cell subdivision

5x5x5

5x5x5

5x5x5

5x5x5

5x5x5

5x5x5

Cell dimensions/[a.u.]

1.000

0.667

0.500

0.400

0.333

0.250

Population of occupied cells

1000

296

125

64

37

16

The key characteristic of this test set-up is that as the core count increases, the number of atoms per core and per cell falls.  This in fact represents a very wasteful use of hardware irrespective of whether or not we attempt load balancing or not.  The more cores we throw at this system the lower will be the computational work load per core but at the same time the communication overheads will increase as the cut-off takes in an increasing number of domains and cells.  This is reflected by the fact that both for the non load balanced and load balanced  tests, the average step time does not decrease significantly beyond processor counts of 100 or so, indeed the step time actually increases for the rcut=1.999 load balanced test.  The message here is clear � when setting aside resources for parallel jobs the magnitude of the  computational work load that the job entails must always be considered.  Much better to run many small system simulations simultaneously on few cores each than to run a single small job on many cores. 

 

Text Box: Figure 6. Speedups measured for the System 3 performance tests.

 

System 4

In the final sets of tests we made an effort to set up scenarios more in keeping with a real MD simulation.  The test systems consisted of a spherical region of dense atom concentration of radius 1.25 a.u. within a cubic MD cell of size 4.0 a.u. divided into 4x4x4=64 cubic domains of size 1.0 a.u. Thus all tests were run on 64 cores.  Note that the volume of the sphere relative to that of the MD cell  is approximately 1/8, in other words similar to the value for the single occupied octant set-ups of the System 1 and System 3 tests described earlier.  The cell subdivision was also 4x4x4 and thus the cell size 0.25 a.u. 

      In one set of tests, 'Sphere 1', the atom spacing was set at 0.1 a.u., which resulted in an atom population of 8144 in the spherical region � a relatively small number of atoms for a run carried out on 64 cores.  The cut-off used for this set tests was 0.249 a.u.  In a second set of tests, 'Sphere 2', the atom spacing was halved to 0.1 a.u. resulting in an atom count of 72327 within the sphere and thus a more substantial computational work load for the hardware.  The cut-off used in these tests was 0.1249 a.u. in keeping with the halving of the atom separation.  Several additional variations were applied both to the basic Sphere 1 and the Sphere 2 set-ups as follows:

 

6.       The atoms within the sphere remained static and the surrounding volume was devoid of atoms.

7.       The sphere was again surrounded by vacuum but each atom within it was moved in the positive x-direction at a rate of 0.01 a.u. every time step.  This obviously allows us to see how the load balancer copes when the atoms in the system are moving across cell and domain boundaries en masse.

8.       The sphere was static and surrounded by vacuum but the atoms within the sphere were set in random motion with the maximum displacement components per time steps set at 0.10 and 0.05 a.u. for Sphere 1 and Sphere 2 respectively.

9.       The sphere of static atoms was surrounded by a sea of randomly placed atoms at a number density approximately 1% that of the number density within the sphere.  This is a crude representation of the type of situation we might find in a simulation where a gas has condensed into a droplet of liquid surrounded by a vapour.

10.   As (4) but with all atoms in both the sphere and the surrounding 'vapour' subject to the same random motions as in (3).

11.   As system (5) but run for a much greater number of steps so that by the end of the run the  atom distribution was essentially isotropic.

 

The timings for the tests with load balancing off and on are tabulated in the appendix at the end of the report, figure 8 shows the resultant speed-ups.  Comparing the Sphere 1 and Sphere 2 tests we see that the speed-up for the  Sphere 1 tests are much lower than for Sphere 2.  This reflects the fact that in the former, the number of atoms is relatively small, around 125 per core as opposed to the latter where their are over 1000 atoms per core.  As noted earlier, in general the benefits of load balancing will be more pronounced when the computational work load per core is greater.

      We also observe that neither the imposition of atom translations nor the presence of a diffuse gas of atoms surrounding the dense spherical region has a marked effect on the speed-ups in either Sphere 1 or Sphere 2 systems.  We can therefore be more confident that the load balancing algorithm is likely to perform well under true simulation conditions where atoms are moving and where the density variations are not so clear cut as they were in the earlier tests.


Text Box: Figure 8. Speedups measured for the System 4 performance tests.

      Perhaps the most encouraging results are shown in figures 9(a) and (b).  These are graphs of the time step durations plotted against step index for the variant 6 tests in which random atom translations were applied for long enough for the system take on a uniform density*.  By the end of the runs it is true that the step times for the load balanced tests are slightly longer than for the non load balanced runs.  This is undoubtedly due to the fact that the load balancing done in the early stages of the test, when the atom distribution was highly non uniform, has resulted in dispersion of the data which persists even when the system has become isotropic.  This will inevitably incur additional communication overheads over and above that necessary in the equivalent non load balanced run.  However the differences between step timings at the end of the runs are relatively small, this suggests that the  residual effects of load balancing need not have severe effects even when the simulation tends towards an isotropic state.  More generally, it suggests that the load balancing algorithm is fairly efficient.  Furthermore, recall from the the overview of the program in

 

Text Box: Figure 9. Time step durations vs. time step index as measured in the variant 6 tests for (a) the Sphere 1 system and (b) the Sphere 2 system.

 


 

 

section 1, that the facility exists within the time stepping loop, at any point, to return all cells to their home processors thus reversing the effects of prior load balancing.  It would be fairly simple to automatically check for the emergence of uniformity in the atom distribution (for example by periodically checking globally how many atoms are physically located on each domain) and hence roll back the load balancing at the appropriate juncture.  This lends further weight to the belief that, though load balancing inevitably leads to the dispersion of data and therefore increase in the communications overhead, the benefits of its application are likely to outweigh its disadvantages in many situations.

 

 

4. Further work

 

      There are two key areas in which the prototype load balancer could be improved.  Firstly more sophisticated work load redistribution schemes  could be devised, i.e. better algorithms for choosing which cells are moved and where they are moved to could be developed.  Desirable features for such schemes would be that the amount of data moved off-core in order to achieve load balancing is kept to a minimum.  Also, when cells are reallocated, ones that are physically contiguous should be kept together on the same core as far as is possible.  These  steps will tend to minimise the dispersion of data and thus the amount of extra communication that has to take place in order for processors to locate and request all the data they require.

      The reallocation of cells is of course based on the estimated work loads for each core.  More effective load balancing therefore partly depends on more accurate work load estimation and this is a second main area for improvements to be made.  The current work load estimate is lacking in two respects.  Firstly, though it takes into consideration the number of atoms that are imported by each core prior to the forces loop, the number of atoms that are exported is not included in the work load estimate.  The number of imports can be determined exactly based on the information already held on each core at this point.  Unfortunately the number of atoms that need to be exported, that is to say the number of atoms that will be requested by other cores, cannot be determined from local information.  Basically this is because the requests are coming in from outside  and the local core's information regarding the distribution of data on other cores is not complete enough to make this prediction.  A rough and ready solution to this problem  would be to use the number of atoms that were exported in the previous time step in the work load estimate for the current time step. This might serve reasonably well as a first approximation particularly if the amount of load balancing that took place in the previous step had been relatively modest and thus the distribution of atoms across the cores would not have changed too drastically. 

      A second major deficiency in the work load estimate is that it does not use an accurate value of ρ, the ratio of the cost of importing a single atom coordinate to that of evaluating a single pairwise interaction.  As described in section 1, attempts have been made to measure ρ separately on each core on the basis of timings of the subroutines concerned with the import/export of atoms and the forces loop.  These measurements have shown that the cost of evaluating a pairwise interaction is fairly consistent both as a function of time and across all cores on the system.  This is due to the fact that execution of the forces loop is an entirely local operation.  The communications costs, on the other hand, fluctuate considerably depending upon the time step at which they were measured and the core that they were measured on.  The more one thinks about why this  should be the case, the more one is drawn to the conclusion that it is because communication is at root a collective operation � the time spent by one core on message passing is inextricably connected with events on other cores.  As a result it has proved impossible to pin down a reliable value for ρ.  One might therefore conclude that the best way forward would be to dispense with ρ and the work load estimate altogether and simply determine the overall work load on each core by pure measurement.  However this proposition ultimately leads us to the same conundrum: if a significant proportion of a core�s overall workload involves communication (as it must do if data is moved off-core) then its value is never independent from its dealings with other cores in the system.  In other words it may actually be practically impossible to define individual work loads.  This is clearly a difficult issue which requires further research.

One of the original aims of the project was to incorporate the load balancer into DL_POLY_4.  Unfortunately we were not able to achieve this in the time available due to the complexity of the task.  However with sustained effort, the integration of the two codes should be entirely feasible and some thought has gone into how this might be best approached, at least for 2-body forces.  Augmentation of the existing load balancer algorithm to allow for the load balancing of other types of inter atomic interactions besides two-body forces, such as metal potentials and 3- and 4-body interactions should be fairly straightforward.  Extending the algorithm to encompass the load balancing of the electrostatic force calculations as well intra molecular interactions would require intimate knowledge of how these features are set up in DL_POLY_4.  The integration of the load balancer into DL_POLY_4, obviously a well established package with a wide user base, would constitute the ultimate test as to its effectiveness since the test scenarios would be real MD simulations as opposed to the rather artificial model systems that the prototype load balancer has been tested on. 

Further testing of the code along with more detailed scrutiny of the diagnostic data would be certainly be advisable.  The work carried out so far indicates that the main communication bottlenecks occur at either side of the forces loop, when atoms are imported and forces exported.  However more detailed study might provide indications as to how matters could be improved.  They would also provide a clearer impression of the types of situation in which this form of load balancing is not viable and those in which it could provide the most benefit.  However , as noted above, the true measure of the load balancer's effectiveness must ultimately be how much it can speed-up real MD simulations i.e. DL_POLY_4 runs.  So although testing the prototype version is undoubtedly instructive,  there can be no substitute for testing it when it is actually incorporated into DL_POLY_4.  If it does prove its worth in this context, it may well also be worth making the effort to incorporate the load balancing algorithm into other types of particle based simulations such as dissipative particle dynamics (DPD). 

 

5. Summary and conclusions

 

The load balancing algorithm we have developed has a number of short comings, in particular the estimate of work loads that determines when load balancing should be initiated and when it should cease needs to be refined.  Clearly a more accurate work load estimate is a requirement for a more optimal work load distribution.  Also the algorithm that selects which cells are moved off-core and the new host cores that they are reallocated to could almost certainly be improved.  The main benefit of this would be minimisation in the dispersal of data and thus a reduction in the communication overheads incurred by  load balancing. 

      Nevertheless, in the performance tests carried out so far, the code as it stands has produced fairly promising results.  In systems with a relatively high computational work load i.e. expensive forces loops, the application of load balancing results in a good speed up.  Conversely it rarely resulted in slow down and, even when load balancing had been applied to a test system that tended to uniform density, the slow down was far from severe. 

      One subject that we have not considered in the report so far that is certainly relevant is the likely future developments of parallel computers.  It is widely believed that clock speeds of individual cores on parallel machines will not increase significantly however the number of cores per processor is set to, something that is likely to motivate greater efforts to improve the efficiency of the communications between cores.  If the computational power of individual cores is not going to improve, there then remains a strong impetus to redistribute work in situations where certain cores have heavier work loads than others.  On the other hand, as message passing becomes more efficient, the penalty for redistributing data should be lessened.  Both these future developments would tend to play to the advantage of the load balancing algorithm described here.

      Finally it is worth adding that no matter how well designed an MD program is, whether it is load balanced or not, the efficient use of parallel computing resources also depends in large part on the diligence of the user.  Every molecular system has its own particular characteristics and these should be considered from a practical as well as a scientific perspective so as to make good choices for operational parameters in simulations.

 

This project was funded under the HECToR Distributed Computational Science and Engineering (CSE) Service operated by NAG Ltd. HECToR � a Research Councils UK High End Computing Service � is the UK's national supercomputing service, managed by EPSRC on behalf of the participating Research Councils.  Its mission is to support capability science and engineering in UK academia. The HECToR

supercomputers are managed by UoE HPCx Ltd and the CSE Support Service is provided by NAG Ltd. http://www.hector.ac.uk.

 

 

 

Appendix � Performance test data

 

All timings quoted are the average step times in seconds as measured taken from 100 step runs unless stated otherwise.  Note that the timings for the System 3, rcut=1.999a.u. tests on 216 and 512 processors are missing simply because the wall time allocated for the jobs was used up before the designated number of steps was completed.

 

System 1

rcut=0.499 a.u.

rcut=0.999 a.u.

rcut=1.999 a.u.

rcut=2.999 a.u.

P

LB off

LB on

LB off

LB on

LB off

LB on

LB off

LB on

8

0.3066

0.0461

0.3881

0.0581

1.5267

0.2399

3.1259

0.4654

64

0.4271

0.0609

0.5298

0.0784

2.6310

0.3733

6.7656

0.8614

216

0.4837

0.0670

0.5912

0.0841

3.2028

0.4274

8.9271

1.1091

512

0.4897

0.0724

0.5964

0.0881

3.2277

0.4435

8.9432

1.1789

1000

0.4953

0.0745

0.6022

0.0936

3.2796

0.4800

9.1024

1.2640

1728

0.4971

0.0775

0.6025

0.0942

3.2616

0.4817

8.9987

1.2892

2744

0.4990

0.0795

0.6079

0.1020

3.3096

0.5029

9.1931

5.4773

2744

Result for rcut=2.999 a.u. run on 12 cores per processor

9.1931

1.2682

 

System 2

rcut=0.499 a.u.

rcut=0.999 a.u.

rcut=1.999 a.u.

rcut=2.999 a.u.

P

LB off

LB on

LB off

LB on

LB off

LB on

LB off

LB on

8

0.3066

0.0462

0.3869

0.0579

1.5193

0.2385

3.0950

0.4594

27

0.3052

0.0197

0.3857

0.0244

1.5180

0.0970

3.0986

0.2006

64

0.3053

0.0143

0.3857

0.0173

1.5177

0.0668

3.0983

0.1354

125

0.3055

0.0129

0.3859

0.0154

1.5181

0.0580

3.0992

0.1095

216

0.3070

0.0111

0.3878

0.0129

1.5222

0.0592

3.1137

0.1114

512

0.3056

0.0118

0.3859

0.0133

1.5189

0.0636

3.0987

0.1161

 

System 3

rcut=0.249 a.u.

rcut=0.499 a.u.

rcut=0.999 a.u.

rcut=1.999 a.u.

P

LB off

LB on

LB off

LB on

LB off

LB on

LB off

LB on

8

18.0160

2.602200

18.840000

2.6909

25.037000

3.4596

98.272000

15.141600

27

1.7435

0.277400

1.963700

0.3065

7.858300

1.2883

23.346000

4.297200

64

0.4278

0.060750

0.532300

0.07775

2.643400

0.3706

13.424000

1.752200

125

0.1369

0.020620

0.439900

0.06026

1.809300

0.2241

8.581400

2.480400

216

0.0511

0.011030

0.218000

0.05083

1.077900

0.1232

 

 

512

0.0145

0.007095

0.073170

0.01844

0.482200

0.1223

 

 

 

System 4

 

Sphere 1

Sphere 2

Test variant

time steps

LB off

LB on

LB off

LB on

1

100

0.00685

0.00319

0.244

0.0333

2

40000

0.00680

0.00376

0.245

0.0346

3

100

0.00632

0.00441

0.268

0.0364

4

100

0.00681

0.00396

0.244

0.0352

5

100

0.00628

0.00444

0.268

0.0389

6

40000

0.00222

0.00283

0.01257

0.01196

 



*     In this report we use the term 'core' to refer to a distinct processing unit that communicates with other cores via message passing.

*     That the system was isotropic by the end of these runs was indicated by the fact that the numbers of atoms physically located on each domain were approximately equal.