The HECToR Service is now closed and has been superceded by ARCHER.

HECToRNews : Advice for running applications on HECToR Phase 2b

Welcome to a special edition of the HECToR newsletter, which provides advice on running applications on the HECToR Phase 2b (XT6) system. This information may be particularly relevant to users whose applications are communication-intensive, and which may not be performing as well as expected on the XT6 with the current Seastar2 interconnect.

Contents:

Please do not hesitate to contact the helpdesk at support@hector.ac.uk with any questions or comments about running codes on the XT6. We are happy to help you port, benchmark and optimise your code, to make best use of the XT6 system.

For the previous issues of HECToRNews please see here.

General Advice

For certain applications users may have noticed that running a fully populated job (i.e. using all cores per node for MPI processes) on the Phase 2b XT6 machine results in inferior performance compared with the same job on the Phase 2a XT4 machine. This is because XT6 nodes have 24 cores which currently share a single link to the same type of interconnect as 4-core XT4 nodes, resulting in increased contention for off-node communications.

Whilst it is anticipated that the new interconnect (codenamed Gemini) which is due be installed later this year will mitigate these problems, there are things that XT6 users can do now in order to try to match or better the AU cost of simulations being run on the XT4, and so take advantage of the newer machine.

There are 3 things users can do to reduce this cost:

  1. Run jobs sparsely populated
  2. Use a shared-memory optimisation
  3. Use multi-threaded maths routines

These three optimisations are complementary, so it is recommended that, if possible, users take advantage of all of them. It may also be necessary to reduce the number of MPI processes being used.

To sparsely populate the compute nodes extra flags on the aprun command command are required. These are the -S and possibly the -d, which may be new to you. The -S flag indicates how many MPI processes to put on each hex-core die on the node[1]. The -d flag is only required for codes that support multiple threads. It indicates how many threads each MPI process should use for multi-threaded routines. If -d is omitted it is assumed that there is 1 thread per MPI process.

Thus as an example

aprun -n 36 -N 4 -S 1 -d 6 ./myexecutable

will launch a 36 MPI process job. This job will have 4 MPI processes on each node, and 1 MPI process per die on the node, and each MPI process will consist of 6 threads. Note that even though you specify the -d flag, you must still set the appropriate environment variables for your job (e.g. OMP_NUM_THREADS, GOTO_NUM_THREADS) to the same value.

Note that sparsely populating a node will lead to more nodes being used. This means that the PBS mppwidth, mppnppn and mppdepth fields in your job scripts will have to be changed accordingly to match the values you pass to aprun (the values for -n, -N and -d respectively).

Because the charging system on HECToR is per node[2], though your job may run more quickly it may still be more expensive, and only through experimentation will you find the best way for your particular case. In many cases the codes are iterative in some sense. For instance they may time step, or have steps in a minimization scheme. In such cases it may be a good idea to run the code for a small number of steps with different values for various flags on the aprun to find the most effective combination for you. To help in this search below are some suggestions for a number of widely-used applications.

CASINO

In general CASINO performs well with fully populated nodes on the XT6. However, the smaller amount of memory per core compared to the XT4 may be a problem for more users if they need in their computation the so called 'blips basis' with size close to or larger than 1GB. The latest version of CASINO (version 2.6) includes a System V shared memory segment capability which is NUMA aware, which allows the blips basis to be shared by multiple cores, reducing overall memory usage while still allowing all cores on the node to be utilised.

Note: A significant slow down has been observed if the same System V memory segment is used across the whole 24 core node. Thus, it is recommended that the sharing of the blips basis is limited to at most 6 cores. The size of the group of cores that share memory within a node can be set with the environment variable CASINO_NUMABLK.

This information, and any further updates, will be added to the CASINO section of the HECToR Users Guide at

   https://wiki.hector.ac.uk/userwiki/CASINO

which can be accessed with the same login and password as your HECToR SAFE account.

CASTEP

In the case of CASTEP, taking the example of the standard al3x3 benchmark which takes around 1350 seconds (looking at SCF cycles) to run on the XT4 using 144 MPI processes and 4 processes per node, the same job run on the XT6 using 24 processes per node will take around 3026 seconds. At the current AU/core/hour costs of 7.5 for the XT4 and 3.77 for the XT6 this works out at a 1.12 times increase in cost, but perhaps more importantly the difference in raw performance is unacceptable.

It is possible to make use of all three optimisation types listed above, since a shared memory optimisation is available for CASTEP, included in the castep/5.5 module, and CASTEP makes heavy use of BLAS/LAPACK maths routines.

In the example of al3x3, the 144 process XT4 job mentioned above can be improved in terms of time and AU cost significantly by using the castep/5.5 executable with 36 MPI processes, 4 per node and for each MPI process using 6 threads for maths routines. This results in a runtime of around 1254 seconds and a 54% AU cost reduction over the XT4 run.

When exploiting the shared memory optimisation it is important that you include the num_proc_in_smp directive in your .param file to indicate to CASTEP how many MPI processes per node you are using. The format of this directive is

num_proc_in_smp : <number of processes per node>
e.g. in the al3x3 example above this is:
num_proc_in_smp : 4

In order to use the shared-memory optimisations, load the module castep/5.5 and use the castepsub command to either print out your job script for submission, or submit your job for you.

The castepsub command for the al3x3 example is shown below, along with the output that it produces:
phase2b> module load castep/5.5
phase2b> castepsub -d -n 36 -N 4 -t 6 -W 00:20:00 al3x3
#! /bin/bash --login
#PBS -A z03
#PBS -N al3x3
#PBS -l mppwidth=216
#PBS -l mppnppn=24
#PBS -l walltime=00:20:00
#PBS -j eo
cd "$PBS_O_WORKDIR"

module load castep/5.5

export TMPDIR=$PBS_O_WORKDIR
export GFORTRAN_TMPDIR=$PBS_O_WORKDIR
export PSPOT_DIR=/usr/local/packages/castep/input/Pseudopotentials-MS
export MPICH_UNEX_BUFFER_SIZE=128M
export MPICH_PTL_UNEX_EVENTS=131072
export PSC_OMP_AFFINITY=FALSE
export KMP_AFFINITY=none
export OMP_NUM_THREADS=6
export GOTO_NUM_THREADS=6

export MPICH_CPUMASK_DISPLAY=1

aprun -n 36 -N 4 -d 6 -S 1 \
   /usr/local/packages/castep/bin/5.5-phase2b/castep al3x3
The syntax of the castepsub command is:
 castepsub [-d] [-A account] [-n nproc] [-N ppnode] [-S ppdie] \
           [-W walltime] [-t nthreads] castep_seed
If the -d flag is omitted castepsub will submit the job rather than print the script. In most cases the -S option is determined automatically. The -A account option may be omitted if the environment variable ACCOUNT is set to the name of the account, eg in the ".bashrc" file
  export ACCOUNT=z03
will set this if you use the bash shell. You may want to experiment with different numbers of MPI processes (-n), MPI processes per node (-N), MPI processes per die (-S) and threads per MPI process (-t option to castepsub, or -d if calling aprun directly) on a small number of SCF cycles to see which options work out best in your case. The number of SCF cycles can be limited for benchmarking purposes by setting max_scf_cycles in your .param file. Remember that if you change -N you should change num_proc_in_smp in your .param file to match, and also adapt -t (or -d on aprun) accordingly so that the number of threads times the number of MPI processes per node equals 24.

This information, and any further updates, will be added to the CASTEP section of the HECToR Users Guide at

   https://wiki.hector.ac.uk/userwiki/CASTEP

which can be accessed with the same login and password as your HECToR SAFE account.

CP2K

Users should always use the cp2k.psmp (MPI/OMP hybrid parallel) executable, which can be found at $CP2K/cp2k.psmp when the cp2k module is loaded. In general for the XT6, running fully-populated with 12 MPI tasks per node (2 threads per task) is likely the best for up to ~288 cores, e.g.
 aprun -n 144 -N 12 -S 3 -d 2 $CP2K/cp2k.psmp
For larger runs it will usually be better to use 4 tasks per node (6 threads per task).

If planning large/long runs, users are advised to run a short benchmark of their calculation (e.g. a small number of SCF cycles) to find the optimum combination of MPI tasks and threads.

This information, and any further updates, will be added to the CP2K section of the HECToR Users Guide at

   https://wiki.hector.ac.uk/userwiki/CP2K

which can be accessed with the same login and password as your HECToR SAFE account.

DL_POLY

Firstly users should be aware that there are 2 versions of DL_POLY available on HECToR, DL_POLY_2 and DL_POLY_3, and that due to the very different designs of the code they are complementary in terms of problem sizes and number of processes upon which they are best.

DL_POLY_2 is best for small system sizes and relatively small processor counts. Small in this context means roughly less than a few tens of thousands of particles, and 64 processors. DL_POLY_3, on the other hand, is designed to handle large system sizes on large processor counts. As the main focus of this is to minimise the cost of runs on large processor counts, the rest of this section will concentrate on DL_POLY_3.

Also note that neither version of DL_POLY supports multiple threads, so you should NOT specify the -d flag on the aprun command.

Experiments on the XT6 so far indicate that while full population of nodes is generally not too bad for DL_POLY_3, sparse population can

  1. Almost always result in a better run time
  2. Slight underpopulation can sometimes result in a cheaper run
Results below illustrate this for the standard benchmark TEST8, available from

   ftp://ftp.dl.ac.uk/ccp5/DL_POLY/DL_POLY_3.0/DATA/

This is a simulation of gramacidin A molecules in water, and consists of roughly 800,000 particles. Electrostatics are handled by the SPME method, and constraints via SHAKE. In the table FULL indicates fully populated nodes, and in other cases the flags on aprun are shown.
                 Time (s)                    Cost (AUs)
Cores   -N16 -S4  -N20 -S5  Full   -N16 -S4  -N20 -S5  Full
16         218       224     229      5.47     5.63    5.75
32         102       105     101      5.14     5.31    5.09
64          70        80      80      7.08     8.09    6.01
128         56        63      74     11.26    11.16   14.94
256         50        52      56     20.01    17.14   15.44
512         48        51      57     38.60    33.33   31.86
It can be seen that the above comments apply. Slight underpopulation leads to a marked improvement in run time, and can in certain cases lead to a reduction in cost. An example of this latter point is either of -N16 -S4 or -N20 -S5 on 128 cores. More generally it is suggested that a user follow the prescription above - run DL_POLY_3 for a small number of timesteps and find what set of flags suit the user's needs best for their particular case.

This information, and any further updates, will be added to the DL_POLY section of the HECToR Users Guide at

   https://wiki.hector.ac.uk/userwiki/DL_POLY

which can be accessed with the same login and password as your HECToR SAFE account.

VASP

Firstly note that VASP can NOT exploit multiple threads, so you should NOT specify the -d flag.

In general for VASP experimentation on the XT6 has shown:

  1. Often users can get the same performance and at the same cost for a large number of cases, if 3 cores per die are used. For example, consider the following benchmark : H defect in Pd: 32 atoms with 10 k-points; PAW pseudopotential; pure DFT functional. This takes 303.3 secs to complete on the quadcore based XT4 using the vasp5 executable on 72 cores. The cost is
       (72/4) nodes * 30 AUs per node * 303.3/3600 hours = 45.49 AUs.
    
    On the XT6 the same benchmark takes 377.4 secs on 72 cores if the node is fully populated. This requires 72/24=3 nodes, hence the cost is
       (72/24) nodes * 90.5 AUs per node * 377.4/3600 hours = 28.47 AUs.
    
    which is significantly cheaper than on the XT4, but requires a longer runtime. However if the job only employs 3 cores per die (12 per node), 6 nodes will be required. Then the job takes only 298.7 secs, which costs
       (72/12) nodes * 90.5 AUs per node * 298.7/3600 hours = 45.05 AUs.
    
    Hence the result is obtained both quicker and cheaper than the XT4.

    It should be noted that on larger number of cores (600 for example) the performance is worse than on XT4, but the sparse population of the nodes is still preferable for performance. For example, a test case that takes 848 secs on 600 cores (1060 AUs) on the XT4 will complete in 1210 secs on 600 cores on the XT6 with fully-populated nodes (760.45 AUs) or in 1072.7 secs using 12 cores per node (1348 AUs).

    To sum up, using 3 cores per die (12 per node) seems a good compromise between cost and performance.

  2. It is quite important for performance that users use the vasp5.gamma (or vasp.gamma if you are using VASP4.6) when doing Gamma-point calculations.

  3. 3) For large calculations using more than 72 cores the shared-memory version of vasp5/5.2 provided by EPCC is suggested (vasp5/5.2-shm). This version of the code performs better at large core counts (for example a test case that normally takes 1120 secs on 600 cores on the XT6, takes 1042 secs when using the shared memory version of the code on the same number of cores). It should be noted though that the code might not work for some choices of NPAR and it has not gone through extensive testing yet.

This information, and any further updates, will be added to the VASP section of the HECToR Users Guide at

   https://wiki.hector.ac.uk/userwiki/VASP

which can be accessed with the same login and password as your HECToR SAFE account.

Footnotes

[1] The 24-core node is effectively made up of 4 hex-core dies: http://www.hector.ac.uk/cse/documentation/Phase2b/#arch.

[2] The current charging rates on HECToR are as follows:

  • XT4 (Phase 2a): 7.50 AUs/core-hour, thus 30.00 AUs/node-hour
  • XT6 (Phase 2b): 3.77 AUs/core-hour, thus 90.48 AUs/node-hour
The charging rate on the XT6 will be reviewed when the Gemini update has been completed.

Fri Nov 5 08:45:40 GMT 2010
Share/Bookmark