6. Running Jobs on HECToR

The HECToR facility use PBSpro to schedule jobs. Writing a submission script is typically the most convenient way to submit your job to the batch system. Example submission scripts (with explanations) are provided below for the most common job types.

Once you have written your job submission script you can validate it by using the checkScript command (see below).

More advanced use of PBS and job submission scripts is covered in the HECToR Optimisation Guide - Batch System Chapter.

If you have any questions or queries about the batch system or on how to submit jobs do not hesitate to contact the HECToR Helpdesk.

Contact the HECToR Helpdesk

6.1 Using PBSpro

You generally interact with PBS in two ways: through options specified in job submission scripts (these are detailed below in the examples) and by using PBSpro commands on the login nodes. There are three key commands used to interact with PBS:

Check the man page of PBS for more advanced commands:

  man pbs

The qsub command

To submit a job, type:

  qsub parallel_script.pbs

This will submit your job script "parallel_script.pbs" to the job-queues. See the sections below for details on how to write job scripts.

The qstat command

Use the command qstat to view HECToR's job queue. For example:

  qstat -q

will list all available queues on the HECToR facility you are logged in to.

You can view just your jobs by using:

qstat -u $USER

The qdel command

The command qdel enables you to remove jobs from the job queue. If the job is running, qdel will abort it. The usage is:

  qdel [Job-ID]

You can obtain the Job ID from the first column in the output of qstat.

6.2 Output from PBS jobs

Standard output and standard error from your batch jobs can be found in the files <jobname>.o<Job ID> and <jobname>.e<Job ID> respectively (or just in <jobname>.o<Job ID> if you specified the -oe option to PBS).

You may see lines that look like the following in standard output:

Resources requested: mpparch=XT,mppnppn=4,mppwidth=128,ncpus=1,place=pack,
walltime=00:30:00
Resources allocated: cpupercent=0,cput=00:00:00,mem=6384kb,ncpus=1,vmem=36148kb,
walltime=00:03:36

The "Resources requested" line indicates the resources that you specified in your job submission script. The "Resources allocated" line is actually meaningless as it relates to resources allocated on the front-end node for submission of the job rather than the resources that will be used in the actual calculation.

6.3 Running Parallel Jobs

The Phase 3 system consists of individual nodes with two 16 core AMD Opteron processors. This results in 32 cores in each node sharing 32GB of main memory. Although the memory on each node is logically shared between the 32 cores, the node actually comprises 4 "NUMA regions" each with 8 cores. Within each NUMA region the 8 cores have fully uniform access (latency and bandwidth) to the memory associated with that NUMA region. However, when referencing the memory of the other 3 NUMA regions, the latency increases and bandwidth is reduced. This has performance implications when you are using a shared-memory programming model (e.g. OpenMP) and the optimal solutions are discussed in more details below.

6.3.1 Parallel job launcher aprun

The job launcher for parallel jobs on HECToR is aprun. This needs to be started from a subdirectory of the /work space. A sample MPI job launch line looks like:

aprun -n 1024 -N 32 my_mpi_executable.x arg1 arg2

This will start the parallel executable "my_mpi_executable.x" with the arguments "arg1" and "arg2". The job will be started using 1024 MPI tasks with 32 tasks placed on each node (remember that a phase 3 node consists of 32 cores).

The most important aprun flags are:

-n
Specifies the total number of distributed memory parallel tasks you want ()not including shared-memory threads).
-N
Specifies the number of distributed memory parallel tasks per node. There is a choice of 1-32 for phase 3 (XE6) nodes. As you are charged per node on HECToR the most economic choice is always to run with "fully-packed" nodes if possible, i.e. -N 32. Running "unpacked" or "underpopulated" (i.e. not using all the cores on a node) is useful if you need large amounts of memory per parallel task or you are using more than one shared-memory thread per parallel task. When running "unpacked", please keep in mind that your budget gets charged for all cores of each node, even if some of the other cores are idle.
-d
Specifies the number of cores for each parallel task to use for shared-memory threading. (This is in addition to the OMP_NUM_THREADS environment variable if you are using OpenMP for your shared memory programming.)
-S
Specifies the number of parallel tasks to place on each NUMA region (each phase 3 (XE6) node is composed of 4 NUMA regions). For mixed-mode jobs (for example, using MPI and OpenMP) it will typically be desirable to use this flag to place parallel tasks on separate NUMA regions so that shared-memory threads in the same team access the same local memory.

Please use man aprun and aprun -h to query further options.

6.3.2 Task affinity for "unpacked" jobs

As described above, a phase 3 node is logically divided into 4 NUMA regions, the scheduler will place tasks on the first region by default, then the second. If you are not using all the cores on a node (i.e. you are running "unpacked") this will introduce a slight imbalance between tasks on the first node and those on the subsequent ones. To ensure an even distribution of tasks further options have to be passed to the aprun command. The "-S" option defines how many parallel tasks should be placed on each NUMA region and enable an even distribution of tasks over the regions within a node. For example, to run "unpacked" using 120 MPI tasks, 20 per node, with 5 cores per NUMA region the following aprun command is required:

aprun -n 120 -N 20 -S 5 my_executable.x

This will place the executable over 6 nodes, each node will have 20 MPI tasks and 5 tasks will be distributed to each of the NUMA regions on a node. To ensure that the correct resources have been allocated to the job, the PBS job file should set the value of mppnppn to 32. To ensure that sufficient nodes are allocated, the value of mppwidth must be correspondingly increased so the right number of nodes is allocated. For example, to run 120 MPI tasks with 20 tasks per node requires 6 nodes, therefore the PBS jobs should set mppnppn to 32 and mppwidth to 192. The values passed to the aprun command are unaffected (but can no longer be read from the PBS job info). The recommended formula for calculating the value of mppwidth from the n, N and S arguments passed to aprun is:

mppwidth = (n + N-1)/N * 32    (where N = 4*S where S is used)

6.3.3 OpenMP Codes

Codes that use OpenMP can use up to 32 threads per parallel task. While using 32 threads per task and placing 1 task on each node will work, it is not likely to yield best performance since the thread group will straddle more than one NUMA region. The recommended way of placing such hybrid (MPI/OpenMP) code is to place one MPI task per NUMA region and use 8-way OpenMP (e.g. aprun -N 4 -d 8 ....). In this way all the OpenMP data accesses will be uniform to their own memory. We recommend users experiment with various combinations and threads and rank combinations to find the optimal performance for their application.

Job submission scripts for parallel jobs using MPI

A simple MPI job submission script would look like:

#!/bin/bash --login
#PBS -N My_job
#PBS -l mppwidth=2048
#PBS -l mppnppn=32
#PBS -l walltime=00:20:00
#PBS -A budget               
  
# Change to the direcotry that the job was submitted from
cd $PBS_O_WORKDIR

# Launch the parallel job
aprun -n 2048 -N 32 ./my_mpi_executable.x arg1 arg2

This will run your executable "my_mpi_executable.x" in parallel with 2048 MPI tasks. PBS will allocate 64 nodes to your job and place 32 tasks on each node (one per core).

Important: You have to change into a subdirectory of /work (your workspace), before calling aprun. If your submission directory (where you issued the qsub command) is part of /work, you can use the environment variable $PBS_O_WORKDIR, in the way shown above, to change into the required directory.

All PBS options start with a "#PBS"-string. The individual options are explained below:

-N My_job
The PBS option -N gives a name to your job. In the example the name will be "My_job". Obviously you can replace the string "My_job" with any other string you want. The name will be used in various places. In particular it will be used in the queue listing and to generate the name of your output and/or error file(s).
-l mppwidth
Use the PBS option -l mppwidth to request the total number of MPI tasks for your job.
-l mppnppn
The PBS option -l mppnppn tells the scheduler how many processes to place on a node. Valid choices are from 1 to 32. Regardless of the number of processes per node your budget gets charged for the number of nodes reserved for your job, multiplied with the time it took to run.
-l walltime
The PBS option -l walltime is used to specify the maximum wall clock time required for your job. The line #PBS -l walltime=00:20:00 will request twenty minutes. If your job exceeds the requested wall time, it will be terminated by PBS. It is advisable to ask for a slightly longer period, than you expect your job to consume. To achieve a better turn around time for short jobs and to get hanging jobs terminated before they consume excessive amounts of time, it is typically best to keep this extra time reasonably small. Our accounting is done after your job has finished. Your budget is charged based on the time which your job has actually consumed. The requested wall time is not considered for accounting purposes.
-A budget
The PBS option -A budget specifies the budget your job is going to be charged to. Please contact the principle investigator (PI) or a project manager (PM) of your project for details on the budget you should be using. You need to replace the string "budget" with the string specifying your budget.

6.3.4 Job submission scripts for parallel jobs using OpenMP

An example OpenMP job submission script is shown below.

#!/bin/bash --login
#PBS -N My_job
# Request the number of cores that you need in total
#PBS -l mppwidth=32
#PBS -l walltime=00:03:00
#PBS -A budget

# Change to the directory that the job was submitted from
cd $PBS_O_WORKDIR

# Set the number of OpenMP threads per node
export OMP_NUM_THREADS=32

# Send the appropriate options to the scheduler
aprun -n 1 -N 1 -d $OMP_NUM_THREADS ./my_openmp_executable.x arg1 arg2

This will run your executable "my_openmp_executable.x" using 32 threads on one node. We set the environment variable OMP_NUM_THREADS to 32.

As there is no distributed memory parallel component to this code, we use -n 1 -N 1 to assign a single task to the job.

The changes to the PBS options are:

-l mppwidth
Use the PBS option -l mppwidth to request the total number of threads in your job. Valid choices for an OpenMP job on HECToR are from 1 to 32.

6.3.5 Job submission scripts for hybrid parallel jobs using MPI and OpenMP (Mixed Mode)

An example mixed mode job submission script is shown below.

#!/bin/bash --login
#PBS -N My_job
#PBS -l mppwidth=4096
#PBS -l mppnppn=32
#PBS -l walltime=00:20:00
#PBS -A budget

# Chnage to the directory the job was submitted from
cd $PBS_O_WORKDIR

# Set the number of OpenMP threads per parallel task
export OMP_NUM_THREADS=8

# This variable should be set to allow the ALPS and the OS scheduler to 
# assign the task affinity rather than the compiler. If this is not set
# they you may see a large negative effect on performance.
export PSC_OMP_AFFINITY=FALSE

aprun -n 512 -N 4 -d 8 -S 1 ./my_mixed_executable.x arg1 arg2

This will run your executable "my_mixed_executable.x" in parallel with 512 MPI processes. PBS pro will allocate 128 nodes to your job and place 4 MPI tasks on each node. You then have 8 cores associated with each process for your OpenMP threads. We set the environment variable OMP_NUM_THREADS and the aprun flag both to 8 (threads), giving a total use of 4096 cores (512 MPI * 8 OpenMP). We also set the PSC_OMP_AFFINITY=FALSE to ensure we get the maximum performance out of the system.

As each node contains 4 NUMA regions (i.e. integrated circuit with 8 processing cores) you will often find that the best performance is gained by allocating 4 MPI processes per node, with each MPI process running on a separate NUMA region, and 8 OpenMP threads per MPI process. The aprun flag -S controls how many MPI processes will be placed on each NUMA region.

Even outside of the context of running mixed mode jobs, it may be desirable to underpopulate nodes (i.e. use fewer than 32 MPI tasks per node) in order to reduce contention on resources (such as memory and access to the interconnect). Underpopulating nodes can be done by following the above mixed mode example job scirpt, without setting the parts that control threading: OMP_NUM_THREADS and the aprun flag -d.

Differences in PBS options from previous examples:

-l mppwidth
Use the PBS option -l mppwidth to request the total number of parallel tasks in your job. For hybrid codes this should be number of MPI tasks * number of OpenMP threads per task.
-l mppnppn
The PBS option -l mppnppn tells the scheduler how many processes to place on a node. Valid choices are from 1 to 32. Regardless of the number of processes per node your budget gets charged for the number of nodes reserved for your job, multiplied with the time it took. This number should be a factor of the mppwidth.

6.3.6 Running Coarray Fortran Jobs on the XE nodes

A simple Coarray Fortran job submission script would look like:

#!/bin/bash --login
#PBS -N My_job
#PBS -l mppwidth=2048
#PBS -l mppnppn=32
#PBS -l walltime=00:20:00
#PBS -A budget               
  
# Change to the directory that the job was submitted from
cd $PBS_O_WORKDIR

# Launch the parallel job
aprun -n 2048 -N 32 ./my_caf_executable.x

This will run your executable "my_caf_executable.x" in parallel with 2048 images. PBS will allocate 64 nodes to your job and place 32 images on each node (one per core). If you compiled using the flag -X to set the number of images at compile time then the number passed to aprun through the -n flag must be the same.

Coarrays are stored on the symmetric heap (which is used to store Coarrays and SHMEM data objects). The environment variable XT_SYMMETRIC_HEAP_SIZE controls the size (in bytes) of the symmetric heap. It understands k,m,g suffixes for kilobytes, megabytes and gigabytes and it must be large enough to contain all memory that has been symmetrically allocated (not in the data segment). The default size when using Coarray Fortran is 32 megabytes.

6.4 checkScript job submission script validation tool

The checkScript tool has been written to allow users to validate their job submission scripts before submitting their jobs. The tool will read your job submission script and try to identify errors, problems or inconsistencies.

Note that tool currently only validates parallel job submission scripts. Serial and low priority jobs are not included.

An example of the sort of output the tool can give would be:

user@hector-xe6-3:/work/x01/x01/user> checkScript submit.pbs 

checkScript: Validate HECToR Job Submission Scripts
===================================================

++ Warning: mppnppn not specified ++
     Default number of cores per node (32) will be used.
++ Warning: aprun command not found in script file ++
     Please check you are calling your parallel job correctly.

Using z01 budget. Remaining kAUs = 2796.859

Script details
---------------
       User: user
Script file: submit.pbs
  Directory: /esfs2/x01/x01/user            (ok)

Requested resources
-------------------
   mppwidth =           2048 cores               (ok)
    mppnppn =             32 cores   (set to default)
      nodes =             64                     (ok)
   walltime =        12:00:0                     (ok)
     budget =            z01                     (ok)


AU Usage Estimate (if full job time used)
-----------------------------------------
                   Raw kAUs =         69.120
     No capability discount =          0.000
     Estimated charged kAUs =         69.120

checkScript finished: 2 warning(s) and 0 error(s).

6.5 OOM (Out of Memory) Error Messages

There is 1 GB of memory per core and as a result codes that require large amounts of memory per core may have to reduce the amount of data held by each MPI task, or run with fewer MPI tasks per node. Applications that attempt to access more memory than is available to the system will abort producing an error similar to the following:

OOM killer terminated this process.

If this happens to your code on the phase 3 system, there are two potential solutions:

  • Try running the same dataset over a larger number of cores, if your application can be easily scaled over more processors, try running on a large number of processors with the same dataset.
  • If it is difficult or undesirable to run with more MPI tasks, then it is possible to run with fewer MPI tasks per node. This is also known as running "unpacked" (Note: you will still be charged as if you had used the full node.)

6.6 Automatic re-running of jobs after system restart

If the HECToR system crashes then all jobs that were running at the time of the system halt will be automatically restarted when the system is brought back up. Sometimes, this default behaviour is undesirable (for exmaple, if you are using checkpoint files to manage your runs). You can change the behaviour for a particular jobs by adding the following line to you job submission script:

#PBS -r n

This prevents the job from being automatically restarted when the system returns.

6.7 Batch System Layout and Limits

The batch system is laid out so that all you need to do is request the number of processes you need (along with the number of processes per node) and the time for your job. The scheduling system will then place the job in the appropriate queue.

6.7.1 Parallel Queue Configuration

The largest job you can request is currently:

  • Phase 3, XE6 nodes: 2048 nodes (65536 processing cores with mppnppn=32). The longest time that can be requested is currently 12 hours.

The queues are arranged in the following way:

  • For phase 3, XE6 parallel jobs, number of nodes of: 4, 8, 16, 32, 64, 128, 256, 512, 1024, 2048.
  • For phase 3, XE6 parallel jobs, times of: 20 minutes, 1 hour, 3 hours, 6 hours and 12 hours.

The names of the queues are derived from the queue limits and resources. For example, the 128 node, 6 hour queue would be called "par:128n_6h" and would be used for a job with mppwidth=4096 and mppnppn=32.

If your job does not exactly match any of the queues it will be placed in the smallest queue that matches your requests for processors and time.

6.7.2 Job Limits

We would ask that users avoid filling the job queues with large batches of jobs. We would suggest you use job chaining if you need to do so. A guide to job chaining can be found on the HECToR User Wiki (you will need your HECToR SAFE deatils to log in). The hard limits in the batch system are dynamically altered depending on the load, backlog and size of jobs on the systems so may differ from the default values given below at any one particular time. You can often find out the reason why a particular job is not running by using the command:

qstat -f 

and looking in the "comment" section of the output.

The default limits are:

  • Maximum 8 jobs total per user per machine in the queues, either running waiting or held.
  • There is a system limit of a maximum 4 jobs running per user per queue at any one time.
  • There is a system limit of a maximum of 8 jobs running per user across the entire system.

6.7.3 Low Priority Queue

A low priority queue is available on HECToR. The key facts are as follows:

  • A single low priority queue is available.
  • Class 1a projects will not be charged for low priority jobs. All other projects will be charged.
  • The minimum job size supported in the queue is 4096 cores. This limit is subject to change depending upon operational necessity.
  • The maximum job size supported in the queue is 14,186 cores. This limit is subject to change depending upon operational necessity.
  • The maximum run time supported is 3 hours.
  • Jobs can be submitted to the queue at any time.
  • The low priority queue will only be enabled when the backlog of 'normal' jobs is below 3 hours. The backlog is currently re-calculated hourly.
  • Jobs in the low priority queue will continue to run right up to the start time of planned maintenance slots. The low priority queue will not be drained. We would therefore encourage you to use check-pointing.
  • Any low priority jobs which fail to complete as a result of planned maintenance or a node/system failure will not be refunded.
  • The execution queue is called 'low'. To submit a low priority job, users should add the following to their PBS job submission script:
            #PBS -q lowpriority
            

6.7.4 Capability Incentives

The Capability Incentives Scheme is an encouragement to users to broaden their computational science and to exploit the capabilities of the service. Under the scheme, jobs will be discounted at three levels depending on how well they scale. The three levels of incentives are Gold, Silver and Bronze.

Level Min Number of Cores AU Discount
Bronze 2048 5%
Silver 4096 15%
Gold 8192 30%

6.8 Reservations

Reservations are available on HECToR. These allow users to reserve a number of nodes for a specified length of time starting at a particular time on the system.

Reservations will require justification. They will only be approved if the request could not be fulfilled with the standard queues. Possible uses for a reservation would be:

  • An exceptional job requires longer than 12 hours runtime.
  • You require a job/jobs to run at a particular time e.g. for a demonstration or course.
  • You require a larger number of nodes than are available through the usual queues.

Note: Reservation requests must be submitted at least 24 Hours in advance of the reservation start time. If requesting a reservation for a Monday at 12:00, please ensure this is received by the Friday at 12:00 the latest. The same applies over Bank Holidays.

Reservations will be charged at 1.5 times the usual AU rate and you will be charged the full rate for the entire reservation at the time of booking, whether or not you use the nodes for the full time. In addition, you will not be refunded the AUs if you fail to use them due to a job crash unless this crash is due to a system failure.

To request a reservation please use the form on your main SAFE page. You need to provide the following:

  • the start time and date of the reservation.
  • the end time and date of the reservation.
  • the project code for the reservation;
  • the number of nodes required;
  • your justification for the reservation - as above this must be provided or the request will be rejected

Your request will be checked by the Helpdesk and if approved you will be provided a reservation ID which can be used on the system. You submit jobs to a reservation using the qsub command in the following way:

qsub -q <reservation ID> <job submission script>

6.9 Serial Queues

The serial queues on the HECToR facility are designed for large compilations, post-calculation analysis and data manipulation. They should be used for jobs which do not require parallel processing but which would have an adverse impact on the operation of the login nodes if they were run interactively.

Example uses include: compressing large data files, visualising large datasets, large compilations and transferring large amounts of data off the system.

Useful information on the serial queues

  • The serial queues have lengths of 20 minutes, 1 hour, 3 hours, 6 hours and 12 hours.
  • Eight serial jobs can run concurrently.
  • Budgets must be specified but no charges will be applied for time used in the serial queues.
  • As the walltime for serial jobs cannot be effectively determined, jobs in the serial queues may be terminated on a system shutdown in the same way as interactive use of the login nodes.
  • Up to 4 jobs can share a single dual-core processor in the serial queues.
  • Up to 4 jobs share 8GB of memory so the amount of memory available for your job varies depending on what else is running in the queues.
  • Current serial queue limits are listed below, but are subject to change:
    serial_12h : 1 max/user, 1 max total jobs
    serial_6h : 1 max/user, 1 max total jobs
    serial_3h : 1 max/user, 2 max total jobs
    serial_1h : 1 max/user, 4 max total jobs
    serial_20m : 1 max/user, 8 max total jobs

6.9.1 Example Serial Job Submission Script

The following PBS options are used in serial job submission scripts:

  • The option -q serial must be included.
  • The -l cput=hh:mm:ss must be used to specify the calculation time limit.

For example, here is a serial job submission script to compress an output file (with a time limit of twenty minutes):

#!/bin/bash --login
#
#PBS -q serial
#PBS -l cput=00:20:00
#PBS -A budget

cd $PBS_O_WORKDIR

gzip output.dat

The #PBS string at the start of the line indicates that you are going to be specifying a PBS options. The options included in this submission script are:

-q serial
The PBS pro option -q is used to specify the queue for your job. The line #PBS -q serialrequests the serial queues.
-l cput
The PBS pro option -l cput is used to specify the CPU time required for your serial job. The line #PBS -l cput=00:20:00 will request twenty minutes. If your job exceeds the requested time, it will be terminated by PBS. It is advisable to ask for a slightly longer period than you expect your job to consume. To achieve a better turn around time for short jobs and to get hanging jobs terminated before they consume excessive amounts of time, it is typically best to keep this extra time reasonably small.
-A budget
The PBS option -A specifies the budget your job is going to be charged to. Please contact the principle investigator (PI) or a project manager (PM) of your project for details on the budget you should be using. You need to replace the string "budget" with the string specifying your budget.

The remaining part of the script specifies the commands to be run in the same way as for a normal shell script.

6.9.2 Example Serial Job Submission Script for Visualisation Software

You can also run visualisation jobs (e.g. Xconv) through the serial queues. In these types of jobs you need to submit the job and then keep your terminal open until the job runs, at which point the application window will appear on your desktop. The application will finish when you close it manually or when you hit the time limit specified in your submission script.

One additional change is needed to the submission script to enable the viewing of graphical data: the $DISPLAY variable must be exported into the job environment. (This all assumes that you have logged in with ssh -Y and have an Xserver running on your desktop.)

An example submission script for the Xconv application is shown below.

#!/bin/bash --login
#
#PBS -q serial
# Export the $DISPLAY variable
#PBS -v DISPLAY
#PBS -l cput=00:20:00
# Replace with your budget
#PBS -A budget

cd $PBS_O_WORKDIR

# Assuming that 'xconv' is in your path
xconv

6.9.3 Problems with serial commands in parallel job scripts

If serial commands are issued in parallel job scripts then they will not be dispatched to backend nodes. Instead, they will run on the aprun job launcher node to the detriment of the batch scheduler and the entire HECToR facilty that you are running on. For this reason it is critical that no serial commands (such as compression, post-processing or compilation) are included in parallel job scripts. All commands in parallel job scripts should begin with aprun.

To run serial commands, please use the serial queues provided (see documentation above).

5. Modules environment | Contents | 7. Compiling