The HECToR Service is now closed and has been superceded by ARCHER.

6. Running Jobs on HECToR

The HECToR facility use PBSpro to schedule jobs. Writing a submission script is typically the most convenient way to submit your job to the batch system. Example submission scripts (with explanations) are provided below for the most common job types.

You can also use the bolt job submission script creation tool to generate correct job submission scripts quickly and easily. (see below)

Once you have written your job submission script you can validate it by using the checkScript command (see below).

More advanced use of PBS and job submission scripts is covered in the HECToR Optimisation Guide - Batch System Chapter.

If you have any questions or queries about the batch system or on how to submit jobs do not hesitate to contact the HECToR Helpdesk.

Contact the HECToR Helpdesk

6.1 Using PBSpro

You generally interact with PBS in two ways: through options specified in job submission scripts (these are detailed below in the examples) and by using PBSpro commands on the login nodes. There are three key commands used to interact with PBS:

Check the man page of PBS for more advanced commands:

  man pbs

The qsub command

To submit a job, type:

  qsub parallel_script.pbs

This will submit your job script "parallel_script.pbs" to the job-queues. See the sections below for details on how to write job scripts.

The qstat command

Use the command qstat to view HECToR's job queue. For example:

  qstat -q

will list all available queues on the HECToR facility you are logged in to.

You can view just your jobs by using:

qstat -u $USER

The qdel command

The command qdel enables you to remove jobs from the job queue. If the job is running, qdel will abort it. The usage is:

  qdel [Job-ID]

You can obtain the Job ID from the first column in the output of qstat.

6.2 Output from PBS jobs

Standard output and standard error from your batch jobs can be found in the files <jobname>.o<Job ID> and <jobname>.e<Job ID> respectively (or just in <jobname>.o<Job ID> if you specified the -oe option to PBS).

Standard Output and Standard Error from jobs running on the compute nodes appear in files placed in the job's working directory at the end of the run. However, writing a lot of information to stdout/stderr (e.g. printing within a loop) can overload the filesystem used to hold this output before it is written to disk. Therefore, recommended practice is to redirect stdout/stderr in your job script as follows:

aprun -n 4096 -N 32 ./myexe >& stdouterr

This way messages can be monitored as the job is progressing, rather than having to wait for the stdout/stderr files to appear at the end of the run. Following this practice stdout/stderr files will still be generated, but the stdout file will contain job information and stderr should be empty.

You may see lines that look like the following in standard output:

Resources requested: mpparch=XT,mppnppn=4,mppwidth=128,ncpus=1,place=pack,
walltime=00:30:00
Resources allocated: cpupercent=0,cput=00:00:00,mem=6384kb,ncpus=1,vmem=36148kb,
walltime=00:03:36

The "Resources requested" line indicates the resources that you specified in your job submission script. The "Resources allocated" line is actually meaningless as it relates to resources allocated on the front-end node for submission of the job rather than the resources that will be used in the actual calculation.

6.3 bolt: Job submission script creation tool

The bolt job submission script creation tool has been written by EPCC to simplify the process of writing job submission scripts for modern multicore architectures. Based on the options you supply, bolt will try to generate a job submission script that uses HECToR as efficiently as possible.

bolt can generate job submission scripts for both parallel and serial jobs on HECToR. MPI, OpenMP and hybrid MPI/OpenMP jobs are supported. Low-priority jobs are not supported.

If there are problems or errors in your job parameter specifications then bolt will print warnings or errors. However, bolt cannot detect all problems so you may wish to run the checkScript tool on any job submission scripts prior to running them to check for budget errors, etc.

6.3.1 Basic Usage

The basic syntax for using bolt is:

bolt -n [parallel tasks] -N [parallel tasks per node] -d [number of threads per task] \
     -t [wallclock time (h:m:s)] -o [script name]  [arguments...]

For example, if I wanted to generate a job submission script to run an executable called 'my_prog.x' for 3 hours using 4096 parallel tasks and 16 tasks per compute node I could use:

bolt -n 4096 -N 16 -t 3:0:0 -o my_job.bolt my_prog.x arg1 arg2

This would generate the job submission script 'my_job.bolt' with the correct options to run 'my_prog.x' with the command line arguments 'arg1' and 'arg2'. You could then run the script with the command:

qsub my_job.bolt

Note: if you do not specify the script name with the '-o' option then your script will be a file called 'a.bolt'

Note: if you do not specify the number of parallel tasks then bolt will generate a serial job submission script.

6.3.2 Specifying which budget to use

bolt will use your default budget code for all job scripts generated. If you have more than one code (for example, you have a project sub-budget or are a member of multiple projects) then you can specify a different budget code to bolt by using the '-A' option. For the example given above:

bolt -n 4096 -N 16 -t 3:0:0 -o my_job.bolt -A z01-sub my_prog.x arg1 arg2

6.3.3 Using OpenMP threads

The '-d' option to bolt allows you to specify the number of OpenMP threads to use. If the number of parallel tasks is 1 (the default) then you will get a pure OpenMP job. If you have more than one parallel task then bolt will produce a script for a hybrid MPI/OpenMP job.

For example, to run a 4 thread, 6 hour OpenMP job with the executable 'my_omp.x' you would use:

bolt -d 4 -t 6:0:0 -o my_omp_job.bolt my_omp.x

To run a hybrid MPI/OpenMP job using 1024 MPI tasks and 8 OpenMP threads per MPI task for 12 hours you would use:

bolt n 1024 -d 8 -t 12:0:0 -o my_hybrid_job.bolt my_hybrid.x

6.3.4 Further help

You can access further help on using bolt on the HECToR machine with the '-h' option to the bolt command. A selection of other options are:

-s
Write and submit the job script rather than just writing the job script.
-p
Force the job to be parallel even if it only uses a single parallel task.

6.4 Running Parallel Jobs

The Phase 3 system consists of individual nodes with two 16 core AMD Opteron processors. This results in 32 cores in each node sharing 32GB of main memory. Although the memory on each node is logically shared between the 32 cores, the node actually comprises 4 "NUMA regions" each with 8 cores. Within each NUMA region the 8 cores have fully uniform access (latency and bandwidth) to the memory associated with that NUMA region. However, when referencing the memory of the other 3 NUMA regions, the latency increases and bandwidth is reduced. This has performance implications when you are using a shared-memory programming model (e.g. OpenMP) and the optimal solutions are discussed in more details below.

6.4.1 Parallel job launcher aprun

The job launcher for parallel jobs on HECToR is aprun. This needs to be started from a subdirectory of the /work space. A sample MPI job launch line looks like:

aprun -n 1024 -N 32 my_mpi_executable.x arg1 arg2

This will start the parallel executable "my_mpi_executable.x" with the arguments "arg1" and "arg2". The job will be started using 1024 MPI tasks with 32 tasks placed on each node (remember that a phase 3 node consists of 32 cores).

The bolt job submission script creation tool will create job submission scripts with the correct settings for the aprun flags for parallel jobs on HECToR.

The most important aprun flags are:

-n
Specifies the total number of distributed memory parallel tasks you want ()not including shared-memory threads).
-N
Specifies the number of distributed memory parallel tasks per node. There is a choice of 1-32 for phase 3 (XE6) nodes. As you are charged per node on HECToR the most economic choice is always to run with "fully-packed" nodes if possible, i.e. -N 32. Running "unpacked" or "underpopulated" (i.e. not using all the cores on a node) is useful if you need large amounts of memory per parallel task or you are using more than one shared-memory thread per parallel task. When running "unpacked", please keep in mind that your budget gets charged for all cores of each node, even if some of the other cores are idle.
-d
Specifies the number of cores for each parallel task to use for shared-memory threading. (This is in addition to the OMP_NUM_THREADS environment variable if you are using OpenMP for your shared memory programming.)
-S
Specifies the number of parallel tasks to place on each NUMA region (each phase 3 (XE6) node is composed of 4 NUMA regions). For mixed-mode jobs (for example, using MPI and OpenMP) it will typically be desirable to use this flag to place parallel tasks on separate NUMA regions so that shared-memory threads in the same team access the same local memory.

Please use man aprun and aprun -h to query further options.

6.4.2 Task affinity for "unpacked" jobs

As described above, a phase 3 node is logically divided into 4 NUMA regions, the scheduler will place tasks on the first region by default, then the second. If you are not using all the cores on a node (i.e. you are running "unpacked") this will introduce a slight imbalance between tasks on the first node and those on the subsequent ones. To ensure an even distribution of tasks further options have to be passed to the aprun command. The "-S" option defines how many parallel tasks should be placed on each NUMA region and enable an even distribution of tasks over the regions within a node. For example, to run "unpacked" using 120 MPI tasks, 20 per node, with 5 cores per NUMA region the following aprun command is required:

aprun -n 120 -N 20 -S 5 my_executable.x

This will place the executable over 6 nodes, each node will have 20 MPI tasks and 5 tasks will be distributed to each of the NUMA regions on a node. To ensure that the correct resources have been allocated to the job, the PBS job file should set the value of mppnppn to 32. To ensure that sufficient nodes are allocated, the value of mppwidth must be correspondingly increased so the right number of nodes is allocated. For example, to run 120 MPI tasks with 20 tasks per node requires 6 nodes, therefore the PBS jobs should set mppnppn to 32 and mppwidth to 192. The values passed to the aprun command are unaffected (but can no longer be read from the PBS job info). The recommended formula for calculating the value of mppwidth from the n, N and S arguments passed to aprun is:

mppwidth = (n + N-1)/N * 32    (where N = 4*S where S is used)

The bolt job submission script creation tool will create job submission scripts with the correct settings for the aprun flags for parallel jobs that use underpopulated nodes on HECToR. The tool will automatically try to select the optimal layout of tasks on a node for your job.

6.4.3 OpenMP Codes

Codes that use OpenMP can use up to 32 threads per parallel task. While using 32 threads per task and placing 1 task on each node will work, it is not likely to yield best performance since the thread group will straddle more than one NUMA region. The recommended way of placing such hybrid (MPI/OpenMP) code is to place one MPI task per NUMA region and use 8-way OpenMP (e.g. aprun -N 4 -d 8 ....). In this way all the OpenMP data accesses will be uniform to their own memory. We recommend users experiment with various combinations and threads and rank combinations to find the optimal performance for their application.

Job submission scripts for parallel jobs using MPI

A simple MPI job submission script would look like:

#!/bin/bash --login
#PBS -N My_job
#PBS -l mppwidth=2048
#PBS -l mppnppn=32
#PBS -l walltime=00:20:00
#PBS -A budget

# Make sure any symbolic links are resolved to absolute path
export PBS_O_WORKDIR=$(readlink -f $PBS_O_WORKDIR)               
  
# Change to the direcotry that the job was submitted from
cd $PBS_O_WORKDIR

# Set the number of threads to 1
#   This prevents any system libraries from automatically 
#   using threading.
export OMP_NUM_THREADS=1

# Launch the parallel job
aprun -n 2048 -N 32 ./my_mpi_executable.x arg1 arg2

This will run your executable "my_mpi_executable.x" in parallel with 2048 MPI tasks. PBS will allocate 64 nodes to your job and place 32 tasks on each node (one per core).

Important: You have to change into a subdirectory of /work (your workspace), before calling aprun. If your submission directory (where you issued the qsub command) is part of /work, you can use the environment variable $PBS_O_WORKDIR, in the way shown above, to change into the required directory.

All PBS options start with a "#PBS"-string. The individual options are explained below:

-N My_job
The PBS option -N gives a name to your job. In the example the name will be "My_job". Obviously you can replace the string "My_job" with any other string you want. The name will be used in various places. In particular it will be used in the queue listing and to generate the name of your output and/or error file(s).
-l mppwidth
Use the PBS option -l mppwidth to request the total number of MPI tasks for your job.
-l mppnppn
The PBS option -l mppnppn tells the scheduler how many processes to place on a node. Valid choices are from 1 to 32. Regardless of the number of processes per node your budget gets charged for the number of nodes reserved for your job, multiplied with the time it took to run.
-l walltime
The PBS option -l walltime is used to specify the maximum wall clock time required for your job. The line #PBS -l walltime=00:20:00 will request twenty minutes. If your job exceeds the requested wall time, it will be terminated by PBS. It is advisable to ask for a slightly longer period, than you expect your job to consume. To achieve a better turn around time for short jobs and to get hanging jobs terminated before they consume excessive amounts of time, it is typically best to keep this extra time reasonably small. Our accounting is done after your job has finished. Your budget is charged based on the time which your job has actually consumed. The requested wall time is not considered for accounting purposes.
-A budget
The PBS option -A budget specifies the budget your job is going to be charged to. Please contact the principle investigator (PI) or a project manager (PM) of your project for details on the budget you should be using. You need to replace the string "budget" with the string specifying your budget.

6.4.4 Job submission scripts for parallel jobs using OpenMP

An example OpenMP job submission script is shown below.

#!/bin/bash --login
#PBS -N My_job
# Request the number of cores that you need in total
#PBS -l mppwidth=32
#PBS -l walltime=00:03:00
#PBS -A budget

# Make sure any symbolic links are resolved to absolute path
export PBS_O_WORKDIR=$(readlink -f $PBS_O_WORKDIR)

# Change to the directory that the job was submitted from
cd $PBS_O_WORKDIR

# Set the number of OpenMP threads per node
export OMP_NUM_THREADS=32

# Send the appropriate options to the scheduler
aprun -n 1 -N 1 -d $OMP_NUM_THREADS ./my_openmp_executable.x arg1 arg2

This will run your executable "my_openmp_executable.x" using 32 threads on one node. We set the environment variable OMP_NUM_THREADS to 32.

As there is no distributed memory parallel component to this code, we use -n 1 -N 1 to assign a single task to the job.

The changes to the PBS options are:

-l mppwidth
Use the PBS option -l mppwidth to request the total number of threads in your job. Valid choices for an OpenMP job on HECToR are from 1 to 32.

6.4.5 Job submission scripts for hybrid parallel jobs using MPI and OpenMP (Mixed Mode)

An example mixed mode job submission script is shown below.

#!/bin/bash --login
#PBS -N My_job
#PBS -l mppwidth=4096
#PBS -l mppnppn=32
#PBS -l walltime=00:20:00
#PBS -A budget

# Make sure any symbolic links are resolved to absolute path
export PBS_O_WORKDIR=$(readlink -f $PBS_O_WORKDIR)

# Chnage to the directory the job was submitted from
cd $PBS_O_WORKDIR

# Set the number of OpenMP threads per parallel task
export OMP_NUM_THREADS=8

# This variable should be set to allow the ALPS and the OS scheduler to 
# assign the task affinity rather than the compiler. If this is not set
# they you may see a large negative effect on performance.
export PSC_OMP_AFFINITY=FALSE

aprun -n 512 -N 4 -d 8 -S 1 ./my_mixed_executable.x arg1 arg2

This will run your executable "my_mixed_executable.x" in parallel with 512 MPI processes. PBS pro will allocate 128 nodes to your job and place 4 MPI tasks on each node. You then have 8 cores associated with each process for your OpenMP threads. We set the environment variable OMP_NUM_THREADS and the aprun flag both to 8 (threads), giving a total use of 4096 cores (512 MPI * 8 OpenMP). We also set the PSC_OMP_AFFINITY=FALSE to ensure we get the maximum performance out of the system.

As each node contains 4 NUMA regions (i.e. integrated circuit with 8 processing cores) you will often find that the best performance is gained by allocating 4 MPI processes per node, with each MPI process running on a separate NUMA region, and 8 OpenMP threads per MPI process. The aprun flag -S controls how many MPI processes will be placed on each NUMA region.

Even outside of the context of running mixed mode jobs, it may be desirable to underpopulate nodes (i.e. use fewer than 32 MPI tasks per node) in order to reduce contention on resources (such as memory and access to the interconnect). Underpopulating nodes can be done by following the above mixed mode example job scirpt, without setting the parts that control threading: OMP_NUM_THREADS and the aprun flag -d.

Differences in PBS options from previous examples:

-l mppwidth
Use the PBS option -l mppwidth to request the total number of parallel tasks in your job. For hybrid codes this should be number of MPI tasks * number of OpenMP threads per task.
-l mppnppn
The PBS option -l mppnppn tells the scheduler how many processes to place on a node. Valid choices are from 1 to 32. Regardless of the number of processes per node your budget gets charged for the number of nodes reserved for your job, multiplied with the time it took. This number should be a factor of the mppwidth.

Do not set mppdepth with this method.

6.4.6 Running Coarray Fortran Jobs on the XE nodes

A simple Coarray Fortran job submission script would look like:

#!/bin/bash --login
#PBS -N My_job
#PBS -l mppwidth=2048
#PBS -l mppnppn=32
#PBS -l walltime=00:20:00
#PBS -A budget               

# Make sure any symbolic links are resolved to absolute path
export PBS_O_WORKDIR=$(readlink -f $PBS_O_WORKDIR)

# Change to the directory that the job was submitted from
cd $PBS_O_WORKDIR

# Launch the parallel job
aprun -n 2048 -N 32 ./my_caf_executable.x

This will run your executable "my_caf_executable.x" in parallel with 2048 images. PBS will allocate 64 nodes to your job and place 32 images on each node (one per core). If you compiled using the flag -X to set the number of images at compile time then the number passed to aprun through the -n flag must be the same.

Coarrays are stored on the symmetric heap (which is used to store Coarrays and SHMEM data objects). The environment variable XT_SYMMETRIC_HEAP_SIZE controls the size (in bytes) of the symmetric heap. It understands k,m,g suffixes for kilobytes, megabytes and gigabytes and it must be large enough to contain all memory that has been symmetrically allocated (not in the data segment). The default size when using Coarray Fortran is 64 megabytes.

6.4.7 Running Chapel Jobs on the XE nodes

A simple Chapel job submission script would look like:

#!/bin/bash --login
#PBS -N chapel_example
#PBS -l mppwidth=32
#PBS -l mppnppn=32
#PBS -l walltime=00:10:00
#PBS -A budget

# Make sure any symbolic links are resolved to absolute path
export PBS_O_WORKDIR=$(readlink -f $PBS_O_WORKDIR)

# Change to the directory that the job was submitted from
cd $PBS_O_WORKDIR

# Load the chapel module
module load chapel
# Set the envionment variable so we can use aprun directly
export CHPL_LAUNCHER=None

aprun -n 1 ./my_chapel_executable -nl 1

6.4.8 Interactive jobs

Interactive jobs on Cray XE systems are useful for debugging or developmental work as they allow you to issue 'aprun' commands directly from the command line. To submit a interactive job reserving 256 cores for 1 hour you would use the following qsub command:

 
qsub -IVl mppwidth=256,walltime=1:0:0 -A budget

When you submit this job your terminal will display something like:

 
qsub: waiting for job 492383.sdb to start

and once the job runs you will be returned to a standard Linux command line. However, while the job lasts you will be able to run parallel jobs by using the 'aprun' command directly at your command prompt. The maximum number of cores you can use is limited by the value of mppwidth you specified at submission time. Remember, that only the /work filesystem is accessible from the compute nodes - although you will be able to access the /home filesystem from the command line. You will normally need to change to a directory on the /work filesystem before you can run a job.

6.5 checkScript job submission script validation tool

The checkScript tool has been written to allow users to validate their job submission scripts before submitting their jobs. The tool will read your job submission script and try to identify errors, problems or inconsistencies.

Note that tool currently only validates parallel job submission scripts. Serial and low priority jobs are not included.

An example of the sort of output the tool can give would be:

user@hector-xe6-3:/work/x01/x01/user> checkScript submit.pbs 

checkScript: Validate HECToR Job Submission Scripts
===================================================

++ Warning: mppnppn not specified ++
     Default number of cores per node (32) will be used.
++ Warning: aprun command not found in script file ++
     Please check you are calling your parallel job correctly.

Using z01 budget. Remaining kAUs = 2796.859

Script details
---------------
       User: user
Script file: submit.pbs
  Directory: /esfs2/x01/x01/user            (ok)

Requested resources
-------------------
   mppwidth =           2048 cores               (ok)
    mppnppn =             32 cores   (set to default)
      nodes =             64                     (ok)
   walltime =        12:00:0                     (ok)
     budget =            z01                     (ok)


AU Usage Estimate (if full job time used)
-----------------------------------------
                   Raw kAUs =         69.120
     No capability discount =          0.000
     Estimated charged kAUs =         69.120

checkScript finished: 2 warning(s) and 0 error(s).

6.6 OOM (Out of Memory) Error Messages

There is 1 GB of memory per core and as a result codes that require large amounts of memory per core may have to reduce the amount of data held by each MPI task, or run with fewer MPI tasks per node. Applications that attempt to access more memory than is available to the system will abort producing an error similar to the following:

OOM killer terminated this process.

If this happens to your code on the phase 3 system, there are two potential solutions:

  • Try running the same dataset over a larger number of cores, if your application can be easily scaled over more processors, try running on a large number of processors with the same dataset.
  • If it is difficult or undesirable to run with more MPI tasks, then it is possible to run with fewer MPI tasks per node. This is also known as running "unpacked" (Note: you will still be charged as if you had used the full node.)

6.7 Automatic re-running of jobs after system restart

If the HECToR system crashes then all jobs that were running at the time of the system halt will NOT be automatically restarted when the system is brought back up.

You can change this default behaviour for a particular job by adding the following line to you job submission script:

#PBS -r y 

This automatically restarts the job when the system returns.

6.8 Batch System Layout and Limits

The batch system is laid out so that all you need to do is request the number of processes you need (along with the number of processes per node) and the time for your job. The scheduling system will then place the job in the appropriate queue.

6.8.1 Parallel Queue Configuration

The largest jobs you can request are currently:

  • Phase 3, XE6 nodes: 1 minute to 12 hours; 1-2048 nodes (32-65536 processing cores with fully-populated nodes).
  • Phase 3, XE6 nodes: 1 minute to 24 hours; 5-128 nodes (160-4096 processing cores with fully-populated nodes).

The queues are arranged in the following way:

  • For phase 3, XE6 parallel jobs, number of nodes of: 4, 8, 16, 32, 64, 128, 256, 512, 1024, 2048.
  • For phase 3, XE6 parallel jobs, times of: 20 minutes, 1 hour, 3 hours, 6 hours, 12 hours and 24 hours. 24 hour queues only available for jobs of 5-128 nodes.

The names of the queues are derived from the queue limits and resources. For example, the 128 node, 6 hour queue would be called "par:128n_6h" and would be used for a job with mppwidth=4096 and mppnppn=32.

If your job does not exactly match any of the queues it will be placed in the smallest queue that matches your requests for processors and time.

6.8.2 Job Limits

To ensure fair access we ask that users avoid filling the job queues with large batches of jobs. If you want to run large sequences of jobs on HECToR then you may find the Batch system chapter of the HECToR Optimisation Guide useful - it lists a number of techniques for runing large sets of batch jobs on HECToR.

We also ask that users do not place larege numbers of held jobs in the HECToR queues. The queue setup on HECToR is not static, it is a dynamic environment that modifies various scheduling parameters (for example, queue running limits, user running limits, enabling of low-priority access) based on the calculations of both the current and projected workload. The projected workload is particularly important as it is used, amongst other things, to decide if we can release large numbers of large jobs to run or to throttle them back to ensure fair access and minimise waiting times. One of the primary inputs to the projected workload calculation is information on the currently queued and held jobs as a user (or related job finishing) can potentially change the status of any held job to queued at any time. As such, it must be recognised as part of the backlog of work on the system.

You can often find out the reason why a particular job is not running by using the command:

qstat -f 

and looking in the "comment" section of the output.

The default limits are:

  • Maximum 8 jobs total per user per machine in the queues, either running waiting or held.
  • There is a system limit of a maximum 4 jobs running per user per queue at any one time.
  • There is a system limit of a maximum of 8 jobs running per user across the entire system.

6.8.3 Job Limits

You can check where the space is in the queues by using the 'queueStatus' command. This lists the current number of jobs running and queued in each queue class as well as the limit on concurrently running jobs in that class. An example of the output from 'queueStatus' is shown below.

Current HECToR queue status:
                                                                  Queue Status
                                                        (running/queued/limit)
Nodes (Cores)          20m           1h           3h           6h          12h
============= ============ ============ ============ ============ ============
      4 (128)        Empty        Empty        Space        Space         Busy
                  (0/0/48)     (0/0/48)     (6/0/48)    (19/0/48)    (39/2/48)
------------- ------------ ------------ ------------ ------------ ------------
      8 (256)        Empty        Space        Space        Space         Busy
                  (0/0/36)     (1/0/36)     (1/0/36)     (4/0/36)    (22/0/36)
------------- ------------ ------------ ------------ ------------ ------------
     16 (512)        Empty        Empty        Empty        Space        Space
                  (0/0/36)     (0/0/36)     (0/0/36)     (2/0/36)    (14/6/36)
------------- ------------ ------------ ------------ ------------ ------------
    32 (1024)        Empty        Empty        Empty        Empty         Busy
                  (0/0/24)     (0/0/24)     (0/0/24)     (0/0/24)    (14/2/24)
------------- ------------ ------------ ------------ ------------ ------------
    64 (2048)        Empty        Empty        Space        Space        Space
                  (0/0/24)     (0/0/24)     (1/0/24)     (2/0/24)     (3/1/24)
------------- ------------ ------------ ------------ ------------ ------------
   128 (4096)        Empty        Empty        Empty        Empty        Empty
                  (0/0/24)     (0/0/24)     (0/0/24)     (0/0/24)     (0/0/24)
------------- ------------ ------------ ------------ ------------ ------------
   256 (8192)        Empty        Empty        Empty        Empty        Empty
                  (0/0/12)     (0/0/12)     (0/0/12)     (0/0/12)     (0/0/12)
------------- ------------ ------------ ------------ ------------ ------------
  512 (16384)        Empty        Empty        Empty        Empty        Empty
                   (0/0/6)      (0/0/6)      (0/0/6)      (0/0/6)      (0/0/6)
------------- ------------ ------------ ------------ ------------ ------------
 1024 (32768)        Empty        Empty        Empty        Empty        Empty
                   (0/0/2)      (0/0/2)      (0/0/2)      (0/0/2)      (0/0/2)
------------- ------------ ------------ ------------ ------------ ------------
 2048 (65536)        Empty        Empty        Empty        Empty        Empty
                   (0/0/1)      (0/0/1)      (0/0/1)      (0/0/1)      (0/0/1)
------------- ------------ ------------ ------------ ------------ ------------

This table is also available on the Service Status page on the HECToR website at:

6.8.3 Low Priority Queue

A low priority queue is available on HECToR. The key facts are as follows:

  • A single low priority queue is available.
  • Class 1a projects will not be charged for low priority jobs. All other projects will be charged.
  • The minimum job size supported in the queue is 512 cores (16 nodes). This limit is subject to change depending upon operational necessity.
  • The maximum job size supported in the queue is 16,384 cores (512 nodes). This limit is subject to change depending upon operational necessity.
  • The maximum run time supported is 3 hours.
  • Users can have the maximum of 2 low priority jobs on the system at any one time, with the maximum of one job running at any one time. Any additional jobs are liable for removal.
  • Jobs can be submitted to the queue at any time.
  • The low priority queue will only be enabled when the backlog of 'normal' jobs is below 3 hours. The backlog is currently re-calculated hourly. This 3-hour trigger point may be varied according to operational necessity.
  • Jobs in the low priority queue will continue to run right up to the start time of planned maintenance slots. The low priority queue will not be drained. We would therefore encourage you to use check-pointing.
  • Any low priority jobs which fail to complete as a result of planned maintenance or a node/system failure will not be refunded.
  • The execution queue is called 'low'. To submit a low priority job, users should add the following to their PBS job submission script:
            #PBS -q lowpriority
            

6.8.4 Capability Incentives

The Capability Incentives Scheme is an encouragement to users to broaden their computational science and to exploit the capabilities of the service. Under the scheme, jobs will be discounted at three levels depending on how well they scale. The three levels of incentives are Gold, Silver and Bronze.

Level Min Number of Cores AU Discount
Bronze 8192 5%
Silver 16384 20%
Gold 32768 50%

6.9 Reservations

Reservations are available on HECToR. These allow users to reserve a number of nodes for a specified length of time starting at a particular time on the system.

Reservations will require justification. They will only be approved if the request could not be fulfilled with the standard queues. Possible uses for a reservation would be:

  • An exceptional job requires longer than 24 hours runtime.
  • You require a job/jobs to run at a particular time e.g. for a demonstration or course.
  • You require a larger number of nodes than are available through the usual queues.

Note: Reservation requests must be submitted at least 48 Hours in advance of the reservation start time. If requesting a reservation for a Monday at 12:00, please ensure this is received by the Friday at 12:00 the latest. The same applies over Bank Holidays.

Reservations will be charged at 1.5 times the usual AU rate and you will be charged the full rate for the entire reservation at the time of booking, whether or not you use the nodes for the full time. In addition, you will not be refunded the AUs if you fail to use them due to a job crash unless this crash is due to a system failure.

To request a reservation please use the form on your main SAFE page. You need to provide the following:

  • the start time and date of the reservation.
  • the end time and date of the reservation.
  • the project code for the reservation;
  • the number of nodes required;
  • your justification for the reservation - as above this must be provided or the request will be rejected

Your request will be checked by the Helpdesk and if approved you will be provided a reservation ID which can be used on the system. You submit jobs to a reservation using the qsub command in the following way:

qsub -q <reservation ID> <job submission script>

6.10 Developing/debugging codes in the batch system - Interactive jobs

The nature of the batch system on HECToR does not lend itself to developing or debugging code as the queues are primarily set up for production jobs.

When you are developing or debugging code you often want to run many short jobs with a small amount of editing the code in between runs. One of the best ways to achieve this on HECToR is to use interactive jobs. An interactive job allows you to issue the 'aprun' commands directly on the command line.

You submit an interactive job using the following syntax:

qsub -IV -l mppwidth=128,walltime=1:0:0 -A z01

This example will run a 1 hour, 128 processor job and charge it to the the budget z01.

When you submit the job you will get a message like:

qsub: waiting for job 587453.sdb to start

Once the job starts you will get a command prompt where you can type 'aprun' commands directly and you can also use standard editing and compilation commands. Remember, that only the /work filesystem is accessible from the compute nodes - though you will be able to access the /home filesystem from the command line. You will normally need to change to a directory on the /work filesystem before you can run a job.

6.11 Serial Queues

The serial queues on the HECToR facility are designed for large compilations, post-calculation analysis and data manipulation. They should be used for jobs which do not require parallel processing but which would have an adverse impact on the operation of the login nodes if they were run interactively.

Example uses include: compressing large data files, visualising large datasets, large compilations and transferring large amounts of data off the system.

Useful information on the serial queues

  • The serial queues have lengths of 20 minutes, 1 hour, 3 hours, 6 hours and 12 hours.
  • Eight serial jobs can run concurrently.
  • Budgets must be specified but no charges will be applied for time used in the serial queues.
  • As the walltime for serial jobs cannot be effectively determined, jobs in the serial queues may be terminated on a system shutdown in the same way as interactive use of the login nodes.
  • Up to 4 jobs can share a single dual-core processor in the serial queues.
  • Up to 4 jobs share 8GB of memory so the amount of memory available for your job varies depending on what else is running in the queues.
  • Current serial queue limits are listed below, but are subject to change:
    serial_12h : 1 max/user, 1 max total jobs
    serial_6h : 1 max/user, 1 max total jobs
    serial_3h : 1 max/user, 2 max total jobs
    serial_1h : 1 max/user, 4 max total jobs
    serial_20m : 1 max/user, 8 max total jobs

6.11.1 Example Serial Job Submission Script

The following PBS options are used in serial job submission scripts:

  • The option -q serial must be included.
  • The -l cput=hh:mm:ss must be used to specify the calculation time limit.

You can use the bolt job submission script creation tool to generate serial job submission scripts with the correct options.

For example, here is a serial job submission script to compress an output file (with a time limit of twenty minutes):

#!/bin/bash --login
#
#PBS -q serial
#PBS -l cput=00:20:00
#PBS -A budget

cd $PBS_O_WORKDIR

gzip output.dat

The #PBS string at the start of the line indicates that you are going to be specifying a PBS options. The options included in this submission script are:

-q serial
The PBS pro option -q is used to specify the queue for your job. The line #PBS -q serialrequests the serial queues.
-l cput
The PBS pro option -l cput is used to specify the CPU time required for your serial job. The line #PBS -l cput=00:20:00 will request twenty minutes. If your job exceeds the requested time, it will be terminated by PBS. It is advisable to ask for a slightly longer period than you expect your job to consume. To achieve a better turn around time for short jobs and to get hanging jobs terminated before they consume excessive amounts of time, it is typically best to keep this extra time reasonably small.
-A budget
The PBS option -A specifies the budget your job is going to be charged to. Please contact the principle investigator (PI) or a project manager (PM) of your project for details on the budget you should be using. You need to replace the string "budget" with the string specifying your budget.

The remaining part of the script specifies the commands to be run in the same way as for a normal shell script.

6.11.2 Example Serial Job Submission Script for Visualisation Software

You can also run visualisation jobs (e.g. Xconv) through the serial queues. In these types of jobs you need to submit the job and then keep your terminal open until the job runs, at which point the application window will appear on your desktop. The application will finish when you close it manually or when you hit the time limit specified in your submission script.

One additional change is needed to the submission script to enable the viewing of graphical data: the $DISPLAY variable must be exported into the job environment. (This all assumes that you have logged in with ssh -Y and have an Xserver running on your desktop.)

An example submission script for the Xconv application is shown below.

#!/bin/bash --login
#
#PBS -q serial
# Export the $DISPLAY variable
#PBS -v DISPLAY
#PBS -l cput=00:20:00
# Replace with your budget
#PBS -A budget

cd $PBS_O_WORKDIR

# Assuming that 'xconv' is in your path
xconv

6.11.3 Interactive serial jobs

Serial interactive jobs run in much the same way as the parallel interactive jobs described above except that you need to set the job time using 'cput' and you need to specify the 'serial' queue. For example, to submit a 1-hour interactive serial job you would use:

 
qsub -IVl cput=1:0:0 -q serial -A budget

(Remember to replace 'budget' with your budget code.) When you submit this job your terminal will display something like:

 
qsub: waiting for job 492383.sdb to start

and once the job runs you will be returned to a standard Linux command line from which you can run your CPU-intensive commands.

6.11.4 Problems with serial commands in parallel job scripts

If serial commands are issued in parallel job scripts then they will not be dispatched to backend nodes. Instead, they will run on the aprun job launcher node to the detriment of the batch scheduler and the entire HECToR facilty that you are running on. For this reason it is critical that no serial commands (such as compression, post-processing or compilation) are included in parallel job scripts. All commands in parallel job scripts should begin with aprun.

To run serial commands, please use the serial queues provided (see documentation above).

5. Modules environment | Contents | 7. Compiling