HECToR

FAQ: Running Jobs

This section describes how to run parallel jobs.

Which batch system is used on HECToR?
Which application should I use to load and execute my program on the compute nodes?
How do I submit jobs?
How many jobs can I queue/run at once?
Why does PBS not run my script?
How do I specify the number of processors on which my job should run?
How do I ask for a single process per processor?
How do I find out which queues are available on the system?
Can I see which nodes my job is running on?
12 hours and 65,536 cores is not enough for my job. How can I use more?
How can I control the placement of scratch (/tmp) files at runtime?
My job re-ran automatically after a system restart - how can I stop this happening?

Go back to the FAQ index.

Q. Which batch system is used on HECToR?

A. HECToR uses the Portable Batch System (PBS).

Q. Which application should I use to load and execute my program on the compute nodes?

A. On Compute Node Linux, you should use aprun.

Q. How do I submit jobs?

A. The command qsub may be used to submit a PBS job, usually from a script but possibly from standard input. Scripts can be written in Linux shells such as bash or sh, as well as Perl, etc.

A PBS job script consists of:

shell specification;
any PBS directives;
your tasks: programs, commands or applications;

Here is an example of a script that: sets the shell to /bin/bash; names the job "Weather1"; limits the run time of the job to one wall hour; and then runs the executable ./weathersim:

 
#!/bin/bash --login
#PBS -N Weather1
#PBS -l mppwidth=4096
#PBS -l mppnppn=32
#PBS -l walltime=1:00:00
#PBS -A budget
 
cd $PBS_O_WORKDIR
 
aprun -n 384 -N 24 ./weathersim

Always make sure that the first line of your PBS script contains a shell interpreter, e.g. #!/bin/bash --login. Leaving this out may cause basic error messages (such as reading past the EOF, trying to access out of array bounds) to be suppressed from your output file. The --login option is needed to use module commands within a submission script (see the Environment FAQ Section).

Q. How many jobs can I queue/run at once?

As specified in the HECToR Code of Conduct you can have a maximum of 4 jobs running in any one queue at any time and a maximum of 8 jobs running simultaneously on the entire machine. You should have no more than 8 jobs in the HECToR queues at any one time (running, queued or held).

Q. Why does PBS not run my script?

A. This may be caused by the use of carriage-return-line-feeds (CRLF) at the end of lines in the job script. This will occur if the script has been edited in Windows and uploaded to HECToR. Linux systems use the line-feed (LF) character instead. The file command allows you to check what is used: if something other than LF is used this will be reported. Use the dos2unix command to perform a conversion.

Q. How do I specify the number of cores on which my job should run?

A. aprun -n NPE ./myprog.exe launches NPE instances of ./myprog.exe. If -n is not specified, it defaults to 1.

Q. How do I ask for a single process per node?

A. To create NPE instances of myprog.exe and launch each on a separate node, run aprun -n NPE -N 1 ./myprog.exe.

Q. How do I find out which queues are available on the system?

A. The qstat -Q command may be used to display the statuses of the queues on HECToR. For example, par:n16_6h is a parallel queue (par) for 16 nodes (n16) and a maximum run wall time of 6 hours (6h).

Q. Can I see which nodes my job is running on?

A. You may use xtnodestat to display information about compute- and service-partition processors and the jobs running in each partition.

Q. 12 hours and 65,536 cores is not enough for my job. How can I use more?

A. The phase 3 machine now offers a maximum job size of 2048 nodes (65,536 cores, fully-packed).

To run for more than 12 hours one method is to use checkpoints/restarts. This can be helped by the qsub option -W depend, which allows you to list job submission dependencies. See the qsub man pages.

If you require more than 12 hours and/or 65,536 cores and cannot use checkpoints then please contact the helpdesk explaining your requirements.

Q. My job re-ran automatically after a system restart - how can I stop this happening?

A. You should add the following line to your job submission script:

#PBS -r n

which will stop the job being re-run after a system restart.

How can I control the placement of scratch (/tmp) files at runtime?

As discussed in the userguide, for PGI use the TMPDIR environment variable and for GNU Fortran use GFORTRAN_TMPDIR.

Go back to the FAQ index.

Main web site navigation

FAQ: Running Jobs

In this section

Apply to ARCHER

Current Service Status