FAQ: Running Jobs
This section describes how to run parallel jobs.
- Which batch system is used on HECToR?
- Which application should I use to load and execute my program on the compute nodes?
- How do I submit jobs?
- How many jobs can I queue/run at once?
- Why does PBS not run my script?
- How do I specify the number of processors on which my job should run?
- How do I ask for a single process per processor?
- How do I find out which queues are available on the system?
- Can I see which nodes my job is running on?
- 12 hours and 65,536 cores is not enough for my job. How can I use more?
- How can I control the placement of scratch (/tmp) files at runtime?
- My job re-ran automatically after a system restart - how can I stop this happening?
- Q. Which batch system is used on HECToR?
-
A. HECToR uses the Portable Batch System (PBS).
- Q. Which application should I use to load and execute my program on the compute nodes?
-
A. On Compute Node Linux, you should use
aprun
. - Q. How do I submit jobs?
-
A. The command
qsub
may be used to submit a PBS job, usually from a script but possibly from standard input. Scripts can be written in Linux shells such as bash or sh, as well as Perl, etc.A PBS job script consists of:
- shell specification;
- any PBS directives;
- your tasks: programs, commands or applications;
Here is an example of a script that: sets the shell to
/bin/bash
; names the job "Weather1"; limits the run time of the job to one wall hour; and then runs the executable./weathersim
:#!/bin/bash --login #PBS -N Weather1 #PBS -l mppwidth=4096 #PBS -l mppnppn=32 #PBS -l walltime=1:00:00 #PBS -A budget cd $PBS_O_WORKDIR aprun -n 384 -N 24 ./weathersim
Always make sure that the first line of your PBS script contains a shell interpreter, e.g.
#!/bin/bash --login
. Leaving this out may cause basic error messages (such as reading past the EOF, trying to access out of array bounds) to be suppressed from your output file. The--login
option is needed to usemodule
commands within a submission script (see the Environment FAQ Section). - Q. How many jobs can I queue/run at once?
-
As specified in the HECToR Code of Conduct you can have a maximum of 4 jobs running in any one queue at any time and a maximum of 8 jobs running simultaneously on the entire machine. You should have no more than 8 jobs in the HECToR queues at any one time (running, queued or held).
- Q. Why does PBS not run my script?
-
A. This may be caused by the use of carriage-return-line-feeds (CRLF) at the end of lines in the job script. This will occur if the script has been edited in Windows and uploaded to HECToR. Linux systems use the line-feed (LF) character instead. The
file
command allows you to check what is used: if something other than LF is used this will be reported. Use thedos2unix
command to perform a conversion. - Q. How do I specify the number of cores on which my job should run?
-
A.
aprun -n NPE ./myprog.exe
launches NPE instances of./myprog.exe
. If-n
is not specified, it defaults to 1. - Q. How do I ask for a single process per node?
-
A. To create NPE instances of
myprog.exe
and launch each on a separate node, runaprun -n NPE -N 1 ./myprog.exe
. - Q. How do I find out which queues are available on the system?
-
A. The
qstat -Q
command may be used to display the statuses of the queues on HECToR. For example, par:n16_6h is a parallel queue (par) for 16 nodes (n16) and a maximum run wall time of 6 hours (6h). - Q. Can I see which nodes my job is running on?
-
A. You may use
xtnodestat
to display information about compute- and service-partition processors and the jobs running in each partition. - Q. 12 hours and 65,536 cores is not enough for my job. How can I use more?
-
A. The phase 3 machine now offers a maximum job size of 2048 nodes (65,536 cores, fully-packed).
To run for more than 12 hours one method is to use checkpoints/restarts. This can be helped by the qsub option
-W depend
, which allows you to list job submission dependencies. See the qsub man pages.If you require more than 12 hours and/or 65,536 cores and cannot use checkpoints then please contact the helpdesk explaining your requirements.
- Q. My job re-ran automatically after a system restart - how can I stop this happening?
-
A. You should add the following line to your job submission script:
#PBS -r n
which will stop the job being re-run after a system restart.
- How can I control the placement of scratch (/tmp) files at runtime?
- As discussed in the userguide, for PGI use the TMPDIR environment variable and for GNU Fortran use GFORTRAN_TMPDIR.