Batch Processing

This website shows information regarding the following topics:

 

All of the HPC clusters (with the exception of a few special machines) run under the control of a batch system. All user jobs except short serial test runs must be submitted to the cluster through this batch system. The submitted jobs are then routed into a number of queues (depending on the needed resources, e.g. runtime) and sorted according to some priority scheme.

A job will run when the required resources become available. On most clusters, a number of nodes is reserved during working hours for short test runs with less than one hour of runtime. These nodes are dedicated to the devel queue.  We do not allow MPI-parallel applications on the frontends, short parallel test runs must be performed using batch jobs.

It is also possible to submit interactive batch jobs that, when started, open a shell on one of the assigned compute nodes and let you run interactive (including X11) programs there.

The older clusters use a software called Torque as the batch system, newer clusters starting with „meggie“ instead use Slurm. Sadly, there are many differences between those two systems. We will describe both below.

Torque

Commands for Torque

The command to submit jobs is called qsub. To submit a batch job use

qsub <further options> [<job script>]

The job script may be omitted for interactive jobs (see below). After submission, qsub will output the Job ID of your job. It can later be used for identification purposes and is also available as the environment variable $PBS_JOBID in job scripts (see below). These are the most important options for the qsub command:

Option Meaning
Important options for qsub and their meaning
-N <job name> Specifies the name which is shown with qstat. If the option is omitted, the name of the batch script file is used.
-l nodes=<# of nodes>:ppn=<nn> Specifies the number of nodes requested. All current clusters (except the SandyBridge partition within Woody) require you to always request full nodes. Thus, for Emmy you always need to specify :ppn=40, and for Woody (usually) :ppn=4. For other clusters, see the documentation of the respective clusters for the correct ppn values.
-l walltime=HH:MM:SS Specifies the required wall clock time (runtime). When the job reaches the walltime given here it will be sent a TERM signal. After a few seconds, if the job has not ended yet, it will be sent KILL. If you omit the walltime option, a – very short – default time will be used. Please specify a reasonable runtime, since the scheduler bases its decisions also on this value (short jobs are preferred).
-M x@y -m abe You will get e-mail to x@y when the job is aborted (a), starting (b), and ending (e). You can choose any subset of abe for the -m option. If you omit the -M option, the default mail address assigned to your RRZE account will be used.
-o <standard output file> File name for the standard output stream. If this option is omitted, a name is compiled from the job name (see -N) and the job ID.
-e <error output file> File name for the standard error stream. If this option is omitted, a name is compiled from the job name (see -N) and the job ID.
-I Interactive job. It is still allowed to specify a job script, but it will be ignored, except for the PBS options it might contain. No code will be executed. Instead, the user will get an interactive shell on one of the allocated nodes and can execute any command there. In particular, you can start a parallel program with mpirun.
-X Enable X11 forwarding. If the $DISPLAY environment variable is set when submitting the job, an X program running on the compute node(s) will be displayed at the user’s screen. This makes sense only for interactive jobs (see -I option).
-W depend:<dependency list> Makes the job depend on certain conditions. E.g., with -W depend=afterok:12345 the job will only run after Job 12345 has ended successfully, i.e. with an exit code of zero. Please consult the qsub man page for more information.
-q <queue> Specifies the Torque queue (see above); default queue is route. Usually it is not required to use this parameter as the route queue automatically forwards the job to an appropriate execution queue.

There are several Torque commands for job inspection and control. The following table gives a short summary:

Command Purpose Options
Useful Torque user commands
qstat [<options>] [<JobID>|<queue>] Displays information on jobs. Only the user’s own jobs are displayed. For information on the overall queue status see the section on job priorities. -a display „all“ jobs in user-friendly format
-f extended job info
-r display only running jobs
qdel <JobID> ... Removes job from queue
qalter <qsub-options> Changes job parameters previously set by qsub. Only certain parameters may be changed after the job has started. see qsub and the qalter manual page
qcat [<options>]  <JobID> Displays stdout/stderr from a running job -o display stdout (default)
-e display stderr
-f output appended data as the job is running (like tail -f

The scheduler typically sets environment variables to tell the job about what resources were allocated to it. These can also be used in batch scripts. The most useful are given below:

Useful environment variables for Torque
Job ID $PBS_JOBID
Directory from which the job was submitted $PBS_O_WORKDIR
List of nodes on which job runs (filename) cat $PBS_NODEFILE
Number of nodes allocated to job $PBS_NUM_NODES

 

Batch scripts for Torque

To submit a batch job you have to write a shell script that contains all the commands to be executed. Job parameters like estimated runtime and required number of nodes/CPUs can also be specified there (instead of on the command line):

Example of a batch script (Emmy cluster), MPI parallel job
#!/bin/bash -l
#
# allocate 4 nodes (80 Cores / 160 SMT threads) for 6 hours
#PBS -l nodes=4:ppn=40,walltime=06:00:00
#
# job name 
#PBS -N Sparsejob_33
#
# first non-empty non-comment line ends PBS options


#load required modules (compiler, MPI, ...)
module load example1
# jobs always start in $HOME - 
# change to work directory
cd  ${PBS_O_WORKDIR}

# uncomment the following lines to use $FASTTMP
# mkdir ${FASTTMP}/$PBS_JOBID
# cd ${FASTTMP}/$PBS_JOBID
# copy input file from location where job was submitted
# cp ${PBS_O_WORKDIR}/inputfile .

# run, using only physical cores
mpirun -n 80 a.out -i inputfile -o outputfile

 

Example of a batch script (woody cluster), shared memory parallel job (OpenMP)
#!/bin/bash -l
#
# allocate 1 node (4 Cores) for 6 hours
#PBS -l nodes=1:ppn=4,walltime=06:00:00
#
# job name 
#PBS -N Sparsejob_33
#
# first non-empty non-comment line ends PBS options

#load required modules (compiler, ...)
module load intel64
# jobs always start in $HOME - 
# change to work directory
cd  ${PBS_O_WORKDIR}
export OMP_NUM_THREADS=4
 
# run 
./a.out

The comment lines starting with #PBS are ignored by the shell but interpreted by Torque as options for job submission (see above for an options summary). These options can all be given on the qsub command line as well. The example also shows the use of the $FASTTMP and $HOME variables. $PBS_O_WORKDIR contains the directory where the job was submitted. All batch scripts start executing in the user’s $HOME so some sort of directory change is always in order.

If you have to load modules from inside a batch script, you can do so. The only requirement is that you have to use either a csh-based shell or bash with the -l switch, like in the example above.

Interactive Jobs with Torque

For testing purposes or when running applications that require some manual intervention (like GUIs), Torque offers interactive access to the compute nodes that have been assigned to a job. To do this, specify the -I option to the qsub command and omit the batch script. When the job is scheduled, you will get a shell on the master node (the first in the assigned job node list). It is possible to use any command, including mpirun, there. If you need X forwarding, use the -X option in addition to -I.

Note that the starting time of an interactive batch job cannot reliably be determined; you have to wait for it to get scheduled. Thus we recommend to always run such jobs with wallclock time limits less than one hour, so that the job will be routed to the devel queue for which a number of nodes is reserved during working hours.

Interactive batch jobs do not produce stdout and stderr files. If you want a protocol of what’s happened, use e.g. the UNIX script command.

Slurm

Commands for Slurm

The command to submit jobs is called sbatch. To submit a batch job use

sbatch [options] <job script>

After submission, sbatch will output the Job ID of your job. It can later be used for identification purposes and is also available as the environment variable $SLURM_JOBID in job scripts (see below). The following parameters can be specified as options for sbatch or included in the job script by using the script directive #SBATCH:

Important options for sbatch/srun and their meaning
--job-name=<name> Specifies the name which is shown with squeue. If the option is omitted, the name of the batch script file is used.
--nodes=<number> Specifies the number of nodes requested. Default value is 1.
--ntasks=<number> Overall number of tasks (MPI processes). Can be omitted if –nodes and –ntasks-per-node are given. Default value is 1.
--ntasks-per-node=<number> Number of tasks (MPI processes) per node.
--cpus-per-task=<number> Number of threads (logical cores) per task. Used for OpenMP or hybrid jobs.
--time=HH:MM:SS Specifies the required wall clock time (runtime). When the job reaches the walltime given here it will be sent a TERM signal. After a few seconds, if the job has not ended yet, it will be sent KILL. If you omit the walltime option, a – very short – default time will be used. Please specify a reasonable runtime, since the scheduler bases its decisions also on this value (short jobs are preferred).
--mail-user=<address>

--mail-type=<type>

You will get e-mail to <address> depending on the type you have specified. As type, you can choose either BEGIN, END, FAIL, TIME_LIMIT or ALL.  Specifying more than one option is also possible.
--output=<file_name> File name for the standard output stream. This should not be used, since a suitable name is automatically compiled from the job name and the job ID.
--error=<file_name> File name for the standard error stream. Per default, stderr is merged with stdout.
--partition=<partition> Specifies the partition/queue to which the job is submitted. If no partition is given, „work“ is used. Partition „devel“ has to be requested if job qualifies. Jobs in this queue will be run with higher priority.
--constraint=hwperf Access to hardware performance counters (e.g. using likwid-perfctr). Only request that feature if you really want to access the hardware performance counters!

There are several Slurm commands for job inspection and control. The following table gives a short summary:

Command Purpose Options
Useful Slurm user commands
squeue <options> Displays information on jobs. Only the user’s own jobs are displayed. -t running display currently running jobs
-j <JobID> display info on job <JobID>
scancel <JobID> Removes job from queue or terminates it if it’s already running.
scontrol show job <JobID> Displays very detailed information on jobs.

The scheduler typically sets environment variables to tell the job about what resources were allocated to it. These can also be used in batch scripts. The most useful are given below:

Useful environment variables for Torque
Job ID $SLURM_JOB_ID
Directory from which the job was submitted $SLURM_SUBMIT_DIR
List of nodes on which job runs $SLURM_JOB_NODELIST
Number of nodes allocated to job $SLURM_JOB_NUM_NODES

By default SLURM jobs will automatically start in the directory where the job was submitted.

One of the differences to Torque is the propagation of environment variables which are set at the time of submission into the Slurm job. This includes currently loaded module files. To have a clean environment in job scripts, it is recommended to add #SBATCH --export=NONE and unset SLURM_EXPORT_ENV to the job script. Otherwise, the job will inherit some settings from the submitting shell.

 

Batch scripts for Slurm

Example of a batch script (meggie cluster), MPI parallel job
#!/bin/bash -l
#
# allocate 4 nodes with 20 cores per node = 4*20 = 80 MPI tasks
#SBATCH --nodes=4
#SBATCH --tasks-per-node=20
#
# allocate nodes for 6 hours
#SBATCH --time=06:00:00
# job name 
#SBATCH --job-name=Sparsejob_33
# do not export environment variables
#SBATCH --export=NONE
#
# first non-empty non-comment line ends SBATCH options

# do not export environment variables
unset SLURM_EXPORT_ENV
# jobs always start in submit directory

#load required modules (compiler, MPI, ...)
module load example1
# uncomment the following lines to use $FASTTMP 
# mkdir ${FASTTMP}/$SLURM_JOB_ID 
# cd ${FASTTMP}/$SLURM_JOB_ID 
# copy input file from location where job was submitted 
# cp ${SLURM_SUBMIT_DIR}/inputfile . 

# run 
srun a.out

Example of a batch script (meggie cluster), shared memory parallel job (OpenMP)
#!/bin/bash -l
#
# allocate 1 nodes with 20 physical cores, without hyperthreading
#SBATCH --nodes=1
#SBATCH --cpus-per-task=20
#
# allocate nodes for 6 hours
#SBATCH --time=06:00:00
# job name 
#SBATCH --job-name=Sparsejob_33
# do not export environment variables
#SBATCH --export=NONE
#
# first non-empty non-comment line ends SBATCH options

# do not export environment variables
unset SLURM_EXPORT_ENV
# jobs always start in submit directory

#load required modules (compiler,...)
module load intel64

export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK

# run 
./a.out

Interactive Jobs with Slurm

To run an interactive Job with Slurm:

srun [Usual srun arguments for number of nodes, walltime, etc.] --pty /bin/bash -l

This will queue a job and give you a shell on the first node allocated as soon as the job starts. The parameters for srun are the same as for sbatch, which are stated above.
There is currently no way to request X11 forwarding to an interactive SLURM job.

 

Advanced topics

Staging Out Results

Warning! This does not work with the current version of the batch system due to a software bug!

When a job reaches its walltime limit, it will be killed by the batch system. The job’s node-local data will either get deleted (if you use $TMPDIR or be inaccessible because login to a node is disallowed if you don’t have a job running there. In order to prevent data loss, Torque waits 60 seconds after the TERM signal before sending the final KILL. If the batch script catches TERM with a signal handler, those 60 seconds can be used to copy node-local data to a global file system:

Example: How to use a shell signal handler to stage out data
#!/bin/bash

# signal handler: catch SIGTERM, save scratch data
trap "sleep 5 ; cd $TMPDIR ; tar cf - * | tar xf - -C ${WOODYHOME}/$PBS_JOBID ; exit" 15

# make job data save directory
mkdir ${WOODYHOME}/$PBS_JOBID

cd $PBS_O_WORKDIR

# assuming a.out stores temp data in $TMPDIR
mpirun ./a.out

The sleep command at the start of the signal handler gives your application some time to shut down before the data is saved. Please note that it is required to use a Bourne or Korn shell variant for catching the TERM signal since csh has only limited facilities for signal handling.

Chain Jobs

For some calculations, it is beneficial to automatically submit a subsequent job after the current run has finished. This can be achieved by including the submit command in your job script. However, it has to be considered that the job will always resubmit itself, even if something goes wrong during run time, e.g. missing input files. This can lead to jobs running wild until they are manually aborted. To prevent this from happening, the job should only be resubmitted if it has run for a sufficiently long time. The following approach can be used:

Example: resubmitting job script for chain jobs
#!/bin/bash

if [ "$SECONDS" -gt "7000" ]; then
cd ${PBS_O_WORKDIR}
qsub job_script
fi

The bash environment variable $SECONDS contains the run time of the shell in seconds. Please note that it is not defined for csh.

On the TinyX clusters, also qsub has to be used within the submit scripts, instead of the machine specific qsub.tinyx.

Job Priorities and Reservations

The scheduler of the batch system assigns a priority to each waiting job. This priority value depends on certain parameters (like waiting time, queue, user group, and recently used CPU time (a.k.a. fairshare)). The ordering of waiting jobs listed by qstat does not reflect the priority of jobs. All waiting jobs with their assigned priority are listed anonymously on the HPC user web pages (some of those pages are password protected; execute the docpw command to get the username and password). There you also get a list of all running jobs, any node reservations, and all jobs which cannot be scheduled for some reason. Some of this information is also available in text form: The text file /home/woody/STATUS/joblist contains a list of all waiting jobs; the text file /home/woody/STATUS/nodelist contains information about node and queue activities.

Job Monitoring

On meggie and emmy, it is possible to access performance data of your finished jobs, including e.g. memory used,  floating point rate and usage of the (parallel) file system. To review this information, you need a job specific AccessKey, which can be found in the output file.