Environment on the RRZE HPC systems

We aim to provide an environment across the RRZE production cluster systems that is as homogeneous as possible. This page describes this environment.

This page covers the following topics:

modules system

On all RRZE HPC systems, established tools for software development (compilers, editors, …), libraries, and selected applications are available. For many of these applications, it is necessary to set special environment variables, so that e.g. searchpaths are correct or license servers can be found.

To ease selection of and switching between different versions of software packages, all HPC systems at RRZE use the modules system (cf. modules.sourceforge.net). It allows to conveniently load the necessary configurations for different programs or different versions of the same program an, if necessary, unload them again later.

Important module commands

Overview of the most important module commands
Command What it does
module avail lists available modules
module whatis shows an over-verbose listing of all available modules
module list shows which modules are currently loaded
module load <pkg> loads the module pkg, that means it makes all the settings that are necessary for using the package pkg (e.g. search paths).
module load <pkg>/version loads a specific version of the module pkg instead of the default version.
module unload <pkg> removes the module pkg, that means it undoes what the load command did.
module help <pkg> shows a detailed description for module pkg.
module show <pkg> shows what environment variables module pkg actually sets/modifies.

General hints for using modules

  • modules always only affects the current shell.
  • If individual modules are to be loaded all the time, you can put the command into your login scripts, e.g. into $HOME/.bash_profile.
  • The syntax of the module-Commands is independent from the shell used. They can thus usually be used unmodified in any type of PBS jobscript.
  • Some modules cannot be loaded together. In some cases such a conflict is detected automatically during the load command, in which case an error message is printed and no modifications are made.
  • Modules can depend on other modules, so that these are loaded automatically when you load the module. It is also possible to define default versions for modules. As an example, the current Intel compiler modules will depend on IntelMPI and the Intel MKL and load these automatically. If you load just the module intel64, you will get the current default intel compiler version for that cluster. If you want to ensure a specific version, append /versionnumber, e.g. intel64/47.11.
  • A current list of all available modules can be retrieved with the command module avail.

Important standard modules

Important standard modules, available on most or all clusters
intel64 This is probably the most used module by far: It loads the current recommended version of the Intel compilers for the current cluster. Note that this will not always be the same version across clusters. This module depends on and automatically loads Intel MPI and MKL on most clusters. If you want to use a different MPI variant, do NOT load this module, but load the module for the MPI variant instead.
openmpi Loads some version of OpenMPI and the matching compiler. Note that OpenMPI is not the MPI-variant recommended by RRZE, but we provide it because some users had better experience with it than the default IntelMPI.
gcc Some version of the GNU compiler collection. Please note that all systems naturally have a default gcc version that is delivered together with the operating system and that is always available without loading any module. However, that version is often a bit dated, so we provide a gcc-module with a somewhat newer version on some clusters.

Einige Tipps, welche den Umgang mit den Modulen insbesondere in Makefile-Dateien erleichtern können:

  • Durch die MPI-Module werden die Umgebungsvariablen MPICHROOTDIR und MPIHOME auf das jeweilige Basisverzeichnis der passenden MPICH-Version gesetzt. Der Zugriff auf die Include-Dateien und Bibliotheken kann somit in Makefiles vereinheitlicht als $MPIHOME/include und $MPIHOME/lib erfolgen.
  • Die Intel Compiler-Module setzen analog die Umgebungsvariablen INTEL_C_HOME bzw. INTEL_F_HOME auf das jeweilige Basis-Verzeichnis. Dies kann insbesondere hilfreich sein, wenn man Fortran und C++ Objekte miteinadere Linken will und dafür die passenden Bibliotheken manuell angeben muss.

Available software on the HPC systems

We provide compilers and some standard-libraries on the clusters.
If you need additional libraries or software, we will only install these globally if there is demand from more than a handful of users. If you are the only group using a software, just install it into your home directory.

The only commercial software we provide on all clusters are the Intel compilers and related tools.
For any other commercial software, HPC@RRZE will NOT provide any licenses. If you want to use any commercial software, you will need to bring the license with you. This is also true for software sub-licensed from the RRZE software group. All calculations you do on the clusters will draw licenses out of your license pool. Please try to clarify any licensing questions before contacting us, as we really do not plan to become experts in software licensing.

We know of the following commercial software that has been run successfully by some users on our clusters.
Please note that we usually do not have experience with these software packages. We will try our best to help you run the software on our clusters, but we expect you know how to use the software in principal.

Commercial software on the different HPC machines
software remarks
STAR-CCM+
Ansys CFX
gaussian
amber
gromacs
abaqus
maple
mathematica This is so prohibitively expensive that practically everybody who tried to use it on the clusters gave up that idea very quickly, as they simply could not afford the necessary licenses.
Turbomole

OpenMP Pinning

Introduction

To reach optimum performance with OpenMP codes, correct pinning of the OpenMP threads is essential. As nowadays practically all machines are ccNUMA, where incorrect or no pinning can have devastating effects, this is something that should not be ignored.
We offer a convenient way to do that on the RRZE systems (including our testcluster): We have implemented a small library that replaces the calls made for thread creation from an OpenMP program with variants that do pinning at runtime.

Usage

To simplify usage, there is the wrapper script /apps/rrze/bin/pin_omp that can be used as follows in the simplest case:
/apps/rrze/bin/pin_omp -c 0-7 ./mybinary
This will run ./mybinary and pin the threads to core 0-7.

Possible problems

Unfortunately, it isn’t always that easy for different reasons:

  • statically linked binaries cannot be tricked through our library. It only works for dynamically linked binaries.
  • The library assumes a thread layout as it is generated by the Intel compilers. Binaries that were created with different compilers possibly require different parameters and could even get slower through our library.
  • The Intel compilers starting with version 10.0 attempt to do some pinning theirselves, that naturally collides with the pinning attempts of our library. This usually leads to all threads being executed on one single CPU core, resulting in horrible performance. They however only do that on genuine Intel CPUs. If you run the same binary e.g. on an Opteron CPU, it does not attempt to do pinning, and the pinning done by our library works as expected. Starting with compiler version 10.1.18 or 11.0, pinning can be disabled by setting the environment variable KMP_AFFINITY: setenv KMP_AFFINITY disabled
  • The sequence of the CPU parameter is not obeyed: -c 0-7 has exactly the same effect as -c 1,3,5,7,0,2,4,6. A different can however be enforced through the use of an environment variable – see the section for advanced users for that.

The following table summarizes the recommended settings:

Recommended combination of compilers and pin_omp wrapper
Compiler on Intel CPUs on non-Intel CPUs (e.g. Opteron)
Intel 9.1 use pin_omp use pin_omp
Intel 10.0 bis 10.1.17 DON’T use pin_omp use pin_omp
Intel ab 10.1.18 use pin_omp, set variable KMP_AFFINITY to disabled. use pin_omp

Further possibilities for advanced users

Advanced users can influence the behavior of the library with environment variables.

Variable Effect
Variables for advanced users and their effect
PINOMP_MASK works like the ‚dplace‘ parameter -x, i.e. it expects a number, that is interpreted as a bitmask. The threads for which the corresponding bit is set will not be pinned.
PINOMP_SKIP works like the ‚dplace‘ parameter -s: The thread with this number will not be pinned. Multiple numbers can be given, separated by commas.
PINOMP_CPUS Explicitly specifies the CPU core numbers to use and their sequence, where the core numbers are separated by commas. This overrides the -c command line parameter.

File systems

Overview

A number of file systems is available at RRZE. There is one simple logic rule to keep in mind: Everything that starts with /home/ is available throughout the RRZE, which naturally includes all HPC systems. Therefore, e.g. /home/woody is accessible from all clusters, even if it was originally bought together with the Woody-Cluster and mainly for use by the Woody cluster.

File system overview
Mount point Access via Purpose Size Backup Data lifetime Quota Remarks
/home/hpc $HOME Storage of source, input and important results 5 TB Yes Account lifetime Yes (restrictive)
/home/vault $HPCVAULT Mid- to Longterm storage 60 TB online, a lot more offline (on tape) Yes Account lifetime Yes hierarchical storage system. Files that have not been touched for a long time are automatically moved to tape
/home/woody $WOODYHOME storage for small files (used to be cluster local storage for woody cluster) 88 TB Limited Account lifetime Yes There is limited backup, meaning that backup on this filesystem does not run daily and data is only kept in backup for a rather short time.
/lxfs $FASTTMP High performance parallel I/O; short-term storage 115 TB NO High watermark deletion No; but number of files/directories limited only available on the LiMa cluster
/elxfs $FASTTMP High performance parallel I/O; short-term storage 430 TB NO High watermark deletion No; but number of files/directories limited only available on the Emmy cluster
/lxfs $FASTTMP High performance parallel I/O; short-term storage 850 TB NO High watermark deletion No; but number of files/directories limited only available on the Meggie cluster
/home/cluster64
/home/cluster32
/home/altix
/wsfs
cluster local storage 0 B NO Account lifetime Yes Turned off some time ago

NFS file system $HOME

When logging in to any system, you will start in your regular RRZE $HOME directory, which is usually located under /home/hpc/.... There are relatively tight quotas there, so it will most probably be too small for the inputs/outputs of your jobs. It however does offer a lot of nice features, like fine grained snapshots, so use it for „important“ stuff, e.g. your jobscripts, or the source code of the program you’re working on. See the HPC storage page for a more detailed description of the features.

Parallel file systems $FASTTMP

The LiMa and the Emmy cluster each have a parallel filesystem for high performance short-term storage. Please note that they are entirely different systems, i.e. you cannot see the files on LiMas $FASTTMP in the $FASTTMP on Emmy. They are not available on systems outside of the respective clusters.

The parallel file systems use a high watermark deletion algorithm: When the filling of the file system exceeds a certain limit (e.g. 70%), files will be deleted starting with the oldest and largest files until a filling of less than 60% is reached. Be aware that the normal tar -x command preserves the modification time of the original file instead of the time when the archive is unpacked. So unpacked files may become one of the first candidates for deletion. Use tar -mx or touch in combination with find to work around this. Be aware that the exact time of deletion is unpredictable.

Note that parallel filesystems generally are not made for handling large amounts of small files. This is by design: Parallel filesystems achieve their amazing speed by writing to multiple different servers at the same time. However, they do that in blocks, in our case 1 MB. That means that for a file that is smaller than 1 MB, only one server will ever be used, so the parallel filesystem can never be faster than a traditional NFS server – on the contrary: due to larger overhead, it will generally be slower. They can only show their strengths with files that are at least a few megabytes in size, and excel if very large files are written by many nodes simultanously (e.g. checkpointing).

shells

In general, two types of shells are available on the HPC systems at RRZE:

  • csh, the C-shell, usually in the form of the feature enhanced tcsh instead of the classic csh.
  • bash

csh used to be the default login shell for all users, not because it is a good shell (it certainly isn’t!), but simply for „historical reasons“. Since ca. 2014 the default shell for new users has been bash instead, which most people having used any Linux systems will be familiar with. The newer clusters (starting with Emmy) will always enforce bash as the shell, even for old accounts. If you have one of those old accounts still using csh and want to change to bash for the older clusters too, you can contact the ServiceTheke or the HPC team to get your login shell changed.

Batch processing

Introduction

All of the HPC clusters (with the exception of a few special machines) run under the control of a batch system. All user jobs except short serial test runs must be submitted to the cluster through this batch system. The submitted jobs are then routed into a number of queues (depending on the needed resources, e.g. runtime) and sorted according to some priority scheme.

A job will run when the required resources become available. On most clusters, a number of nodes is reserved during working hours for short test runs with less than one hour of runtime. These nodes are dedicated to the devel queue. Do not use the devel queue for production runs. Since we do not allow MPI-parallel applications on the frontends, short parallel test runs must be performed using batch jobs.

It is also possible to submit interactive batch jobs that, when started, open a shell on one of the assigned compute nodes and let you run interactive (including X11) programs there.

The older clusters use a software called Torque as the batch system, newer clusters starting with „meggie“ instead use Slurm. Sadly, there are many differences between those two systems. We will describe both below.

Commands for Torque

The command to submit jobs is called qsub. To submit a batch job use

qsub <further options> [<job script>]

The job script may be omitted for interactive jobs (see below). After submission, qsub will output the Job ID of your job. It can later be used for identification purposes and is also available as the environment variable $PBS_JOBID in job scripts (see below). These are the most important options for the qsub command:

Option Meaning
Important options for qsub and their meaning
-N <job name> Specifies the name which is shown with qstat. If the option is omitted, the name of the batch script file is used.
-l nodes=<# of nodes>:ppn=<nn> Specifies the number of nodes requested. All current clusters require you to always request full nodes. (The old Cluster64 allows to allocate single CPUs.) Thus, for LiMa you always need to specify :ppn=24, for TinyBlue :ppn=16 and for woody :ppn=4. For other clusters, see the documentation of the respective clusters for the correct ppn values.
-l walltime=HH:MM:SS Specifies the required wall clock time (runtime). When the job reaches the walltime given here it will be sent a TERM signal. After a few seconds, if the job has not ended yet, it will be sent KILL. If you omit the walltime option, a – very short – default time will be used. Please specify a reasonable runtime, since the scheduler bases its decisions also on this value (short jobs are preferred).
-M x@y -m abe You will get e-mail to x@y when the job is aborted (a), starting (b), and ending (e). You can choose any subset of abe for the -m option. If you omit the -M option, the default mail address assigned to your RRZE account will be used.
-o <standard output file> File name for the standard output stream. If this option is omitted, a name is compiled from the job name (see -N) and the job ID.
-e <error output file> File name for the standard error stream. If this option is omitted, a name is compiled from the job name (see -N) and the job ID.
-I Interactive job. It is still allowed to specify a job script, but it will be ignored, except for the PBS options it might contain. No code will be executed. Instead, the user will get an interactive shell on one of the allocated nodes and can execute any command there. In particular, you can start a parallel program with mpirun.
-X Enable X11 forwarding. If the $DISPLAY environment variable is set when submitting the job, an X program running on the compute node(s) will be displayed at the user’s screen. This makes sense only for interactive jobs (see -I option).
-W depend:<dependency list> Makes the job depend on certain conditions. E.g., with -W depend=afterok:12345 the job will only run after Job 12345 has ended successfully, i.e. with an exit code of zero. Please consult the qsub man page for more information.
-q <queue> Specifies the Torque queue (see above); default queue is route. Usually it is not required to use this parameter as the route queue automatically forwards the job to an appropriate execution queue.

There are several Torque commands for job inspection and control. The following table gives a short summary:

Command Purpose Options
Useful Torque user commands
qstat [<options>] [<JobID>|<queue>] Displays information on jobs. Only the user’s own jobs are displayed. For information on the overall queue status see the section on job priorities. -a display „all“ jobs in user-friendly format
-f extended job info
-r display only running jobs
qdel <JobID> ... Removes job from queue
qalter <qsub-options> Changes job parameters previously set by qsub. Only certain parameters may be changed after the job has started. see qsub and the qalter manual page
qcat [<options>]  <JobID> Displays stdout/stderr from a running job -o display stdout (default)
-e display stderr
-f output appended data as the job is running (like tail -f

Batch scripts for Torque

To submit a batch job you have to write a shell script that contains all the commands to be executed. Job parameters like estimated runtime and required number of nodes/CPUs can also be specified there (instead of on the command line):

Example of a batch script (Woody cluster)
#!/bin/bash -l
#
# allocate 16 nodes (64 CPUs) for 6 hours
#PBS -l nodes=16:ppn=4,walltime=06:00:00
#
# job name 
#PBS -N Sparsejob_33
#
# stdout and stderr files
#PBS -o job33.out -e job33.err
#
# first non-empty non-comment line ends PBS options

# jobs always start in $HOME -
# change to a temporary job directory on $FASTTMP
mkdir ${FASTTMP}/$PBS_JOBID
cd ${FASTTMP}/$PBS_JOBID
# copy input file from location where job was submitted
cp ${PBS_O_WORKDIR}/inputfile .

# run
mpirun ${HOME}/bin/a.out -i inputfile -o outputfile

# save output on parallel file system
mkdir -p ${FASTTMP}/output/$PBS_JOBID
cp outputfile ${FASTTMP}/output/$PBS_JOBID
cd 
# get rid of the temporary job dir
rm -rf ${FASTTMP}/$PBS_JOBID

The comment lines starting with #PBS are ignored by the shell but interpreted by Torque as options for job submission (see above for an options summary). These options can all be given on the qsub command line as well. The example also shows the use of the $FASTTMP and $HOME variables. $PBS_O_WORKDIR contains the directory where the job was submitted. All batch scripts start executing in the user’s $HOME so some sort of directory change is always in order.

If you have to load modules from inside a batch script, you can do so. The only requirement is that you have to use either a csh-based shell or bash with the -l switch, like in the example above.

Interactive Jobs with Torque

For testing purposes or when running applications that require some manual intervention (like GUIs), Torque offers interactive access to the compute nodes that have been assigned to a job. To do this, specify the -I option to the qsub command and omit the batch script. When the job is scheduled, you will get a shell on the master node (the first in the assigned job node list). It is possible to use any command, including mpirun, there. If you need X forwarding, use the -X option in addition to -I.

Note that the starting time of an interactive batch job cannot reliably be determined; you have to wait for it to get scheduled. Thus we recommend to always run such jobs with wallclock time limits less than one hour, so that the job will be routed to the devel queue for which a number of nodes is reserved during working hours.

Interactive batch jobs do not produce stdout and stderr files. If you want a protocol of what’s happened, use e.g. the UNIX script command.

Commands for Slurm

Command Purpose Options
Useful Slurm user commands
squeue Displays information on jobs. Only the user’s own jobs are displayed. -t running display currently running jobs
-j <JobID> display info on job <JobID>
scancel <JobID[.StepID]> Removes job from queue or terminates it if it’s already running. StepID is optional, without it the whole job and not just one individual step will be cancelled.
sbatch <options> <jobscript> Sumits jobs to the queue. see sbatch section below.

Batch scripts for Slurm

Interactive Jobs with Slurm

To run an interactive Job with Slurm:

srun [Usual srun arguments] --pty bash

This will queue a job and give you a shell on the first node allocated as soon as the job starts.