Woody Compute-Cluster

The RRZE’s „Woody“ is the preferred cluster for serial/single-node throughput jobs.

The cluster has changed significantly over time. You can find more about the history in the section about history. The current hardware configuration looks like this:

  • 40 compute nodes (w10xx nodes) with Xeon E3-1280 CPUs („SandyBridge“, 4 cores, HT disabled, 3,5 GHz base frequency), 8 GB RAM, 500 GB HDD – from 12/2011
  • 72 compute nodes (w11xx nodes) with Xeon E3-1240 v3 CPUs („Haswell“, 4 cores, HT disabled, 3,4 GHz base frequency), 8 GB RAM, 1 TB HDD – from 09/2013
  • 8 compute nodes (w12xx nodes) with Xeon E3-1240 v5 CPUs („Skylake“, 4 cores, HT disabled, 3,5 GHz base frequency), 16 GB RAM, 1 TB HDD – from 04/2016
  • 56 compute nodes (w13xx nodes) with Xeon E3-1240 v5 CPUs („Skylake“, 4 cores, HT disabled, 3,5 GHz base frequency), 32 GB RAM, 1 TB HDD – from 01/2017


front of a rack, with servers in it and lots of cables
The w11xx nodes in Woody

This website shows information regarding the following topics:

Access, User Environment, and File Systems

Access to the machine

Access to the system is granted through the frontend nodes via ssh. Please connect to


and you will be randomly routed to one of the frontends. All systems in the cluster, including the frontends, have private IP addresses in the range. Thus they can only be accessed directly from within the FAU networks. If you need access from outside of FAU you have to connect for example to the dialog server cshpc.rrze.uni-erlangen.de first and then ssh to Woody from there. While it is possible to ssh directly to a compute node, a user is only allowed to do this when they have a batch job running there. When all batch jobs of a user on a node have ended, all of their shells will be killed automatically.

The login and compute nodes run a 64-bit Ubuntu LTS-version. As on most other RRZE HPC systems, a modules environment is provided to facilitate access to software packages. Type „module avail“ to get a list of available packages.

File Systems

The following table summarizes the available file systems and their features. Also check the main file system table in the HPC environment description.

File system overview for the Woody cluster
Mount point Access via Purpose Technology, size Backup Data lifetime Quota
/home/hpc $HOME Storage of source, input and important results cental servers, 5 TB YES + Snapshots Account lifetime YES (very restrictive)
/home/vault Mid- to longterm storage central servers, HSM YES + Snapshots Account lifetime YES
/home/woody $WOODYHOME storage for small files NFS, 88 TB limited Account lifetime YES
/tmp $TMPDIR Temporary job data directory Node-local, between 400 and 900 GB NO Job runtime NO

Node-local storage $TMPDIR

Each node has at least 400 GB of local hard drive capacity for temporary files available under /tmp (also accessible via /scratch/). All files in these directories will be deleted at the end of a job without any notification.

If possible, compute jobs should use the local disk for scratch space as this reduces the load on the central servers. Important data to be kept can be copied to a cluster-wide volume at the end of the job, even if the job is cancelled by a time limit. See the section on batch processing for details.

In batch scripts the shell variable $TMPDIR points to a node-local, job-exclusive directory whose lifetime is limited to the duration of the batch job. This directory exists on each node of a parallel job separately (it is not shared between the nodes). It will be deleted automatically when the job ends. Please see the section on batch processing for examples on how to use $TMPDIR.

Software Development

You will find a wide variety of software packages in different versions installed on the cluster frontends. The module concept is used to simplify the selection and switching between different software packages and versions. Please see the section on batch processing for a description of how to use modules in batch scripts.



Intel compilers are the recommended choice for software development on Woody. A current version of the Fortran90, C and C++ compilers (called ifort, icc and icpc, respectively) can be selected by loading the intel64 module. For use in scripts and makefiles, the module sets the shell variables $INTEL_F_HOME and $INTEL_C_HOME to the base directories of the compiler packages.

As a starting point, try to use the option combination -O3 -xHost when building objects. All Intel compilers have a -help switch that gives an overview of all available compiler options. For in-depth information please consult the local docs in $INTEL_[F,C]_HOME/doc/ and Intel’s online documentation for their compiler suite (currently named „Intel Parallel Studio XE“).


All x86-based processors use the little-endian storage format which means that the LSB for multi-byte data has the lowest memory location. The same format is used in unformatted Fortran data files. To simplify the handling of big-endian files (e.g. data you have produced on IBM Power, Sun Ultra, or NEC SX systems) the Intel Fortran compiler has the ability to convert the endianness on the fly in read or write operations. This can be configured separately for different Fortran units. Just set the environment variable F_UFMTENDIAN at run-time.


Effect of the environment variable F_UFMTENDIAN
big everything treated as BE
little everything treated as LE (default)
big:10,20 everything treated as LE, except for units 10 and 20
„big;little:8“ everything treated as BE, except for unit 8


The GNU compiler collection (GCC) is available directly without having to load any module. As the cluster is running an enterprise version of SuSE Linux, do not expect to find the latest GCC version here. Be aware that the default Intel MPI module assumes the Intel compiler and does not work with the GCC. For details see the section on parallel computing.

MPI Profiling with Intel Trace Collector/Analyzer

Intel Trace Collector/Analyzer are powerful tools that acquire/display information on the communication behaviour of an MPI program. Peformance problems related to MPI can be identified by looking at timelines and statistical data. Appropriate filters can reduce the amount of information displayed to a manageable level.

In order to use Trace Collector/Analyzer you have to load the itac module. This section describes only the most basic usage patterns. Complete documentation can be found in ${VT_ROOT}/doc/, on Intel’s ITAC website, or in the Trace Analyzer Help menu.

Trace Collector (ITC)

ITC is a tool for producing tracefiles from a running MPI application. These traces contain information about all MPI calls and messages and, optionally, on functions in the user code. To use ITC in the standard way you only have to re-link your application. If you want to add user function information to the trace, the code must by instrumented manually using the ITC API and recompiled. Please note that we currently support Intel MPI only.

Shell variables for compiling and linking an MPI application with ITC
Variable Use Example Comments
$ITC_LIB Link against ITC libraries mpif90 *.o -o a.out $ITC_LIB Place after object files (but before any MPI library) on linker command line! Trace files are not written if MPI code does not finish correctly.
$ITC_LIBFS Link against „failsafe“ ITC libraries mpif90 *.o -o a.out $ITC_LIBFS Place after object files (but before any MPI library) on linker command line! Use this variant for MPI codes that do not finish correctly. More intrusive than $ITC_LIB.
$ITC_INC Include directory with ITC API headers mpicc $ITC_INC -c hello.c

After an MPI application that has been compiled or linked with ITC has terminated, a collection of trace files is written to the current directory. They follow the naming scheme <binary-name>.stf* and serve as input for the Trace Analyzer tool.

Trace Analyzer (ITA)

The <binary-name>.stf file produced after running the instrumented MPI application should be used as an argument to the traceanalyzer command:

traceanalyzer <binary-name>.stf

The trace analyzer processes the trace files written by the application and lets you browse through the data. Click on „Charts-Event Timeline“ to see the messages transferred between all MPI processes and the time each process spends in MPI and application code, respectively. Click and drag lets you zoom into the timeline data (zoom out with the „o“ key). „Charts-Message profile“ shows statistics about the communication requirements of each pair of MPI processes. The statistics displays change their content according to the currently displayed data in the timeline window. Please consider the Help menu or the docs in ${VT_ROOT}/doc/ to get more information. Additionally,the HPC group of RRZE will be happy to work with you on getting insight into the performance characteristics of your MPI applications.

Parallel Computing

The intended parallelization paradigm on Woody is either message passing using the Message Passing Interface (MPI) or shared-memory programming with OpenMP.


The installed Intel compilers support at least the relevant parts of recent OpenMP standards. The compiler recognizes OpenMP directives if you supply the command line option -openmp or -qopenmp. This is also required for the link step.


Although the cluster is basically able to support many different MPI versions, we maintain and recommend to use Intel MPI. Intel MPI supports different compilers (GCC, Intel). If you use Intel compilers, the appropriate intelmpi module is loaded automatically upon loading the intel64 compiler module. The standard MPI scripts mpif77, mpif90, mpicc and mpicxx are then available. By loading a intelmpi/3.XXX-gnu module instead of the default intelmpi, those scripts will use the GCC.

There are no special prerequisites for running MPI programs. Just use

mpirun [<options>] your-binary your-arguments

By default, one process will be started on each allocated CPU (4 per node) in a blockwise fashion, i.e. the first node is filled completely, followed by the second node etc.. If you want to start n<4 processes per node (e.g. because of large memory requirements) you can specify the -npernode n option to mpirun (-pernode is equivalent to -npernode 1). Finally, if you want to start less processes than CPUs available you can add the -np N option which will only start N processes.

Examples: We assume that the batch system has allocated 8 nodes (32 processors) for the job.

mpirun a.out

will start 32 processes. If r is the rank of an MPI process, rank r will run on node (r % 4).

mpirun -npernode 2 a.out

will start 16 processes, and rank r will run on node (r % 2).

mpirun -pernode -np 4 a.out

will start 4 processes, each on its own node. I.e., 4 of the 8 allocated nodes stay empty. Note that it is currently not possible to start more processes than processors allocated.

We do not support running MPI programs interactively on the frontends. To do interactive testing, please start an interactive batch job on some compute nodes. During working hours, a number of nodes is reserved for short (< 1 hour) tests.

The MPI start mechanism communicates all environment variables that are set in the shell where mpirun is running to all MPI processes. Thus it is not required to change your login scripts in order to export things like OMP_NUM_THREADS, LD_LIBRARY_PATH etc..


Mathematical Libraries

Intel [Cluster] Math Kernel Library ([C]MKL)

The Math Kernel Library provides threaded BLAS, LAPACK, and FFT routines and some supplementary functions (e.g., random number generators). For distributed-memory parallelization there is also SCALAPACK and CDFT (cluster DFT), together with some sparse solver subroutines. It is highly recommended to use MKL for any kind of linear algebra if possible.

After loading the mkl module, several shell variables are available that help with compiling and linking programs that use MKL:

Environment variables for compiling and linking with MKL
Variable Use Example
$MKL_INC Compiler option(s) for MKL include search path. icc -O3 $MKL_INC -c code.c
$MKL_SHLIB Linker options for dynamic linking of LAPACK, BLAS, FFT ifort *.o -o prog.exe $MKL_SHLIB
$MKL_LIB Linker options for dynamic linking of LAPACK, BLAS, FFT ifort *.o -o prog.exe $MKL_LIB
$MKL_SCALAPACK Linker options for SCALAPACK (includes LAPACK, BLAS FFT) mpicc *.o -o parsolve.exe $MKL_SCALAPACK
$MKL_CDFT Linker options for Cluster DFT functions (includes BLAS, FFT) mpif90 *.o -o parfft.exe $MKL_CDFT

Many MKL routines are threaded and can run in parallel by setting the OMP_NUM_THREADS shell variable to the desired number of threads. If you do not set OMP_NUM_THREADS, the default number of threads is one. Using OpenMP together with threaded MKL is possible, but the OMP_NUM_THREADS setting will apply to both your code and the MKL routines. If you don’t want this it is possible to force MKL into serial mode by setting the MKL_SERIAL environment variable to YES.

For more in-depth information, please refer to Intel’s online documentation on MKL.


FFTW is a high-performance, free library for Fast Fourier Transforms. It is used by many software packages. We used to provide a current version of FFTW that is compatible with the Intel compilers by the fftw[2|3] modules. In the meantime we recommend using the FFTW bindings from the Intel MKL instead.

Environment variables for compiling and linking with FFTW
Variable Use Example
$FFTW_INC Compiler option(s) for FFTW include search path. icc -O3 $FFTW_INC -c code.c
$FFTW_LIB Linker options for (static) linking of FFTW ifort *.o -o prog.exe $FFTW_LIB
$FFTW_BASE Base directory of FFTW installation

The fftw-wisdom and fftw-wisdom-to-conf tools and their manual pages are also provided in the respective search paths.

Batch Processing

All user jobs except short serial test runs must be submitted to the cluster by means of the Torque Resource Manager. The submitted jobs are routed into a number of queues (depending on the needed resources, e.g. runtime) and sorted according to some priority scheme. It is normally not necessary to explicitly specify the queue when submitting a job to the cluster, the sorting into the proper queue happens automatically. The queue configuration looks like follows:

Queues on the Woody cluster
Queue min – max walltime min – max nodes Availablility Comments
route N/A N/A all users Default router queue; sorts jobs into execution queues
devel 0 – 01:00:00 1 – 16 all users Some nodes reserved for queue during working hours
work 01:00:01 – 24:00:00 1 – 32 all users „Workhorse“
onenode 01:00:01 – 48:00:00 1 – 1 all users only very few jobs from this queue are allowed to run at the same time.
special 0 – infinity 1 – all special users Direct job submit with -q special

If you submit jobs, then by default you can get any type of node: SandyBridge, Haswell or Skylake based w1xxx-nodes. They all have the same number of cores (4) and minimum memory (at least 8 GB) per node, but the speed of the CPUs can be different, which means that job runtimes will vary. You will have to calculate the walltime you request from the batch system so that your jobs can finish even on the slowest nodes.

It is also possible to request certain kinds of nodes from the batch system. This has two mayor use cases besides the obvious „benchmarking“: If you want to run jobs that use less than a full node, those are currently only allowed on the SandyBridge nodes, so you need to request those explicitly. Some applications can benefit from using AVX2 which is not available on the SandyBridge based nodes. Moreover, the Skylake based nodes have more memory (16 GB or 32 GB). You request a node property by adding it to your -lnodes=... request string, e.g.: qsub -l nodes=1:ppn=4:sb. In general, the following node properties are available:

Available node properties on the Woody cluster
Property Matching nodes (#) Comments
:avx w1xxx (176) Can run on any node that supports AVX, that is all the SandyBridge, Haswell and Skylake nodes.
:sb w10xx (40) Can run on the SandyBridge nodes only. Required for jobs with ppn other than 4.
:hw w11xx (72) Can run on the Haswell nodes only.
:sl w12xx (8) and w13xx (56) Can run on the Skylake nodes (both 16 and 32 GB) only.
:sl16g w12xx (8) Can run on the Skylake nodes with 16 GB RAM only.
:sl32g w13xx (56) Can run on the Skylake nodes with 32 GB RAM only.
:hdd900 w1[1-3]xx (136) Can run on any node with (at least) 900 GB scratch on HDD.

A job will run when the required resources become available. For short test runs with less than one hour of runtime, a number of nodes is reserved during working hours. These nodes are dedicated to the devel queue. Do not use the devel queue for production runs. Since we do not allow MPI-parallel applications on the frontends, short parallel test runs must be performed using batch jobs.

It is also possible to submit interactive jobs that, when started, open a shell on one of the assigned compute nodes and let you run interactive (including X11) programs there.

The command to submit jobs is called qsub. To submit a batch job use

qsub <further options> [<job script>]

The job script may be omitted for interactive jobs (see below). After submission, qsub will output the Job ID of your job. It can later be used for identification purposes and is also available as $PBS_JOBID in job scripts (see below). These are the most important options for the qsub command:

Important options for qsub and their meaning
Option Meaning
-N <job name> Specifies the name which is shown with qstat. If the option is omitted, the name of the batch script file is used.
-o <standard output file> File name for the standard output stream. If this option is omitted, a name is compiled from the job name (see -N) and the job ID.
-e <error output file> File name for the standard error stream. If this option is omitted, a name is compiled from the job name (see -N) and the job ID.
-l nodes=1:ppn=4 Single-node job which can run on any node type. Performance may vary depending on the segment.
-l nodes=1:ppn=<1|2|3|4>:sb Single-core job which only runs in w10xx segment. If PPN is less then 4, the other CPU(s) are considered to be available by Torque and my be assigned to other jobs. Make sure you only use the fraction of main memory according to the PPN value.
-l walltime=HH:MM:SS Specifies the required wall clock time (runtime). When the job reaches the walltime given here it will be sent a TERM signal. After 60 seconds, if the job has not ended yet, it will be sent KILL. See the section on stage-out below for hints how to use this delay for saving important data.
If you omit the walltime option a short default time is used. Please specify a reasonable runtime, since the scheduler bases its decisions also on this value (short jobs are preferred).
-M x@y -m abe You will get e-mail to x@y when the job is aborted (a), starting (b), and ending (e). You can choose any subset of abe for the -m option.
-W depend:<dependency list> Makes the job depend on certain conditions. E.g., with -W depend=afterok:12345 the job will only run after Job 12345 has ended successfully, i.e. with an exit code of zero. Please consult the qsub man page for more information.
-r [y|n] Specifies if the job is rerunnable (y, default) or not (n). Under some (error) conditions, Torque will decide to re-queue jobs that had already been running before the error occurred. If a job is not suited for this kind of action, use -r n.
-I Interactive job. It is still allowed to specify a job script, but it will be ignored except for the PBS options. No code will be executed. Instead, the user will get an interactive shell on one of the allocated nodes and can execute any command there. In particular, you can start a parallel program with mpirun.
-X Enable X11 forwarding. If the $DISPLAY environment variable is set when submitting the job, an X program running on the compute node(s) will be displayed at the user’s screen. This makes sense only for interactive jobs (see -I option).
-q <queue> Specifies the Torque queue (see above); default queue is route. Usually it is not required to use this parameter as the route queue automatically forwards the job to an appropriate execution queue.

Regular jobs are always required to request all CPUs in a node (ppn=4). Using less than 4 CPUs per node is only supported in the SandyBridge segment.

There are several Torque commands for job inspection and control. The following table gives a short summary:

Useful Torque user commands
Command Purpose Options
qstat [<options>] [<JobID>|<queue>] Displays information on jobs. Only the user’s own jobs are displayed. For information on the overall queue status see the section on job priorities. -a display „all“ jobs in user-friendly format
-f extended job info
-r display only running jobs
qdel <JobID> ... Removes job from queue
qalter <qsub-options> Changes job parameters previously set by qsub. Only certain parameters may be changed after the job has started. see qsub and the qalter manual page
qcat [<options>]  <JobID> Displays stdout/stderr from a running job -o display stdout (default)
-e display stderr
-f output appended data as the job is running (like tail -f

Batch Scripts

To submit a batch job you have to write a shell script that contains all the commands to be executed. Job parameters like estimated runtime and required number of nodes/CPUs can also be specified there:

Example of a batch script
#!/bin/bash -l
# allocate 1 node (4 CPUs) for 6 hours
#PBS -l nodes=1:ppn=4,walltime=06:00:00
# job name 
#PBS -N Sparsejob_33
# first non-empty non-comment line ends PBS options

# jobs always start in $HOME -; thus, manually change to where the job was submitted

# run
mpirun ${WOODYHOME}/bin/a.out -i inputfile -o outputfile

The comment lines starting with #PBS are ignored by the shell but interpreted by Torque as options for job submission (see above for an options summary). These options can all be given on the qsub command line as well. The example also shows the use of the$PBS_O_WORKDIRand $WOODYHOME variables. $PBS_O_WORKDIR contains the directory where the job was submitted. All batch scripts start executing in the user’s $HOME so some sort of directory change is always in order.

If you have to load modules from inside a batch script, you can do so. The only requirement is that you have to use either a csh-based shell or bash with the -l switch, like in the example above.

Interactive Jobs

The resources of the Woody cluster are mainly available in batch mode. However, for testing purposes or when running applications that require some manual intervention (like GUIs), Torque offers interactive access to the compute nodes that have been assigned to a job. To do this, specify the -I option to the qsub command and omit the batch script. When the job is scheduled, you will get a shell on the master node (the first in the assigned job node list). It is possible to use any command, including mpirun, there. If you need X forwarding, use the -X option in addition to -I.

Note that the starting time of an interactive batch job cannot reliably be determined; you have to wait for it to get scheduled. Thus we recommend to always run such jobs with wallclock time limits less than one hour so the job will be routed to the devel queue for which a number of nodes is reserved during working hours.

Interactive batch jobs do not produce stdout and stderr files. If you want a protocol of what’s happened, use e.g. the UNIX script command.

Staging Out Results

Warning! This does not work with the current version of the batch system due to a software bug!

When a job reaches its walltime limit, it will be killed by the batch system. The job’s node-local data will either get deleted (if you use $TMPDIR or be inaccessible because login to a node is disallowed if you don’t have a job running there. In order to prevent data loss, Torque waits 60 seconds after the TERM signal before sending the final KILL. If the batch script catches TERM with a signal handler, those 60 seconds can be used to copy node-local data to a global file system:

Example: How to use a shell signal handler to stage out data

# signal handler: catch SIGTERM, save scratch data
trap "sleep 5 ; cd $TMPDIR ; tar cf - * | tar xf - -C ${WOODYHOME}/$PBS_JOBID ; exit" 15

# make job data save directory


# assuming a.out stores temp data in $TMPDIR
mpirun ./a.out

The sleep command at the start of the signal handler gives your application some time to shut down before the data is saved. Please note that it is required to use a Bourne or Korn shell variant for catching the TERM signal since csh has only limited facilities for signal handling.

Job Priorities and Reservations

The scheduler of the batch system assigns a priority to each waiting job. This priority value depends on certain parameters (like waiting time, queue, user group, and recently used CPU time (a.k.a. fairshare)). The ordering of waiting jobs listed by qstat does not reflect the priority of jobs. All waiting jobs with their assigned priority are listed anonymously on the HPC user web pages (some of those pages are password protected; execute the docpw command to get the username and password). There you also get a list of all running jobs, any node reservations, and all jobs which cannot be scheduled for some reason. Some of this information is also available in text form: The text file /home/woody/STATUS/joblist contains a list of all waiting jobs; the text file /home/woody/STATUS/nodelist contains information about node and queue activities.

Further Information


The cluster was originally delivered end of 2006 by companies Bechtle and HP, with 180 compute-nodes, each with two Xeon 5160 „Woodcrest“ chips (4 cores) running at 3.0 GHz with 4 MB Shared Level 2 Cache per dual core, 8 GB of RAM and 160 GB of local scratch disk and a half-DDR/half-SDR high speed infiniband-network. The cluster was expanded to 212 nodes within a year. However, those nodes were replaced over time and turned off one by one. None of these nodes remain today. At that time it was the main cluster at RRZE, intended for distributed-memory (MPI) or hybrid parallel programs with medium to high communication requirements. It also was the first cluster at RRZE that employed a parallel filesystem (HP SFS) with a capacity of 15 TB and an aggregated parallel I/O bandwidth of > 900 MB/s. That filesystem was retired in 2012.

row of racks with servers
The woody cluster in 2006

The system entered the November 2006 Top500 list on rank 124 and in (November 2007) was ranked number 329.

In 2012, 40 single socket compute nodes with Intel Xeon E3-1280 processors (4-core „SandyBrdige“, 3.5 GHz, 8 GB RAM and 400 GB of local scratch disk) have been added (w10xx nodes). These nodes are only connected by GBit ethernet. Therefore, only single-node (or single-core) jobs are allowed in this segment.

In 2013, 72 single socket compute nodes with Intel Xeon E3-1240 v3 processors (4-core „Haswell“, 3.4 GHz, 8 GB RAM and 900 GB of local scratch disk) have been added (w11xx nodes). These nodes are only connected by GBit ethernet. Therefore, only single-node jobs are allowed in this segment. These nodes replaced three racks full of old w0xxx-nodes, providing significantly more compute power at a fraction of the power usage.

In 2016, 8 single socket compute nodes with Intel Xeon E3-1240 v5 processors (4-core „Skylake“, 3.5 GHz, 16 GB RAM and 900 GB of local scratch disk) have been added (w12xx nodes). Only single-node jobs are allowed in this segment.

Although Woody was originally a system that was designed for running parallel programs using significantly more than one node, the communications network is pretty weak compared to our other clusters and todays standards. It is therefore now mostly intended for running single node jobs. Note however that the rule Jobs with less than one node are not supported on the w0xx nodes and are subject to be killed without notice still applies. In other words, you cannot reserve single CPUs, the minimum allocation is one node. In the w10xx segment, also single cores can be requested as an exception.