This guide will give you a short overview over the most important aspects of running applications on the HPC systems. More more in-depth information, please refer to the linked documentation.
This guide assumes that you already have an HPC- account. If this is not the case, you can get the application form here. Basic usage of the HPC systems typically is free of charge for FAU researchers for publicly funded research. If you have any questions regarding the application, please contact your local RRZE contact person or the HPC-support.
By default, all clusters use Linux operating systems with text-mode only. Basic knowledge of file handling, scripting, editing, etc. under Linux is therefore required.
RRZE operates divers HPC systems which are tailored to different use cases. Thus, choice of the appropriate cluster always is essential even if your account will work on most of the systems:
- single-core or single node (throughput) jobs: Woody and/or TinyEth
- multi-node MPI-parallel jobs: Emmy (and Meggie)
access to Meggie is restricted to projects which already proofed efficient resource usage — thus it’s not a system for starters
- GPU jobs: TinyGPU or Emmy
most of the nodes in TinyGPU have been financed by individual groups; therefore, access restrictions / throttling policies may apply.
- large main memory requirement: TinyFat
the modern Broadwell-based nodes have been financed by an individual group; therefore, access restrictions / throttling policies may apply.
Connecting to HPC systems
To log into the HPC front ends, you have to connect via a SSH (SecureShell) client. Windows users can either use the Linux subsystem included in recent Windows 10 versions or a third-party client like for example PuTTY or MobaXterm. Under Linux and Mac, native OpenSSH functionality is available. From within the university network, you can connect using the following command:
In this case,
USERNAME is your HPC user name and
CLUSTERNAME is the name of the cluster you want to log into, e.g.
emmy. If you want to access
TinyEth, you also have to connect to
If you want to access the clusters from outside the university network, you have to connect to the dialogserver first :
You can then ssh to the cluster front ends from there. As an alternative, you can also use VPN to access the clusters directly.
Working with data
Different file systems are accessible from the clusters. Due to their different properties, some might be more suited for the required task than others. The first three classes of directories are available on all HPC systems:
$HOME: standard home directory at login, available under
- small quota (10 GB) – cannot be increased
- backup: regular, additional fine-grained snapshots
- storage of important files only
$WORK:General purpose work directory
The recommended work directory is
$WORK. Its destination may point to different file servers and file systems:
$WOODYHOME: available under
- standard quota 200GB
- no backup
- can be used for input/output files and for small files
$SATURNHOME: available under
/home/titan, both are for share holders only!
- group quota according to payment (typically 25+ TB)
- no backup
- can be used for input/output files and for small files
- HSM file system
$HPCVAULT: available under
- standard quota 100 GB for online-files and quota on the number of files/directories
- backup: regular, additional snapshots
- mid- to long term storage of large files; these files may transparently be migrated to offline tape
- Parallel file systems
- local to emmy/meggie, cannot be accessed from outside these systems
- no backup, no quota for data volume, but high watermark deletion and limits on the number of files/directories
- short term storage, only for high performance parallel I/O, no ASCII files
For all filesystems your personal folder is located in your group directory, for example for
/home/hpc/GROUPNAME/USERNAME. You can also use the environment variables to access the folders directly.
File system quota
Nearly all file systems impose quotas on the data volume and/or the number of files or directories. These quotas may be set per user or per group. There is a distinction between hard quota, which is the absolute upper limit which cannot be exceeded, and soft quota, which can be exceeded temporarily for a certain grace period (7 days). After that time, it turns into a hard quota. You will be notified automatically if you exceed your personal quota on any file systems. You can look up your used quota by either typing
quota -s or
shownicerquota.pl on any cluster front end.
Share holders can lookup information on their group quota on
$SATURNHOME in text files available as
Under Linux and Mac,
rsync are the preferred ways to copy data from and to a remote machine. Under Windows, either the Linux subsystem or additional tools like WinSCP can be used.
The standard Linux packages are installed on the cluster front ends. On the compute nodes, usually much less software is available.
The majority of software is provided by RRZE via the modules system. It contains a variety of compilers, libraries, open and commercial software. A module has to be loaded explicitly to become usable. All
module commands affect the current shell only. The available modules may differ between the clusters.
The available modules can be listed via
module avail. Module are loaded via
module load <modulename> and unloaded via
module unload <modulename>. The currently loaded modules are displayed by
module list. The
module commands can usually be used unmodified in any type of PBS job script.
Some modules cannot be loaded together. In some cases such a conflict is detected automatically during the load command, in which case an error message is printed and no modifications are made. Modules can depend on other modules, so that these are loaded automatically when you load the module. As an example, the current Intel compiler modules will depend on IntelMPI and Intel MKL which are loaded automatically:
$ module load intel64 $ module list Currently Loaded Modulefiles: 1) intelmpi/2017up04-intel 2) mkl/2017up05 3) intel64/17.0up05
Compiling parallel applications
For compiling your MPI parallel application, you have to explicitly load the necessary modules. For example when using the Intel compiler and Intel MPI, just use
module load intel64. When gcc should be used, use
module load gcc to get the default version of the compiler. In this case, you have to manually load the desired MPI module.
You can then use the wrapper commands
mpif77 , or
mpif90 to compile your MPI source code. Prior to running your code, you have to load the same modules as for compiling the program.
More details on running parallel applications can be found here.
The cluster front ends can be used for for interactive work like editing input files or compiling your application. The amount of time each of your applications is running is restricted by system limits, e.g., after 1 hour of CPU time your run will be killed. Front ends are shared among all users, so be considerate which applications you run. Please do not run applications with large computational or memory requirements on the front ends, since this may interfere with the work of other users. MPI parallel jobs are generally not allowed on front ends at RRZE.
Compute nodes cannot be accessed directly. Compute resources have to be requested by a resource manager software, the so-called batch system. All user jobs except short serial test runs must be submitted to the cluster through this batch system. This is done by creating a job script, that contains all the commands you want to run and also the requested resources like number of compute nodes and runtime. The submitted jobs are routed into a number of queues (depending on the needed resources, e.g. runtime) and sorted according to some priority scheme. A job will run when the required resources become available. The output of the job is written into a file in your submit directory.
The older clusters use a software called Torque as the batch system, newer clusters starting with
meggie instead use Slurm. Sadly, there are many differences between those two systems. Please refer to the linked documentation of the two batch systems for details on the required commands and example scripts.
It is also possible to submit interactive batch jobs that, when started, open a shell on one of the assigned compute nodes and let you run interactive (including X11) programs there. This is especially useful for testing or applications which cannot be run on the front ends due to higher computational requirements.
The current status of the clusters can be found here. It also includes a list of running and queued jobs. This information can be useful to assess the current workload of the cluster, which also influences the queuing time of your job. It will also show the Message of the day (MOTD) for each cluster, where changes in the configuration, maintenance times, and other disruptions in service will be announced. The MOTD is also visible when you log into a cluster front end.
Try to use the appropriate amount of parallelism. Since most workloads are not highly scalable, it is not always better to use more cores for your application. It can be beneficial to run scaling experiments to figure out the “sweet spot” of your application.
Check the results of your job regularly to prevent waste of computational resources. You can also check if your job actually uses the allocated nodes in the intended way and if it runs with the expected performance. On meggie and emmy, it is also possible to access performance data of your finished jobs, including e.g. memory used, floating point rate and usage of the (parallel) file system. To review this information here, you need a job specific AccessKey, which can be found in the output file.
Use the appropriate file system for your calculations. Doing tiny-size, high-frequency I/O on a parallel file may overload the metadata servers. When data becomes obsolete, delete it, especially on the parallel file systems (
$FASTTMP). No quota limitations apply there, but if a certain level is reached, a high-watermark deletion will be executed, which will affect old files of all users. Data which should be archived should be moved to
If you have a problem with your application that you cannot solve yourself, report it to the HPC-support using your FAU mail address. This will immediately open a helpdesk ticket and someone will get back to you. Please provide as much detail as possible so we know where to look, including user name, cluster name, jobID, file system, time of event, etc..