HPC file systems

front of two racks containing a few servers and many hard discs
The hard discs and servers of vault

This website shows information regarding the following topics:

File systems

Overview

A number of file systems is available at RRZE. They differ in available storage size, backup and also in their intended use. Please consider these properties when looking for a place to store your files. More details on the respective systems are listed below.

There is one simple logic rule to keep in mind: Everything that starts with /home/ is available throughout the RRZE, which naturally includes all HPC systems. Therefore, e.g. /home/woody is accessible from all clusters, even if it was originally bought together with the Woody-Cluster and mainly for use by the Woody cluster.

File system overview
Mount point Access via Purpose Size Backup Data lifetime Quota Remarks
/home/hpc $HOME Storage of source, input and important results 5 TB Yes Account lifetime Yes (restrictive)
/home/vault $HPCVAULT Mid- to long term storage; especially for large file 60 TB online, a lot more offline (on tape) Yes Account lifetime Yes hierarchical storage system. Files that have not been touched for a long time are automatically moved to tape.
/home/woody $WOODYHOME general purpose work directory and storage for small files (used to be cluster local storage for woody cluster) 88 TB Limited Account lifetime Yes There is limited backup, meaning that backup on this filesystem does not run daily and data is only kept in backup for a rather short time.
/home/saturn $SATURNHOME
(only defined if you are eligible)
general purpose work directory and storage for small to large files 300 TB NO Account lifetime Yes (group quota) There is no backup and it is a shareholder-only filesystem, i.e. only groups who paid for the file server have access.
/elxfs $FASTTMP High performance parallel I/O; short-term storage; no large ASCII files! 430 TB NO High watermark deletion No; but number of files/directories limited only available on the emmy cluster
/lxfs $FASTTMP High performance parallel I/O; short-term storage; no large ASCII files! 850 TB NO High watermark deletion No; but number of files/directories limited only available on the meggie cluster

Home directory $HOME

The Home directories of the HPC users are housed in the HPC storage system. These directories are available under the path /home/hpc/GROUPNAME/USERNAME on all RRZE HPC systems. The home directory is the directory, in which you are placed right after login, and where most programs try to save settings and similar things. When this directory is unavailable, most programs will stop working or show really strange behaviour – which is why we tried to make the system highly redundant.

The home directory is protected by fine-grained snapshots, and additionally by regular backups. It should therefore be used for „important“ data, e.g. your job scripts, source code of the program you’re working on, or unrecoverable input files. There are relatively tight quotas there, so it will most probably be too small for the inputs/outputs of your jobs.

Each user gets a standard quota of 10 Gigabytes for the home. Quota extensions are not possible.

Vault $HPCVAULT

the inside of the tape robot, with shelves containing tapes on the left side, and the gripper on the right
view inside the tape robot

Additional storage is provided by a HSM backed archive section called „vault“. Each HPC user has a directory there that is available under the path /home/vault/GROUPNAME/USERNAME on all RRZE HPC systems.

HSM stands for hierarchical storage management and means that data is transparently moved between different storage media types without the need for user intervention. In our case, this means the online storage on fast SAS hard disk arrays and the offline pool on tapes in our tape robot. When you put a file in the archive section, it will naturally go to the disks first. If you do not use this file for some time, it will at some point in time be moved to a tape in the tape robot – or actually to two tapes, for redundancy. This however is fully transparent to you: Even when the file has been moved, you will still see it in your directory. When you access the file, the system will automatically tell the tape robot to fetch the tape and copy the file back to the hard disks. This might take a few minutes, but other than waiting there is no user interaction required.

Because of the migration to tape, the file system should be used for mid- to long term storage. Similar to $HOME, it is also protected by regular snapshots and backups.

There is a limit (quota) for the space used by an user on the online pool, i.e. the rotating hard disks.

Home directory $WOODYHOME

Despite the name, $WOODYHOME is available from all HPC systems under the path /home/woody/GROUPNAME/USERNAME.  It is intended as a general purpose work directory and should be used for input/output files and as a storage location for small files.

However, bear in mind that backup on $WOODYHOME is limited. Backup is not run daily and is also only kept for a short amount of time. Hence, important data should be stored in other locations.

The standard quota for each user is 200 Gigabytes.

$SATURNHOME

Access to this shareholder-only filesystem is only available for eligible users. It is intended as a general purpose work directory for both small and large files. Keep in mind that no backup or snapshots are available here.

The quota for this file system is defined for the whole group, not for the individual user. It is dependent on the respective share the group has paid for. If your group is interested in contributing, please contact HPC Services.

Parallel file systems $FASTTMP

The emmy and meggy cluster have a local parallel filesystem for high performance short-term storage. Please note that they are entirely different systems, i.e. you cannot see the files on emmy’s $FASTTMP in the $FASTTMP on meggy. They are not available on systems outside of the respective clusters.

The parallel file systems use a high watermark deletion algorithm: When the filling of the file system exceeds a certain limit (e.g. 70%), files will be deleted starting with the oldest and largest files until a filling of less than 60% is reached. Be aware that the normal tar -x command preserves the modification time of the original file instead of the time when the archive is unpacked. So unpacked files may become one of the first candidates for deletion. Use tar -mx or touch in combination with find to work around this. Be aware that the exact time of deletion is unpredictable.

Note that parallel filesystems generally are not made for handling large amounts of small files or ASCII files. This is by design: Parallel filesystems achieve their amazing speed by writing binary streams to multiple different servers at the same time. However, they do that in blocks, in our case 1 MB. That means that for a file that is smaller than 1 MB, only one server will ever be used, so the parallel filesystem can never be faster than a traditional NFS server – on the contrary: due to larger overhead, it will generally be slower. They can only show their strengths with files that are at least a few megabytes in size, and excel if very large files are written by many nodes simultaneously (e.g. checkpointing).

Snapshots

Snapshots work mostly like the name suggests. In certain intervals, the filesystem takes an „snapshot“, which is an exact read-only copy of the contents of the whole filesystem at one moment in time. In a way, a snapshot is similar to a backup, but with one great restriction: As the „backup“ is stored on the exact same filesystem, this is no protection against disasters – if for some reason the filesystem fails, all snapshots will be gone as well. Snapshots do however provide a great protection against user errors, which has always been the number one cause of data loss on the RRZE HPC systems. Users can restore Important files that have been deleted or overwritten from an earlier snapshot.

Snapshots are stored in a hidden directory .snapshots. Please note that this directory is more hidden than usual: It will not even show up on ls -a, it will only appear when it is explicitly requested.

This is best explained by an example: Lets assume you have a file important.txt in your home directory /home/hpc/exam/example1 that you have been working on for months. You accidentally delete that file. Thanks to snapshots, you should be able to recover most of the file, and „only“ lose the last few hours of work. If you do a ls -l /home/hpc/exam/example1/.snapshots/, you should see something like this:

ls -l /home/hpc/exam/example1/.snapshots/
drwx------ 49 example1 exam 32768  8. Feb 10:54 @GMT-2019.02.10-03.00.00
drwx------ 49 example1 exam 32768 16. Feb 18:06 @GMT-2019.02.17-03.00.00
drwx------ 49 example1 exam 32768 24. Feb 00:15 @GMT-2019.02.24-03.00.00
drwx------ 49 example1 exam 32768 28. Feb 23:06 @GMT-2019.03.01-03.00.00
drwx------ 49 example1 exam 32768  1. Mär 21:34 @GMT-2019.03.03-03.00.00
drwx------ 49 example1 exam 32768  1. Mär 21:34 @GMT-2019.03.02-03.00.00
drwx------ 49 example1 exam 32768  3. Mär 23:54 @GMT-2019.03.04-03.00.00
drwx------ 49 example1 exam 32768  4. Mär 17:01 @GMT-2019.03.05-03.00.00

Each of these directories contains an exact read-only copy of your home directory at the time that is given in the name. To restore the file in the state as it was at 3:00 UTC on the 5th of March, you can just copy it from there to your current work directory again: cp '/home/hpc/exam/example1/.snapshots/@GMT-2019.03.05-03.00.00/important.txt' '/home/hpc/exam/example1/important.txt'

Snapshots are enabled on both the home directories and vault section, but they are made much more often on the home directories than on vault. Please note that the exact snapshot intervals and the number of snapshots retained may change at any time – you should not rely on the existence of a specific snapshot. Also note that any times given are in GMT / UTC. That means that, depending on whether daylights saving time is active or not, the 03:00 UTC works out to either 05:00 or 04:00 german time. At the time of this writing, snapshots were configured as follows:

Snapshot settings on home section (/home/hpc)
Interval x Copies retained = covered timespan
30 minutes (every half and full hour) 6 3 hours
2 hours (every odd-numbered hour – 01:00, 03:00, 05:00, …) 12 1 day
1 day (at 03:00) 7 1 week
1 week (Sundays at 03:00) 4 4 weeks
Snapshot settings on vault section (/home/vault)
Interval x Copies retained = covered time span
1 day (at 03:00) 7 1 week
1 week (Sundays at 03:00) 4 4 weeks

Advanced Topics

Limitations on number of files

Please note that having a large number of small files is pretty bad for the filesystem performance. This is actually true for almost any filesystem and certainly for all RRZE fileservers, but it is a bit tougher for the HPC storage system ($HOME,$HPCVAULT) due to the underlying parallel filesystem, the snapshots and the hierarchical storage management.  We have therefore set a limit on the number of files a user is allowed. That limit is set rather high for the home section, so that you are unlikely to hit it unless you try to, because small files are part of the intended usage there. It is however set rather tight on the vault section, especially compared to the huge amount of space available there. Note that for every file, a small (1 MB) stub is kept on the disks even if the rest of the file is migrated to tape, meaning that even migrated files take up some disk space. It also means that files that are smaller than the stub size are never written to tape because that would not make sense.

If you have a large number of small files in the vault section that you do not intend to use for a long time, please put them into an archive (tar, zip, etc.).

The same limitations apply for the parallel file systems ($FASTTMP) on meggie and emmy . More details can be found on the respective pages of emmy and meggie.

Access Control Lists (ACLs)

Besides the normal Unix permissions that you set with chmod (where you can set permissions for the owning user, the owning group, and everyone else), the system also supports more advanced ACLs.

However, they are not done in the traditional (and non-standardized) way with setfacl / getfacl that users of Linux or Solaris might be familiar with, but in the new standardized way that NFS version 4 uses. This has both advantages and disadvantages. One advantage ist that the way these ACLs work is practically compatible with what Windows does, meaning that you could set them from a windows client through the usual explorer interface. The major disadvantages are that they are unnecessarily complex and their support on the linux side is far from perfect yet – not only because NFS version 4 is not in a usable state yet. That leads to a few restrictions that we will cover in the next section.

The ACLs can only be edited and viewed from clients that access the filesystem through a native GPFS client or CIFS, not from NFS clients. As there are no native GPFS clients available to normal users, the only way to edit them currently is CIFS. Please take note that this restriction only applies to the setting and display of ACLs, not their effectiveness. As access permissions are checked by the filesystem servers and not the clients, any ACL that is set will affect any client, even if the client machine has no idea that there is an ACL.

Further Information on HPC storage

The system serves two functions: It houses the normal home directories of all HPC users, and it provides tape-backed mid- to longterm storage for users data. It is based on IBM hard- and software (Spectrum Scale/GPFS, Spectrum Protect/TSM) and took up operation in August 2009.

Technical data

  • 6 file servers, IBM X3650 7979-B3G, 16 GB RAM, 10 GB Ethernet
  • 1 console node, IBM X3650 7979-B3G, 16 GB RAM, 10 GB Ethernet
  • 1 TSM server, IBM X3650 7979-B3G, 8 GB RAM, 10 GB Ethernet
  • IBM TS3500 tape library with currently
    • 6 LTO4 tape drives and 2 expansion frames
    • 2913 LTO4 tape slots
    • >1500 LTO4 tapes
  • 3 IBM DS3500
    • plus 12 IBM EXP3000 expansion units (4 per DS3500)
    • redundant controllers
    • 180 SAS 600 GB 15 rpm drives for data
    • Usable data capacity: 66 TB for vault, 5 TB for homes
  • 1 IBM DS3400 and 1 IBM DS3500 for TSM