LiMa Compute-Cluster

Aisle in the serverroom with racks full of servers on both sides — The LiMa cluster at RRZE

The LiMa cluster has been retired in December 2018 after more than 8 years of operation. It is no longer available.

All further content is for historic reference only.

The RRZE’s LiMa cluster (manufacturer: NEC) is a high-performance compute resource with high speed interconnect installed in 2010. It is nowadays intended for distributed-memory (MPI) or hybrid parallel programs with low to high communication requirements.

initially 500 compute nodes, each with two Xeon 5650 „Westmere“ chips (12 cores + SMT) running at 2.66 GHz with 12 MB Shared Cache per chip and 24 GB of RAM (DDR3-1333).
2 frontend nodes with the same CPUs as the nodes, but 48 GB memory.
parallel filesystem (LXFS) with capacity of 100 TB and an aggregated parallel I/O bandwidth of > 3000 MB/s
Infiniband interconnect fabric with 40 GBit/s bandwith per link and direction
Overall peak performance of ca. 64 TFlop/s (56.7 TFlop/s LINPACK). With this performance, the cluster entered the Top 500 list in November 2010 on Rank 130. It was placed on Rank 366 a year later, and dropped out of with the June 2012 list.

LiMa is a system that was designed for running parallel programs using significantly more than one node. However, as there are better systems for this purpose nowadays, it is nowadays also permitted to run single-node jobs there. However, Jobs with less than one node are not supported by RRZE and are subject to be killed without notice.

This website shows information regarding the following topics:

Access, User Environment, File Systems
Further Information
- Intel Xeon 5650 „Westmere“ processor
- InfiniBand Interconnect Fabric

Access, User Environment, and File Systems

Access to the machine

Users can connect to
lima.rrze.uni-erlangen.de
by SSH and will be randomly routed to one of the two frontends. All systems in the cluster, including the frontends, have private IP addresses in the 10.188.8.0/22 range. Thus they can only be accessed directly from within the FAU networks. If you need access from outside of FAU, you have to connect for example to the dialog server cshpc.rrze.uni-erlangen.de first and then ssh to LiMa from there. While it is possible to ssh directly to a compute node, a user is only allowed to do this while they have a batch job running there. When all batch jobs of a user on a node have ended, all of their processes, including any open shells, will be killed automatically.

The login and compute nodes run 64-bit CentOS (which is basically Redhat Enterprise without the support). As on most other RRZE HPC systems, a modules environment is provided to facilitate access to software packages. Type „module avail“ to get a list of available packages.

File Systems

The following table summarizes the available file systems and their features. It is only an excerpt from the main file system table in the HPC environment description.

File system overview for the LiMa cluster
Mount point	Access via	Purpose	Technology, size	Backup	Data lifetime	Quota
`/home/hpc`	`$HOME`	Storage of source, input and important results	NFS on central servers, small	YES + Snapshots	Account lifetime	YES (restrictive)
`/home/vault`		Mid- to Longterm storage	central servers, HSM	YES + Snapshots	Account lifetime	YES
`/home/woody`	$WOODYHOME	Short- to Midimterm storage or small files	central NFS server	NO	Account lifetime	YES
`/lxfs`	`$FASTTMP`	High performance parallel I/O; short-term storage	LXFS (Lustre) parallel file system via InfiniBand, 115 TB	NO	High watermark deletion	NO

Please note the following differences to our other clusters:

The nodes do not have any local hard disc drives
/scratch and /tmp lie in RAM, so it is absolutely NOT possible to store more than a few MB of data there

NFS file system `$HOME`

When connecting to one of the front end nodes, you’ll find yourself in your regular RRZE $HOME directory (/home/hpc/...). There are relatively tight quotas there, so it will most probably be too small for the inputs/outputs of your jobs. It however does offer a lot of nice features, like fine grained snapshots, so use it for „important“ stuff, e.g. your jobscripts, or the source code of the program you’re working on. See the HPC storage page for a more detailed description of the features.

Parallel file system `$FASTTMP`

The cluster’s parallel file system is mounted on all nodes under /lxfs/$GROUP/$USER/ and available via the $FASTTMP environment variable. It supports parallel I/O using the MPI-I/O functions and can be accessed with an aggregate bandwidth of >3000 MB/sec (and even much larger if caching effects can be used).

The parallel file system is strictly intended to be a high-performance short-term storage, so a high watermark deletion algorithm is employed: When the filling of the file system exceeds a certain limit (e.g. 80%), files will be deleted starting with the oldest and largest files until a filling of less than 60% is reached. Be aware that the normal tar -x command preserves the modification time of the original file instead of the time when the archive is unpacked. So unpacked files may become one of the first candidates for deletion. Use tar -mx or touch in combination with find to work around this. Be aware that the exact time of deletion is unpredictable.

Note that parallel filesystems generally are not made for handling large amounts of small files. This is by design: Parallel filesystems achieve their amazing speed by writing to multiple different servers at the same time. However, they do that in blocks, in our case 1 MB. That means that for a file that is smaller than 1 MB, only one server will ever be used, so the parallel filesystem can never be faster than a traditional NFS server – on the contrary: due to larger overhead, it will generally be slower. They can only show their strengths with files that are at least a few megabytes in size, and excel if very large files are written by many nodes simultanously (e.g. checkpointing). For that reason, we have set a limit on the number of files you can store there.

Batch processing

As with all production clusters at RRZE, resources are controlled through a batch system. The frontends can be used for compiling and very short serial testruns, but everything else has to go through the batch system to the cluster.

Please see the batch system description in our HPC environment description.

The following queues are available on this cluster:

Queues on the LiMa cluster
Queue	min – max walltime	min – max nodes	Availablility	Comments
`route`	N/A	N/A	all users	Default router queue; sorts jobs into execution queues
`devel`	0 – 01:00:00	1 – 8	all users	Some nodes reserved for queue during working hours
`work`	01:00:01 – 24:00:00	1 – 64	all users	„Workhorse“
`big`	01:00:01 – 24:00:00	1 – 500	special users	Not active all the time as it causes quite some waste. Users can get access for benchmarking or after proving they can really make use of more than 64 nodes with their codes.
`special`	0 – infinity	1 – all	special users	Direct job submit with `-q special`

As full nodes have to be requested, you always need to specify -l nodes=<nnn>:ppn=24 on qsub.

MPI

There are two supported MPI implementations on LiMa: The first is IntelMPI, i.e. the same as on Woody, the second is OpenMPI. The reason for this is that an increasing number of applications support OpenMPI, while we have support from the vendor for IntelMPI.

Unfortunately, no MPI start mechanism has proved to be „perfect“ on LiMa. You will need to experiment to find the one that works best for your application. The following hints might be helpful:

There is no mpirun in the default $PATH (unless you have the openmpi module loadeed).
For IntelMPI to use an start mechanism more or less compatible to the other RRZE clusters use /apps/rrze/bin/mpirun_rrze-intelmpd -intelmpd -pin 0_1_2_3_4_5_6_7_8_9_10_11 .... In this way, you can explicitly pin all your processes as on the other RRZE clusters. However, this start mechanism can run into problems if the process counts get very large, i.e. you have very large jobs.
Another option (currently only available on LiMa) is to use one of the official mechanisms of Intel MPI (assuming you use bash for your job script and intelmpi/4.0.1.007-[intel|gnu] is loaded):
export PPN=12
export NODES=`uniq $PBS_NODEFILE | wc -l`
export I_MPI_PIN=enable
mpiexec.hydra -rmk pbs -ppn $PPN -n $(( $PPN * $NODES )) -print-rank-map ./a.out
Attention: pinning does not work properly under all circumstances for this start method. See chapter 3.2 of /apps/intel/mpi/4.0.1.0007/doc/Reference_Manual.pdf for more details on I_MPI_PIN and friends.

Further Information

Intel Xeon 5650 „Westmere“ Processor

The Xeon 5650 processor implements Intel’s Nehalem microarchitecture and is a hexa-core chip running at 2.66 GHz. The most significant improvements compared to the Core 2 based chips (as used, e.g., in our Woodcrest cluster) have been made to the memory interface, and they can dynamically overclock themselves as long as they stay within their thermal envelope. Since the nodes are inside water cooled racks, they basically run at 2.93 GHz all the time.

The memory interface controllers are now no longer in the chipset, but integrated into the CPU, a concept that is familiar from the Opteron CPUs of Intels competitor AMD. Intel has however decided to go the whole hog: Each CPU has no less than three independant memory channels, which leads to a vastly improved memory bandwidth compared to Core 2 based CPUs like the Woodcrest. Please note that this improvement really only applies to the memory interface. Applications that run mostly from the cache do not run better on Nehalem/Westmere than on Woodcrest.

The physical CPU sockets are coupled with something called QPI. As the memory is now attached directly to the CPUs, accesses to the Memory of the other socket have to go through QPI and the other processor, so they are more expensive and slower. In other words, the Westmeres and Nehalems are CC-NUMA machines.

InfiniBand Interconnect Fabric

The InfiniBand network on LiMa is a quad data rate (QDR) network, i.e. the links run at 40 GBit/s in each direction. It is fully non blocking, i.e. the backbone is capable of handling the maximum amount of traffic coming in through the client ports without any congestion. However, due to the fact that InfiniBand still uses static routing, i.e. once a route is established between two nodes it doesn’t change even if the load on the backbone links changes, it is possible to generate traffic patterns that will cause congestion on individual links. This is however not likely to happen on normal user jobs.