Environment on the RRZE HPC systems
This page covers the following topics:
modules system
On all RRZE HPC systems, established tools for software development (compilers, editors, ...), libraries, and selected applications are available. For many of these applications, it is necessary to set special environment variables, so that e.g. searchpaths are correct or license servers can be found.
To ease selection of and switching between different versions of software
packages, all linux-based HPC systems at RRZE use the
modules system
(cf.
modules.sourceforge.net).
It allows to conviently load the necessary configurations for
different programs or different versions of the same program an, if necessary,
unload them again later.
Important module commands
module avail |
lists available modules |
module whatis |
shows an over-verbose listing of all available modules |
module list |
shows which modules are currently loaded |
module load <pkg> |
loads the module pkg, that means it makes all the settings that are necessary for using the package pkg (e.g. search paths). |
module load <pkg>/version |
loads a specific version of the module pkg instead of the default version. |
module unload <pkg> |
removes the module pkg, that means it undoes what the load command did. |
module help <pkg> |
shows a detailed description for module pkg. |
module show <pkg> |
shows what environment variables module pkg actually sets/modifies. |
General hints for using modules
- modules always only affects the current shell.
- If individual modules are to be loaded all the time, you can put
the command into your login scripts, e.g. into
$HOME/.cshrc. - The syntax of the
module-Commands is independent from the shell used. They can thus usually be used unmodified in any type of PBS jobscript. - Some modules cannot be loaded together. In some cases such a conflict is detected automatically during the load command, in which case an error message is printed and no modifications are made.
Wichtige Module auf dem IA32-Cluster
Auf dem IA32-Cluster sind diverse Meta-Module vorhanden. Diese Module laden automatisch weitere Einzelmodule und vereinfachen somit die Auswahl empfohlener Softwarekomponenten. Details zu den Einzelmodulen finden Sie weiter unten auf dieser Seite.
intel |
Durch dieses Meta-Modul wird die Verwendung der Intel C/C++ und Fortran Compiler in der vom RRZE empfohlenen Version für 32 Bit Systeme vorbereitet. Gleichzeitg werden die nötigen Einstellungen für ein dazu passendes GBit-MPICH gesetzt. |
intel64 |
Durch dieses Meta-Modul wird die Verwendung der Intel C/C++ und Fortran Compiler in der vom RRZE empfohlenen Version für 64 Bit Systeme vorbereitet. Gleichzeitg werden die nötigen Einstellungen für ein dazu passendes GBit-MPICH gesetzt. |
mpich/gnu und mpich/gnu64 |
Dieses Modul ist genau genommen kein Meta-Modul, da die GNU Compiler standardmässig im Suchpfad enthalten sind. Dieses Module bereitet die Verwendung von MPICH über GBit mit den 32/64-bit GNU Compilern vor. |
Um eine grössere Wahlmöglichkeit zu haben, kann man auch einzelne Module in speziellen Versionen laden:
intel-c/X.Y-Z, intel-f/X.Y-Z,
intel64-c/X.Y-Z und intel64-f/X.Y-Z |
Durch diese Module können spezielle Versionen der Compiler ausgewählt werden. MPI-Module werden dadurch nicht automatisch geladen. |
mpich/gnu64-ib und mpich/intel64-ib |
Lädt die Einstellungen für die Infiniband-Variante von MPICH. |
Eine aktuelle Liste mit allen verfügbaren Modulen mit
entsprechenden Beschreibungen erhalten Sie durch die Befehle
module avail und module whatis!
Einige Tipps, welche den Umgang mit den Modulen insbesondere in
Makefile-Dateien erleichtern können:
- Durch die MPI-Module werden die Umgebungsvariablen
MPICHROOTDIRundMPIHOMEauf das jeweilige Basisverzeichnis der passenden MPICH-Version gesetzt. Der Zugriff auf die Include-Dateien und Bibliotheken kann somit in Makefiles vereinheitlicht als$MPIHOME/includeund$MPIHOME/liberfolgen. - Die Intel Compiler-Module setzen analog die Umgebungsvariablen
INTEL_C_HOMEbzw.INTEL_F_HOMEauf das jeweilige Basis-Verzeichnis. Dies kann insbesondere hilfreich sein, wenn man Fortran und C++ Objekte miteinadere Linken will und dafür die passenden Bibliotheken manuell angeben muss.
Overview of available software on the different HPC systems
The following table gives a rough overview of the software available on
the different HPC systems. Note that this list is likely outdated -
refer to the output of module avail on the respective machines.
| cluster -> software v |
LiMa | Woody | Cluster32/64 | remarks |
|---|---|---|---|---|
| GNU Compiler | 4.x as default | 4.x as default | 4.x as default | the exact default version depends on the underlying linux distribution. modules for more current gcc versions are usually available. |
| Recommended native compiler | Intel C/C++ and Fortran90 Compiler 11.1 and newer | Intel C/C++ and Fortran90 Compiler 10.1 and newer | Intel C/C++ and Fortran90 Compiler 9.1 and newer | |
| Mathematical library | Intel MKL | Intel MKL | Intel MKL | |
| recommended MPI library | OpenMPI or Intel MPI | Intel MPI | Intel MPI | |
| parallel Debugger | Totalview | DDT | N/A | The parallel debuggers are usually not something users can handle theirselves. Please contact HPC services for assistance. |
| Profiling-Tools | yes | yes | yes | These tools tend to confuse unexperienced users and lead them to incorrect conclustions. Please contact HPC services for assistance. |
Some commercial or semi-commercial software packages are also available on some HPC systems. There are however two important points to note:
- HPC@RRZE does not provide any licenses.
If you want to use any commercial software, you will need to bring the license with you. This is also true for software sublicensed from the RRZE software group. All calculations you do on the clusters will draw licenses out of your pool. Please try to clarify any licensing questions before contacting us, as we really do not plan to become experts in software licensing. - We usually do not have experience with these software packages.
We can and will not help you with general questions on these packages. We will try our best to help you run the software on our clusters, but we expect you know how to use the software in principal.
We know of the following commercial software that has been run successfully by some users on our clusters:
| software | remarks |
|---|---|
| STAR-CD | |
| CFX-5 | |
| Fluent | RRZE cannot provide any kind of assistance for this. For any problems, contact Fluent Germany. |
| gaussian | not really usable on our current systems |
| amber | |
| gromacs | |
| abaqus | |
| maple | |
| mathematica |
OpenMP Pinning
Introduction
To reach optimum performance with OpenMP codes, correct pinning of the OpenMP
threads is essential. As nowadays practically all machines are ccNUMA, where
incorrect or no pinning can have devastating effects, this is something that
should not be ignored.
We offer a convenient way to do that on the RRZE systems (including our
testcluster): We have implemented a small library that replaces the calls
made for thread creation from an OpenMP program with variants that do pinning
at runtime.
Usage
To simplify usage, there is the wrapper script /apps/rrze/bin/pin_omp
that can be used as follows in the simplest case:
/apps/rrze/bin/pin_omp -c 0-7 ./mybinary
This will run ./mybinary and pin the threads to core 0-7.
Possible problems
Unfortunately, it isn't always that easy for different reasons:
- statically linked binaries cannot be tricked through our library. It only works for dynamically linked binaries.
- The library assumes a thread layout as it is generated by the Intel compilers. Binaries that were created with different compilers possibly require different parameters and could even get slower through our library.
- The Intel compilers starting with version 10.0 attempt to do some
pinning theirselves, that naturally collides with the pinning attempts
of our library. This usually leads to all threads being executed on one
single CPU core, resulting in horrible performance. They however only
do that on genuine Intel CPUs. If you run the same binary e.g. on an
Opteron CPU, it does not attempt to do pinning, and the pinning done
by our library works as expected. Starting with compiler version
10.1.18 or 11.0, pinning can be disabled by setting the environment
variable KMP_AFFINITY:
setenv KMP_AFFINITY disabled - The sequence of the CPU parameter is not obeyed:
-c 0-7has exactly the same effect as-c 1,3,5,7,0,2,4,6. A different can however be enforced through the use of an environment variable - see the section for advanced users for that.
The following table summarizes the recommended settings:
| Compiler | on Intel CPUs | on non-Intel CPUs (e.g. Opteron) |
|---|---|---|
| Intel 9.1 | use pin_omp | use pin_omp |
| Intel 10.0 bis 10.1.17 | DON'T use pin_omp | use pin_omp |
| Intel ab 10.1.18 | use pin_omp, set variable KMP_AFFINITY to disabled. | use pin_omp |
Further possibilities for advanced users
Advanced users can influence the behaviour of the library with environment variables.
| Variable | Effect |
|---|---|
PINOMP_MASK |
works like the 'dplace' parameter -x, i.e. it expects a number, that is interpreted as a bitmask. The threads for which the corresponding bit is set will not be pinned. |
PINOMP_SKIP |
works like the 'dplace' parameter -s: The thread with this number will not be pinned. Multiple numbers can be given, seperated by commas. |
PINOMP_CPUS |
Explicitly specifies the CPU core numbers to use and their sequence, where the core numbers are seperated by commas. This overrides the -c command line parameter. |
File systems
Overview
A number of file systems is available at RRZE. In the past, every cluster
was procured with a rather large cluster local NFS server, of which some
are still in operation, which makes the following list rather large.
There is one simple logic rule to keep in mind: Everything that
starts with /home/ is available throughout the RRZE,
which naturally includes all HPC systems. Therefore, e.g.
/home/woody is accessible from all clusters, even if it
was bought together with the Woody-Cluster and mainly for use by the
Woody cluster.
| Mount point | Access via | Purpose | Size | Backup | Data lifetime | Quota | Remarks |
|---|---|---|---|---|---|---|---|
/home/hpc |
$HOME |
Storage of source, input and important results | 15 TB | Yes | Account lifetime | Yes (restrictive) | |
/home/vault |
Mid- to Longterm storage | 50 TB online, a lot more offline (on tape) | Yes | Account lifetime | Yes | hierarchical storage system. Files that have not been touched for a long time are automatically moved to tape | |
/home/woody |
used to be cluster local storage for woody cluster, now official storage for small files | 35 TB | NO | Account lifetime | Yes | ||
/home/cluster64 |
cluster local storage for cluster64 | 13 TB | NO | Account lifetime | Yes | This filesystem is approaching its end of life. | |
/home/cluster32 |
cluster local storage for cluster32 | turned off some time ago | |||||
/home/altix |
local storage for the altixes | turned off some time ago | |||||
/wsfs |
$FASTTMP |
High performance parallel I/O; short-term storage | 15 TB | NO | High watermark deletion | No | only available on the Woody cluster and on the cluster64 nodes (not the frontends). This filesystem is approaching its end of life. |
/lxfs |
$FASTTMP |
High performance parallel I/O; short-term storage | 115 TB | NO | High watermark deletion | No | only available on the LiMa cluster |
NFS file system $HOME
When logging in to any system, you will start in your regular
RRZE
$HOME directory, which is usually located under /home/hpc/....
There are relatively tight quotas there, so it will most probably be too small
for the inputs/outputs of your jobs. It however does offer a lot of nice
features, like fine grained snapshots, so use it for "important" stuff,
e.g. your jobscripts, or the source code of the program you're working on. See
the HPC storage page for a more detailed
description of the features.
Parallel file systems $FASTTMP
The Woody and the LiMa cluster each have a parallel filesystem for high performance short-term storage. Please note that they are entirely different systems, i.e. you cannot see the files on Woodys $FASTTMP in the $FASTTMP on LiMa. They are not available on systems outside of the respective clusters.
The parallel file systems use a high watermark deletion algorithm:
When the filling of the file system exceeds a certain limit
(e.g. 80%), files will be deleted starting with the oldest and largest files
until a filling of less than 60% is reached. Be aware that the normal tar -x
command preserves the modification time of the original file instead of the time
when the archive is unpacked. So unpacked files may become one of the first candidates
for deletion. Use tar -mx or touch in combination with
find to work around this. Be aware that the exact time of deletion is
unpredictable.
Note that parallel filesystems generally are not made for handling large amounts of small files. This is by design: Parallel filesystems achieve their amazing speed by writing to multiple different servers at the same time. However, they do that in blocks, in our case 1 MB. That means that for a file that is smaller than 1 MB, only one server will ever be used, so the parallel filesystem can never be faster than a traditional NFS server - on the contrary: due to larger overhead, it will generally be slower. They can only show their strengths with files that are at least a few megabytes in size, and excel if very large files are written by many nodes simultanously (e.g. checkpointing).
shells
In general, two types of shells are available on the HPC systems:
csh, the C-shell, usually in the form of the feature enhancedtcshinstead of the classiccshbash
csh is the default login shell for all users. Whenever a new
account is created, it gets csh as login shell, so everytime you
log in, you are in a csh. However, users familiar with
linux systems are used to bash, and often confused by the
behaviour and very limited featureset of csh. If you want to
use bash instead, you can contact the ServiceTheke or the HPC team
to get your login shell changed. Be warned though that bash is
only a supported shell on the HPC systems, so you should not change to it if you
want to log in to non-HPC systems at RRZE, as you may experience problems
there or even be unable to log in at all.
Batch processing
Introduction
All of the HPC clusters (with the exception of a few special machines) run under the control of a batch system. All user jobs except short serial test runs must be submitted to the cluster through this batch system. The submitted jobs are then routed into a number of queues (depending on the needed resources, e.g. runtime) and sorted according to some priority scheme.
A job will run when the required resources become available.
On most clusters, a number of nodes is reserved during working hours for short
test runs with less than one hour of runtime. These nodes are dedicated to the
devel queue. Do not use the devel queue
for production runs. Since we do not allow MPI-parallel applications
on the frontends, short parallel test runs must be performed
using batch jobs.
It is also possible to submit interactive batch jobs that, when started, open a shell on one of the assigned compute nodes and let you run interactive (including X11) programs there.
The command to submit jobs is called qsub. To submit a batch job use
qsub <further options> [<job script>]
The job script may be omitted for interactive
jobs (see below). After submission, qsub will output the Job ID
of your job. It can later be used for identification purposes and is also
available as the environment variable $PBS_JOBID in job scripts
(see below). These are the most important options for the qsub
command:
| Option | Meaning |
|---|---|
-N <job name> |
Specifies the name which is shown with qstat. If the option
is omitted, the name of the batch script file is used. |
-l nodes=<# of nodes>:ppn=<nn> |
Specifies the number of nodes requested.
All current clusters require you to always request full nodes. (The old
Cluster64 allows to allocate single CPUs.) Thus, for LiMa you always need
to specify :ppn=24, for TinyBlue :ppn=16 and
for woody :ppn=4.
For other clusters, see the documentation of the respective clusters for
the correct ppn values. |
-l walltime=HH:MM:SS |
Specifies the required wall clock time (runtime). When the job reaches
the walltime given here it will be sent a TERM signal.
After a few seconds, if the job has not ended yet, it will be sent
KILL. If you omit the walltime option, a
- very short - default time will be used. Please specify a reasonable
runtime, since the scheduler bases its decisions also on this
value (short jobs are preferred). |
-M x@y -m abe |
You will get e-mail to x@y when the job is aborted (a),
starting (b), and ending (e). You can choose any subset of
abe for the -m option. If you omit the
-M option, the default mail address assigned to your RRZE account
will be used. |
-o <standard output file> |
File name for the standard output stream. If this option
is omitted, a name is compiled from the job name (see -N) and the job ID. |
-e <error output file> |
File name for the standard error stream. If this option is omitted, a name
is compiled from the job name (see -N) and the job ID. |
-I |
Interactive job. It is still allowed to specify
a job script, but it will be ignored, except for the PBS
options it might contain. No code will be executed. Instead, the user
will get an interactive shell on one of the allocated nodes and can
execute any command there. In particular, you can start a parallel
program with mpirun. |
-X |
Enable X11 forwarding. If the $DISPLAY environment variable
is set when submitting the job, an X program running on the compute node(s) will
be displayed at the user's screen. This makes sense only for
interactive jobs (see -I option). |
-W depend:<dependency list> |
Makes the job depend on certain conditions. E.g., with -W depend=afterok:12345
the job will only run after Job 12345 has ended successfully, i.e. with
an exit code of zero. Please consult the qsub man page for
more information. |
-q <queue> |
Specifies the Torque queue (see above); default queue is route.
Usually it is not required to use this parameter as the route queue
automatically forwards the job to an appropriate execution queue. |
There are several Torque commands for job inspection and control. The following table gives a short summary:
| Command | Purpose | Options |
|---|---|---|
qstat [<options>] [<JobID>|<queue>] |
Displays information on jobs. Only the user's own jobs are displayed. For information on the overall queue status see the section on job priorities. | -a display "all" jobs in user-friendly format-f extended job info-r display only running jobs |
qdel <JobID> ... |
Removes job from queue | - |
qalter <qsub-options> |
Changes job parameters previously set by qsub.
Only certain parameters may be changed after the job has started. |
see qsub and the qalter manual page |
qcat [<options>] <JobID> |
Displays stdout/stderr from a running job | -o display stdout (default)-e display stderr-f output appended data as the job is running (like tail -f |
Batch scripts
To submit a batch job you have to write a shell script that contains all the commands to be executed. Job parameters like estimated runtime and required number of nodes/CPUs can also be specified there (instead of on the command line):
#!/bin/bash -l
#
# allocate 16 nodes (64 CPUs) for 6 hours
#PBS -l nodes=16:ppn=4,walltime=06:00:00
#
# job name
#PBS -N Sparsejob_33
#
# stdout and stderr files
#PBS -o job33.out -e job33.err
#
# first non-empty non-comment line ends PBS options
# jobs always start in $HOME -
# change to a temporary job directory on $FASTTMP
mkdir ${FASTTMP}/$PBS_JOBID
cd ${FASTTMP}/$PBS_JOBID
# copy input file from location where job was submitted
cp ${PBS_O_WORKDIR}/inputfile .
# run
mpirun ${HOME}/bin/a.out -i inputfile -o outputfile
# save output on parallel file system
mkdir -p ${FASTTMP}/output/$PBS_JOBID
cp outputfile ${FASTTMP}/output/$PBS_JOBID
cd
# get rid of the temporary job dir
rm -rf ${FASTTMP}/$PBS_JOBID
|
The comment lines starting with #PBS are ignored by the shell
but interpreted by Torque as options for job submission (see above for
an options summary). These options can all be given on the qsub
command line as well. The example also shows the use of the
$FASTTMP and $HOME variables. $PBS_O_WORKDIR
contains the directory where the job was submitted. All batch scripts start executing
in the user's $HOME so some sort of directory change is always
in order.
If you have to load modules from inside a batch script, you can do so.
The only requirement is that you have to use either a csh-based shell
or bash with the -l switch, like in the example above.
Interactive Jobs
For testing purposes or when running applications that require some manual
intervention (like GUIs), Torque offers interactive access to the compute nodes
that have been assigned to a job. To do this, specify the -I
option to the qsub command and omit the batch script.
When the job is scheduled, you will get a shell on the master node
(the first in the assigned job node list). It is possible to
use any command, including mpirun, there. If you need
X forwarding, use the -X option in addition to
-I.
Note that the starting time of an interactive batch job cannot reliably
be determined; you have to wait for it to get scheduled. Thus we recommend
to always run such jobs with wallclock time limits less than one hour, so
that the job will be routed to the devel queue for which
a number of nodes is reserved during working hours.
Interactive batch jobs do not produce stdout and stderr
files. If you want a protocol of what's happened, use e.g. the UNIX
script command.



