ATOMIN Cluster Info, News etc


Latest infos:

After Sun. 30.07 power failure we have lost our Infiniband connectivity due to loss of the IB switch :-( I have moved all the IB connectivity to our ethernet network but ... remember it is much slower and the MPI will also use the same hardware now so ... both file access and MPI jobs will suffer :-(

PLEASE be friendly to everybody and place in your PBS submitting script number of processors your job will be using on each machine, via ppn parameter as in "-l nodes=1:ppn=16" that way jobs not filling whole machine may be running together without degrading cluster and job performance :-)

Happy computing from your admin - Roman.Marcinek@uj.edu.pl


General info

Cluster state - GANGLIA Cluster state - *stat
The same info as below one can get in terminal session via info-en (English version) or info-deszno (Polish version) command. For intro to PBS (queue system) see our other cluster home page.

DESZNO

The system we are logging in from outside (deszno.if.uj.edu.pl) is an access node to our cluster by the internal name of mgmt. This is a relatively small machine: only 2 four-cored processors (Intel Xeon E5504 @ 2 GHz) which, besides providing an external access, serves also as a monitoring node and a PBS (queue system) server. One can use it for small works such as editing and compiling but nothing more! All real computing work should be done either via PBS (queue system) on nodes complex01 - complex07 or on complex08, which serves as a test machine (big compilations, test runs etc.). We also have a second external access node, at the moment by the external name of everest.if.uj.edu.pl (IP=149.156.64.23), devoted exclusively to serve our BIG /home partition to the rest of cluster. Main tasks performed on this node should be transfers to or from it to the external world, although some small tasks as file editing is allowed. Provided that everything works as expected, queue system tasks such as qstat/pestat will also be available there. In all cases please refrain from more intensive works on this node as it is also very small machine.

COMPUTING NODES

The computing power of the cluster is provided by 8 nodes: six of them (complex01 to complex06) are 96 core (4x4 Intel Xeon E7450 @ 2.4 GHz) equipped with 256 GB of RAM. The remaining two nodes (complex07 and complex08) are smaller: 64 cores (2x4 Intel Xeon X7550 @ 2.0 GHz) and 128 GB RAM. The complex08 machine is available via ssh from both access nodes, whereas the direct access to other computing nodes is blocked for normal users. Please use the queue system to run batch jobs on those nodes. The complex08 system serves mainly as a test ground for programs, for big compilations and other compute intensive tasks. If needed, we will apply automatic killing of processes running too long - so far it is not enforced but beware of such possibility and do not attempt running really long computations as this prevents others from their testing. Load of the nodes can be checked either via our home page (ganglia) or the pestat command.

DISK SPACE

An information about available disk space can be obtained via normal UNIX/Linux commands such as df. At some point df -h gave the following output - learn how to interpret it :-)
Filesystem            Size  Used Avail Use% Mounted on
/dev/sda5              16G   13G  2.5G  84% /
/dev/sda8              39G   21G   16G  57% /data
/dev/sda7             7.8G  147M  7.3G   2% /tmp
/dev/sda3              31G  625M   29G   3% /var
/dev/sda2              31G  3.5G   26G  12% /usr
/dev/sda1             122M   34M   82M  30% /boot
tmpfs                 4.9G   12M  4.9G   1% /dev/shm
10.1.1.101:/          146T   11T  135T   8% /home
10.1.1.1:/home1       7.3T  5.5T  1.5T  80% /home1
10.1.1.2:/home2       7.3T  5.0T  1.9T  73% /home2
10.1.1.3:/home3       7.3T  2.2T  4.8T  31% /home3
10.1.1.4:/home4       7.3T  5.4T  1.6T  78% /home4
10.1.1.5:/home5       7.3T  5.7T  1.3T  83% /home5
10.1.1.6:/home6       7.3T  3.1T  3.9T  44% /home6
10.1.1.7:/home7       3.3T  199M  3.1T   1% /home7
10.1.1.8:/home8       3.7T  196M  3.5T   1% /home8
The user available disk space comes in /home and /homeX catalogs. The /homeX file systems are local disks for complex0X machines. They are mounted using NFS for remaining machines. The /home partition is a local file system of the storage01 (secondary access node), and it is NFS available for all other machines. For that reason if some program needs a really fast access to the disk space, it is recommended to use a local disk of the node on which the program is running- if the transfers are not so big then NFS transfer via Infiniband will probably be more than adequate. In the case of queue system jobs the execution node may not be known in advance, so one may use the following piece of code in the batch file to find out the local disk: /home`hostname | sed s/complex0//` (the hostname command returns the host name where the code runs and the sed extracts a node number - /homeX file systems are numbered accordingly).

COMPILERS AND OTHER SOFTWARE STUFF

The same software is installed on all machines (except storage01) including the Intel Cluster Toolkit Compiler. It contains Intel compilers (C/C++/Fortran), MKL - Intel Math Kernel Library (a library containing among other things highly optimized lapack and fftw3) and Intel MPI implementation. As most of extra software, it is installed in the /opt directory (/opt/intel in fact), so for example ifort (Intel Fortran Compiler) is available as: /opt/intel/Compiler/11.1/075/bin/intel64/ifort. There are, of course, standard linux GNU compilers in different versions available as well. A lot of other software such as Maple, Mathematica, GSL etc. can also be found. If something is not installed, please ask the administrator.

OPENMP

All Intel (and some GNU) compilers support multi-threading via the OPENMP extension, which enables parallel (multithreaded) jobs within each node - up to 96 (64) threads. Some trial and error testing is needed to check if the most efficient set-up is achieved with a bit less threads than the number of available computing cores (operating system needs sometimes quite a lot of computing power, specially with active NFS connections). For some compilers it is necessary to point the dynamic loader to the proper library search path, for example for Intel Compilers: export LD_LIBRARY_PATH=/opt/intel/Compiler/11.1/075/lib/intel64/:$LD_LIBRARY_PATH

INTEL MPI

Openmp is contained within single node - if even more computing power is needed then one can perform parallel jobs via MPI (Message Passing Interface) which should also perform well since our cluster is connected via both Ethernet and Infiniband that delivers our fast interconnect. Intel MPI has been tested and found out to behave quite well. It can be found in /opt/intel/impi directory. The binaries are located in /opt/intel/impi/3.2.2.006/bin64/ subdirectory. The command mpirun from that directory creates a ring of mpi daemons on predefined hosts and runs a specified MPI program. For jobs demanding more than a single machine Intel MPI requires a file called mpd.hosts specifying the nodes on which the program should be run. It should contain a list of nodes with appropriate number of processes in the form similar to:
complex01:96
complex02:96
complex03:96
complex04:96
complex05:96
complex06:96
complex07:64
This file needs to be created in the working directory of the program, i.e., the one from which mpirun is invoked. There is also an option to the mpirun command which enables usage of other than default file. In multinode case one also needs to define communication channel, for Intel MPI and our cluster it is "-r ssh". Another parameter useful for communication optimization "-genv I_MPI_DEVICE rdssm" indicates a mixed (hybrid) communication between cores (shared memory + Infiniband). The full command running such MPI job (still with Intel MPI) is: /opt/intel/impi/3.2.2.006/bin64/mpirun -r ssh -genv I_MPI_DEVICE rdssm -np total_core_number ./program_name PBS ADVICE: as queue system assigns the nodes dynamically, the file mpd.hosts cannot be created in advance. Instead the system creates a file containing all assigned nodes and passes the name of the file in the environment variable PBS_NODEFILE - this file needs to be used instead of the default mpd.hosts or, alternatively, one may copy the file to the mpd.hosts within the batch script before running mpirun.

OTHER MPI SOFTWARE

All other MPI related software (libraries etc.) may be listed (and also managed) via:
mpi-selector --list
the description of all that software is way to big to include here, for the details contact an administrator and be prepared for LONG reading.

QUEUE SYSTEM (PBS)

At the moment the following queues are defined:
Q.name   Node no.       No. of cores   RAM available    Walltime
single   1              96             256   GB         5 day
double   2              192            0.512 TB         5 days
six      3-6            576            1.5   TB         3 days
small    1              64             128   GB         5 day
The "small" queue is served by complex07 only, whereas complex01 to complex06 serve all the other queues.