ATOMIN Cluster Hardware and Filesystem Info
The system we are logging in from outside (deszno.if.uj.edu.pl) is an access
node to our cluster by the internal name of mgmt. This is a relatively small
machine: only 2 four-cored processors (Intel Xeon E5504 @ 2 GHz) which, besides
providing an external access, serves also as a monitoring node and a PBS
(queue system) server. One can use it for small works such as editing and
compiling but nothing more! All real computing work should be done via slurm
on nodes complex01 - complex06 all belonging to the bigone partition/queue
or complex07/08 belonging to small partition/queue.
We also have a second external access node, at the moment by the external name
of deszno2.if.uj.edu.pl (IP=149.156.64.23), devoted exclusively to serve our
BIG /home partition to the rest of cluster. Main tasks performed on this node
should be transfers to or from it to the external world, although some small
tasks as file editing is allowed. Provided that everything works as expected,
queue system tasks such as qstat/pestat will also be available there. In all
cases please refrain from more intensive works on this node as it is also very
small machine.
COMPUTING NODES
The computing power of the cluster is provided by 8 nodes: six of them
(complex01 to complex06) are 96 core (4x4 Intel Xeon E7450 @ 2.4 GHz) equipped
with 256 GB of RAM. The remaining two nodes (complex07 and complex08) are
smaller: 64 cores (2x4 Intel Xeon X7550 @ 2.0 GHz) and 128 GB RAM and are
working as they should after quite long non-working state.
CONNECTIVITY
As quite some time ago we have lost our big Infiniband switch and acquired a
new but smaller one to which we can connect only single cards of each machine
then we gave up Infiniband bound network file systems and switched to the
more reliable but slower glusterfs over ethernet connections. And therefore
those Infiniband cards are used exclussively to our MPI jobs.
DISK SPACE
An information about available disk space can be obtained via normal UNIX/Linux
commands such as df. At some point df -h gave the following output - learn
how to interpret it :-)
FILESYSTEM (=) USED FREE (-) %USED AVAILABLE TOTAL MOUNTED ON
udev [--------------------] 0% 4.9G 4.9G /dev
tmpfs [===-----------------] 10% 899.2M 999.2M /run
/dev/sda2 [=====---------------] 23% 90.6G 117.6G /
tmpfs [--------------------] 0% 4.9G 4.9G /dev/shm
tmpfs [--------------------] 0% 5.0M 5.0M /run/lock
tmpfs [--------------------] 0% 4.9G 4.9G /sys/fs/cgroup
tmpfs [--------------------] 0% 999.2M 999.2M /run/user/0
complex06-ib:home6 [====----------------] 16% 4.8T 5.7T /home6
complex05-ib:home5 [==------------------] 6% 2.3T 2.4T /home5
complex02-ib:home2 [==------------------] 7% 5.3T 5.7T /home2
storage01-ib:home [=-------------------] 1% 79.5T 80.0T /home
complex01-ib:home1 [====================] 95% 269.8G 5.7T /home1
complex03-ib:home3 [===============-----] 75% 1.4T 5.7T /home3
complex04-ib:home4 [====================] 99% 54.9G 5.7T /home4
The user available disk space comes in /home and /homeX catalogs. The /homeX
file systems are local disks for complex0X machines. They are mounted using
glustrefs for remaining machines. The /home partition is a local file system of
the storage01 (secondary access node), and it is glustrefs available for all
other machines.
For that reason if some program needs a really fast access to the disk space,
it is recommended to use a local disk of the node on which the program is
running- if the transfers are not so big then glusterfs transfer via 1 Gb
ethernet will probably be more than adequate.
In the case of queue system jobs the execution node may not be known in advance,
so one may use the following piece of code in the batch file to find out the
local disk:
/home`hostname | sed s/complex0//`
(the hostname command returns the host name where the code runs and
the sed extracts a node number - /homeX file systems are numbered
accordingly).