ATOMIN Cluster Hardware and Filesystem Info

The system we are logging in from outside (deszno.if.uj.edu.pl) is an access node to our cluster by the internal name of mgmt. This is a relatively small machine: only 2 four-cored processors (Intel Xeon E5504 @ 2 GHz) which, besides providing an external access, serves also as a monitoring node and a PBS (queue system) server. One can use it for small works such as editing and compiling but nothing more! All real computing work should be done via slurm on nodes complex01 - complex06 all belonging to the bigone partition/queue or complex07/08 belonging to small partition/queue. We also have a second external access node, at the moment by the external name of deszno2.if.uj.edu.pl (IP=149.156.64.23), devoted exclusively to serve our BIG /home partition to the rest of cluster. Main tasks performed on this node should be transfers to or from it to the external world, although some small tasks as file editing is allowed. Provided that everything works as expected, queue system tasks such as qstat/pestat will also be available there. In all cases please refrain from more intensive works on this node as it is also very small machine.

COMPUTING NODES

The computing power of the cluster is provided by 8 nodes: six of them (complex01 to complex06) are 96 core (4x4 Intel Xeon E7450 @ 2.4 GHz) equipped with 256 GB of RAM. The remaining two nodes (complex07 and complex08) are smaller: 64 cores (2x4 Intel Xeon X7550 @ 2.0 GHz) and 128 GB RAM and are working as they should after quite long non-working state.

CONNECTIVITY

As quite some time ago we have lost our big Infiniband switch and acquired a new but smaller one to which we can connect only single cards of each machine then we gave up Infiniband bound network file systems and switched to the more reliable but slower glusterfs over ethernet connections. And therefore those Infiniband cards are used exclussively to our MPI jobs.

DISK SPACE

An information about available disk space can be obtained via normal UNIX/Linux commands such as df. At some point df -h gave the following output - learn how to interpret it :-)

FILESYSTEM         (=) USED      FREE (-) %USED AVAILABLE     TOTAL MOUNTED ON
udev               [--------------------]    0%      4.9G      4.9G /dev
tmpfs              [===-----------------]   10%    899.2M    999.2M /run
/dev/sda2          [=====---------------]   23%     90.6G    117.6G /
tmpfs              [--------------------]    0%      4.9G      4.9G /dev/shm
tmpfs              [--------------------]    0%      5.0M      5.0M /run/lock
tmpfs              [--------------------]    0%      4.9G      4.9G /sys/fs/cgroup
tmpfs              [--------------------]    0%    999.2M    999.2M /run/user/0
complex06-ib:home6 [====----------------]   16%      4.8T      5.7T /home6
complex05-ib:home5 [==------------------]    6%      2.3T      2.4T /home5
complex02-ib:home2 [==------------------]    7%      5.3T      5.7T /home2
storage01-ib:home  [=-------------------]    1%     79.5T     80.0T /home
complex01-ib:home1 [====================]   95%    269.8G      5.7T /home1
complex03-ib:home3 [===============-----]   75%      1.4T      5.7T /home3
complex04-ib:home4 [====================]   99%     54.9G      5.7T /home4

The user available disk space comes in /home and /homeX catalogs. The /homeX file systems are local disks for complex0X machines. They are mounted using glustrefs for remaining machines. The /home partition is a local file system of the storage01 (secondary access node), and it is glustrefs available for all other machines. For that reason if some program needs a really fast access to the disk space, it is recommended to use a local disk of the node on which the program is running- if the transfers are not so big then glusterfs transfer via 1 Gb ethernet will probably be more than adequate. In the case of queue system jobs the execution node may not be known in advance, so one may use the following piece of code in the batch file to find out the local disk: /home`hostname | sed s/complex0//` (the hostname command returns the host name where the code runs and the sed extracts a node number - /homeX file systems are numbered accordingly).