understanding shared file system and scratch			2025-04

TCML cluster consists of 4 different types of nodes

1. management node
   DHCP, DNS, slurm-control-daemon, imaging to nodes, user management, management of shared file system and other

2. storage nodes or shared file system nodes
   data and meta (4 servers)

3. login nodes without GPU
   ( 2 virtual machines with less ressources and 1 physical server)

4. 40 calculating nodes with different GPU types
   (1080ti, 2080ti, A4000, L40S)

The shared file system (in our case: beegfs) is used to provide the /home directory to nodes type 1,3,4
The users can transfer to and store their data on their homes and so their data are at nodes type 1,3,4 available

Another part of beegfs provides the folder /common (available on nodes type 1,3,4) which can be used 
to storage shared data, f. e. data sets, singularity images

beegfs is mounted on nodes type 1,3,4 to /mnt/beegfs/
and /home is a link to /mnt/beegfs/home/
and /common is a link to /mnt/beegfs/common/


Every node is connected to a switch, all nodes with 10GBit/s except storage nodes with 40GBit/s


When a user wants to run a neuronal network (NN) training he/she has to submit a job to slurm.
A user can also run a job on the login node, but login node does not have a GPU :-(

While NN training, some data has to be used millions of times.
Access to network lacks on ethernet delay which is much more slower than SSD access on the same calculating node

The solution is to give slurm an addition command inside the job definition file (sbatch file): copy working data to /scratch
To define this, you have to use a value which isn't known when submitting the job. The value is the job-ID which is
assigned when slum start the job, f. e.
cp -R /common/datasets/MNIST/ /scratch/$SLURM_JOB_ID/

/scratch is a LOCAL file system on every calculating node.
It's also available on login nodes - but only for compatibility.

/scratch on calculating nodes is part of the local root file system and can assume

calculating nodes with 4 1080ti:  up to 1,8 TB  SATA SSDs with RAID10
calculating nodes with 4 A4000:   up to 1,8 TB  SATA SSDs with RAID10
calculating nodes with 8 2080ti:  up to 1,6 TB  NVMe SSD
calculating nodes with 8 L40S:    up to 3,3 TB  NVMe SSD


If you have a "normal" dataset then starting a job WITH /scratch has a dalay in training NN because of copying data first
to EVERY involved calculating node but then the job runs much more faster than training a NN without scratch.


If you have a dataset with f. e. 500 GB and you define jobs with 1 GPU -> doesn't fit to a 4-GPU-calculating node.
Copying a dataset with 40 millions of files (we had this situation in the past) doesn't fit because 
  copying time would be very long.
  

So users have to think about: "which is the better solution/strategy"


Some times users are not aware of this and starts jobs without /scratch.

If one person runs a job without using /scratch then the job uses the shared file system the COMPLETE time of the job.
If many users are running jobs* at the same time without using scratch then the shared file system slows down.
* or uploading/copying/extracting data


beegfs can deliver more than 1 GByte/s to some node, stronly depending on file sizes.

To see the usage of beegfs use:
$>  beegfs-ctl --userstats --interval=1 --nodetype=storage --allstats --names --maxlines=20 | awk '{print $1,$22,$23,$26,$27}'
  without filtering (awk) you will get too many parameters
  --interval=1  --> updating every second
  --maxlines=20  --> take the highest 20 values
  
If you notice that a lot of users doing havy load to beegfs you can write an email to tcml-contact@listserv.uni-tuebingen.de and we will inform those users.