CS Compute Cluster

The school has a small compute cluster which is composed of the workstations in Trottier. You can send jobs from mimi which can simultaneously run on 32 machines with GPUs.

When you submit jobs to the cluster, the code that runs on each machine must be entirely self contained. So it is best used for experiments where you are testing a piece of code repeatedly. An obvious example task would be tuning a hyperparameter of a machine learning model.

Basics of Using the Cluster

To begin using the cluster, first log in to mimi via ssh. mimi is the control host for the cluster so you will use it to interact with the cluster.

To get some information about the state of the cluster, use the sinfo command

canton14@teach-vw2:~$ sinfo
PARTITION     AVAIL  TIMELIMIT  NODES  STATE NODELIST
teaching-gpu*    up    1:00:00      3  down* open-gpu-[1,5,8]
teaching-gpu*    up    1:00:00     12  drain open-gpu-[2-4,6-7,9-12,14-16]
teaching-gpu*    up    1:00:00     17   idle open-gpu-[13,17-32]

At this time, the partion teaching-gpu has 3 nodes down, 12 nodes in a "drained" state, and 17 nodes idle. Down nodes are powered off, drained nodes currently do not have enough free resources to be assigned new jobs (perhaps they are being used by someone else) and idle nodes are ready for you to use. The asterisk by the teaching-gpu partition name indicates that it is the default partition. As we add more partitions, you may need to select the one you want to use when running commands.

Since this cluster is shared amongst all CS students, there is a time limit of one hour on all jobs. If your job takes longer than that, it will be terminated early, so please ensure that you experiment will take less than an hour before running it on the cluster. Depending on usage, we may adjust the time limit.

Running Jobs

Once you have found a partition that will be able to run your job, you can freely launch it. First, let's launch a simple example, running the hostname command on 9 nodes. To do this, we will use the srun command on mimi.

canton14@teach-vw2:~$ srun -p teaching-gpu -N9 /bin/hostname
open-gpu-21
open-gpu-24
open-gpu-20
open-gpu-19
open-gpu-22
open-gpu-17
open-gpu-18
open-gpu-25
open-gpu-23

As you can see, all it took as appending the srun command and some options to the front of the command we wanted to run to distribute it across the nodes.

The -p teaching-gpu argument specified the partition we want to use. Since teaching-gpu is the default (and only, for now) partition, we could have omitted that argument. The -N9 argument specified that we wanted our command to run on 9 nodes. If we only wanted it to run on 9 CPU cores, we could have used the -n9 option.

canton14@teach-vw2:~$ srun -p teaching-gpu -n9 /bin/hostname
open-gpu-17
open-gpu-17
open-gpu-17
open-gpu-17
open-gpu-17
open-gpu-18
open-gpu-18
open-gpu-18
open-gpu-18

The benefits of using -N are that that you would have all the RAM and GPU of each node for each process. Whereas the benefit of using -n is that you have 8-12 cores per machine, which means you can run 8-12 times as many processes.

You can also launch an interactive shell across a set of nodes using the salloc command with the same arguments as srun. You can then prefix commands with srun (without arguments) and they will be run across the nodes you previously selected.

'canton14@teach-vw2:~$ ps
  PID TTY          TIME CMD
 1229 pts/129  00:00:00 bash
 2453 pts/129  00:00:00 ps
canton14@teach-vw2:~$ salloc -N 10
salloc: Granted job allocation 31
canton14@teach-vw2:~$ ps
  PID TTY          TIME CMD
 1229 pts/129  00:00:00 bash
 1301 pts/129  00:00:00 salloc
 1303 pts/129  00:00:00 bash
 1319 pts/129  00:00:00 ps
canton14@teach-vw2:~$ hostname
teach-vw2
canton14@teach-vw2:~$ srun hostname
open-gpu-19
open-gpu-17
open-gpu-25
open-gpu-23
open-gpu-21
open-gpu-26
open-gpu-18
open-gpu-22
open-gpu-24
open-gpu-20

As you can see from the output of ps the current shell is running as a subprocess of salloc. To quit salloc just exit your shell as nowrmal.

How to Train a Model Across Nodes

Say you have a (very bad) model which takes an integer as input, adds some random value between zero and one, and returns that as the output. You could

canton14@teach-vw2:~$ cat batch_job.py 
#!/usr/bin/env python3

import os
import random

jobid = os.getenv('SLURM_ARRAY_TASK_ID')

result = int(jobid) + random.random()

print(jobid, result)
canton14@teach-vw2:~$ sbatch --array=1-10 -N 10 batch_job.py
Submitted batch job 92
canton14@teach-vw2:~$ cat slurm-92_*
10 10.551001157018066
1 1.7942053382823158
2 2.781956597945983
3 3.9022921961241126
4 4.063291931356006
5 5.501764124355088
6 6.5673130218314775
7 7.08193661136367
8 8.412695441528129
9 9.234586397610121

The results of each job will be put in the file slurm-92_n.out, so you will get n output files. You can change that behaviour using the -o option to sbatch.

This pattern would allow you to tune a hyperparameter on your model. You would just need to have your script train the model, and have the hyperparemeter be set via the SLURM_ARRAY_TASK_ID environment variable. Unfortunately, this variable can only be an integer so you will have to transform it as needed in your code.