GPU Job Workflow on MFCF Slurm Cluster

MFCF cluster is managed using the Slurm resource manager. Therefore, you'll need to login to Slurm login/head node and use Slurm commands to run your jobs.

Below are the basic steps to get started.

Note: In the following instructions, the generic placeholder <userid> should be replaced with your actual UW user ID (8 characters or fewer). For example: nalamron.

Summary of Steps for Running Jobs on Slurm:

Copy your project (code + data) to the cluster
Log in to the Slurm head node
Create Slurm job script
Submit your job using sbatch
Monitor or cancel your job using Slurm commands

Step-by-Step Details

Copy Files to the Cluster

From a Linux/macOS terminal:

Use scp command to copy your data/code.
```
scp -r projectdir/ <userid>@rsubmit.math.private.uwaterloo.ca:~/.
```
From Windows:

Use the WinSCP GUI tool or run the same scp command from the Windows PowerShell terminal (after installing OpenSSH if needed).
```
scp -r projectdir/ <userid>@rsubmit.math.private.uwaterloo.ca:~/.
```
Login to the Head Node
```
ssh <userid>@rsubmit.math.private.uwaterloo.ca
cd projectdir/
```
NOTE: Never run heavy compute tasks on this node — use Slurm instead.

Create a Slurm sbatch Script

The following links provide information about available resources on MFCF Slurm cluster:

For information about available Slurm partitions check Slurm partitions
For information about GPU partitions and the required --gres option check: GPU job submission example
For per node resource data check: Cluster machines specifications

Below is an example my-script.sh (not complete – update paths, settings, code script and user ID as needed):

#!/bin/bash -l

#SBATCH --mail-user=<userid>@uwaterloo.ca
#SBATCH --mail-type=BEGIN,END,FAIL
#SBATCH --job-name="test2"
#SBATCH --partition=gpu_a100
#SBATCH --ntasks=5
#SBATCH --nodes=1
#SBATCH --cpus-per-task=2
#SBATCH --time=0-12:00:00
#SBATCH --mem-per-cpu=1G
#SBATCH --gres=gpu:a100:1
#SBATCH --output=%x_%j.out

## Load modules
module load anaconda3

## Check if MYENV has already been  created
MYENV="gpuenv2"

CHK_ENV=$(conda  env list | grep $MYENV | awk '{print $1}')

echo "CHK_ENV: $CHK_ENV"
if [ "$CHK_ENV" =  "" ]; then
# if MYENV does not exist
echo "$MYENV doesn't exist, create it..."
conda create -y -n $MYENV numba cudatoolkit -c conda-forge -c nvidia
fi

conda activate $MYENV

echo "echo: $(which python3) "
echo ""

python3 my-gpu-test.py

Use sbatch command to submit the Job
```
sbatch my-script.sh
```

Monitor and Control jobs

Use squeue command to check Job's status
```
squeue --me
```
Use scancel command to kill a Job
```
scancel -j <job_id>
```
Use sinfo command to check Cluster Status
```
sinfo
```
Use sacct command to check resource usage after job completion
```
sacct -j 18121 -o "JobName%20,Partition%20,NCPUS,State,ExitCode,Elapsed,CPUTime,MaxRSS"
```
Replace 18121 with your actual job ID returned by sbatch command.
Copy results back to your own UW directory

After completion, transfer your project (code, data and results) to your home system:
```
scp -r /work/<userid>/projectdir linux.math.uwaterloo.ca:/u/<userid>/
```

Key Safety Reminder

Never run heavy compute jobs directly on the head node.
Always submit jobs with Slurm (sbatch, srun).
Always replace placeholders (<userid>) with your real UW userid.

For more details, please refer to MFCF's Slurm documentation

Summary of Steps for Running Jobs on Slurm:

Step-by-Step Details

Monitor and Control jobs

Key Safety Reminder

Departments/Schools

Inquiries

Suggestions