MFCF cluster is managed using the Slurm resource manager. Therefore, you'll need to login to Slurm login/head node and use Slurm commands to run your jobs.
Below are the basic steps to get started.
Note: In the following instructions, the generic placeholder <userid> should be replaced with your actual UW user ID (8 characters or fewer). For example: nalamron.
Summary of Steps for Running Jobs on Slurm:
- Copy your project (code + data) to the cluster
- Log in to the Slurm head node
- Create Slurm job script
- Submit your job using
sbatch - Monitor or cancel your job using Slurm commands
Step-by-Step Details
-
Copy Files to the Cluster
From a Linux/macOS terminal:
Use
scpcommand to copy your data/code.scp -r projectdir/ <userid>@rsubmit.math.private.uwaterloo.ca:~/.
From Windows:
Use the
WinSCPGUI tool or run the samescpcommand from the Windows PowerShell terminal (after installing OpenSSH if needed).scp -r projectdir/ <userid>@rsubmit.math.private.uwaterloo.ca:~/.
-
Login to the Head Node
ssh <userid>@rsubmit.math.private.uwaterloo.ca cd projectdir/
NOTE: Never run heavy compute tasks on this node — use Slurm instead.
-
Create a Slurm sbatch Script
The following links provide information about available resources on MFCF Slurm cluster:
- For information about available Slurm partitions check Slurm partitions
- For information about GPU partitions and the required
--gresoption check: GPU job submission example - For per node resource data check: Cluster machines specifications
Below is an example
my-script.sh(not complete – update paths, settings, code script and user ID as needed):#!/bin/bash #SBATCH --mail-user=<userid>@uwaterloo.ca #SBATCH --mail-type=BEGIN,END,FAIL #SBATCH --job-name="test2" #SBATCH --partition=gpu_p100 #SBATCH --ntasks=5 #SBATCH --nodes=1 #SBATCH --cpus-per-task=2 #SBATCH --time=0-12:00:00 #SBATCH --mem-per-cpu=1G #SBATCH --gres=gpu:p100:1 #SBATCH --output=%x_%j.out ## Load modules module load anaconda3 ## Check if MYENV has already been created MYENV="gpuenv2" CHK_ENV=$(conda env list | grep $MYENV | awk '{print $1}') echo "CHK_ENV: $CHK_ENV" if [ "$CHK_ENV" = "" ]; then # if MYENV does not exist echo "$MYENV doesn't exist, create it..." conda create -y -n $MYENV numba cudatoolkit -c conda-forge -c nvidia fi conda activate $MYENV echo "echo: $(which python3) " echo "" python3 my-gpu-test.py -
Use
sbatchcommand to submit the Jobsbatch my-script.sh
Monitor and Control jobs
-
Use
squeuecommand to check Job's statussqueue --me
-
Use
scancelcommand to kill a Jobscancel -j <job_id>
-
Use
sinfocommand to check Cluster Statussinfo
-
Use
sacctcommand to check resource usage after job completionsacct -j 18121 -o "JobName%20,Partition%20,NCPUS,State,ExitCode,Elapsed,CPUTime,MaxRSS"
Replace 18121 with your actual job ID returned by sbatch command.
-
Copy results back to your own UW directory
After completion, transfer your project (code, data and results) to your home system:
scp -r /work/<userid>/projectdir linux.math.uwaterloo.ca:/u/<userid>/
Key Safety Reminder
- Never run heavy compute jobs directly on the head node.
- Always submit jobs with Slurm (
sbatch,srun). - Always replace placeholders (
<userid>) with your real UW userid.
For more details, please refer to MFCF's Slurm documentation