MFCF cluster is managed using the Slurm resource manager. Therefore, you'll need to login to Slurm login/head node and use Slurm commands to run your jobs.
Below are the basic steps to get started.
Note: In the following instructions, the generic placeholder <uw-user-id> should be replaced with your actual UW user ID (8 characters or fewer). For example: nalamron.
Basic Workflow to Run Jobs on Slurm:
- Copy your project (code + data) to the cluster
- Log in to the Slurm head node
- Create Slurm job script
- Submit your job using
sbatch - Monitor or cancel your job using Slurm commands
Step-by-Step Details
- Copy Files to the Cluster
From a Linux/macOS terminal:
Usescpcommand to copy your data/code.scp -r my-project-dir/ <uw-user-id>@rsubmit.math.private.uwaterloo.ca:~/.
From Windows:
Use the
WinSCPGUI tool or run the samescpcommand from Windows PowerShell terminal (after installing OpenSSH if needed).scp -r my-project-dir/ <uw-user-id>@rsubmit.math.private.uwaterloo.ca:~/.
- Login to the Head Node
ssh <uw-user-id>@rsubmit.math.private.uwaterloo.ca cd my-project-dir/
NOTE: Never run heavy compute tasks on this node — use Slurm instead.
- Create a Slurm sbatch Script
The following links provides information about available resources on MFCF Slurm cluster:
- For information about available Slurm partitions check Slurm partitions
- For information about GPU partitions and the required
--gresoption check: GPU job submission example - For per node resource data check: Cluster machines specifications
Below is an example
my-script.sh(not complete – update paths, settings, code script and user ID as needed):#!/bin/bash #SBATCH --mail-user=<uw-user-id>@uwaterloo.ca #SBATCH --mail-type=BEGIN,END,FAIL #SBATCH --job-name="test2" #SBATCH --partition=gpu_p100 #SBATCH --ntasks=5 #SBATCH --nodes=1 #SBATCH --cpus-per-task=2 #SBATCH --time=0-12:00:00 #SBATCH --mem-per-cpu=1G #SBATCH --gres=gpu:p100:1 #SBATCH --output=%x_%j.out ## Load modules module load anaconda3 ## Check if MY_CONDA_ENV has already been created MY_CONDA_ENV="gpuenv2" CHK_ENV=$(conda env list | grep $MY_CONDA_ENV | awk '{print $1}') echo "CHK_ENV: $CHK_ENV" if [ "$CHK_ENV" = "" ]; then # if MY_CONDA_ENV does not exist echo "$MY_CONDA_ENV doesn't exist, create it..." conda create --yes --name $MY_CONDA_ENV python=3.10 numba cudatoolkit -c conda-forge -c nvidia fi conda activate $MY_CONDA_ENV echo "echo: $(which python3) " echo "" python3 my-gpu-test.py - Use
sbatchcommand to submit the Jobsbatch my-script.sh
Monitor and Control jobs
- Use
squeuecommand to check Job's statussqueue -u <uw-user-id>
- Use
scancelcommand to kill a Jobscancel -j <job_id>
- Use
sinfocommand to check Cluster Statussinfo
- Use
sacctcommand to check resource usage after job completionsacct --jobs=18121 --format="Submit,JobID%-15,JobName%20,Partition%20,NCPUS,State,ExitCode,Elapsed,CPUTime,MaxRSS"
Replace 18121 with your actual job ID returned from sbatch. - Copy results back to your own UW directory
After completion, transfer your project (code, data and results) to your home system:
scp -r /work/<uw_user_id>/my-project-dir linux.math.uwaterloo.ca:/u/<uw_user_id>/
Key Safety Reminder
- Never run heavy compute jobs directly on the head node.
- Always submit jobs with Slurm (
sbatch,srun). - Always replace placeholders (
<uw_user_id>) with your real UW userid.
For more details, please refer to MFCF's Slurm documentation