GPU Job Workflow on MFCF Slurm Cluster

MFCF cluster is managed using the Slurm resource manager. Therefore, you'll need to login to Slurm login/head node and use Slurm commands to run your jobs.

Below are the basic steps to get started.

Note: In the following instructions, the generic placeholder <userid> should be replaced with your actual UW user ID (8 characters or fewer). For example: nalamron.

Summary of Steps for Running Jobs on Slurm:

  • Copy your project (code + data) to the cluster
  • Log in to the Slurm head node
  • Create Slurm job script
  • Submit your job using sbatch
  • Monitor or cancel your job using Slurm commands

Step-by-Step Details

  • Copy Files to the Cluster

    From a Linux/macOS terminal:

    Use scp command to copy your data/code.

    scp -r projectdir/ <userid>@rsubmit.math.private.uwaterloo.ca:~/.
     

    From Windows:

    Use the WinSCP GUI tool or run the same scp command from the Windows PowerShell terminal (after installing OpenSSH if needed).

    scp -r projectdir/ <userid>@rsubmit.math.private.uwaterloo.ca:~/.
  • Login to the Head Node

    ssh <userid>@rsubmit.math.private.uwaterloo.ca
    cd projectdir/
    

    NOTE: Never run heavy compute tasks on this node — use Slurm instead.

  • Create a Slurm sbatch Script

    The following links provide information about available resources on MFCF Slurm cluster:

     

    Below is an example my-script.sh (not complete – update paths, settings, code script and user ID as needed):

    #!/bin/bash
    #SBATCH --mail-user=<userid>@uwaterloo.ca
    #SBATCH --mail-type=BEGIN,END,FAIL
    #SBATCH --job-name="test2"
    #SBATCH --partition=gpu_p100
    #SBATCH --ntasks=5
    #SBATCH --nodes=1
    #SBATCH --cpus-per-task=2
    #SBATCH --time=0-12:00:00
    #SBATCH --mem-per-cpu=1G
    #SBATCH --gres=gpu:p100:1
    #SBATCH --output=%x_%j.out
    
    ## Load modules
    module load anaconda3
    
    ## Check if MYENV has already been  created
    MYENV="gpuenv2"
    
    CHK_ENV=$(conda  env list | grep $MYENV | awk '{print $1}')
    
    echo "CHK_ENV: $CHK_ENV"
    if [ "$CHK_ENV" =  "" ]; then
    # if MYENV does not exist
    echo "$MYENV doesn't exist, create it..."
    conda create -y -n $MYENV numba cudatoolkit -c conda-forge -c nvidia
    fi
    
    conda activate $MYENV
    
    echo "echo: $(which python3) "
    echo ""
    
    python3 my-gpu-test.py
    

  • Use sbatch command to submit the Job

    sbatch my-script.sh

Monitor and Control jobs

  • Use squeue command to check Job's status

    squeue --me
  • Use scancel command to kill a Job

    scancel -j <job_id>
  • Use sinfo command to check Cluster Status

    sinfo
  • Use sacct command to check resource usage after job completion

    sacct -j 18121 -o "JobName%20,Partition%20,NCPUS,State,ExitCode,Elapsed,CPUTime,MaxRSS"

    Replace 18121 with your actual job ID returned by sbatch command.

  • Copy results back to your own UW directory

    After completion, transfer your project (code, data and results) to your home system:

    scp -r /work/<userid>/projectdir linux.math.uwaterloo.ca:/u/<userid>/

Key Safety Reminder

  • Never run heavy compute jobs directly on the head node.
  • Always submit jobs with Slurm (sbatch, srun).
  • Always replace placeholders (<userid>) with your real UW userid.

For more details, please refer to MFCF's Slurm documentation