GPU Job Workflow on MFCF Slurm Cluster

MFCF cluster is managed using the Slurm resource manager. Therefore, you'll need to login to Slurm login/head node and use Slurm commands to run your jobs.

Below are the basic steps to get started.

Note: In the following instructions, the generic placeholder <uw-user-id> should be replaced with your actual UW user ID (8 characters or fewer). For example: nalamron.

Basic Workflow to Run Jobs on Slurm:

  1. Copy your project (code + data) to the cluster
  2. Log in to the Slurm head node
  3. Create Slurm job script
  4. Submit your job using sbatch
  5. Monitor or cancel your job using Slurm commands

Step-by-Step Details

  1. Copy Files to the Cluster

    From a Linux/macOS terminal:

    Use scp command to copy your data/code.
    scp -r my-project-dir/ <uw-user-id>@rsubmit.math.private.uwaterloo.ca:~/.

    From Windows:

    Use the WinSCP GUI tool or run the same scp command from Windows PowerShell terminal (after installing OpenSSH if needed).

    scp -r my-project-dir/ <uw-user-id>@rsubmit.math.private.uwaterloo.ca:~/.
  2. Login to the Head Node
    ssh <uw-user-id>@rsubmit.math.private.uwaterloo.ca
    cd my-project-dir/
    

    NOTE: Never run heavy compute tasks on this node — use Slurm instead.

  3. Create a Slurm sbatch Script

    The following links provides information about available resources on MFCF Slurm cluster:

    Below is an example my-script.sh (not complete – update paths, settings, code script and user ID as needed):

    #!/bin/bash
    #SBATCH --mail-user=<uw-user-id>@uwaterloo.ca
    #SBATCH --mail-type=BEGIN,END,FAIL
    #SBATCH --job-name="test2"
    #SBATCH --partition=gpu_p100
    #SBATCH --ntasks=5
    #SBATCH --nodes=1
    #SBATCH --cpus-per-task=2
    #SBATCH --time=0-12:00:00
    #SBATCH --mem-per-cpu=1G
    #SBATCH --gres=gpu:p100:1
    #SBATCH --output=%x_%j.out
    
    ## Load modules
    module load anaconda3
    
    ## Check if MY_CONDA_ENV has already been  created
    MY_CONDA_ENV="gpuenv2"
    
    CHK_ENV=$(conda  env list | grep $MY_CONDA_ENV | awk '{print $1}')
    
    echo "CHK_ENV: $CHK_ENV"
    if [ "$CHK_ENV" =  "" ]; then
    # if MY_CONDA_ENV does not exist
    echo "$MY_CONDA_ENV doesn't exist, create it..."
    conda create --yes  --name $MY_CONDA_ENV python=3.10 numba cudatoolkit -c conda-forge -c nvidia
    fi
    
    conda activate $MY_CONDA_ENV
    
    echo "echo: $(which python3) "
    echo ""
    
    python3 my-gpu-test.py
    

  4. Use sbatch command to submit the Job
    sbatch my-script.sh

Monitor and Control jobs

  1. Use squeue command to check Job's status
    squeue -u <uw-user-id>
  2. Use scancel command to kill a Job
    scancel -j <job_id>
  3. Use sinfo command to check Cluster Status
    sinfo
  4. Use sacct command to check resource usage after job completion
    sacct --jobs=18121 --format="Submit,JobID%-15,JobName%20,Partition%20,NCPUS,State,ExitCode,Elapsed,CPUTime,MaxRSS"
    Replace 18121 with your actual job ID returned from sbatch.
  5. Copy results back to your own UW directory

    After completion, transfer your project (code, data and results) to your home system:

    scp -r /work/<uw_user_id>/my-project-dir linux.math.uwaterloo.ca:/u/<uw_user_id>/

Key Safety Reminder

  • Never run heavy compute jobs directly on the head node.
  • Always submit jobs with Slurm (sbatch, srun).
  • Always replace placeholders (<uw_user_id>) with your real UW userid.

For more details, please refer to MFCF's Slurm documentation