MPI Job Workflow on MFCF Slurm Cluster

This example shows a full pipeline for running an MPI (parallel CPU) job using Slurm on the MFCF cluster.

MFCF cluster is managed using the Slurm resource manager. Therefore, you'll need to login to Slurm login/head node and use Slurm commands to run your jobs.Below are the basic steps to get started.

Note: In the following instructions, the generic placeholder <uw-user-id> should be replaced with your actual UW user ID (8 characters or fewer). For example: nalamron.

Basic Workflow to Run Jobs on Slurm:

  1. Copy your project (code + data) to the cluster
  2. Log in to the Slurm head node
  3. Create Slurm job script
  4. Submit your job using sbatch
  5. Monitor or cancel your job using Slurm commands

Step-by-Step Details

  1. Copy your project files (code + data) to the cluster

    Transfer the MPI source file to a working directory on the cluster:

    scp my_test_mpi.c <uw_user_id>@rsubmit.math.private.uwaterloo.ca:/work/<uw_user_id>/my-project-dir/

    Note: you can also develop your code directly on the head node.

  2. Log in to the Slurm head node
    ssh <uw_user_id>@rsubmit.math.private.uwaterloo.ca
    (Login uses your WatIAM credentials.)
  3. cd to my-project-dir directory
     
    cd my-project-dir
    
  4. Create a Slurm batch script (my_mpi_job.sh)

    Example script structure:

    Below is an example my-script.sh (not complete – update paths, settings, code script and user ID as needed):

    #!/bin/bash
    #SBATCH --mail-user=<uw_user_id>@uwaterloo.ca
    #SBATCH --mail-type=BEGIN,END,FAIL
    #SBATCH --job-name="my_mpi_job"
    #SBATCH --partition=cpu_pr3
    #SBATCH --account=normal
    #SBATCH --time=00:01:00
    #SBATCH --nodes=4
    #SBATCH --ntasks-per-node=6
    #SBATCH --mem-per-cpu=1GB
    #SBATCH --output=%x-%j.out
    #SBATCH --error=%x-%j.err
    
    module load mpi/openmpi-uw
    
    # compile parallel code 
    mpicc my_test_mpi.c -o my_mpi_test
    
    # execute parall job
    srun --mpi=pmi2 ./my_mpi_test
    

  5. Submit the job
    sbatch my_mpi_job.sh

    Expected response format:

    Submitted batch job 655

Monitor and Control jobs

  1. Monitor job status

    Check your job in the queue:

    squeue job=655

    Typical status output includes:

    JOBID  PARTITION      NAME        USER       ST  TIME     NODES ...
    655    cpu_mosaic_guest  my_mpi_job  <uw_user_id>    R   1:11     1 ...
    
    ST = R means the job is currently running.
  2. Copy results back to your own UW directory

    After completion, transfer your project (code, data and results) to your home system:

    scp -r /work/<uw_user_id>/my-project-dir linux.math.uwaterloo.ca:/u/<uw_user_id>/

Key Safety Reminder

  • Never run heavy compute or long parallel jobs directly on the head node.
  • Always submit them with Slurm (sbatch, srun).
  • Always replace placeholders (<uw_user_id>) with your real UW userid.

For more details, please refer to MFCF's specialty research linux servers documentation.