This example shows a full pipeline for running an MPI (parallel CPU) job using Slurm on the MFCF cluster.
MFCF cluster is managed using the Slurm resource manager. Therefore, you'll need to login to Slurm login/head node and use Slurm commands to run your jobs.Below are the basic steps to get started.
Note: In the following instructions, the generic placeholder <uw-user-id> should be replaced with your actual UW user ID (8 characters or fewer). For example: nalamron.
Basic Workflow to Run Jobs on Slurm:
- Copy your project (code + data) to the cluster
- Log in to the Slurm head node
- Create Slurm job script
- Submit your job using
sbatch - Monitor or cancel your job using Slurm commands
Step-by-Step Details
- Copy your project files (code + data) to the cluster
Transfer the MPI source file to a working directory on the cluster:
scp my_test_mpi.c <uw_user_id>@rsubmit.math.private.uwaterloo.ca:/work/<uw_user_id>/my-project-dir/
Note: you can also develop your code directly on the head node.
- Log in to the Slurm head node
ssh <uw_user_id>@rsubmit.math.private.uwaterloo.ca
(Login uses your WatIAM credentials.) - cd to
my-project-dirdirectorycd my-project-dir
- Create a Slurm batch script (
my_mpi_job.sh)Example script structure:
Below is an example
my-script.sh(not complete – update paths, settings, code script and user ID as needed):#!/bin/bash #SBATCH --mail-user=<uw_user_id>@uwaterloo.ca #SBATCH --mail-type=BEGIN,END,FAIL #SBATCH --job-name="my_mpi_job" #SBATCH --partition=cpu_pr3 #SBATCH --account=normal #SBATCH --time=00:01:00 #SBATCH --nodes=4 #SBATCH --ntasks-per-node=6 #SBATCH --mem-per-cpu=1GB #SBATCH --output=%x-%j.out #SBATCH --error=%x-%j.err module load mpi/openmpi-uw # compile parallel code mpicc my_test_mpi.c -o my_mpi_test # execute parall job srun --mpi=pmi2 ./my_mpi_test
- Submit the job
sbatch my_mpi_job.sh
Expected response format:
Submitted batch job 655
Monitor and Control jobs
- Monitor job status
Check your job in the queue:
squeue job=655
Typical status output includes:
JOBID PARTITION NAME USER ST TIME NODES ... 655 cpu_mosaic_guest my_mpi_job <uw_user_id> R 1:11 1 ...
ST = R means the job is currently running. - Copy results back to your own UW directory
After completion, transfer your project (code, data and results) to your home system:
scp -r /work/<uw_user_id>/my-project-dir linux.math.uwaterloo.ca:/u/<uw_user_id>/
Key Safety Reminder
- Never run heavy compute or long parallel jobs directly on the head node.
- Always submit them with Slurm (
sbatch,srun). - Always replace placeholders (
<uw_user_id>) with your real UW userid.
For more details, please refer to MFCF's specialty research linux servers documentation.