Example of steps to use Math teaching GPU clusters

This example assumes you already have a UW user account and a Slurm account on teaching GPU cluster.

It shows how to access and use the teaching GPU cluster, and explains the steps needed to compile Python code and run it as a GPU job.  Follow similar procedures for other types of jobs such as OpenMP, MPI, and serial jobs.

The example assumes the source code was developed on a remote machine. The user ID in this example is mathuser.

Copy files (if needed)

Copy the local source file (e.g. mnist-gpu.py) to the cluster under /work/mathuser/demo directory.

$ scp mnist-gpu.py mathuser@tsubmit.math.private.uwaterloo.ca:/work/mathuser/demo

Login to headnode

$ ssh mathuser@cpusubmit.math.private.uwaterloo.ca

Use Nexus credential to login.

Create Slurm batch script

Example of a script (my-py-gpu-job.sh) is:


#SBATCH --mail-user=mathuser@uwaterloo.ca
#SBATCH --job-name="mnist-test"
#SBATCH --partition=gpu-k80
#SBATCH --account=normal
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=1
#SBATCH --time=00:20:00
#SBATCH --mem=8000
#SBATCH --gres=gpu:k80:1
#SBATCH --output=stdout-%j.log
#SBATCH --error=stderr-%j.log

# Set the environment for anaconda3
module load anaconda3/2023.07.1

# set conda environment name

#create $MY_CONDA_ENV conda environment..."
conda create --yes --name $MY_CONDA_ENV \
-c anaconda numpy pandas tensorflow-gpu

# activate conda environment
conda activate $MY_CONDA_ENV

# run your code
srun python mnist-gpu.py
# or you can use
# python mnist-gpu.py

Submit Slurm job

$ sbatch my-py-gpu-job.sh
Submitted batch job 54

Note the job ID number to use in other commands.

Check job status

To check job status use squeue with the job ID number:

$ squeue job=54