Teaching GPU cluster job submit examples

Introduction

This page describes how to submit batch jobs. Below are examples for GPU jobs and interactive jobs. Interactive jobs can be submitted using the srun or salloc commands. Batch jobs can be submitted using sbatch.

GPU jobs

For GPU jobs, first, login to tsubmit.math.private.uwaterloo.ca. Then submit the job to one of the available partitions (e.g. gpu-gtx1080ti partition). Below are examples for launching python-based and CUDA-based code.

Launching Python GPU code on Slurm

To launch GPU job, first a GPU device has to be requested from from Slurm using the --gres option. Then create python code that calls GPU functions. In the MFCF teaching Slurm cluster, to run python code, first create a conda environment (as shown in the example below), then activate the conda environment, then run python code.

Launching Python Tensorflow code on Slurm:


#!/bin/bash


#SBATCH --mail-user=<UWuserid>@uwaterloo.ca

#SBATCH --mail-type=BEGIN,END,FAIL

#SBATCH --job-name="mnist-test"

#SBATCH --partition=gpu-gen

#SBATCH --account=normal

#SBATCH --ntasks=1

#SBATCH --cpus-per-task=1

#SBATCH --time=00:20:00

#SBATCH --mem=8000

#SBATCH --gres=gpu:gtx1080ti:1

#SBATCH --output=stdout-%j.log

#SBATCH --error=stderr-%j.log


# Set the environment for  anaconda3

module load anaconda3/2023.07.1


# set conda environment name

MY_CONDA_ENV="gpuenv1"


# create $MY_CONDA_ENV conda environment..."

conda create --yes  --name $MY_CONDA_ENV \

         -c anaconda  numpy pandas tensorflow-gpu


# activate conda environment

conda activate $MY_CONDA_ENV


# run your code

srun python tf-mnist-gpu.py

# or you can use

# python tf-mnist-gpu.py

Launching Python Pytorch code on Slurm:


#!/bin/bash


#SBATCH --mail-user=<UWuserid>@uwaterloo.ca

#SBATCH --mail-type=BEGIN,END,FAIL

#SBATCH --job-name="gpu-test"

#SBATCH --partition=gpu-gen

#SBATCH --account=normal

#SBATCH --ntasks=1

#SBATCH --cpus-per-task=1

#SBATCH --time=00:30:00

#SBATCH --mem=2GB

#SBATCH --gres=gpu:gtx1080ti:1

#SBATCH --output=stdout-%x_%j.log

#SBATCH --error=stderr-%x_%j.log


# Set the environment for  anaconda3

module load  anaconda3/2023.07.1


# set conda environment name

MY_CONDA_ENV="torchenv2"


# create $MY_CONDA_ENV conda environment...

conda create --yes --name $MY_CONDA_ENV pytorch \

     torchvision torchaudio pytorch-cuda=12.1 \

     -c pytorch -c nvidia

     
conda activate $MY_CONDA_ENV


echo "echo: $(which python3) "

echo ""


# put your extra installs here

# using conda command (recommended), e.g.

# conda install --yes skorch

# or using pip command, e.g. 

# pip install --yes skorch


srun python3 torch-mnist-gpu.py

Launching CUDA-based GPU code on Slurm

Also request GPU GRES resources using --gres option. Use the srun command to launch your executable job (e.g. memtestG80 ) as shown in the script below


#!/bin/bash

#SBATCH --mail-user=<UWuserid>@uwaterloo.ca

#SBATCH --mail-type=BEGIN,END,FAIL

#SBATCH --job-name=gpu-gen

#SBATCH --account=normal

#SBATCH --ntasks=1

#SBATCH --cpus-per-task=4

#SBATCH --time=01:00:00

#SBATCH --mem-per-cpu=2000

#SBATCH --partition=gpu-k80

#SBATCH --gres=gpu:k80:1

#SBATCH --output=stdout-%j.log

#SBATCH --error=stderr-%j.log


module load cuda/12.0.0

echo "== Start memory test ============"

srun ./memtestG80

Interactive jobs

Interactive jobs give you console access on the compute nodes. You can work as if you are on an interactive node. Interactive jobs (sessions) are useful for jobs that require direct user input. Examples of such situations are as follows:

compiling your code, especially when the compute node architecture differs from the headnode architecture. For example, it is best to compile a CUDA code on a GPU machine that has similar target GPU device architecture. In this case, you need to create a Slurm interactive session on a GPU compute node.
Testing and debugging code.
Running applications on a graphical user interface such as X window.

To launch interactive jobs, use srun with --pty option. The basic form of this command is:

srun --pty bash -i

Srun's --pty option runs Slurm's task zero in pseudo terminal mode. Bash's -i option tells it to run in interactive mode. With no resources explicitly specified, this job will run under default Slurm settings: default account, default partition, resource allocation defaults such as number of CPUs, memory size, etc.

Using srun command line options, you can request any resources made available to you. For example, using --gres= option to request GPU devices or using --nodes= option to specify number of nodes, etc. You can also specify the partition with --partition= option and account with --account= option. The command below demonstrates how to request an interactive session on a resource with two GPU devices. In this case, select gpu-gtx1080ti partition and 'normal' account because this combination allows a GPU device to be allocated.


srun --gres=gpu:gtx1080ti:1 \

     --partition=gpu-gen \

     --account=normal --pty /bin/bash

Note: You may not get an interactive session immediately. After executing an srun command, the job will be queued. You will get an interactive session on a compute node(s) as soon as the requested resources become available.