Job submit commands with examples - teaching

Introduction

Slurm's main job submission commands are: sbatch, salloc, and srun

Note: Slurm does not automatically copy executable or data files to the nodes allocated to a job. The files must exist either on a local disk or in some global file system (e.g. NFS or CIFS). You can also, use sbcast command to transfer files to local storage on allocated nodes.

Command sbatch

Submit a job script for later execution.

Syntax


sbatch [options] Jobscript [args...]

Options

  • Slurm batch job options are usually embedded in a job script prefixed by #SBATCH directives
  • Slurm options can also be passed as command line options or by setting Slurm input environment variables
  • options passed on the command line override corresponding options embedded in the job script or environment variable
  • see Passing options to Slurm commands section below for more details

Job script

Usually a job script consists of two parts:

  • Slurm directives section:
    • identified by #SBATCH directive at the beginning of each line in this section
    • where sbatch command line options are set. For reproducibility, use this section (instead of command line or environment variables) to pass sbatch options. For legibility, use long form options.
  • Job commands section:
    • commands in this section are executed in the assigned node resources. It is written in scripting language identified by interpreter directive (e.g. #!/bin/bash).

Example

First create job script (e.g. job01.sh)


#!/bin/bash
#SBATCH --mail-user=userid@uwaterloo.ca
#SBATCH --mail-type=end,fail
#SBATCH --partition=gpu-k80
#SBATCH --job-name="test-job-01"
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=3
#SBATCH --time=00:10:00
#SBATCH --mem-per-cpu=1G
#SBATCH --gres=gpu:k80:1
#SBATCH --output=stdout-%j.log
#SBATCH --output=stderr-%j.log

# Your job commands
srun --label nvidia-smi --query-gpu=gpu_name \
     --format=csv,noheader

Then Submit the job script using sbatch command:


sbatch job01.sh

For more details on sbatch, read the man page (man sbatch) or visit the sbatch page on SchedMD website

Command salloc

  • allocate resources (e.g. nodes), possibly with a set of constraints (e.g. number of processor per node) for later utilization
  • use to allocate resources and spawn a shell, in which the srun command is used to launch parallel tasks

Syntax


salloc [options] [command [args...]]

Options

See Passing options to Slurm commands section below for more details.

Example

In this example we use salloc command to get allocation and create bash shell. Then use srun to execute nvidia-smi command.


$ salloc --partition=gpu-k80 --gres=gpu:k80:1 \
         --nodes=1 --ntasks-per-node=3 \
         /bin/bash
 salloc: Granted job allocation 168
 salloc: Waiting for resource configuration
 salloc: Nodes gpu-pt1-01 are ready for job

Now use srun to run --label nvidia-smi --query-gpu=gpu_name --format=csv,noheader command on resources allocated by salloc. The --label option will prepend lines of output with the remote task id


$ srun --label nvidia-smi \
       --query-gpu=gpu_name --format=csv,noheader
 2: Tesla K80
 1: Tesla K80
 0: Tesla K80

To exit salloc session, use exit command.


$ exit
 exit
 salloc: Relinquishing job allocation 168
 salloc: Job allocation 168 has been revoked.

For more details on salloc command read the man page (man salloc) or visit  salloc page on SchedMD website

Command srun

  • Run a parallel job on cluster managed by Slurm. If run from command line directly, srun will first create a resource allocation in which to run the parallel job.
  • srun can be run from within sbatch script or salloc shell

Syntax


srun [OPTIONS] executable [args...]

Options

See Passing options to Slurm commands section below for more details.

Example


$ srun --partition=gpu-k80 --gres=gpu:k80:1 \
       --nodes=1 --ntasks-per-node=3 \
       --label nvidia-smi --query-gpu=gpu_name \
       --format=csv,noheader

The output of the above example should looks like:


 2: Tesla K80
 0: Tesla K80
 1: Tesla K80

For more details on srun command read man pages (man srun) or visit  srun page on SchedMD website

Passing options to Slurm commands

  • Command options can be passed in the following ways, listed in order of precedence:
    • On the command line
    • Input environment variables
    • In the job script (for sbatch command) prefixed by #SBATCH directive. 

The table below shows the most commonly-used options. All of these options can be used with sbatch command. srun and salloc commands support most of these options.
 

Options for all job submit commands (sbatch, srun and salloc)
Option Description
--mail-user Mail address to contact job owner. Must specify a valid email address!
--mail-type When to notify a job owner: none, all, begin, end, fail, requeue, array_tasks
--account Which account to charge. Regular users don't need to specify this option.
--job-name Specify a job name
--time Expected runtime of the job. Acceptable time formats include "minutes", "minutes:seconds", "hours:minutes:seconds", "days-hours", "days-hours:minutes". For example: --time 7-12:30 means 7 days, 12 hours and 30 minutes
--mem-per-cpu Minimum memory required per allocated CPU in megabytes. Different units can be specified using the suffix [K|M|G]
--tmp Specify the amount of disk space that must be available on the compute node(s). The local scratch space for the job is referenced by the variable TMPDIR. Default units are megabytes. Different units can be specified using the suffix [K|M|G|T].
--ntasks Number of tasks (processes). Used for MPI jobs that may run distributed across multiple compute nodes
--nodes Request a certain number of nodes
--ntasks-per-node Specify how many tasks will run on each allocated node. Meant to be used with --nodes.
--cpus-per-task Number of CPUs per task (threads). Used for shared memory jobs that run locally on a single compute node
--partition The "all" partition is the default partition. A different partition must be requested with the --partition option!
--dependency Defer the start of this job until the specified dependencies have been satisfied. See. man sbatch for a description of all valid dependency types
--hold Submit job in hold state. Job is not allowed to run until explicitly released
--immediate Only submit the job if all requested resources are immediately available
--exclusive Use the compute node(s) exclusively, i.e. do not share nodes with other jobs. CAUTION: Only use this option if you are an experienced user, and you really understand the implications of this feature. If used improperly, the use of this option can lead to a massive waste of computational resources
--constraint Request nodes with certain features. This option allows you to request a homogeneous pool of nodes for you MPI job
Options applying only to sbatch and srun
--output Redirect standard output. All directories specified in the path must exist before the job starts!
--error Redirect standard error. All directories specified in the path must exist before the job starts!
--test-only Validate the batch script and return the estimated start time considering the current cluster state
Options applying only to sbatch
--array Submit an array job. Use "%" to specify the max number of tasks allowed to run concurrently.
--chdir Set the current working directory. All relative paths used in the job script are relative to this directory. By default, current working directory is the calling process working directory.
--parsable Print the job id only
  • below are some of the input environment variables for sbatch command and their equivalent command line options
  • salloc also uses similar environment variables (SALLOC_*). Many of salloc environment variables have an sbatch equivalent.
    • for example, SBATCH_ACCOUNT environment variable is equivalent to SALLOC_ACCOUNT variable
  • srun environment variables start with SLURM_* (e.g. SLURM_ACCOUNT)
Environment variable Equivalent option
SBATCH_ACCOUNT --account
SBATCH_ACCTG_FREQ --acctg-freq
SBATCH_ARRAY_INX --array
SBATCH_CLUSTERS or SLURM_CLUSTERS --clusters
SBATCH_CONN_TYPE --conn-type
SBATCH_CONSTRAINT --constraint
SBATCH_CORE_SPEC --core-spec
SBATCH_DEBUG --verbose
SBATCH_DELAY_BOOT --delay-boot
SBATCH_DISTRIBUTION --distribution
SBATCH_EXCLUSIVE --exclusive
SBATCH_EXPORT --export
SBATCH_GEOMETRY --geometry
SBATCH_GET_USER_ENV --get-user-env
SBATCH_GRES_FLAGS --gres-flags
SBATCH_HINT or SLURM_HINT --hint
SBATCH_IGNORE_PBS --ignore-pbs
SBATCH_IMMEDIATE --immediate
SBATCH_IOLOAD_IMAGE --ioload-image
SBATCH_JOBID --jobid
SBATCH_JOB_NAME --job-name
SBATCH_LINUX_IMAGE --linux-image
SBATCH_MEM_BIND --mem-bind
SBATCH_MLOADER_IMAGE --mloader-image
SBATCH_NETWORK --network
SBATCH_NO_REQUEUE --no-requeue
SBATCH_NO_ROTATE --no-rotate
SBATCH_OPEN_MODE --open-mode
SBATCH_OVERCOMMIT --overcommit
SBATCH_PARTITION --partition
SBATCH_POWER --power
SBATCH_PROFILE --profile
SBATCH_QOS --qos
SBATCH_RAMDISK_IMAGE --ramdisk-image
SBATCH_RESERVATION --reservation
SBATCH_REQUEUE --requeue
SBATCH_SIGNAL --signal
SBATCH_SPREAD_JOB --spread-job
SBATCH_THREAD_SPEC --thread-spec
SBATCH_TIMELIMIT --time
SBATCH_USE_MIN_NODES --use-min-nodes
SBATCH_WAIT --wait
SBATCH_WAIT_ALL_NODES --wait-all-nodes
SBATCH_WCKEY --wckey
SLURM_CONF The location of the Slurm configuration file