Controlling and signalling jobs

Introduction

Two main Slurm commands that control jobs are scancel and scontrol. scancel can be used to cancel or signal a job while scontrol can be used to hold, release, suspend, resume, and requeue jobs.

scancel

scancel is used to signal or cancel jobs, job arrays, or job steps.

Syntax

scancel [OPTIONS...] [job_id[_array_id][.step_id]] ...]

Common Options

option Description
--account Restrict the scancel operation to jobs under this charge account
--name Restrict the scancel operation to jobs with this job name
--partition Restrict the scancel operation to jobs in this partition
--state Restrict the scancel operation to jobs in this state
--batch Signal only the batch step (the shell script), but not any other steps nor any children of the shell script.
--signal The name or number of the signal to send. If this option is not used the specified job or step will be terminated.

For more details check man pages ($man scancel) or use --help option.

Examples

  • Cancel job 1234 along with all of its steps:
    scancel 1234
  • Send SIGTERM to steps 1 and 3 of job 1234:
    scancel --signal=TERM 1234.1 1234.3
  • Send SIGKILL to all steps of job 1235, but do not cancel the job itself: scancel --signal=KILL 1235
  • Send SIGUSR1 to the batch shell processes of job 1236: scancel --signal=USR1 --batch 1236
  • Cancel job all pending jobs belonging to user "bob" in partition "debug": scancel --state=PENDING --user=bob --partition=debug
  • Cancel only array ID 4 of job array 1237 scancel 1237_4

scontrol

scontrol is used to control jobs (e.g hold, resume and requeue jobs).

syntax

scontrol [options] [command]

scontrol commands for job control

As shown in the syntax, scontrol utility uses commands to control jobs. A list of these commands shown in the table below. All of these commands should be followed by a jobid or job list. For example: $ scontrol hold <job_list>. The job_list argument is a comma-separated list of job IDs OR "jobname=" with the job's name.

Option Description
hold Prevent a pending job from being started (sets its priority to 0).
release Release a previously held job to begin execution
suspend Suspend a running job. Use the resume command to resume its execution. If a suspended job is requeued, it will be placed in a held state.
resume Resume a previously suspended job
requeue Requeue a running, suspended or finished Slurm batch job into pending state
requeuehold Requeue a running, suspended or finished Slurm batch job into pending state, moreover the job is put in held state (priority zero).
uhold Prevent a pending job from being started (sets it's priority to 0). This command is designed for a system administrator to hold a job so that the job owner may release it rather than requiring the intervention of a system administrator

Examples

  • To stop pending job 1245 from starting:
    scontrol hold 1245
  • To unhold job 1245 and allow it to start execution again:
    scontrol release 1245
  • To suspend a running job (jobid 1245):
    scontrol suspend 1245
  • To resume a suspended job (jobid 1245):
    scontrol resume 1245

As with any other Slurm commands, use man pages (e.g. man scontrol) for more detailed information about the command. Also the --help option provides a brief list of the command's options.