Introduction
Two main Slurm commands that control jobs are scancel and scontrol. scancel can be used to cancel or signal a job while scontrol can be used to hold, release, suspend, resume, and requeue jobs.
scancel
scancel is used to signal or cancel jobs, job arrays, or job steps.
Syntax
scancel [OPTIONS...] [job_id[_array_id][.step_id]] ...]
Common Options
option | Description |
---|---|
--account | Restrict the scancel operation to jobs under this charge account |
--name | Restrict the scancel operation to jobs with this job name |
--partition | Restrict the scancel operation to jobs in this partition |
--state | Restrict the scancel operation to jobs in this state |
--batch | Signal only the batch step (the shell script), but not any other steps nor any children of the shell script. |
--signal | The name or number of the signal to send. If this option is not used the specified job or step will be terminated. |
For more details check man pages ($man scancel) or use --help option.
Examples
- Cancel job 1234 along with all of its steps:
scancel 1234
- Send SIGTERM to steps 1 and 3 of job 1234:
scancel --signal=TERM 1234.1 1234.3
- Send SIGKILL to all steps of job 1235, but do not cancel the job itself:
scancel --signal=KILL 1235
- Send SIGUSR1 to the batch shell processes of job 1236:
scancel --signal=USR1 --batch 1236
- Cancel job all pending jobs belonging to user "bob" in partition "debug":
scancel --state=PENDING --user=bob --partition=debug
- Cancel only array ID 4 of job array 1237
scancel 1237_4
scontrol
scontrol is used to control jobs (e.g hold, resume and requeue jobs).
syntax
scontrol [options] [command]
scontrol commands for job control
As shown in the syntax, scontrol utility uses commands to control jobs. A list of these commands shown in the table below. All of these commands should be followed by a jobid or job list. For example: $ scontrol hold <job_list>. The job_list argument is a comma-separated list of job IDs OR "jobname=" with the job's name.
Option | Description |
---|---|
hold | Prevent a pending job from being started (sets its priority to 0). |
release | Release a previously held job to begin execution |
suspend | Suspend a running job. Use the resume command to resume its execution. If a suspended job is requeued, it will be placed in a held state. |
resume | Resume a previously suspended job |
requeue | Requeue a running, suspended or finished Slurm batch job into pending state |
requeuehold | Requeue a running, suspended or finished Slurm batch job into pending state, moreover the job is put in held state (priority zero). |
uhold | Prevent a pending job from being started (sets it's priority to 0). This command is designed for a system administrator to hold a job so that the job owner may release it rather than requiring the intervention of a system administrator |
Examples
- To stop pending job 1245 from starting:
scontrol hold 1245
- To unhold job 1245 and allow it to start execution again:
scontrol release 1245
- To suspend a running job (jobid 1245):
scontrol suspend 1245
- To resume a suspended job (jobid 1245):
scontrol resume 1245
As with any other Slurm commands, use man pages (e.g. man scontrol) for more detailed information about the command. Also the --help option provides a brief list of the command's options.