Introduction
Slurm provides commands to obtain information about nodes, partitions, jobs, jobsteps on different levels. These commands are sinfo, squeue, sstat, scontrol, and sacct. All these commands output can be formatted using --format (-o) or --Format (-O) option. The --sort (-S) option can be used to sort the output. Man pages are available for all commands. Most command options support a short form as well as a long form (e.g. -o <output_format>, and --format=<output_format>), but for readability, it is recommended to use long form of the command line options.
sinfo
Reports status information about nodes and partition.
Syntax
sinfo [Options...]
Using command line options, the output can be filtered sorted and formatted. Use sinfo man page man sinfo for more detailed information. You can also use --help or --usage for short list of the command line options, or visit the sinfo page on SchedMD website. Below is a summary of output format as well as node and partition states.
sinfo command common options
long form | short | Description |
---|---|---|
--Node | -N | Print information in a node-oriented format with one line per node. The default is to print information in a partition-oriented format. |
--node | -n | Print information only about the specified node(s). Multiple nodes may be comma separated or expressed using a node range expression |
--partition | -p | Print information only about the specified partition(s). Multiple partitions are separated by commas. |
--long | -l | Print more detailed information. |
--exact | -e | If set, do not group node information on multiple nodes unless their configurations to be reported are identical |
--summarize | -s | List only a partition state summary with no node state details |
--list-reasons | -R | List reasons nodes are in the down, drained, fail or failing state |
Output format
sinfo command by default will display the following header fields:
PARTITION AVAIL TIMELIMIT NODES STATE NODELIST
The table below describes these fields:
head title | Description |
---|---|
PARTITION | Name of a partition. Default partition identified by "*" suffix. |
AVAIL | Partition state: up or down |
TIMELIMIT | Maximum time limit for any user job in days-hours:minutes:seconds |
NODES | Count of nodes with this particular configuration |
STATE | State of the nodes. Possible states include: allocated, completing, down, drained, draining, fail, failing, future, idle |
NODELIST | Names of nodes associated with the configuration/partition. |
sinfo examples
-
Report
basic
node
and
partition
configurations:
$ sinfo PARTITION AVAIL TIMELIMIT NODES STATE NODELIST gpu-gen* up 30-00:00:0 3 idle gpu-pt1-[01-03] gpu-k80 up 60-00:00:0 1 idle gpu-pt1-01 gpu-gtx1080ti up 60-00:01:0 2 idle gpu-pt1-[02-03]
-
Report
partition
summary
information:
$ sinfo -s PARTITION AVAIL TIMELIMIT NODES(A/I/O/T) NODELIST gpu-gen* up 30-00:00:0 0/3/0/3 gpu-pt1-[01-03] gpu-k80 up 60-00:00:0 0/1/0/1 gpu-pt1-01 gpu-gtx1080ti up 60-00:01:0 0/2/0/2 gpu-pt1-[02-03]
-
Report
more
complete
information
about
a
certain
partition:
$ sinfo --long --partition=gpu-k80 Tue Sep 19 13:07:17 2023 PARTITION AVAIL TIMELIMIT JOB_SIZE ROOT OVERSUBS GROUPS NODES STATE NODELIST gpu-k80 up 60-00:00:0 1 no FORCE:4 all 1 idle gpu-pt1-01
-
Report
only
those
nodes
that
are
in
state
idle:
# sinfo --state idle PARTITION AVAIL TIMELIMIT NODES STATE NODELIST gpu-gen* up 30-00:00:0 3 idle gpu-pt1-[01-03] gpu-k80 up 60-00:00:0 1 idle gpu-pt1-01 gpu-gtx1080ti up 60-00:01:0 2 idle gpu-pt1-[02-03]
-
Report
node-oriented
information
with
details
and
exact
matches:
$ sinfo -Nel Tue Sep 19 13:10:13 2023 NODELIST NODES PARTITION STATE CPUS S:C:T MEMORY ... gpu-pt1-01 1 gpu-gen* idle 28 2:14:1 128770 ... gpu-pt1-01 1 gpu-k80 idle 28 2:14:1 128770 ... gpu-pt1-02 1 gpu-gtx1080ti idle 24 2:12:1 257810 ... gpu-pt1-02 1 gpu-gen* idle 24 2:12:1 257810 ... gpu-pt1-03 1 gpu-gtx1080ti idle 24 2:12:1 257810 ... gpu-pt1-03 1 gpu-gen* idle 24 2:12:1 257810 ...
-
Report
only
down,
drained
and
draining
nodes
and
their
reason
field:
$ sinfo -R REASON USER TIMESTAMP NODELIST Not responding root 2023-06-20T06:49:17 gpu-pt1-02 Hardware failure,ETA slurm 2023-07-20T06:25:05 gpu-pt1-03
squeue
Use the squeue command to get a high-level overview of all active (running and pending) jobs in the cluster.
Syntax
$ squeue [options]
These commonly-used options filter the output of the squeue command.
option | Description |
---|---|
--user=user_list | Request job data for a user or a comma-separated list of users. |
--jobs=job_list | Request data based on specific job_list. job_list can be a single job ID or a comma-separated list of job IDs |
--partition=part_list | Get information on jobs running on a partition or a comma-separated list of partitions |
--states=state_list | Display data on jobs in specific states. state_list can be a single state, "all", or a comma-separated list of states. Default: "PD,R,CG" |
squeue command output table header
The headers of the squeue command's default output are
Header title | Description |
---|---|
JOBID | Job or step ID. For array jobs, the job ID format will be of the form <job_id>_<index> |
PARTITION | Partition of the job/step |
NAME | Name of the job/step |
USER | Owner of the job/step |
ST | State of the job/step. See below for a description of the most common states |
TIME | Time used by the job/step. Format is days-hours:minutes:seconds |
NODES | Number of nodes allocated to the job or the minimum amount of nodes required by a pending job |
NODELIST(REASON) | For pending and failed jobs, this field display the reason for pending or failure. Otherwise, this field shows a list of allocated nodes See below for a list of the most common reason codes |
You can easily tailor the output format of squeue to your own needs using the --format (-o) or --Format (-O) options. See the man page for more information: man squeue
Job States
During its lifetime, a job passes through several states. The most common states are PENDING, RUNNING, SUSPENDED, COMPLETING, and COMPLETED.
state symbole | Description |
---|---|
PD | Pending. Job is waiting for resource allocation |
R | Running. Job has an allocation and is running |
S | Suspended. Execution has been suspended and resources have been released for other jobs |
CA | Cancelled. Job was explicitly cancelled by the user or the system administrator |
CG | Completing. Job is in the process of completing. Some processes on some nodes may still be active |
CD | Completed. Job has terminated all processes on all nodes with an exit code of zero |
F | Failed. Job has terminated with non-zero exit code or other failure condition |
REASON column
The REASON column of the squeue output gives you a hint why your job is not running.
Reason code | Description |
---|---|
AssociationJobLimit | The job's association has reached its maximum job count. |
AssociationResourceLimit | The job's association has reached some resource limit. |
AssociationTimeLimit | The job's association has reached its time limit. |
Dependency | This job is waiting for a dependent job to complete. |
InvalidAccount | The job's account is invalid. |
JobLaunchFailure | The job could not be launched. This may be due to a file system problem, invalid program name, etc. |
NodeDown | A node required by the job is down. |
NonZeroExitCode | The job terminated with a non-zero exit code. |
PartitionDown | The partition required by this job is in a DOWN state. |
PartitionNodeLimit | The number of nodes required by this job is outside of its partitions current limits |
PartitionTimeLimit | The job's time limit exceeds its partition's current time limit. |
Priority | One or more higher priority jobs exist for this partition or advanced reservation. |
Resources | The job is waiting for resources to become available. |
SystemFailure | Failure of the Slurm system, a file system, the network, etc. |
TimeLimit | The job exhausted its time limit. |
squeue command examples
-
List
all
jobs
of
current
user
:
squeue --me
-
List all currently running jobs of user jsmith:
squeue --user=jsmith --states=PD,R
-
List all currently running jobs of user jsmith in partition gpu-k80:
squeue --user=jsmith --partition=gpu-k80 --states=R
-
Print the job steps in the gpu-k80 partition sorted by user:
squeue -s -p gpu-k80 -S u
-
Print information only about jobs 12345,12346, and 12348:
squeue --jobs 12345,12346,12348
-
Print information only about job step 65552.1:
squeue --steps 65552.1
scontrol
Collect information about nodes, partitions, jobs, steps. Use scontrol show command to do that.
Syntax
scontrol [options] show entity=entityID (or entity entityID)
Some
useful
options
are:
--all
(-a
)
,
--details
(-d
),
--verbose
(-v
).
Examples
of
entities
are:
node,
partition,
job,
step
Examples
-
Show detailed information about job with ID 500:
scontrol --details show job=500
-
Show even more detailed information about job with ID 500 (including the jobscript):
scontrol -dd show job 500
sacct
Display accounting data for all jobs and job steps in the Slurm job accounting log or Slurm database
Syntax
sacct [options]
Common options
option | description |
---|---|
--endtime=end_time | Select jobs in any state before the specified time. |
--starttime=start_time | Select jobs in any state after the specified time. |
--state=state_list | Select jobs based on their state during the time period given. |
By
default,
the
start
and
end
time
will
be
the
current
time
when
the
--state
option
is
specified,
and
hence
only
currently
running
jobs
will
be
displayed.
By
default,
sacct
reports
jobs
owned
by
the
current
user
in
any
state
after
00:00:00
of
the
current
day.
To
select
older
jobs
you
must
specify
a
start
time
with
the --starttime
option.
When
specifying
states
(--state
)
and
no
start
time
is
provided
the
default
start
time
is
'now'.