Slurm partitions

Partitions in Slurm can be considered as a resource abstraction. A partition configuration defines job limits and access controls for a group of nodes. Slurm allocates resources to a job within the selected partition by taking into consideration the job's requested resources and the partition's available resources and restrictions.

Thirteen partitions are currently configured:

MFCF CPU partitions
- cpu_pr1 : For running jobs on the hpc-pr2 cluster
- cpu_pr3 : For running jobs on the hpc-pr3 cluster
MFCF GPU partitions
- gpu_p100 : For running jobs on Pascal 100 GPU server
- gpu_a100 : For running jobs on the Ampere 100 GPU server
- gpu_h100 : For running jobs on the Hopper 100 GPU server
- gpu_l40s : For running jobs on the Ada Lovelace L40S GPU server
hagrid cluster partitions
- hagrid_batch : For running batch jobs on the Hagrid cluster
- hagrid_interactive : For interactive jobs on the Hagrid cluster
barrio1 partition
- barrio1 : For running jobs on the barrio1 machine
mosaic cluster partitions: The Mosaic cluster was purchased by two CFI projects and was originally operated by SHARCNET. The owners have contributed this cluster to MFCF for use by the Faculty of Mathematics (other than SCS). The owners and their collaborators retain higher priority for use of the cluster.
- cpu_mosaic_owner : high priority CPU jobs
- cpu_mosaic_guest : low priority CPU jobs
- gpu_k20_mosaic_owner : high priority GPU jobs
- gpu_k20_mosaic_guest : low priority GPU jobs

Details of each partition's resources are shown in tables below. Resources may be adjusted according to observed usage patterns. Also partitions' computational resources may overlap. For example, cpu_mosaic_owner and cpu_mosaic_guest share the same computation resources. They differ in job limits, access controls and job priorities.

Useful commands

For detailed information on all partitions on a cluster
scontrol show partition
For information on specific partition
scontrol show partition <partition-name>

Partitions details

cpu_pr1 partition: for running production jobs on hpc-pr2 cluster.

cpu_pr1 partition resources
Partition name	cpu_pr1
Total available memory	512 GB
Max Cores	96 cores
Threads per core	2 Threads
Total GPU devices	0
GPU memory per device	0 GB
Compute Nodes	hpc-pr2-[01-08]

cpu_pr1 partition per job resource limits
Max runtime (h)	180 hours
Max Nodes	6 Nodes

cpu_pr3 partition: for running production jobs on hpc-pr3 cluster.

cpu_pr3 partition resources
Partition name	cpu_pr3
Total available memory	1024 GB
Max Cores	256 cores
Threads per core	2 Threads
Total GPU devices	0
GPU memory per device	0 GB
Compute Nodes	hpc-pr3-[01-08]

cpu_pr3 partition per job resource limits
Max runtime (h)	180 hours
Max Nodes	6 Nodes

gpu_p100 partition: for jobs using the P100 GPUs

gpu_p100 partition resources
Partition name	gpu_p100
Total available memory	44 GB
Max Cores	28 cores
Threads per core	2 Threads
Total GPU devices	4 Tesla P100
GPU memory per device	16 GB
Compute Nodes	gpu-pr1-01

gpu_p100 partition per job resource limits
Max runtime (h)	180 hours
Max Nodes	1 Node

gpu_a100 partition: for jobs using the A100 GPUs

gpu_a100 partition resources
Partition name	gpu_a100
Total available memory	1 TB
Max Cores	64 cores
Threads per core	1 Threads
Total GPU devices	8 Tesla A100
GPU memory per device	4 GPUs with 40 GB 4 GPUs with 80 GB
Compute Nodes	gpu-pr1-02

gpu_a100 partition per job resource limits
Max runtime (h)	180 hours
Max Nodes	1 Node

gpu_h100 partition: for jobs using the H100 GPUs

gpu_h100 partition resources
Partition name	gpu_h100
Total available memory	1 TB
Max Cores	112 cores
Threads per core	1 Threads
Total GPU devices	4 H100
GPU memory per device	80 GB
Compute Nodes	gpu-pr1-03

gpu_h100 partition per job resource limits
Max runtime (h)	180 hours
Max Nodes	1 Node

gpu_l40s partition: for jobs using the L40S GPUs

gpu_l40s partition resources
Partition name	gpu_l40s
Total available memory	768 GB
Max Cores	84 cores
Threads per core	1 Threads
Total GPU devices	3 L40S
GPU memory per device	48 GB
Compute Nodes	gpu-pr1-04

gpu_l40s partition per job resource limits
Max runtime (h)	180 hours
Max Nodes	1 Node

hagrid_batch partition: This partition is accessible only by hagrid cluster users. If hagrid_batch partition is selected, you have to specify hagrid account using --account=hagrid option.

hagrid_batch partition resources
Partition name	hagrid_batch
Total available memory	1440 GB
Total available cores	160 cores
Threads per core	1 Threads
Total GPU devices	0
Compute Nodes	hagrid[01-08]

hagrid_batch partition per job resource limits
Max runtime (h)	200 hours
Max Nodes	6 Nodes

hagrid_interactive partition: This partition is for Slurm interactive sessions and is accessible only by hagrid cluster users. Interactive jobs, or sessions, are useful for jobs that require direct user input such as code development, compiling, testing/debugging etc.

hagrid_interactive partition resources
Partition name	hagrid_interactive
Total available memory	180 GB
Total available cores	20 cores
Threads per core	2 Threads
Total GPU devices	0
Compute Nodes	hagrid-storage

hagrid_interactive partition per job resource limits
Max runtime (h)	4 hours
Max Nodes	1 Node

barrio1 partition: This partition is accessible only by barrio1 cluster users. It is a single node partition. Slurm interactive session would be recommended for jobs that require direct user input such as code development, compiling, testing/debugging etc.

barrio1 partition resources
Partition name	barrio1
AllowAccounts	barrio1
Total available memory	64 GB
Total available cores	8 cores
Threads per core	2 Threads
Total GPU devices	0
Compute Nodes	barrio1.math

barrio1 partition per job resource limits
Max runtime (h)	UNLIMITED
Max Nodes	1 Node

mosaic cluster partitions

The Mosaic cluster consists of 20 dual-CPU machines each with one GPU device, and 4 quad-CPU machines with large memory. The machines are connected by an InfiniBand network. All non-CS Math Faculty researchers and grad students may use the cluster, but the owners and their collaborators retain higher priority.

cpu_mosaic_owner partition: this partition is for the owners and their collaborators

cpu_mosaic_owner partition resources
Partition name	cpu_mosaic_owner
AllowAccounts	mosaic_owners
Total available memory	2304 GB
Total available cores	96 cores
Threads per core	1 Threads
Total GPU devices	0
Compute Nodes	cpu_mosaic_owner.math

cpu_mosaic_owner partition per job resource limits
Max runtime (day-hh:mm:ss)	7-04:00:00
Max Nodes	3 nodes

cpu_mosaic_guest partition: this partition is for users other than the owners of the Mosaic cluster

cpu_mosaic_guest partition resources
Partition name	cpu_mosaic_guest
AllowAccounts	cpu_mosaic_guest
Total available memory	2304 GB
Total available cores	96 cores
Threads per core	1 Threads
Total GPU devices	0
Compute Nodes	mosaic-[21-23]

cpu_mosaic_guest partition per job resource limits
Max runtime (day-hh:mm:ss)	7-04-00-00
Max Nodes	3 nodes

gpu_k20_mosaic_owner partition: This partition is for the owners. Each node in this partition has one Tesla K20m GPU device. Use this partition for GPU jobs.

gpu_k20_mosaic_owner partition resources
Partition name	gpu_k20_mosaic_owner
AllowAccounts	gpu_k20_mosaic_owner
Total available memory	4608 GB
Total available cores	360 cores
Threads per core	1 Threads
Total GPU devices	18 Tesla Kepler
Compute Nodes	mosaic-[01-19]

gpu_k20_mosaic_owner partition per job resource limits
Max runtime (day-hh:mm:ss)	7-12:01:00
Max Nodes	9 Nodes

gpu_k20_mosaic_guest partition: Anyone may use this partition for GPU jobs, but jobs launched here will have a lower priority than jobs launched via cpu_mosaic_owner. Each node in this partition has one Tesla K20m GPU device. The gpu_k20_mosaic_guest and gpu_k20_mosaic_owner partitions share the same machines. Jobs launched via gpu_k20_mosaic_owner will pre-empt jobs in gpu_k20_mosaic_guest.

gpu_k20_mosaic_guest partition resources
Partition name	gpu_k20_mosaic_guest
AllowAccounts	ALL
Total available memory	4608 GB
Total available cores	360 cores
Threads per core	1 Threads
Total GPU devices	18 Tesla Kepler
Compute Nodes	mosaic-[01-19].math

gpu_k20_mosaic_guest partition per job resource limits
Max runtime (day-hh:mm:ss)	7-12:01:00
Max Nodes	5 Nodes

Useful commands

Partitions details

Departments/Schools

Inquiries

Suggestions