Slurm partitions

Partitions in Slurm can be considered as a resource abstraction. A partition configuration defines job limits and access controls for a group of nodes. Slurm allocates resources to a job within the selected partition by taking into consideration the job's requested resources and the partition's available resources and restrictions.

Thirteen partitions are currently configured:

  • MFCF CPU partitions
    • cpu_pr1 : For running jobs on the hpc-pr2 cluster
    • cpu_pr3 : For running jobs on the hpc-pr3 cluster
  • MFCF GPU partitions
    • gpu_p100 : For running jobs on Pascal 100 GPU server
    • gpu_a100 : For running jobs on the Ampere 100 GPU server
    • gpu_h100 : For running jobs on the Hopper 100 GPU server
    • gpu_l40s : For running jobs on the Ada Lovelace L40S GPU server
  • hagrid cluster partitions

Details of each partition's resources are shown in tables below. Resources may be adjusted according to observed usage patterns. Also, partitions' computational resources may overlap.

Useful commands

  • For detailed information on all partitions on a cluster
    scontrol show partition
  • For information on specific partition
    scontrol show partition <partition-name>

Partitions details

  • cpu_pr1 partition: for running production jobs on hpc-pr2 cluster.
    Partition name cpu_pr1
    Total available memory 512 GB
    Max Cores 96 cores
    Threads per core 2 Threads
    Total GPU devices 0
    GPU memory per device 0 GB
    Compute Nodes hpc-pr2-[01-08]
    cpu_pr1 partition resources
     
    Max runtime (h) 180 hours
    Max Nodes 6 Nodes
    cpu_pr1 partition per job resource limits
  • cpu_pr3 partition: for running production jobs on hpc-pr3 cluster.
    Partition name cpu_pr3
    Total available memory 1024 GB
    Max Cores 256 cores
    Threads per core 2 Threads
    Total GPU devices 0
    GPU memory per device 0 GB
    Compute Nodes hpc-pr3-[01-08]
    cpu_pr3 partition resources
     
    Max runtime (h) 180 hours
    Max Nodes 6 Nodes
    cpu_pr3 partition per job resource limits
  • gpu_p100 partition: for jobs using the P100 GPUs
    Partition name gpu_p100
    Total available memory 44 GB
    Max Cores 28 cores
    Threads per core 2 Threads
    Total GPU devices 4 Tesla P100
    GPU memory per device 16 GB
    Compute Nodes gpu-pr1-01
    gpu_p100 partition resources
     
    Max runtime (h) 180 hours
    Max Nodes 1 Node
    gpu_p100 partition per job resource limits
  • gpu_a100 partition: for jobs using the A100 GPUs
    Partition name gpu_a100
    Total available memory 1 TB
    Max Cores 64 cores
    Threads per core 1 Threads
    Total GPU devices 8 Tesla A100
    GPU memory per device 4 GPUs with 40 GB
    4 GPUs with 80 GB
    Compute Nodes gpu-pr1-02
    gpu_a100 partition resources
     
    Max runtime (h) 180 hours
    Max Nodes 1 Node
    gpu_a100 partition per job resource limits
  • gpu_h100 partition: for jobs using the H100 GPUs
    Partition name gpu_h100
    Total available memory 1 TB
    Max Cores 112 cores
    Threads per core 1 Threads
    Total GPU devices 6 H100
    GPU memory per device

    4GPUs with 80 GB

    2GPUs with 96 GB

    Compute Nodes

    gpu-pr1-03

    gpu-pr1-06

    gpu_h100 partition resources
    Max runtime (h) 180 hours
    Max Nodes 1 Node
    gpu_h100 partition per job resource limits
  • gpu_l40s partition: for jobs using the L40S GPUs
    Partition name gpu_l40s
    Total available memory 768 GB
    Max Cores 84 cores
    Threads per core 1 Threads
    Total GPU devices 5 L40S
    GPU memory per device 48 GB
    Compute Nodes

    gpu-pr1-04

    gpu-pr1-05

    gpu_l40s partition resources
    Max runtime (h) 180 hours
    Max Nodes 1 Node
    gpu_l40s partition per job resource limits
  • hagrid_batch partition: This partition is accessible only by hagrid cluster users. If hagrid_batch partition is selected, you have to specify hagrid account using --account=hagrid option.
    Partition name hagrid_batch
    Total available memory 1440 GB
    Total available cores 160 cores
    Threads per core 1 Threads
    Total GPU devices 0
    Compute Nodes hagrid[01-08]
    hagrid_batch partition resources
     
    Max runtime (h) 200 hours
    Max Nodes 6 Nodes
    hagrid_batch partition per job resource limits
  • hagrid_interactive partition: This partition is for Slurm interactive sessions and is accessible only by hagrid cluster users. Interactive jobs, or sessions, are useful for jobs that require direct user input such as code development, compiling, testing/debugging etc.
    Partition name hagrid_interactive
    Total available memory 180 GB
    Total available cores 20 cores
    Threads per core 2 Threads
    Total GPU devices 0
    Compute Nodes hagrid-storage
    hagrid_interactive partition resources
     
    Max runtime (h) 4 hours
    Max Nodes 1 Node
    hagrid_interactive partition  per job resource limits