Priorities, Partitions, Quality of Service & Reservations

How to submit jobs to slurm?

Slurm’s job scheduling and waiting times

When slurm is not configured for FIFO scheduling, jobs are prioritized in the following order:

  1. Jobs that can preempt: Not enabled in DAIC
  2. Jobs with an advanced reservation: See Slurm's Advanced Resource Reservation Guide
  3. Partition PriorityTier: See Priority tiers
  4. Job priority: See Priority calculations and QoS priority
  5. Job ID

Priority tiers

DAIC partitions are tiered:

  • The general partition is in the lowest priority tier,
  • Department partitions (eg, insy, st) are in the middle priority tier, and
  • Partitions for specific groups (eg, influence, mmll) are in the highest priority tier. Those partitions correspond to resources contributed by the respective groups or departments (see Contributing departments).

When resources become available, the scheduler will first look for jobs in the highest priority partition that those resources are in, and start the highest (user) priority jobs that fit within the resources (if any). When resources remain, the scheduler will check the next lower priority tier, and so on. Finally, the scheduler will try to backfill lower (user) priority jobs that fit (if any).

The partition priorities have no impact on resources that are in use, so jobs have to wait until the resources become available.

Partition selection

The purpose of this tiering is to let you submit your jobs to multiple partitions (e.g., --partition=mml,insy,general), allowing the scheduler to determine where the job can start the soonest. This ensures your job has the highest possible priority across different partitions in the cluster, without negatively impacting your or others’ resource access.

Keep in mind that:

  • Resources of all partitions (eg, st) are also part of the general partition (see Fig 1). Thus:
    • Submitting to the general partition allows jobs to use all nodes
    • Submitting to group-specific partitions alone results in longer waiting times, since the general partition has much more resources than any of them (The bigger the resource pool, the more chances a job has to be scheduled or back-filled)
    • The optimal strategy is to submit to both general and group-specific partitions when accessible. This is to skip over higher-priority jobs that would otherwise get started first on resources that are also in the specific partition.
  • You should only submit jobs to partitions that your account has access to. Submitting jobs to unauthorized partitions (e.g., using --partition=insy,st when your submitting account does not have access to both of these) will result in the job remaining in a pending state and generate excessive logging, potentially overloading the Slurm controller nodes.
Correct: explicit default account and partition specification

#SBATCH --account=ewi-insy-prb
#SBATCH --partition=insy,general
Correct: Implicit default account omitted since it has access to the specified patition

#SBATCH --partition=insy,general
Incorrect: Multiple partitions with account mismatch

#SBATCH --account=ewi-insy-prb
#SBATCH --partition=insy,st  
Incorrect: Specifying a wrong account for the partition

#SBATCH --account=ewi-st
#SBATCH --partition=insy 

Priority calculations

Slurm continually calculates job priorities and schedules the execution of jobs based on its configurations. A few configuration parameters affect priority computations:

  • SchedulerType: The type of scheduling used based on available resources, requested resources, and job priorities. On DAIC, slurm is used with backfill scheduling mechanism. This mechanism allows low priority jobs to backfill idle resources if doing so does not delay the expected start time of any high priority job (based on resource availability).
  • PriorityType: The way priority is computed. On DAIC, a multifactor computation is applied, where job priority at any given time is a weighted sum of the following factors:
    • Fairshare: a measure of the amount of resources that a group (ie account in slurm terminology) has contributed, and the historical usage of the group and the user.
    • QOS: the quality of service associated with the job, which is specified with the slurm --qos directive (see QoS priority).

The following commands are useful for checking prioritization of your own jobs:

CommandPurpose
sprio -j <YourJobID>Determine the priority of your job
squeue -j <YourJobID> --startRequest your job’s estimated start time
sshare -u <YourNetID>Determine your current fairshare value

QoS priority

The purpose of the (multiple) QoSs in DAIC is to optimize the throughput of the cluster and to reduce the waiting times for jobs:

  • Long jobs block resources for a long time, thus leading to long waiting times and fragmentation of resources.
  • Short jobs block resources only for short times, and can more easily fill in the gaps in the scheduling of resources (thus start sooner), and are therefore better for throughput and waiting times.

Thus, DAIC has the following policy:

  • To stimulate short jobs, the short QoS has a higher priority, and allows you to use a larger part of all resources, than the medium and long QoS.

  • To prevent long jobs from blocking all resources in the cluster for long times (thus causing long waiting times), only a certain part of all cluster resources is available to all running long QoS jobs (of all users) combined.

  • All running medium QoS jobs together can use a somewhat larger part of all resources in the cluster, and all running short QoS jobs combined are allowed to fill the biggest part of the cluster.

    • These limits are called the QoS group limits.
    • When this limit is reached, no new jobs with this QoS can be started, until some of the running jobs with this QoS finish and release some resources.
    • The scheduler will indicate this with the reason QoS Group CPU/memory/GRES limit.
  • To prevent one user from single-handedly using all available resources in a certain QoS, there are also limits for the total resources that all running jobs of one user in a specific QoS can use.

    • These are called the QoS per-user limits.
    • When this limit is reached, no new jobs of this user with this QoS can be started, until some of the running jobs of this user and with this QoS finish and release some resources.
    • The scheduler will indicate this with the reason QoS User CPU/memory/GRES limit.

These per-group and per-user limits are set by the DAIC user board, and the scheduler strictly enforces these limits. Thus, no user can use more resources than the amount that was set by the user board. Any (perceived) imbalance in the use of resources by a certain QoS or user should not be held against a user or the scheduler, but should be discussed in the user board.

Partitions

In SLURM, a partition is a scheduling construct that groups nodes or resources based on certain characteristics or policies. Partitions are used to organize and manage resources within a cluster, and they allow system administrators to control how jobs are allocated and executed on different nodes.

See partition definitions

On DAIC the scontrol command only shows you the general partitions. More partitions are available.

$ scontrol show partition
PartitionName=general
   AllowGroups=ALL AllowAccounts=ALL DenyQos=influence
   AllocNodes=login[1-3],oodtest Default=YES QoS=N/A
   DefaultTime=00:01:00 DisableRootJobs=NO ExclusiveUser=NO GraceTime=0 Hidden=NO
   MaxNodes=UNLIMITED MaxTime=UNLIMITED MinNodes=0 LLN=NO MaxCPUsPerNode=UNLIMITED
   Nodes=3dgi[1-2],100plus,awi[01-26],cor1,gpu[01-11],grs[1-4],influ[1-6],insy[11-16],tbm5,wis1
   PriorityJobFactor=1 PriorityTier=1 RootOnly=NO ReqResv=NO OverSubscribe=NO
   OverTimeLimit=NONE PreemptMode=OFF
   State=UP TotalCPUs=4064 TotalNodes=59 SelectTypeParameters=NONE
   JobDefaults=(null)
   DefMemPerNode=1024 MaxMemPerNode=UNLIMITED
   TRESBillingWeights=CPU=0.5,Mem=0.083333333G,GRES/gpu=16.0

Quality of Service (QoS)

When you submit a job in a slurm-based system, it enters a queue waiting for resources. The partition and Quality of Service(QoS) are the two job parameters slurm uses to assign resources for a job:

  • The partition is a set of compute nodes on which a job can be scheduled. In DAIC, the nodes contributed or funded by a certain group are lumped into a corresponding partition (see Contributing departments). All nodes in DAIC are part of the general partition, but other partitions exist for prioritization purposes on select nodes (see Priority tiers).
  • The Quality of Service is a set of limits that controls what resources a job can use and, therefore, determines the priority level of a job. This includes the run time, CPU, GPU and memory limits on the given partition. Jobs that exceed these limits are automatically terminated (see QoS priority).

For DAIC, Table 1 shows the QoS limits on the general partition.

Table 1: The general partition and its operational and per-QoS per-user limits; specific groups use other partitions and QoS
*infinite QoS jobs will be killed when servers go down, eg, during maintenance. It is not recommended to submit jobs with this QoS.
PartitionQoSPriorityMax run timeJobs per userCPU limitsGPU limitsMemory limits
Per QoSPer userPer QoSPer userPer QoSPer User
generalinteractivehigh1 hour1 running-2-2-16G
shortnormal4 hours100003672 (85%)2160 (50%)109 (85%)64 (50%)23159G (85%)13623G (50%)
mediummedium1 ½ day20003456 (80%)1512 (35%)103 (80%)45 (35%)21796G (80%)9536G (35%)
longlow7 days10003240 (75%)864 (20%)96 (75%)25 (20%)20434G (75%)5449G (20%)
infinite*noneinfinite1 running32-2-250G-

See Quality of Service definitions

On DAIC you can check the QoS policies with the sacctmgr command:

$ sacctmgr list qos
      Name   Priority  GraceTime    Preempt   PreemptExemptTime PreemptMode                                    Flags UsageThres UsageFactor       GrpTRES   GrpTRESMins GrpTRESRunMin GrpJobs GrpSubmit     GrpWall       MaxTRES MaxTRESPerNode   MaxTRESMins     MaxWall     MaxTRESPU MaxJobsPU MaxSubmitPU     MaxTRESPA MaxJobsPA MaxSubmitPA       MinTRES 
---------- ---------- ---------- ---------- ------------------- ----------- ---------------------------------------- ---------- ----------- ------------- ------------- ------------- ------- --------- ----------- ------------- -------------- ------------- ----------- ------------- --------- ----------- ------------- --------- ----------- ------------- 
    normal          0   00:00:00                                    cluster                              DenyOnLimit               1.000000                                                                                                                                                                                                                cpu=1 
     short         50   00:00:00                                    cluster                              DenyOnLimit               1.000000 cpu=3562,gre+                                         65536                                                           04:00:00 cpu=2096,gre+                 10000                                      cpu=1,mem=1M 
      long         25   00:00:00                                    cluster                              DenyOnLimit               1.000000 cpu=3144,gre+                                         65536                                                         7-00:00:00 cpu=838,gres+                  1000                                      cpu=1,mem=1M 
  infinite          0   00:00:00                                    cluster                              DenyOnLimit               1.000000 cpu=32,gres/+                                         65536                                                                                          1         100                                      cpu=1,mem=1M 
interacti+        100   00:00:00                                    cluster                              DenyOnLimit               2.000000                                                       65536                                                           01:00:00 cpu=2,gres/g+         1           1                                      cpu=1,mem=1M 
   student         10   00:00:00                                    cluster                              DenyOnLimit               1.000000 cpu=192,gres+                                         65536                                                           04:00:00 cpu=2,gres/g+         1         100                                      cpu=1,mem=1M 
reservati+        100   00:00:00                                    cluster          DenyOnLimit,RequiresReservation               1.000000                                                       65536                                                                                                  10000                                      cpu=1,mem=1M 
 influence        100   00:00:00                                    cluster                              DenyOnLimit               1.000000                                                       65536                                                                                                  10000                                      cpu=1,mem=1M 
guest-sho+         10   00:00:00                                    cluster                              DenyOnLimit               1.000000 cpu=200,gres+                                         65536                                                           04:00:00 cpu=128,gres+                   100                                      cpu=1,mem=1M 
guest-long          0   00:00:00                                    cluster                              DenyOnLimit               1.000000 cpu=200,gres+                                         65536                                                         7-00:00:00 cpu=128,gres+         1          10                                      cpu=1,mem=1M 
    medium         35   00:00:00                                    cluster                              DenyOnLimit               1.000000 cpu=3352,gre+                                         65536                                                         1-12:00:00 cpu=1466,gre+                  2000                                      cpu=1,mem=1M 

How to use QoS in your sbatch scripts?

In your sbatch.slurm script you can specify the QoS with #SBATCH --qos=... option.

Example:

#!/bin/bash
#SBATCH --job-name=hello-world
#SBATCH --partition=general
#SBATCH --account=ewi-insy-reit
#SBATCH --qos=short               # This is how you specify QoS
#SBATCH --time=0:01:00     
#SBATCH --nodes=1        
#SBATCH --tasks-per-node=1        
#SBATCH --cpus-per-task=2        
#SBATCH --mem=1GB                
#SBATCH --output=slurm-%n-%j.out  
#SBATCH --error=slurm-%n-%j.err

srun echo 'Hi, from Slurm!'
sleep 30  # Wait for 30 seconds before exiting.

QoS for reservations

In case you have a reservation you need to specify --qos=reservation and `–reservation=. You can find an example here.

Resources reservations

Slurm gives the possibility to reserve one or more compute nodes exclusively for a specific user or group of users. A reservation ensures that the designated node (or nodes) are dedicated solely to the reservation holder’s tasks and are not shared with other users during the reserved period. This feature allows users to plan the execution of future workloads, and accommodates cluster users with special needs beyond the batch system (eg latency measurement scenarios).

Requesting a Reservation

To request a reservation for nodes, please use to the Request Reservation form. You can request a reservation for an entire compute node (or a group of nodes) if you have contributed this (or these) nodes to the cluster and you have special needs that needs to be accommodated.

General guidelines for reservations’ requests:

  • You can be granted a reservation only on nodes from a partition that is contributed by your group (See Partitions to check the name of the partition contributed by your group, and System specifications for a listing of available nodes and their features).
  • Please ask for the least amount of resources you need as to minimize impact on other users.
  • Plan ahead and request your reservation as soon as possible: Reservations usually ignore running jobs, so any running job on the machine(s) you request will continue to run when the reservation starts. While jobs from other users will not start on the reserved node(s), the resources in use by an already running job at the start time of the reservation will not be available in the reservation until this running job ends. The earlier ahead you request resources, the easier it is to allocate the requested resources.

Using reservations

Once your reservation request is approved and a reservation is placed on the system, you can run your jobs in the reservation by specifying --qos=reservation along with the following directives to your slurm commands: --reservation=<name> and --partition=<partition>. For example, to submit the job job.sbatch to a reservation named icra_iv on the cor1 node on the cor partition use:

$ sbatch --qos=reservation --reservation=icra_iv --partition=cor job.sbatch

Alternatively, it is possible to add the following lines to the job.sbatch file, and submitting this file as usual:

#SBATCH --qos=reservation
#SBATCH --reservation=icra_iv
#SBATCH --partition=cor

To make use of an existing reservation you have to specify --qos=reservation and --reservation=<reservation-name> in your sbatch script.

Viewing reservations

To view all active and future reservations run the scontrol command as follows:

$ scontrol show reservations
ReservationName=icra_iv StartTime=2023-09-09T00:00:00 EndTime=2023-09-16T00:00:00 Duration=7-00:00:00
   Nodes=cor1 NodeCnt=1 CoreCnt=32 Features=(null) PartitionName=cor Flags=
   TRES=cpu=64
   Users=(null) Groups=(null) Accounts=3me-cor Licenses=(null) State=ACTIVE BurstBuffer=(null) Watts=n/a
   MaxStartDelay=(null)

ReservationName=maintenance weekend 2023-10-14 StartTime=2023-10-13T20:00:00 EndTime=2023-10-16T09:00:00 Duration=2-13:00:00
   Nodes=3dgi[1-2],100plus,awi[01-26],cor1,gpu[01-11],grs[1-4],influ[1-6],insy[11-12,14-16],tbm5,wis1 NodeCnt=58 CoreCnt=2000 Features=(null) PartitionName=(null) Flags=MAINT,IGNORE_JOBS,SPEC_NODES,ALL_NODES
   TRES=cpu=4000
   Users=root Groups=(null) Accounts=(null) Licenses=(null) State=INACTIVE BurstBuffer=(null) Watts=n/a
   MaxStartDelay=(null)

Last modified December 17, 2024: merge job chains and job arrays (8dbe822)