Quality of Service (QoS)
How to submit jobs to slurm?
3 minute read
When you submit a job in a slurm-based system, it enters a queue waiting for resources. The partition and Quality of Service(QoS) are the two job parameters slurm uses to assign resources for a job:
- The partition is a set of compute nodes on which a job can be scheduled. In DAIC, the nodes contributed or funded by a certain group are lumped into a corresponding partition (see Contributing departments).
All nodes in DAIC are part of the
general
partition, but other partitions exist for prioritization purposes on select nodes (see Priority tiers). - The Quality of Service is a set of limits that controls what resources a job can use and, therefore, determines the priority level of a job. This includes the run time, CPU, GPU and memory limits on the given partition. Jobs that exceed these limits are automatically terminated (see QoS priority).
For DAIC, Table 1 shows the QoS limits on the general
partition.
*infinite QoS jobs will be killed when compute nodes go down, eg, during maintenance. It is not recommended to submit jobs with this QoS. | ||||||||||
Partition | QoS | Priority | Max run time | Jobs per user | CPU limits | GPU limits | Memory limits | |||
---|---|---|---|---|---|---|---|---|---|---|
Per QoS | Per user | Per QoS | Per user | Per QoS | Per User | |||||
general | interactive | high | 1 hour | 1 running | - | 2 | - | 2 | - | 16G |
short | normal | 4 hours | 10000 | 3672 (85%) | 2160 (50%) | 109 (85%) | 64 (50%) | 23159G (85%) | 13623G (50%) | |
medium | medium | 1 ½ day | 2000 | 3456 (80%) | 1512 (35%) | 103 (80%) | 45 (35%) | 21796G (80%) | 9536G (35%) | |
long | low | 7 days | 1000 | 3240 (75%) | 864 (20%) | 96 (75%) | 25 (20%) | 20434G (75%) | 5449G (20%) | |
infinite* | none | infinite | 1 running | 32 | - | 2 | - | 250G | - |
Note
The priority of a job is a function of both QoS and previous usage (less is better). Read Priority and waiting times for more information.See Quality of Service definitions
On DAIC you can check the QoS policies with the sacctmgr
command:
$ sacctmgr list qos
Name Priority GraceTime Preempt PreemptExemptTime PreemptMode Flags UsageThres UsageFactor GrpTRES GrpTRESMins GrpTRESRunMin GrpJobs GrpSubmit GrpWall MaxTRES MaxTRESPerNode MaxTRESMins MaxWall MaxTRESPU MaxJobsPU MaxSubmitPU MaxTRESPA MaxJobsPA MaxSubmitPA MinTRES
---------- ---------- ---------- ---------- ------------------- ----------- ---------------------------------------- ---------- ----------- ------------- ------------- ------------- ------- --------- ----------- ------------- -------------- ------------- ----------- ------------- --------- ----------- ------------- --------- ----------- -------------
normal 0 00:00:00 cluster DenyOnLimit 1.000000 cpu=1
short 50 00:00:00 cluster DenyOnLimit 1.000000 cpu=3562,gre+ 65536 04:00:00 cpu=2096,gre+ 10000 cpu=1,mem=1M
long 25 00:00:00 cluster DenyOnLimit 1.000000 cpu=3144,gre+ 65536 7-00:00:00 cpu=838,gres+ 1000 cpu=1,mem=1M
infinite 0 00:00:00 cluster DenyOnLimit 1.000000 cpu=32,gres/+ 65536 1 100 cpu=1,mem=1M
interacti+ 100 00:00:00 cluster DenyOnLimit 2.000000 65536 01:00:00 cpu=2,gres/g+ 1 1 cpu=1,mem=1M
student 10 00:00:00 cluster DenyOnLimit 1.000000 cpu=192,gres+ 65536 04:00:00 cpu=2,gres/g+ 1 100 cpu=1,mem=1M
reservati+ 100 00:00:00 cluster DenyOnLimit,RequiresReservation 1.000000 65536 10000 cpu=1,mem=1M
influence 100 00:00:00 cluster DenyOnLimit 1.000000 65536 10000 cpu=1,mem=1M
guest-sho+ 10 00:00:00 cluster DenyOnLimit 1.000000 cpu=200,gres+ 65536 04:00:00 cpu=128,gres+ 100 cpu=1,mem=1M
guest-long 0 00:00:00 cluster DenyOnLimit 1.000000 cpu=200,gres+ 65536 7-00:00:00 cpu=128,gres+ 1 10 cpu=1,mem=1M
medium 35 00:00:00 cluster DenyOnLimit 1.000000 cpu=3352,gre+ 65536 1-12:00:00 cpu=1466,gre+ 2000 cpu=1,mem=1M
How to use QoS in your sbatch
scripts?
In your sbatch.slurm
script you can specify the QoS with #SBATCH --qos=...
option.
Example:
#!/bin/bash
#SBATCH --job-name=hello-world
#SBATCH --partition=general
#SBATCH --account=ewi-insy-reit
#SBATCH --qos=short # This is how you specify QoS
#SBATCH --time=0:01:00
#SBATCH --nodes=1
#SBATCH --tasks-per-node=1
#SBATCH --cpus-per-task=2
#SBATCH --mem=1GB
#SBATCH --output=slurm-%n-%j.out
#SBATCH --error=slurm-%n-%j.err
srun echo 'Hi, from Slurm!'
sleep 30 # Wait for 30 seconds before exiting.
QoS for reservations
In case you have a reservation you need to specify --qos=reservation
and `–reservation=
Feedback
Was this page helpful?
Glad to hear it! Please click here to notify us. We appreciate it.
Sorry to hear that. Please click here let the page maintainers know.