This is the multi-page printable view of this section. Click here to print.

Return to the regular view of this page.

Documentation

1: About DAIC

1.1: Contributors and funding
1.2: Advisors and Impact

2: Policies & Usage guidelines

3: System specifications

3.1: Login Nodes
3.2: Compute nodes
3.3: Storage
3.4: Scheduler
3.5: Cluster comparison

4: User manual

4.1: Connecting to DAIC
4.2: Data management & transfer

4.3: Software

4.3.1: Operating system
4.3.2: Available software
4.3.3: Modules
4.3.4: Installing software
4.3.5: Containerization

4.4: Job submission

4.4.1: Basics of Slurm jobs
4.4.2: Priorities, Partitions, Quality of Service & Reservations
4.4.3: Advanced Slurm jobs
4.4.4: Kerberos

4.5: Best practices
4.6: Handy commands on DAIC

These pages contain basic concepts and details to make optimal use of TU Delft’s DAIC. Alternatively, you might with to jump to the Quickstart or Tutorials for more thematic content.

1 - About DAIC

Overview of the Delft AI Cluster (DAIC) and its role in high-performance computing at TU Delft.

What is an HPC cluster?

A high-performance computing (HPC) cluster is a collection of interconnected compute resources (like CPUs, GPUs, memory, and storage) shared among a group of users. These resources work together to perform lengthy and computationally intensive tasks that would be too large or too slow on a single computer. HPC is especially useful for modern scientific computing applications, where datasets are typically large, models are complex, and computations require specialized hardware (such as GPUs or FPGAs).

What is DAIC?

The Delft AI Cluster (DAIC), formerly known as INSY-HPC (or simply “HPC”), is a TU Delft high-performance computing cluster consisting of Linux compute nodes (i.e., servers) with substantial processing power and memory for running large, long, or GPU-enabled jobs.

What started in 2015 as a CS-only cluster has grown to serve researchers across many TU Delft departments. Each expansion has continued to support the needs of computer science and AI research. Today, DAIC nodes are organized into partitions that correspond to the groups contributing those resources. (See Contributing departments and TU Delft clusters comparison.)

DAIC partitions and access/usage best practices

1.1 - Contributors and funding

Overview of departments and funding sources contributing to DAIC.

The Delft AI Cluster (DAIC)—formerly known as INSY-HPC or simply HPC—was initiated within the INSY department in 2015. In later phases, resources were merged with the ST department (collectively called CS@Delft) and expanded further with contributions from other departments across multiple faculties.

Joining DAIC?

If you are interested in joining DAIC as a contributor, please contact us via this TopDesk DAIC Contact Us form.

Contributing departments

The cluster is available only to users from participating departments. Access is arranged through your department’s contact persons (see Access and accounts).

Table 1: Current DAIC-contributing TU Delft departments/faculties
I	Contributor	Faculty	Faculty abbreviation (English/Dutch)
1	3D Geoinformation	Faculty of Architecture and the Built Environment	ABE/BK
2	Architecture	Faculty of Architecture and the Built Environment	ABE/BK
3	Aerospace Structures and Materials	Faculty of Aerospace Engineering	AE/LR
4	Control and Operations	Faculty of Aerospace Engineering	AE/LR
5	Imaging Physics	Faculty of Applied Sciences	AS/TNW
6	Cognitive Robotics	Faculty of Mechanical Engineering	ME
7	Geoscience & Remote Sensing	Faculty of Civil Engineering and Geosciences	CEG/CiTG
8	Intelligent Systems	Faculty of Electrical Engineering, Mathematics & Computer Science	EEMCS/EWI
9	Software Technology
10	Signal Processing Systems, Microelectronics

Note

To check the corresponding nodes or servers for each department, see the Cluster Specification page.

Funding sources

In addition to funding from contributing departments, DAIC has received support from the following projects and funding sources:

NWO

Horizon 2020

Epistemic AI

MMLL

Booking.com

D-STANDARD

Model-Driven Decisions Lab (MoDDL)

MoDDL

Immersive Tech Lab

Immersive Technology Lab, part of Convergence AI

Others

Contributing departments' funding

1.2 - Advisors and Impact

Meet the DAIC advisory board and learn how to cite, acknowledge, and measure the scientific impact of DAIC.

Advisory board

Thomas Abeel

Department of Intelligent Systems

Pattern Recognition and Bioinformatics group

Thomas Abeel

Frans Oliehoek

Department of Intelligent Systems

Sequential Decision Making group

Frans Oliehoek

Asterios Katsifodimos

Software Technology Department

Data-intensive Systems group

Asterios Katsifodimos

Citation and Acknowledgement

To help demonstrate the impact of DAIC, we ask that you both cite and acknowledge DAIC in your scientific publications. Please use one of the following formats:

Delft AI Cluster (DAIC). (2024). The Delft AI Cluster (DAIC), RRID:SCR_025091. https://doi.org/10.4233/rrid:scr_025091

@misc{DAIC,
  author = {{Delft AI Cluster (DAIC)}},
  title = {The Delft AI Cluster (DAIC), RRID:SCR_025091},
  year = {2024},
  doi = {10.4233/rrid:scr_025091},
  url = {https://daic.tudelft.nl/}
}

TY  - DATA
T1  - The Delft AI Cluster (DAIC), RRID:SCR_025091
UR  - https://doi.org/10.4233/rrid:scr_025091
PB  - TU Delft
PY  - 2024

Acknowledgement text

Research reported in this work was partially or completely facilitated by computational resources and support of the Delft AI Cluster (DAIC) at TU Delft (RRID: SCR_025091), but remains the sole responsibility of the authors, not the DAIC team.

Scientific impact in numbers

Since 2015, DAIC has facilitated more than 2,000 scientific outputs from participating departments:

	Article	Conference/Meeting contribution	Book/Book chapter/Book editing	Dissertation (TU Delft)	Abstract	Other	Editorial	Patent	Grand Total
Grand Total	1067	854	123	99	69	32	29	8	2281

These outputs span a wide range of research areas. Title analysis highlights frequent use of terms related to data analysis and machine learning:

Word cloud of the most common words in scientific output titles using DAIC

Reference

The table and word cloud are based on retrospective retrieval of scientific outputs (2015–2023) from TU Delft’s Pure database.
Data was generated by the Strategic Development – Data Insights team.

Publications using DAIC

Note

The following list is compiled retrospectively by the Data Insights team and/or based on self-reported contributions. As a result, it may be incomplete. If your publication is missing, please share it via the ScientificOutput Mattermost channel.

2 - Policies & Usage guidelines

What is required to access DAIC, usage policies, and how to acknowledge DAIC in your publication.

User agreement

This user agreement is intended to establish the expectations between all users and administrators of the cluster with respect to fair-use and fair-share of cluster resources. By using the DAIC cluster you agree to these terms and conditions.

Defintions

Cluster structure: The DAIC cluster is made up of shared resources contributed by different labs and groups. The pooling of resources from different groups is beneficial for everyone: it enables larger, parallelized computations and more efficient use of resources with less idle time.
Basic principles: Regardless of the specific details, cluster use is always based on basic principles of fair-use and fair-share (through priority) of resources, and all users are expected to take care at all times that their cluster use is not hindering other users.
Policies: Cluster policies are decided by the user board and enforced by various automated and non-automated actions, for example by the job scheduler based on QoS limits and the administrators for ensuring the stability and performance of the cluster.
Support:
- Cluster administrators offer, during office hours, different levels of support, which include (in order of priority): ensuring the stability and performance of the cluster, providing generic software, helping with cluster-specific questions and problems, and providing information (via e-mails and during the board meeting) about cluster updates.
- Contact persons from participating groups add and manage users at the level of their respective groups, communicate needs and updates between their groups and system administrators, and may help with cluster-specific questions and problems.
- HPC Engineers, in CS@Delft, provide support to (CS) students, researchers and staff members to efficiently use DAIC resources. This includes: maintaining updated documentation resources, running onboarding and advanced training courses on cluster usage, organizing workshops to assess compute needs, plan infrastructure upgrades, and may collaborate with researchers on individual projects as fits.
More information: Please see the Terms of service and What to do in case of problems sections on where to find more information about cluster use.
Cluster workflow:
- The typical steps for running a job on the cluster are: Test → Determine resources → Submit → Monitor job → Repeat until results are obtained. See Quickstart
- You can use the logins nodes for testing your code, determining the required resources and submitting jobs (see Computing on login nodes).
- For testing jobs which require larger resources (more than 4 CPUs and/or more than 4 GB of memory and/or one or more powerful GPUs), start an interactive job (see Interactive jobs).
- For determining resources of larger jobs, you can submit a single (short) test job (see Submitting jobs)
QoS:
- A Quality of Service (QoS) is a set of limits that controls what resources a job can use and determines the priority level of a job. DAIC adopts multiple QoSs to optimize the throughput of job scheduling and to reduce the waiting times in the cluster (see Quality of Service).
- The DAIC QoS limits are set by the DAIC user board, and the scheduler strictly enforces these limits. Thus, no user can use more resources than the amount that was set by the user board.
- Any (perceived) imbalance in the use of resources by a certain QoS or user should not be hold against a user or the scheduler, but should be discussed in the user board.

Access and accounts

DAIC is a cluster dedicated for TU Delft researchers (eg, PhD students, postdocs, .. etc) from participating groups (see Contributing departments).

Needing access to DAIC?

To access DAIC resources, eligible candidates from these groups can request an account via the DAIC request Access form.

Additionally, requests for resources reservations can also be accommodated (see Terms of service).

Terms of service

You may use cluster resources for your research within the QoS restrictions of your domain user and user group. Depending on your user group, you might be eligible to use specific partitions, giving higher priorities on certain nodes. See Priority tiers, and please check this with your lab.
Depending on your user group, you might be eligible to get priorities on certain nodes. For example, you might have access to a specialized partition or limited-time node reservation for your group or department (for example before a conference deadline). Please check this with your lab and try to use these in your *.sbatch file, your jobs should then start faster! See Resources Reservations for more information.
In general, you will be informed about standard administrative actions on the cluster. All official DAIC cluster e-mails are sent to your official TU Delft mailbox, so it is advised to check it regularly.
1. You will receive e-mails about downtimes relating to scheduled maintenance.
2. You, or your supervisor, will receive e-mails about scheduled cluster user board meetings where any updates and changes to the cluster structure, software, or hardware will be announced. Please check with your lab or feel free to join the cluster board meetings if you want to be up-to-date about any changes.
3. You will receive automated e-mails regarding the efficiency of your jobs. The cluster monitors the use of resources of all jobs. When certain specific inefficiencies are detected for a significant number of jobs in the same day, an automated efficiency mail is sent to inform you about these problems with your resource use, to help you optimize your jobs. These mails will not lead to automatic cancellations or bans. To avoid spamming, limited inefficient use will not trigger a mail.
4. You will receive an e-mail when your jobs are canceled or you receive a cluster ban (see the Expectations from cluster users and Consequences of irresponsible usage sections). You will be informed about why your jobs were canceled or why you were banned from the cluster (often before the bans take place). If the problem is still not clear to you from the e-mails you already received, please follow the steps detailed in the What to do in case of problems section.
5. You are not entitled to receive personalized help on how to debug your code via e-mail. It is your responsibility to solve technical problems stemming from your code. Please first consult with your lab for a solution to a technical problem (see What to do in case of problems). However, admins might offer help, advice and solutions along with information regarding a job cancellation or ban. Please listen to such advice, it might help you solve your problem and improve fair use of the cluster.
You may join cluster user board meetings. In the meetings you will be informed of any new developments, hardware and software updates and can suggest changes and improvements. These meetings take place roughly every 3 months and will be announced by e-mail and on the MatterMost channel.

Expectations from cluster users

You are responsible for your jobs not interfering with other users’ cluster usage. Please try to always keep in mind that cluster resources are limited and shared between all users, and that fair use benefits everyone.
You are not allowed to use the cluster for reasons unrelated to your studies and research.
If your jobs are destructive to other users’ jobs or are threatening cluster integrity, your jobs might be canceled. You have the responsibility at all times to avoid behavior which interferes negatively with other users’ cluster usage. See Consequences of irresponsible usage.
If the destructive behavior of your jobs does not change over time or you are unresponsive to e-mails from system admins requesting information or requiring immediate action regarding your cluster use, you might receive a ban from the cluster. See Consequences of irresponsible usage.
You are expected to cite and acknowledge DAIC in your scientific publications using the format specified in the Citation and Acknowledgement section. Additionally, please remember to post any scientific output based-off work performed on DAIC to the ScientificOutput MatterMost channel.

Responsible cluster usage

You are responsible that your jobs run efficiently:

Please keep an eye on your jobs and the automated efficiency e-mails to check for unexpected behavior.
Sometimes many jobs from the same user, or from student groups, will be running on many nodes at the same time. While this may seem like one user, or user group, is blocking the cluster for everyone else, please keep in mind that the scheduler operates on a set of predetermined rules based on the QoS and priority settings. We do not want idle resources. Therefore, at the time that those jobs were started, the resources were idle, no higher priority jobs were in the queue and the jobs did not exceed the QoS limits. If you repeatedly observe pending jobs, please bring it up in the user board meeting.
Short job efficiency: If you are running many (hundreds or thousands of) very short jobs (duration of a few minutes), you may want to consider that starting and individually loading the same modules for each job may create overheads. When reasonably possible, it might save computation time to instead group some jobs together. The jobs can still be submitted to the short queue if the runtime is less than 4 hours.
GPU job efficiency: If you are running multi-GPU jobs (for example due to GPU memory limitations), you may want to consider that the communications between the GPUs and other CPU processes (for example data loaders) may create overheads. It might be useful to consider running jobs on less GPUs with more GPU memory each, or taking advantage of specialized libraries optimized for multi-GPU computing in your code.

Consequences of irresponsible usage

Your jobs might be canceled if:
1. The node your jobs are running on becomes unresponsive and the node is automatically restarted.
2. The job is overloading the node (for example overloading the network communication of the node).
3. The job is adversely affecting the execution of other jobs (jobs that are not using all requested resources (effectively) and thus unfairly block waiting jobs from running may also be canceled).
4. The jobs ignore the directions from the administrators (for example if a job is (still) affected by the same problem that the administrators informed you about before, and asked you to fix and test before resubmitting).
5. The job is showing clear signs of a problem (like hanging, or being idle, or using only 1 CPU of the multiple CPUs requested, or not using a GPU that was requested).
You might receive a cluster access ban for:
1. Disallowed use of the cluster, including disallowed use of computing time, purposefully ignoring directions, guidelines, fair-use principles and/or (trying to gain) unauthorized access and/or causing disruptions to the cluster or parts thereof (even if unintentional).
2. Unresponsiveness to e-mails from system admins requesting information or requiring immediate action regarding your cluster use.
3. Repeated problems caused by your cluster use which go unsolved even after attempts to resolve the issue.
4. Your cluster use privileges will be returned when all parties are confident that you understand the problem and it won’t reoccur.
Your jobs won’t be canceled for:
1. Scheduled maintenance. This is planned in advance and jobs that would run during scheduled maintenance times won’t start until the end of maintenance.

What to do in case of problems?

When you encounter problems, please follow the subsequent steps, in the indicated order:

First, please contact your colleagues and fellow cluster users in your lab, concerning problems with your code, job performance and efficiency. They may be running similar jobs and potentially have solutions for your problem.
You can also ask questions to fellow users on the MatterMost channel.
For prolonged problems, your initial contact point is your supervisor/PI.
As a final step, you can contact the cluster administrators for technical sysadmin problems or persistent efficiency problems, or for more information if you are not sure why you are banned from the cluster. You can do this by reporting your question, through the Self Service Portal , to the Service Desk. In your question, refer to the ‘DAIC cluster’.
For severe recurring problems, complaints and suggestions for policy changes, or issues affecting multiple users, you can contact the DAIC advisory board to bring it up as an agenda point in the next user board meeting.

Usage guidelines

The available processing power and memory in DAIC is large, but still limited. You should use the available resources efficiently and fairly. This page lays out a few general principles and guidelines for considerate use of DAIC.

Using shared resources

The computing servers within DAIC are primarily meant to run large, long (non-interactive) jobs. You share these resources with other users across departments. Thus, you need to be cautious of your usage so you do not hinder other users.

To help protect the active jobs and resources, when a login server becomes overloaded, new logins to this server are automatically disabled. This means that you will sometimes have to wait for other jobs to finish and at other times ICT may have to kill a job to create space for other users.

One rule: Respect your fellow users.

Implication: we reserve the right to terminate any job or process that we feel is clearly interfering with the ability of others to complete work, regardless of technical measures or its resource usage.

Recommendations

Always choose the login server with the lowest use (most importantly system load and memory usage), by checking the Current resource usage page or the servers command for information.
- Each server displays a message at login. Make sure you understand it before proceeding. This message includes the current load of the server, so look at it at every login
Only use the storage best suited to your files (See Storage).

Do interactive code development, debugging and testing in your local machine, as much as possible. In the cluster, try to organize your code as scripts, instead of working interactively in the command line.
If you need to test and debug in the cluster, for example, in a GPU node, request an interactive session and do not work in the login node itself (See Interactive jobs on compute nodes).
Save results frequently: your job can crash, the server can become overloaded, or the network shares can become unavailable.
Write your code in a modular way, so that you can continue the job from the point where it last crashed.

Actively monitor the status of your jobs and the loads of the servers.
- Make sure your job runs normally and is not hindering other jobs. Check the following at the start of a job and thereafter at least twice a day:
  - If your job is not working correctly (or halted) because of a programming error, terminate it immediately; debug and fix the problem instead of just trying again (the result will almost certainly be exactly the same).
  - If your screen’s Kerberos ticket has expired, renew it so your job can successfully save it’s results.
  - Use the top program to monitor the cpu (%CPU) and memory (%MEM) usage of your code. If either is too high, kill your code so it doesn’t cause problems for other users.
  - Don’t leave top running unless your are continuously watching it; press q to quit.
  - Watch the current resource usage (see Current resource usage page or use the servers command), and if the server is running close to it’s limits (higher than 90% server load or memory, swap or disk usage), consider moving your job to a less busy server.

You can use login nodes for basic tasks like compiling software, preparing submission scripts for the batch queue, submitting and monitoring jobs in the batch queue, analyzing results, and moving data or managing files.
Small-scale interactive work may be acceptable on login nodes if your resource requirements are minimal.

Please do not run production research computations on the login nodes. If necessary, request an interactive session in these cases (See Interactive jobs on compute nodes)

Note

Most multi-threaded applications (such as Java and Matlab) will automatically use all cpu cores of a server, and thus take away processing power from other jobs. If you can specify the number of threads, set it to at most 25% (¼) of the cores in that server (for a server with 16 cores, use at most 4; this leaves enough processing capacity for other users). Also see How do I request CPUs for a multithreaded program?

3 - System specifications

Overview of DAIC system specifications and comparison with other TU Delft clusters.

This section provides an overview of the Delft AI Cluster (DAIC) infrastructure and its comparison with other compute facilities at TU Delft.

3.1 - Login Nodes

Overview of DAIC login nodes and appropriate usage guidelines.

Login nodes act as the gateway to the DAIC cluster. They are intended for lightweight tasks such as job submission, file transfers, and compiling code (on specific nodes). They are not designed for running resource-intensive jobs, which should be submitted to the compute nodes.

Specifications and usage notes

Hostname	CPU (Sockets x Model)	Total Cores	Total RAM	Operating System	GPU Type	GPU Count	Usage Notes
`login1.daic.tudelft.nl`	1 x Intel(R) Xeon(R) CPU E5-2620 v4 @ 2.10GHz	8	15.39 GB	OpenShift Enterprise	Quadro K2200	1	For file transfers, job submission, and lightweight tasks.
`login2.daic.tudelft.nl`	1 x Intel(R) Xeon(R) CPU E5-2683 v3 @ 2.00GHz	1	3.70 GB	OpenShift Enterprise	N/A	N/A	Virtual server, for non-intensive tasks. No compilation.
`login3.daic.tudelft.nl`	2 x Intel(R) Xeon(R) CPU E5-2683 v4 @ 2.10GHz	32	503.60 GB	RHEV	Quadro K2200	1	For large compilation and interactive sessions.

Login1 resource limits (effective immediately)

Due to excessive background usage (especially from VSCode-related processes), per-user limits on login1 have been reduced to: 1 CPU, 1 GB RAM. This helps prevent system-wide instability.
For compiling large code or running memory-heavy tasks, please use an interactive job: Interactive jobs on compute nodes

3.2 - Compute nodes

The foundational hardware components of DAIC.

DAIC compute nodes are high-performance servers with multiple CPUs, large memory, and, on many nodes, one or more GPUs. The cluster is heterogeneous: nodes vary in processor types, memory sizes, GPU configurations, and performance characteristics.

If your application requires specific hardware features, you must request them explicitly in your job script (see Submitting jobs).

CPUs

All compute nodes have multiple CPUs (sockets), each with multiple cores. Most nodes support hyper-threading, which allows two threads per physical core. The number of cores per node is listed in the List of all nodes section.

Request CPUs based on how many threads your program can use. Oversubscribing doesn’t improve performance and may waste resources. Undersubscribing may slow your job due to thread contention.

To request CPUs for your jobs, see Job scripts.

GPUs

Many nodes in DAIC include one or more NVIDIA GPUs.GPU types differ in architecture, memory size, and compute capability. The table that follows summarizes the main GPU types in DAIC. For a per-node overview, see the List of all nodes section.

To request GPUs in your job, use --gres=gpu:<type>:<count>. See GPU jobs for more information.

Table 1: Counts and specifications of DAIC GPUs
GPU (slurm) type	Count	Model	Architecture	Compute Capability	CUDA cores	Memory
l40	18	NVIDIA L40	Ada Lovelace	8.9	18176	49152 MiB
a40	84	NVIDIA A40	Ampere	8.6	10752	46068 MiB
turing	24	NVIDIA GeForce RTX 2080 Ti	Turing	7.5	4352	11264 MiB
v100	11	Tesla V100-SXM2-32GB	Volta	7.0	5120	32768 MiB

In table 1: the headers denote:

Model: The official product name of the GPU
Architecture: The hardware design used in the GPU, which defines its specifications and performance characteristics. Each architecture (e.g., Ampere, Turing, Volta) represents a different GPU generation.
Compute capability: A version number indicating the features supported by the GPU, including CUDA support. Higher values offer more advanced functionality.
CUDA cores: The number of processing cores available on the GPU. More CUDA cores allow more parallel operations, improving performance for parallelizable workloads.
Memory: The total internal memory on the GPU. This memory is required to store data for GPU computations. If a model’s memory is insufficient, performance may be severely affected.

Memory

Each node has a fixed amount of RAM, shown in the List of all nodes section. Jobs may only use the memory explicitly requested using --mem or --mem-per-cpu. Exceeding the allocation may result in job failure.

Memory cannot be shared across nodes, and unused memory cannot be reallocated.

For memory-efficient jobs, consider tuning your requested memorey to match your code’s peak usage closely. Fore more information, see Slurm basics.

Note

All compute nodes support Advanced Vector Extensions 1 and 2 (AVX, AVX2), and use hyper-threading (ht), i.e, each physical core provides two logical CPUs. These are always allocated in pairs by the job scheduler (see Workload Scheduler).

List of all nodes

The following table gives an overview of current nodes and their characteristics. Use the search bar to filter by hostname, GPU type, or any other column, and select columns to be visible.

Note

Slurm partitions typically correspond to research groups or departments that have contributed compute resources to the cluster. Most partition names follow the format <faculty>-<department> or <faculty>-<department>-<section>. A few exceptions exist for project-specific nodes.

For more information, see the Partitions section.

Hostname	CPU (Sockets x Model)	Cores per Socket	Total Cores	CPU Speed (MHz)	Total RAM (GiB)	Local Disk (/tmp, GiB)	GPU Type	GPU Count	SlurmPartitions	SlurmActiveFeatures
100plus	2 x Intel(R) Xeon(R) CPU E5-2683 v4 @ 2.10GHz	16	32	2097.594	755	3174	N/A	0	general;ewi-insy	avx;avx2;ht;10gbe;bigmem
3dgi1	1 x AMD EPYC 7502P 32-Core Processor	32	32	2500.000	251	148	N/A	0	general;bk-ur-uds	avx;avx2;ht;10gbe;ssd
3dgi2	1 x AMD EPYC 7502P 32-Core Processor	32	32	2500.000	251	148	N/A	0	general;bk-ur-uds	avx;avx2;ht;10gbe;ssd
awi01	2 x Intel(R) Xeon(R) Gold 6140 CPU @ 2.30GHz	18	36	3494.921	376	393	Tesla V100-PCIE-32GB	1	general;tnw-imphys	avx;avx2;ht;10gbe;avx512;gpumem32;nvme;ssd
awi02	2 x Intel(R) Xeon(R) CPU E5-2680 v4 @ 2.40GHz	14	28	2899.951	504	393	Tesla V100-SXM2-16GB	2	general;tnw-imphys	avx;avx2;ht;10gbe;bigmem;ssd
awi04	2 x Intel(R) Xeon(R) CPU E5-2680 v4 @ 2.40GHz	14	28	2899.951	503	5529	N/A	0	general;tnw-imphys	avx;avx2;ht;ib;imphysexclusive
awi08	2 x Intel(R) Xeon(R) CPU E5-2680 v4 @ 2.40GHz	14	28	2899.951	503	5529	N/A	0	general;tnw-imphys	avx;avx2;ht;ib;imphysexclusive
awi09	2 x Intel(R) Xeon(R) CPU E5-2680 v4 @ 2.40GHz	14	28	2899.951	503	5529	N/A	0	general;tnw-imphys	avx;avx2;ht;ib;imphysexclusive
awi10	2 x Intel(R) Xeon(R) CPU E5-2680 v4 @ 2.40GHz	14	28	2899.951	503	5529	N/A	0	general;tnw-imphys	avx;avx2;ht;ib;imphysexclusive
awi11	2 x Intel(R) Xeon(R) CPU E5-2680 v4 @ 2.40GHz	14	28	2899.951	503	5529	N/A	0	general;tnw-imphys	avx;avx2;ht;ib;imphysexclusive
awi12	2 x Intel(R) Xeon(R) CPU E5-2680 v4 @ 2.40GHz	14	28	2899.951	503	5529	N/A	0	general;tnw-imphys	avx;avx2;ht;ib;imphysexclusive
awi19	2 x Intel(R) Xeon(R) CPU E5-2680 v4 @ 2.40GHz	14	28	2899.951	251	856	N/A	0	general;tnw-imphys	avx;avx2;ht;ib;ssd
awi20	2 x Intel(R) Xeon(R) CPU E5-2680 v4 @ 2.40GHz	14	28	2899.951	251	856	N/A	0	general;tnw-imphys	avx;avx2;ht;ib;ssd
awi21	2 x Intel(R) Xeon(R) CPU E5-2680 v4 @ 2.40GHz	14	28	2899.951	251	856	N/A	0	general;tnw-imphys	avx;avx2;ht;ib;ssd
awi22	2 x Intel(R) Xeon(R) CPU E5-2680 v4 @ 2.40GHz	14	28	2899.951	251	856	N/A	0	general;tnw-imphys	avx;avx2;ht;ib;ssd
awi23	2 x Intel(R) Xeon(R) Gold 6140 CPU @ 2.30GHz	18	36	2672.149	376	856	N/A	0	general;tnw-imphys	avx;avx2;ht;ib;ssd
awi24	2 x Intel(R) Xeon(R) Gold 6140 CPU @ 2.30GHz	18	36	3299.932	376	856	N/A	0	general;tnw-imphys	avx;avx2;ht;ib;ssd
awi25	2 x Intel(R) Xeon(R) Gold 6140 CPU @ 2.30GHz	18	36	3542.370	376	856	N/A	0	general;tnw-imphys	avx;avx2;ht;ib;ssd
awi26	2 x Intel(R) Xeon(R) Gold 6140 CPU @ 2.30GHz	18	36	2840.325	376	856	N/A	0	general;tnw-imphys	avx;avx2;ht;ib;ssd
cor1	2 x Intel(R) Xeon(R) Gold 6242 CPU @ 2.80GHz	16	32	3573.315	1510	7168	Tesla V100-SXM2-32GB	8	general;me-cor	avx;avx2;ht;10gbe;avx512;gpumem32;ssd
gpu01	2 x AMD EPYC 7413 24-Core Processor	24	48	2650.000	503	415	NVIDIA A40	3	general;ewi-insy	avx;avx2;10gbe;bigmem;gpumem32;ssd
gpu02	2 x AMD EPYC 7413 24-Core Processor	24	48	2650.000	503	415	NVIDIA A40	3	general;ewi-insy	avx;avx2;10gbe;bigmem;gpumem32;ssd
gpu03	2 x AMD EPYC 7413 24-Core Processor	24	48	2650.000	503	415	NVIDIA A40	3	general;ewi-insy	avx;avx2;10gbe;bigmem;gpumem32;ssd
gpu04	2 x AMD EPYC 7413 24-Core Processor	24	48	2650.000	503	415	NVIDIA A40	3	general;ewi-insy	avx;avx2;10gbe;bigmem;gpumem32;ssd
gpu05	2 x AMD EPYC 7413 24-Core Processor	24	48	2650.000	503	415	NVIDIA A40	3	general;ewi-st	avx;avx2;10gbe;bigmem;gpumem32;ssd
gpu06	2 x AMD EPYC 7413 24-Core Processor	24	48	2650.000	503	415	NVIDIA A40	3	general;ewi-st	avx;avx2;10gbe;bigmem;gpumem32;ssd
gpu07	2 x AMD EPYC 7413 24-Core Processor	24	48	2650.000	503	415	NVIDIA A40	3	general;ewi-st	avx;avx2;10gbe;bigmem;gpumem32;ssd
gpu08	2 x AMD EPYC 7413 24-Core Processor	24	48	2650.000	503	415	NVIDIA A40	3	general;ewi-st	avx;avx2;10gbe;bigmem;gpumem32;ssd
gpu09	2 x AMD EPYC 7413 24-Core Processor	24	48	2650.000	503	415	NVIDIA A40	3	general;tnw-imphys	avx;avx2;10gbe;bigmem;gpumem32;ssd
gpu10	2 x AMD EPYC 7413 24-Core Processor	24	48	2650.000	503	415	NVIDIA A40	3	general;tnw-imphys	avx;avx2;10gbe;bigmem;gpumem32;ssd
gpu11	2 x AMD EPYC 7413 24-Core Processor	24	48	2650.000	503	415	NVIDIA A40	3	bk-ur-uds;general	avx;avx2;10gbe;bigmem;gpumem32;ssd
gpu12	2 x AMD EPYC 7413 24-Core Processor	24	48	2650.000	503	415	NVIDIA A40	3	general;ewi-st	avx;avx2;10gbe;bigmem;gpumem32;ssd
gpu14	2 x AMD EPYC 7543 32-Core Processor	32	64	2800.000	503	856	NVIDIA A40	3	general;ewi-st	avx;avx2;10gbe;bigmem;gpumem32;ssd
gpu15	2 x AMD EPYC 7543 32-Core Processor	32	64	2800.000	503	856	NVIDIA A40	3	general;ewi-st	avx;avx2;10gbe;bigmem;gpumem32;ssd
gpu16	2 x AMD EPYC 7543 32-Core Processor	32	64	2800.000	503	856	NVIDIA A40	3	general;ewi-st	avx;avx2;10gbe;bigmem;gpumem32;ssd
gpu17	2 x AMD EPYC 7543 32-Core Processor	32	64	2800.000	503	856	NVIDIA A40	3	general;ewi-st	avx;avx2;10gbe;bigmem;gpumem32;ssd
gpu18	2 x AMD EPYC 7543 32-Core Processor	32	64	2800.000	503	856	NVIDIA A40	3	general;ewi-st	avx;avx2;10gbe;bigmem;gpumem32;ssd
gpu19	2 x AMD EPYC 7543 32-Core Processor	32	64	2800.000	503	856	NVIDIA A40	3	general;ewi-insy	avx;avx2;10gbe;bigmem;gpumem32;ssd
gpu20	2 x AMD EPYC 7543 32-Core Processor	32	64	2800.000	1007	856	NVIDIA A40	3	general;ewi-insy	avx;avx2;10gbe;bigmem;gpumem32;ssd
gpu21	2 x AMD EPYC 7543 32-Core Processor	32	64	2800.000	1007	856	NVIDIA A40	3	general;ewi-insy-prb;ewi-insy	avx;avx2;10gbe;bigmem;gpumem32;ssd
gpu22	2 x AMD EPYC 7543 32-Core Processor	32	64	2800.000	1007	856	NVIDIA A40	3	general;ewi-insy	avx;avx2;10gbe;bigmem;gpumem32;ssd
gpu23	2 x AMD EPYC 7543 32-Core Processor	32	64	2800.000	1007	856	NVIDIA A40	3	general;ewi-insy	avx;avx2;10gbe;bigmem;gpumem32;ssd
gpu24	2 x AMD EPYC 7543 32-Core Processor	32	64	2800.000	1007	856	NVIDIA A40	3	general;ewi-insy	avx;avx2;10gbe;bigmem;gpumem32;ssd
gpu25	2 x AMD EPYC 7543 32-Core Processor	32	64	2800.000	1007	856	NVIDIA A40	3	mmll;general;ewi-insy	avx;avx2;10gbe;bigmem;gpumem32;ssd
gpu26	2 x AMD EPYC 7543 32-Core Processor	32	64	2800.000	1007	856	NVIDIA A40	3	lr-asm;general	avx;avx2;10gbe;bigmem;gpumem32;ssd
gpu27	2 x AMD EPYC 7543 32-Core Processor	32	64	2800.000	503	856	NVIDIA A40	3	me-cor;general	avx;avx2;10gbe;bigmem;gpumem32;ssd
gpu28	2 x AMD EPYC 7543 32-Core Processor	32	64	2800.000	503	856	NVIDIA A40	3	me-cor;general	avx;avx2;10gbe;bigmem;gpumem32;ssd
gpu29	2 x AMD EPYC 7543 32-Core Processor	32	64	2800.000	503	856	NVIDIA A40	3	me-cor;general	avx;avx2;10gbe;bigmem;gpumem32;ssd
gpu30	1 x AMD EPYC 9534 64-Core Processor	64	64	2450.000	755	856	NVIDIA L40	3	ewi-insy;general	avx;avx2;10gbe;bigmem;gpumem32;ssd
gpu31	1 x AMD EPYC 9534 64-Core Processor	64	64	2450.000	755	856	NVIDIA L40	3	ewi-insy;general	avx;avx2;10gbe;bigmem;gpumem32;ssd
gpu32	1 x AMD EPYC 9534 64-Core Processor	64	64	2450.000	755	856	NVIDIA L40	3	ewi-me-sps;general	avx;avx2;10gbe;bigmem;gpumem32;ssd
gpu33	1 x AMD EPYC 9534 64-Core Processor	64	64	2450.000	755	856	NVIDIA L40	3	lr-co;general	avx;avx2;10gbe;bigmem;gpumem32;ssd
gpu34	1 x AMD EPYC 9534 64-Core Processor	64	64	2450.000	755	856	NVIDIA L40	3	ewi-insy;general	avx;avx2;10gbe;bigmem;gpumem32;ssd
gpu35	1 x AMD EPYC 9534 64-Core Processor	64	64	2450.000	755	856	NVIDIA L40	3	bk-ar;general	avx;avx2;10gbe;bigmem;gpumem32;ssd
grs1	2 x Intel(R) Xeon(R) CPU E5-2667 v4 @ 3.20GHz	8	16	3499.804	251	181	N/A	0	citg-grs;general	avx;avx2;ht;ib;ssd
grs2	2 x Intel(R) Xeon(R) CPU E5-2667 v4 @ 3.20GHz	8	16	3499.804	251	181	N/A	0	citg-grs;general	avx;avx2;ht;ib;ssd
grs3	2 x Intel(R) Xeon(R) CPU E5-2667 v4 @ 3.20GHz	8	16	3499.804	251	181	N/A	0	citg-grs;general	avx;avx2;ht;ib;ssd
grs4	2 x Intel(R) Xeon(R) CPU E5-2667 v4 @ 3.20GHz	8	16	3500	251	181	N/A	0	citg-grs;general	avx;avx2;ht;ib;ssd
influ1	2 x Intel(R) Xeon(R) Gold 6130 CPU @ 2.10GHz	16	32	3385.711	376	197	NVIDIA GeForce RTX 2080 Ti	8	influence;ewi-insy;general	avx;avx2;ht;10gbe;avx512;nvme;ssd
influ2	2 x Intel(R) Xeon(R) Gold 5218 CPU @ 2.30GHz	16	32	2300.000	187	369	NVIDIA GeForce RTX 2080 Ti	4	influence;ewi-insy;general	avx;avx2;ht;10gbe;avx512;ssd
influ3	2 x Intel(R) Xeon(R) Gold 5218 CPU @ 2.30GHz	16	32	2300.000	187	369	NVIDIA GeForce RTX 2080 Ti	4	influence;ewi-insy;general	avx;avx2;ht;10gbe;avx512;ssd
influ4	2 x AMD EPYC 7452 32-Core Processor	32	64	2350.000	252	148	N/A	0	influence;ewi-insy;general	avx;avx2;ht;10gbe;ssd
influ5	2 x AMD EPYC 7452 32-Core Processor	32	64	2350	503	148	N/A	0	influence;ewi-insy;general	avx;avx2;ht;10gbe;bigmem;ssd
influ6	2 x AMD EPYC 7452 32-Core Processor	32	64	2350	503	148	N/A	0	influence;ewi-insy;general	avx;avx2;ht;10gbe;bigmem;ssd
insy15	2 x Intel(R) Xeon(R) Gold 5218 CPU @ 2.30GHz	16	32	2300.000	754	416	NVIDIA GeForce RTX 2080 Ti	4	ewi-insy;general	avx;avx2;ht;10gbe;avx512;bigmem;ssd
insy16	2 x Intel(R) Xeon(R) Gold 5218 CPU @ 2.30GHz	16	32	2300.000	754	416	NVIDIA GeForce RTX 2080 Ti	4	ewi-insy;general	avx;avx2;ht;10gbe;avx512;bigmem;ssd
Total (66 nodes)			3016 cores		35.02 TiB	76.79 TiB		137 GPU

3.3 - Storage

What are the foundational components of DAIC?

Storage

DAIC compute nodes have direct access to the TU Delft home, group and project storage. You can use your TU Delft installed machine or an SCP or SFTP client to transfer files to and from these storage areas and others (see data transfer) , as is demonstrated throughout this page.

File System Overview

Unlike TU Delft’s DelftBlue , DAIC does not have a dedicated storage filesystem. This means no /scratch space for storing temporary files (see DelftBlue’s Storage description and Disk quota and scratch space ). Instead, DAIC relies on direct connection to the TU Delft network storage filesystem (see Overview data storage ) from all its nodes, and offers the following types of storage areas:

Personal storage (aka home folder)

The Personal Storage is private and is meant to store personal files (program settings, bookmarks). A backup service protects your home files from both hardware failures and user error (you can restore previous versions of files from up to two weeks ago). The available space is limited by a quota (see Quotas) and is not intended for storing research data.

You have two (separate) home folders: one for Linux and one for Windows (because Linux and Windows store program settings differently). You can access these home folders from a machine (running Linux or Windows OS) using a command line interface or a browser via TU Delft's webdata . For example, Windows home has a My Documents folder. My documents can be found on a Linux machine under /winhome/<YourNetID>/My Documents

Home directory	Access from	Storage location
Linux home folder
	Linux	`/home/nfs/<YourNetID>`
	Windows	only accessible using an scp/sftp client (see SSH access)
	webdata	not available
Windows home folder
	Linux	`/winhome/<YourNetID>`
	Windows	`H:` or `\\tudelft.net\staff-homes\[a-z]\<YourNetID>`
	webdata	`https://webdata.tudelft.nl/staff-homes/[a-z]/<YourNetID>`

It’s possible to access the backups yourself. In Linux the backups are located under the (hidden, read-only) ~/.snapshot/ folder. In Windows you can right-click the H: drive and choose Restore previous versions.

Note

To see your disk usage, run something like:

du -h '</path/to/folder>' | sort -h | tail

Group storage

The Group Storage is meant to share files (documents, educational and research data) with department/group members. The whole department or group has access to this storage, so this is not for confidential or project data. There is a backup service to protect the files, with previous versions up to two weeks ago. There is a Fair-Use policy for the used space.

Destination	Access from	Storage location
Group Storage
	Linux	`/tudelft.net/staff-groups/<faculty>/<department>/<group>` or
	Linux	`/tudelft.net/staff-bulk/<faculty>/<department>/<group>/<NetID>`
	Windows	`M:` or `\\tudelft.net\staff-groups\<faculty>\<department>\<group>` or
	Windows	`L:` or `\\tudelft.net\staff-bulk\ewi\insy\<group>\<NetID>`
	webdata	`https://webdata.tudelft.nl/staff-groups/<faculty>/<department>/<group>/`

Project Storage

The Project Storage is meant for storing (research) data (datasets, generated results, download files and programs, …) for projects. Only the project members (including external persons) can access the data, so this is suitable for confidential data (but you may want to use encryption for highly sensitive confidential data). There is a backup service and a Fair-Use policy for the used space.

Project leaders (or supervisors) can request a Project Storage location via the Self-Service Portal or the Service Desk .

Destination	Access from	Storage location
Project Storage
	Linux	`/tudelft.net/staff-umbrella/<project>`
	Windows	`U:` or `\\tudelft.net\staff-umbrella\<project>`
	webdata	`https://webdata.tudelft.nl/staff-umbrella/<project>` or `https://webdata.tudelft.nl/staff-bulk/<faculty>/<department>/<group>/<NetID>`

Tip

Data deleted from project storage, staff-umbrella, remains in a hidden .snapshot folder. If accidently deleted, you can recover such data by copying it from the (hidden).snapshot folder in your storage.

Local Storage

Local storage is meant for temporary storage of (large amounts of) data with fast access on a single computer. You can create your own personal folder inside the local storage. Unlike the network storage above, local storage is only accessible on that computer, not on other computers or through network file servers or webdata. There is no backup service nor quota. The available space is large but fixed, so leave enough space for other users. Files under /tmp that have not been accessed for 10 days are automatically removed. A process that has a file opened can access the data until the file is closed, even when the file is deleted. When the file is deleted, the file entry will be removed but the data will not be removed until the file is closed. Therefore, files that are kept open by a process can be used for longer. Additionally, files that are being accessed (read, written) multiple times within one day won’t be deleted.

Destination	Access from	Storage location
Local storage
	Linux	`/tmp/<NetID>`
	Windows	not available
	webdata	not available

Memory Storage

Warning

Using /dev/shm is very risky, and should only be done when you understand all implications. Consider using the local storage (/tmp) as a safer alternative.

Cluster-wide Risk: When memory storage fills up, it can cause memory exhaustion that kills running jobs. The scheduler cannot identify the cause, so it continues launching new jobs that will also fail, potentially making the whole cluster unusable.

Clean up policy: Users must always clean up ‘/dev/shm’ after using it, even when jobs fail or are stopped via the scheduler.

Memory storage is meant for short-term storage of limited amounts of data with very fast access on a single computer. You can create your own personal folder inside the memory storage location. Memory storage is only accessible on that computer, and there is no backup service nor quota. The available space is limited and shared with programs, so leave enough space (the computer will likely crash when you don’t!). Files that have not been accessed for 1 day are automatically removed.

Destination	Access from	Storage location
Memory storage
	Linux	`/dev/shm/<NetID>`
	Windows	not available
	webdata	not available

Info

Use this only when using other storage makes your job or the whole computer slow. Files in /dev/shm/ use system memory directly and do not count toward your job’s memory allocation. Request enough memory to cover both your job’s processing needs and any files stored in memory storage. Never exceed your allocated memory, not even for one second.

Checking quota limits

The different storage areas accessible on DAIC have quotas (or usage limits). It’s important to regularly check your usage to avoid job failures and ensure smooth workflows.

Helpful commands

For /home:

$ quota -s -f ~
Disk quotas for user netid (uid 000000): 
     Filesystem   space   quota   limit   grace   files   quota   limit   grace
svm111.storage.tudelft.net:/staff_homes_linux/n/netid
                 12872M  24576M  30720M           19671   4295m   4295m

For project space: You can use either:

$ du -hs /tudelft.net/staff-umbrella/my-cool-project
37G	/tudelft.net/staff-umbrella/my-cool-project

Or:

$ df -h /tudelft.net/staff-umbrella/my-cool-project
Filesystem                                       Size  Used Avail Use% Mounted on
svm107.storage.tudelft.net:/staff_umbrella_my-cool-project  1,0T   38G  987G   4% /tudelft.net/staff-umbrella/my-cool-project

Note that the difference is due to snapshots, which can stay for up to 2 weeks

3.4 - Scheduler

What are the foundational components of DAIC?

Workload scheduler

DAIC uses the Slurm scheduler to efficiently manage workloads. All jobs for the cluster have to be submitted as batch jobs into a queue. The scheduler then manages and prioritizes the jobs in the queue, allocates resources (CPUs, memory) for the jobs, executes the jobs and enforces the resource allocations. See the job submission pages for more information.

A slurm-based cluster is composed of a set of login nodes that are used to access the cluster and submit computational jobs. A central manager orchestrates computational demands across a set of compute nodes. These nodes are organized logically into groups called partitions, that defines job limits or access rights. The central manager provides fault-tolerant hierarchical communications, to ensure optimal and fair use of available compute resources to eligible users, and make it easier to run and schedule complex jobs across compute resources (multiple nodes).

3.5 - Cluster comparison

Overview of the clusters available to TU Delft (CS) researchers

Cluster comparison

TL;DR

Most AI training → DAIC.
Many CPUs / high-memory or MPI jobs → DelftBlue.
Distributed/experimental systems work → DAS-6.
Bigger than local capacity or cross-institutional projects → Snellius (via SURF).
Euro-scale GPU runs → LUMI (EuroHPC via SURF).
Exascale runs → Jupiter (EuroHPC)

TU Delft clusters

DAIC is one of several clusters accessible to TU Delft CS researchers (and their collaborators). The table below gives a comparison between these in terms of use case, eligible users, and other characteristics.

Tip

When in doubt, start on DAIC for prototyping. If you hit limits (time-to-solution, memory, scale), graduate to DelftBlue, then Snellius/LUMI.
TU Delft has other clusters that are not listed here. These tend to be more specialized or have different access requirements.

System	Best for	Strengths	Use it when	Access & Support
🎓	AI/ML training; data-centric workflows; GPU‑intensive workloads	Large NVIDIA GPU pool (L40, A40, RTX 2080 Ti, V100 SXM2); local expert support (REIT and ICT); direct TU Delft storage	Quick iteration, hyper‑parameter sweeps, demos, and almost any workload from participating groups; queues are generally shorter than DelftBlue but limited by available GPUs	Access • Specs • Community
🎓	CPU/MPI jobs; high‑memory runs; large per-GPU memory needed; education	Large CPU pool; larger Nvidia GPUs (A100); dedicated scratch storage; local expert support (DHPC, ICT)	Many cores, tightly‑coupled MPI, long CPU jobs, or very high memory per node; education	Access • Specs • Community
🎓	Distributed systems research; streaming; edge/fog computing; in-network processing	Multi‑site testbed; mix of GPUs (16× A4000, 4× A5000) and CPUs	Cross‑cluster experiments, network‑sensitive prototypes	Access • Docs • Project
🇳🇱	National‑scale runs; larger GPU pools; cross‑institutional projects	Large CPU+GPU partitions (A100 and H100); mature SURF user support; common NL platform	When local capacity/queue limits progress or when collaborating with other Dutch institutions	Access • Docs • Specs
🇪🇺	Euro‑scale AI/data; very large GPU jobs; benchmarking at scale	Tier‑0 system with AMD MI250 GPUs (LUMI‑G); high‑performance I/O; strong EuroHPC ecosystem	Beyond Snellius capacity or part of a funded EU consortium / EuroHPC allocation	Access • Docs

Other EuroHPC resources

In addition to LUMI, TU Delft researchers can also apply for access to other EuroHPC Tier-0/1 systems through EuroHPC Joint Undertaking calls. Examples include:

Jupiter (Germany): Europe’s first exascale supercomputer, targeting the most demanding HPC and AI workloads.
Leonardo (Italy): Pre-exascale system with hybrid architecture: a large GPU partition (Nvidia) for AI and a CPU partition for HPC simulations.
MareNostrum 5 (Spain): Pre-exascale general-purpose system, with Nvidia GPUs.
MeluXina (Luxembourg): Petascale modular system, suitable for AI, digital twins, quantum simulation, and traditional computational workloads.
Karolina (Czech Republic): Petascale system for HPC, AI, and Big Data applications.
Discoverer (Bulgaria): Petascale system focused on simulations and modelling.

These systems complement LUMI and broaden the options for AI, simulation, and large-scale scientific workflows at the European level.

TU Delft cloud resources

For both education and research activities, TU Delft has established the Cloud4Research program. Cloud4Research aims to facilite the use of public cloud resources, primarily Amazon AWS. At the administrative level, Cloud4Research provides AWS accounts with an initial budget. Subsequent billing can be incurred via a project code, instead of a personal credit card. At the technical level, the ICT innovation teams provides intake meetings to facilitate getting started. Please refer to the Policies and FAQ pages for more details.

Strategic opportunities

Are you planning infrastructure proposals or strategic partnerships? Contact us to discuss collaborative opportunities via this TopDesk DAIC Contact Us form

4 - User manual

Practical guide to using DAIC.

4.1 - Connecting to DAIC

How to connect to DAIC?

SSH access

If you have a valid DAIC account (see Access and accounts), you can access DAIC resources using an SSH client. SSH (Secure SHell) is a protocol that allows you to connect to a remote computer via a secure network connection. SSH supports remote command-line login and remote command execution. SCP (Secure CoPy) and SFTP (Secure File Transfer Protocol) are file transfer protocols based on SSH (see wikipedia's ssh page ).

SSH clients

Most modern operating systems like Linux, macOS, and Windows 10 include SSH, SCP, and SFTP clients (part of the OpenSSH package) by default. If not, you can install third-party programs like:

MobaXterm , PuTTY page , or FileZilla .

Connecting to DAIC from inside and outside TU Delft network

Access from the TU Delft Network

To connect to DAIC within TU Delft network (ie, via eduram or wired connection), open a command-line interface (prompt, or terminal, see Wikipedia's CLI page ), and run the following command:

$ ssh <YourNetID>@login.daic.tudelft.nl # Or
$ ssh login.daic.tudelft.nl             # If your username matches your NetID

<YourNetID> is your TU Delft NetID. If the username on your machine you are connecting from matches your NetID, you can omit the square brackets and their contents, [<YourNetID>@].

This will log you in into DAIC’s login1.daic.tudelft.nl node for now. Note that this setup might change in the future as the system undergoes migration, potentially reducing the number of login nodes..

Note

Currently DAIC has 3 login nodes: login1.daic.tudelft.nl, login2.daic.tudelft.nl, and login3.daic.tudelft.nl. You can connect to any of these nodes directly as per your needs. For more on the choice of login nodes, see DAIC login nodes.

Note

Upon first connection to an SSH server, you will be prompted to confirm the server’s identity, with a message similar to:

The authenticity of host 'login.daic.tudelft.nl (131.180.183.244)' can't be established.
ED25519 key fingerprint is SHA256:MURg8IQL8oG5o2KsUwx1nXXgCJmDwHbttCJ9ljC9bFM.
Are you sure you want to continue connecting (yes/no/[fingerprint])? yes
Warning: Permanently added 'login.daic.tudelft.nl' (ED25519) to the list of known hosts.

A distinct fingerprint will be shown for each login node, as below:

SHA256:MURg8IQL8oG5o2KsUwx1nXXgCJmDwHbttCJ9ljC9bFM

SHA256:MURg8IQL8oG5o2KsUwx1nXXgCJmDwHbttCJ9ljC9bFM

SHA256:O3AjQQjCfcrwJQ4Ix4dyGaUoYiIv/U+isMT5+sfeA5Q

If you notice any discrepancy in the key fingerprint, do not proceed unless notified of legitimate changes.

Once identity confirmed, enter your password when prompted (nothing will be printed as you type your password):

The HPC cluster is restricted to authorized users only.
YourNetID@login.daic.tudelft.nl's password:

Next, a welcome message will be shown:

Last login: Mon Jul 24 18:36:23 2023 from tud262823.ws.tudelft.net
 #########################################################################
 #                                                                       #
 # Welcome to login1, login server of the HPC cluster.                   #
 #                                                                       #
 # By using this cluster you agree to the terms and conditions.          #
 #                                                                       #
 # For information about using the HPC cluster, see:                     #
 # https://login.hpc.tudelft.nl/                                         #
 #                                                                       #
 # The bulk, group and project shares are available under /tudelft.net/, #
 # your windows home share is available under /winhome/$USER/.           #
 #                                                                       #
 #########################################################################
 18:40:16 up 51 days,  6:53,  9 users,  load average: 0,82, 0,36, 0,53

And, now you can now verify your environment with basic commands:

YourNetID@login1:~$ hostname  # show the current hostname
login1.hpc.tudelft.nl
YourNetID@login1:~$ echo $HOME  # show the path to your home directory
/home/nfs/YourNetID
YourNetID@login1:~$ pwd  # show current path
/home/nfs/YourNetID
YourNetID@login1:~$ exit  # exit current connection
logout
Connection to login.daic.tudelft.nl closed.

In this example, the user, YourNetID, is logged in via the login node login1.hpc.tudelft.nl as can be seen from the hostname output. The user has landed in the $HOME directory, as can be seen by printing its value, and checked by the pwd command. Finally, the exit command is used to exit the cluster.

Graphical applications

We discourage running graphical applications (via ssh -X) on DAIC login nodes, as GUI applications are not supported on the HPC systems.

Access from outside university network using a VPN

Direct access to DAIC from outside the university network is blocked by a firewall. Therefore, a VPN is needed

You need to use TU Delft’s EduVPN or OpenVPN (See TU Delft’s Access via VPN recommendations ) to access DAIC directly. Once connected to the VPN, you can ssh to DAIC directly, as in Access from the TU Delft Network.

VPN access trouble?

If you are having trouble accessing DAIC via the VPN, please report an issue via this Self-Service link.

Using the Linux Bastion Server

As of January 16th 2025, access to TU Delft bastion hosts is only possible via a VPN. Therefore, it is now recommended to activate a TU Delft VPN and directly connect to DAIC (instead of jumping via the bastion host).

Simplifying SSH with Configuration Files

To simplify SSH connections, you can store configurations in a file in your local machine. The SSH configuration file can be created (or found, if already exists) in ~/.ssh/config on Linux/Mac systems, or in C:\Users\<YourUserName>\.ssh on Windows.

For example, on a Linux system, you can have the following lines in the configuration file:

~/.ssh/config


Host daic
  HostName login.daic.tudelft.nl # Or any other login node
  User <YourNetID>

where:

The Host keyword starts the SSH configuration block and specifies the name (or pattern of names, like daic in this example) to which the configuration entries will apply.
The HostName is the actual hostname to log into. Numeric IP addresses are also permitted (both on the command line and in HostName specifications).
The User is the login username. This is especially important when the username differs between your machine and the remote server/cluster.

You can then connect to DAIC from inside TU Delft network (or behind TU Delft VPN) by just typing the following command:

$ ssh daic

Efficient SSH Connections with SSH Multiplexing

SSH multiplexing allows you to reuse an existing connection for multiple SSH sessions, reducing the time spent entering your password for every new connection. After the first connection is established, subsequent connections will be much faster since the existing control connection is reused.

To enable SSH multiplexing, add the following lines to your SSH configuration file. Assuming a Linux/Mac system, you can add the following lines to ~/.ssh/config:

~/.ssh/config


Host *
  ControlMaster auto
  ControlPath /tmp/ssh-%r@%h:%p

where:

The ControlPath specifies where to store the “control socket” for the multiplexed connections. %r refers to the remote login name, %h refers to the target host name, and %p refers to the destination port. This ensures that SSH separates control sockets for different connections.
The ControlMaster setting activates multiplexing. With the auto setting, SSH will use an existing master connection if available or create a new one when necessary. This configuration helps streamline SSH connections and reduces the need to enter your password for each new session.

This setup will speed up connections after the first one and reduce the need to repeatedly enter your password for each new SSH session.

Note

On Windows you may need to adjust the ControlPath to match a valid path for your operating system. For example, instead of /tmp/, you might use a path like C:/Users/<YourUserName>/AppData/Local/Temp/.

Important

SSH public key logins (passwordless login) are not supported on DAIC, because Kerberos authentication is required to access your home directory. You will need to enter your password for each session

4.2 - Data management & transfer

How and where to store data on DAIC.

Data Management Guidelines

DAIC login and compute nodes have direct access to standard TU Delft network storage, including your personal home folder, group storage, and project storage. It is important to use the correct storage location for your data, as each has different use cases, access rights, and quota limits.

For example, Project Storage (staff-umbrella) is the recommended location for research data, datasets, and code. In contrast, staff-bulk is a legacy storage area that is being phased out. For a complete overview of storage types, official guidelines, and quota limits, always consult the TU Delft Overview data storage .

This page explains the best methods for transferring data to and from these storage locations.

Recommended Workflow: Direct Data Download

The most efficient way to download large datasets from external sources (e.g. collaborators or public repositories) is to transfer them directly from your local computer to your TU Delft project storage. This avoids using the DAIC login and compute nodes, which are optimized for computation, not large data transfers, and avoids unnecessary load on the internal network.

Note

Do not connect to DAIC using sshfs! That would only (over)load the network connection to the login nodes, which would affect the interactive work of other users. Instead, download data directly to your project storage as described below.

Follow these steps to download data directly to your project storage (and access it from DAIC):

1. Access your DAIC storage from your local computer

You can either mount the storage as a network drive or use an SFTP client. Mounting is often more convenient as it makes the remote storage appear like a local folder. Choose the appropriate method for your operating system:

For TU Delft-managed computers:

Project Data Storage is mounted automatically under This PC as Project Data (U:) or \\tudelft.net\staff-umbrella.

For personal computers:

Connect to EduVPN first.
Install WebDrive and connect to sftp.tudelft.nl. Click on staff-umbrella (this is the Project Data Storage).

Option 1: Using Finder

Press ⌘K or choose Go > Connect to Server.
Enter: smb://tudelft.net/staff-umbrella/<your_project_name> and click Connect.
(Optional) Add this address to your Favorite Servers for easy access later.

Option 2: Using an SFTP client (e.g., Terminal, FileZilla, CyberDuck)

Connect to sftp.tudelft.nl with your NetID and password. From the terminal, you can use:

sftp <YourNetID>@sftp.tudelft.nl
cd staff-umbrella/<your_project_name>
put data.zip  # Upload a file (data.zip) to your storage
get results.zip # Download a file (results.zip) from your storage

Graphical clients like FileZilla or CyberDuck provide a drag-and-drop interface for the same purpose.

For TU Delft-managed computers:

For managed Ubuntu 22.04, contact ICT for setting up the mount.
For Ubuntu 18.04, storage is mounted under /tudelft.net/staff-umbrella/:
- You can access it via the terminal:
```
cd /tudelft.net/staff-umbrella/<your_project_name>
```
- Or via the file manager (nautilus or dolphin): under Other locations > Computer > tudelft.net > staff-umbrella > <your_project_name>

For personal computers: Option 1: Mount with sshfs

mkdir ~/storage_mount
sshfs YourNetID@sftp.tudelft.nl:/staff-umbrella/<your_project_name> ~/storage_mount
ls ~/storage_mount # Check contents of your project storage

And, after you are done with the mount:

fusermount -u ~/storage_mount

Option 2: Use sftp

sftp <YourNetID>@sftp.tudelft.nl
cd staff-umbrella/<your_project_name>
put data.zip  # Upload a file (data.zip) to your storage
get results.zip # Download a file (results.zip) from your storage

2. Download the data directly to the storage location

Once you have mounted or connected to your storage, you can use standard tools like wget, curl, or your web browser to download files directly into that location.

For example, if you mounted your storage on Linux at ~/storage_mount, you can download a large dataset into your project folder with wget:

wget -P ~/storage_mount/datasets/ https://www.robots.ox.ac.uk/~vgg/data/flowers/102/102flowers.tgz

The file (the Oxford Flowers 102 Dataset in this example) downloads directly to your project folder in the staff-umbrella storage, using your local machine’s network connection.

Command-Line Transfer Tools

Both your Linux and Windows Personal Storage and the Project and Group Storage are also available world-wide via an SCP/SFTP client.

For direct transfers between your local machine and DAIC, or for scripting automated workflows, you can use command-line tools like scp and rsync.

SCP (Secure Copy)

The scp command provides a simple way to copy files over a secure channel. It has the following basic syntax:

$ scp <source_file> <target_destination>       # for files
$ scp -r <source_folder> <target_destination>  # for folders

For example, to transfer a file from your computer to DAIC:

$ scp mylocalfile [<YourNetID>@]login.daic.tudelft.nl:~/destination_path_on_DAIC/

To transfer a folder (recursively) from your computer to DAIC:

$ scp -r mylocalfolder [<YourNetID>@]login.daic.tudelft.nl:~/destination_path_on_DAIC/

To transfer a file from DAIC to your computer:

$ scp [<YourNetID>@]login.daic.tudelft.nl:~/origin_path_on_DAIC/remotefile ./

To transfer a folder from DAIC to your computer:

$ scp -r [<YourNetID>@]login.daic.tudelft.nl:~/origin_path_on_DAIC/remotefolder ./

The above commands work from both the university network, or when using EduVPN. If a “jump” via linux-bastion is needed (see Access from outside university network), modify the above commands by replacing scp with scp -J <YourNetID>@linux-bastion.tudelft.nl and keep the rest of the command as before:

# Transfer a local file to DAIC via the bastion host
$ scp -J [<YourNetID>@]linux-bastion.tudelft.nl <localfile> [<YourNetID>@]login.daic.tudelft.nl:/tudelft.net/staff-umbrella/<your_project_name>/

# Transfer a remote file from DAIC to your local machine via the bastion host
$ scp -J [<YourNetID>@]linux-bastion.tudelft.nl [<YourNetID>@]login.daic.tudelft.nl:/tudelft.net/staff-umbrella/<your_project_name>/<remotefile> ./

Where:

Case is important.
Items between < > brackets are user-supplied values (so replace with your own NetID, file or folder name).
Items between [ ] brackets are optional: when your username on your local computer is the same as your NetID username, you don’t have to specify it.
When you specify your NetID username, don’t forget the @ character between the username and the computer name.

Note for students

Please use student-linux.tudelft.nl instead of linux-bastion.tudelft.nl as an intermediate server!

Hint

Use quotes when file or folder names contain spaces or special characters.

rsync

rsync is a robust file copying and synchronization tool commonly used in Unix-like operating systems. It allows you to transfer files and directories efficiently, both locally and remotely. rsync supports options that enable compression, preserve file attributes, and allow for incremental updates.

Basic Usage

Copy files locally:
```
rsync [options] <source> <destination>
```
This command copies files and directories from the source to the destination.
Copy files remotely:
```
rsync [options] <source> <user>@<remote_host>:<destination>
```
This command transfers files from a local source to a destination on a remote host.

Note

When sending data to staff-umbrella or staff-bulk, you must use the --no-perms option to avoid errors, as the underlying network filesystem does not support changing permissions.

A recommended command to use is:

$ rsync --progress -avz --no-perms <source_file> [<YourNetID>@]login.daic.tudelft.nl:<destination_umbrella_directory>

This command is effective because:

--progress shows the transfer progress.
-a (archive mode) efficiently copies directories and preserves file attributes like timestamps.
-v provides verbose output.
-z compresses data to speed up the transfer.
--no-perms prevents errors related to file permissions on the destination.

Examples

Synchronize a local directory with a remote directory:
```
rsync -avz /path/to/local/dir user@remote_host:/path/to/remote/dir
```
This synchronizes a local directory with a remote directory, using archive mode (-a) to preserve file attributes, verbose mode (-v) for detailed output, and compression (-z) for efficient transfer.
Synchronize a remote directory with a local directory:
```
rsync -avz user@remote_host:/path/to/remote/dir /path/to/local/dir
```
This transfers files from a remote directory to a local directory, using the same options as the previous example.
Delete files in the destination that are not present in the source:
```
rsync -av --delete /path/to/source/dir /path/to/destination/dir
```
This synchronizes the source and destination directories and deletes files in the destination that are not in the source.
Exclude certain files or directories during transfer:
```
rsync -av --exclude='*.tmp' /path/to/source/dir /path/to/destination/dir
```
This synchronizes the source and destination directories, excluding files with the .tmp extension.

Other Options in rsync

In addition to the commonly used options, rsync provides several other options for more advanced control and customization during file transfers:

--dry-run: Perform a trial run without making any changes. This option allows you to see what would be done without actually doing it.
--checksum: Use checksums instead of file size and modification time to determine if files should be transferred. This is more precise but slower.
--partial: Keep partially transferred files and resume them later. This is useful in case of an interrupted transfer.
--partial-dir=DIR: Specify a directory to hold partial transfers. This option works well with --partial.
--bwlimit=KBPS: Limit the bandwidth used by the transfer to the specified rate in kilobytes per second. Useful for managing network load.
--timeout=SECONDS: Set a maximum wait time in seconds for receiving data. If the timeout is exceeded, rsync will exit.
--no-implied-dirs: When transferring a directory, this option prevents the creation of implied directories on the destination side that exist in the source but not explicitly specified in the transfer.
--files-from=FILE: Read a list of source files from the specified FILE. This can be useful when you want to transfer specific files.
--update: Skip files that are newer on the destination than the source. This is useful for incremental backups.
--ignore-existing: Skip files that already exist on the destination. Useful when you want to avoid overwriting existing files.
--inplace: Update files in place instead of creating temporary files and renaming them later. This can save disk space and improve speed.
--append: Append data to files instead of replacing them if they already exist on the destination.
--append-verify: Append data and verify it with checksums to ensure integrity.
--backup: Make backups of files that are overwritten or deleted during the transfer. By default, a ~ is appended to the backup filename.
--backup-dir=DIR: Specify a directory to store backup files.
--suffix=SUFFIX: Specify a suffix to append to backup files instead of the default ~.
--progress: Displays the progress of the transfer, including the speed and the number of bytes transferred. This is useful for monitoring long transfers and seeing how much data has been copied so far.

These options, along with others, provide additional flexibility and control over your rsync transfers, allowing you to fine-tune the synchronization process to meet your specific needs.

4.3 - Software

How to set up your tools and/or run certain libraries.

4.3.1 - Operating system

Overview of the DAIC operating system and its implications for available software.

At present, DAIC and DelftBlue use different software stacks. This includes differences in the operating system (CentOS 7 for DAIC vs. Red Hat Enterprise Linux 8 for DelftBlue) and, consequently, the available modules.

Be mindful that code or environments developed on one system may not run identically on the other. Check the DelftBlue modules and DAIC software pages to avoid portability issues.

Operating system

DAIC runs the Red Hat Enterprise Linux 7 distribution. Most common software—such as programming languages, libraries, and development tools—is installed on the nodes (see Available software).

However, niche or recently released packages may be missing. If your work depends on a state-of-the-art program not yet available for Red Hat 7, you’ll need to install it manually. See Installing software for instructions.

4.3.2 - Available software

How to find and work with pre-installed software?

General software

Most common general software, like programming languages and libraries, is installed on the DAIC nodes. To check if the program that you need is pre-installed, you can simply try to start it:

$ python
Python 2.7.5 (default, Jun 28 2022, 15:30:04) 
[GCC 4.8.5 20150623 (Red Hat 4.8.5-44)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> quit()

To find out which binary is used exactly you can use which command:

$ which python
/usr/bin/python

Alternatively, you can try to locate the program or library using the whereis command:

$ whereis python
python: /usr/bin/python3.4m-config /usr/bin/python3.6m-x86_64-config /usr/bin/python2.7 /usr/bin/python3.6-config /usr/bin/python3.4m-x86_64-config /usr/bin/python3.6m-config /usr/bin/python3.4 /usr/bin/python3.4m /usr/bin/python2.7-config /usr/bin/python3.6 /usr/bin/python3.4-config /usr/bin/python /usr/bin/python3.6m /usr/lib/python2.7 /usr/lib/python3.4 /usr/lib/python3.6 /usr/lib64/python2.7 /usr/lib64/python3.4 /usr/lib64/python3.6 /etc/python /usr/include/python2.7 /usr/include/python3.4m /usr/include/python3.6m /usr/share/man/man1/python.1.gz

Or, you can check if the package is installed using the rpm -q command as follows:

$ rpm -q python
python-2.7.5-94.el7_9.x86_64
$ rpm -q python4
package python4 is not installed

You can also search with wildcards:

$ rpm -qa 'python*'
python2-wheel-0.29.0-2.el7.noarch
python2-cryptography-1.7.2-2.el7.x86_64
python34-virtualenv-15.1.0-5.el7.noarch
python-networkx-1.8.1-12.el7.noarch
python-gobject-3.22.0-1.el7_4.1.x86_64
python-gofer-2.12.5-3.el7.noarch
python-iniparse-0.4-9.el7.noarch
python-lxml-3.2.1-4.el7.x86_64
python34-3.4.10-8.el7.x86_64
python36-numpy-f2py-1.12.1-3.el7.x86_64
...

Useful commands on DAIC

For a list of handy commands on DAIC have a look here.

4.3.3 - Modules

How to find and work with pre-installed software?

In the context of Unix-like operating systems, the module command is part of the environment modules system, a tool that provides a dynamic approach to managing the user environment. This system allows users to load and unload different software packages or environments on demand. Some often used third-party software (e.g., CUDA, cuDNN, MATLAB) is pre-installed on the cluster as environment modules .

Usage

To see or use the available modules, first, enable the software collection:

$ module use /opt/insy/modulefiles

Now, to see all available packages and versions:

$ module avail
---------------------------------------------------------------------------------------------- /opt/insy/modulefiles ----------------------------------------------------------------------------------------------
   albacore/2.2.7-Python-3.4        cuda/11.8                 cudnn/11.5-8.3.0.98        devtoolset/6    devtoolset/10        intel/oneapi  (D)    matlab/R2021b (D)    miniconda/3.9             (D)
   comsol/5.5                       cuda/12.0                 cudnn/12-8.9.1.23   (D)    devtoolset/7    devtoolset/11 (D)    intel/2017u4         miniconda/2.7        nccl/11.5-2.11.4
   comsol/5.6                (D)    cuda/12.1          (D)    cwp-su/43R8                devtoolset/8    diplib/3.2           matlab/R2020a        miniconda/3.7        openmpi/4.0.1
   cuda/11.5                        cudnn/11-8.6.0.163        cwp-su/44R1         (D)    devtoolset/9    :
   ...

D is a label for the default module in case multiple versions are available. E.g. module load cuda will load cuda/12.1
L means a module is currently loaded

To check the description of a specific module:

$ module whatis cudnn
cudnn/12-8.9.1.23   : cuDNN 8.9.1.23 for CUDA 12
cudnn/12-8.9.1.23   : NVIDIA CUDA Deep Neural Network library (cuDNN) is a GPU-accelerated library of primitives for deep neural networks.

And to use the module or package, load it as follows:

$ module load cuda/11.2 cudnn/11.2-8.1.1.33 # load the module

$ module list                               # check the loaded modules

Currently Loaded Modules:
   1) cuda/11.2   2) cudnn/11.2-8.1.1.33

Note

For more information about using the module system, run module help.

Compilers and Development Tools

The cluster provides several compilers and development tools. The following table lists the available compilers and development tools. These are available in the devtoolset module:

$ module use /opt/insy/modulefiles
$ module avail devtoolset

---------------------------------------------------------------------------------------------- /opt/insy/modulefiles ----------------------------------------------------------------------------------------------
   devtoolset/6    devtoolset/7    devtoolset/8    devtoolset/9    devtoolset/10    devtoolset/11 (L,D)

  Where:
   L:  Module is loaded
   D:  Default Module

If the avail list is too long consider trying:

"module --default avail" or "ml -d av" to just list the default modules.
"module overview" or "ml ov" to display the number of modules for each name.

Use "module spider" to find all possible modules and extensions.
Use "module keyword key1 key2 ..." to search for all possible modules matching any of the "keys".

$ module whatis devtoolset
devtoolset/11       : Developer Toolset 11 Software Collection
devtoolset/11       : GNU Compiler Collection, GNU Debugger, and other development, debugging, and performance monitoring tools.

$ module load devtoolset/11
$ gcc --version
gcc (GCC) 11.2.1 20220127 (Red Hat 11.2.1-9)
Copyright (C) 2021 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

4.3.4 - Installing software

How to install unavailable software?

Basic principles

On a cluster, it’s important that software is available and identical on all nodes, both login and compute nodes (see Workload scheduler). For self-installed software, it’s easier to install the software in one shared location than installing and maintaining the same software separately on every single node. You should therefore install your software on one of the network shares (e.g., your $HOME folder or an umbrella or bulk folder) that are accessible from all nodes (see Storage).
As a regular Linux user you don’t have administrator rights. Yet, you can do your normal work, including installing software in a personal folder, without needing administrator rights. Consequently, you don’t need (nor are you allowed) to use the sudo or su commands that are often shown in manuals.
Like other clusters, DAIC has a set quota on $HOME directories (see Checking Quota Limits). It means that installing software in your $HOME directory is limited. If you need more space, you should use a project share (see Storage).
Both group storage (under /tudelft.net/staff-groups/ or /tudelft.net/staff-bulk/) and project storage (under /tudelft.net/staff-umbrella/) are Windows-based, leading to problems installing packages with tools like pip due to file permission errors. Therefore, the recommended way of using your own software and environments is to use containerization and to store your containers under /tudelft.net/staff-umbrella/.... Check out the Apptainer tutorial for guidance.

Stop!

Although both Linux flavors Red Hat Enterprise Linux (RHEL, CentOS, Scientific Linux, Fedora) and Debian (Ubuntu) can run the same Linux software, they use completely different package systems for installing software. The available software, packages’ names and package versions might differ, and the package formats and package management tools are incompatible. This means:

It is not possible to install Ubuntu or Debian .deb packages in CentOS or use apt-get to install software in DAIC. So when installing software, use a manual for CentOS, Red Hat or Fedora.
If you can only find a manual for Ubuntu, you have to substitute the CentOS versions for any Ubuntu-specific packages or commands.

Managing environments

Conda/Mamba

Conda and Mamba are both package management and environment management tools used primarily in the data science and programming communities. Conda, developed by Anaconda, Inc., allows users to manage packages and create isolated environments for different projects, supporting multiple languages like Python and R. Mamba is a more recent alternative to Conda that offers faster performance and improved dependency solving using the same package repositories as Conda. Both tools help avoid dependency conflicts and simplify the management of software packages and environments. You can install it with:

Use module load conda

Miniconda is available as module.

$ module use /opt/insy/modulefiles  # If not already
$ module load miniconda 
$ which conda 
/opt/insy/miniconda/3.9/bin/conda

Creating a conda environment

To create a new environment you can run conda create:

$ conda create -n env
Collecting package metadata (current_repodata.json): done
Solving environment: done

==> WARNING: A newer version of conda exists. <==
  current version: 4.10.1
  latest version: 24.3.0

Please update conda by running

    $ conda update -n base -c defaults conda

 ## Package Plan ##

  environment location: /home/nfs/username/.conda/envs/env

Creating a conde environment from a YAML file

Conda allows you to create environments from a YAML file that specifies the packages and their versions for the desired environment. This feature makes it easier to reproduce environments across different machines and share environment configurations with others.

$ conda env create -f environment.yml (-n new-name)

For how to create a environment.yml file see Exporting environments

Environment variables

You can set enviromnet variables to install packages and environments in other locations:

CONDA_PREFIX: This variable points to the active conda environment’s root directory. When an environment is active, CONDA_PREFIX contains the path to that environment’s root directory.
CONDA_ENVS_DIRS: This variable specifies the directories where conda environments are stored. You can set it to a list of directories (separated by colons on Unix-like systems and semicolons on Windows). Conda will search for and store environments in these directories.
CONDA_PKGS_DIRS: This variable specifies the directories where conda stores downloaded packages. Like CONDA_ENVS_DIRS, you can set it to a list of directories. Conda uses these directories as cache locations for package downloads and installations.

Examples

Set conda environments directory:

$ export CONDA_ENVS_DIRS="/tudelft.net/staff-umbrella/my-project/"

A caveat is that the /tudelft.net mounts are windows based and therefore have compatibility issues with pip. When you create your conda environments there you will not be able to use pip to install packages. It is therefore recommeneded to keep the conda environments minimal and in your home directory, and to use containerization for larger environments.

List existing environments

You can list environments with

$ conda env list

Activating environments

You can activate an existing environemnt with conda activate, for example to install more packages:

$ conda activate env  # Activate the newly created environment

Modifying environments

Sometimes you need to add/remove/change packages and libraries in existing environments. First, activate the enviroment you want to change with conda activate and then run conda install package-name or conda remove package-name. You can also use pip to install packages inside a conda environment, but for that pip has to be installed inside the environment. To make sure pip is installed in your enviroment run conda install pip first.

(env) $ conda install pandas  # Add a new package to the active environment
Collecting package metadata (current_repodata.json): done
Solving environment: done

==> WARNING: A newer version of conda exists. <==
  current version: 4.10.1
  latest version: 24.3.0

Please update conda by running

    $ conda update -n base -c defaults conda

## Package Plan ##

  environment location: /home/nfs/sdrwacker/.conda/envs/test

  added / updated specs:
    - pandas

The following packages will be downloaded:

    package                    |            build
    ---------------------------|-----------------
    blas-1.0                   |              mkl           6 KB
    bottleneck-1.3.7           |  py312ha883a20_0         140 KB
    bzip2-1.0.8                |       h5eee18b_5         262 KB
    expat-2.6.2                |       h6a678d5_0         177 KB
    intel-openmp-2023.1.0      |   hdb19cb5_46306        17.2 MB
    ld_impl_linux-64-2.38      |       h1181459_1         654 KB
    libffi-3.4.4               |       h6a678d5_0         142 KB
    libuuid-1.41.5             |       h5eee18b_0          27 KB
    mkl-2023.1.0               |   h213fc3f_46344       171.5 MB
    mkl-service-2.4.0          |  py312h5eee18b_1          66 KB
    mkl_fft-1.3.8              |  py312h5eee18b_0         204 KB
    mkl_random-1.2.4           |  py312hdb19cb5_0         284 KB
    ncurses-6.4                |       h6a678d5_0         914 KB
    numexpr-2.8.7              |  py312hf827012_0         149 KB
    numpy-1.26.4               |  py312hc5e2394_0          11 KB
    numpy-base-1.26.4          |  py312h0da6c21_0         7.7 MB
    openssl-3.0.13             |       h7f8727e_0         5.2 MB
    pandas-2.2.1               |  py312h526ad5a_0        15.4 MB
    pip-23.3.1                 |  py312h06a4308_0         2.8 MB
    python-3.12.3              |       h996f2a0_0        34.8 MB
    pytz-2023.3.post1          |  py312h06a4308_0         197 KB
    readline-8.2               |       h5eee18b_0         357 KB
    setuptools-68.2.2          |  py312h06a4308_0         1.2 MB
    six-1.16.0                 |     pyhd3eb1b0_1          18 KB
    sqlite-3.41.2              |       h5eee18b_0         1.2 MB
    tbb-2021.8.0               |       hdb19cb5_0         1.6 MB
    tk-8.6.12                  |       h1ccaba5_0         3.0 MB
    tzdata-2024a               |       h04d1e81_0         116 KB
    wheel-0.41.2               |  py312h06a4308_0         131 KB
    xz-5.4.6                   |       h5eee18b_0         651 KB
    zlib-1.2.13                |       h5eee18b_0         103 KB
    ------------------------------------------------------------
                                           Total:       266.1 MB

The following NEW packages will be INSTALLED:

  _libgcc_mutex      pkgs/main/linux-64::_libgcc_mutex-0.1-main
  _openmp_mutex      pkgs/main/linux-64::_openmp_mutex-5.1-1_gnu
  blas               pkgs/main/linux-64::blas-1.0-mkl
  bottleneck         pkgs/main/linux-64::bottleneck-1.3.7-py312ha883a20_0
  bzip2              pkgs/main/linux-64::bzip2-1.0.8-h5eee18b_5
  ca-certificates    pkgs/main/linux-64::ca-certificates-2024.3.11-h06a4308_0
  expat              pkgs/main/linux-64::expat-2.6.2-h6a678d5_0
  intel-openmp       pkgs/main/linux-64::intel-openmp-2023.1.0-hdb19cb5_46306
  ld_impl_linux-64   pkgs/main/linux-64::ld_impl_linux-64-2.38-h1181459_1
  libffi             pkgs/main/linux-64::libffi-3.4.4-h6a678d5_0
  libgcc-ng          pkgs/main/linux-64::libgcc-ng-11.2.0-h1234567_1
  libgomp            pkgs/main/linux-64::libgomp-11.2.0-h1234567_1
  libstdcxx-ng       pkgs/main/linux-64::libstdcxx-ng-11.2.0-h1234567_1
  libuuid            pkgs/main/linux-64::libuuid-1.41.5-h5eee18b_0
  mkl                pkgs/main/linux-64::mkl-2023.1.0-h213fc3f_46344
  mkl-service        pkgs/main/linux-64::mkl-service-2.4.0-py312h5eee18b_1
  mkl_fft            pkgs/main/linux-64::mkl_fft-1.3.8-py312h5eee18b_0
  mkl_random         pkgs/main/linux-64::mkl_random-1.2.4-py312hdb19cb5_0
  ncurses            pkgs/main/linux-64::ncurses-6.4-h6a678d5_0
  numexpr            pkgs/main/linux-64::numexpr-2.8.7-py312hf827012_0
  numpy              pkgs/main/linux-64::numpy-1.26.4-py312hc5e2394_0
  numpy-base         pkgs/main/linux-64::numpy-base-1.26.4-py312h0da6c21_0
  openssl            pkgs/main/linux-64::openssl-3.0.13-h7f8727e_0
  pandas             pkgs/main/linux-64::pandas-2.2.1-py312h526ad5a_0
  pip                pkgs/main/linux-64::pip-23.3.1-py312h06a4308_0
  python             pkgs/main/linux-64::python-3.12.3-h996f2a0_0
  python-dateutil    pkgs/main/noarch::python-dateutil-2.8.2-pyhd3eb1b0_0
  python-tzdata      pkgs/main/noarch::python-tzdata-2023.3-pyhd3eb1b0_0
  pytz               pkgs/main/linux-64::pytz-2023.3.post1-py312h06a4308_0
  readline           pkgs/main/linux-64::readline-8.2-h5eee18b_0
  setuptools         pkgs/main/linux-64::setuptools-68.2.2-py312h06a4308_0
  six                pkgs/main/noarch::six-1.16.0-pyhd3eb1b0_1
  sqlite             pkgs/main/linux-64::sqlite-3.41.2-h5eee18b_0
  tbb                pkgs/main/linux-64::tbb-2021.8.0-hdb19cb5_0
  tk                 pkgs/main/linux-64::tk-8.6.12-h1ccaba5_0
  tzdata             pkgs/main/noarch::tzdata-2024a-h04d1e81_0
  wheel              pkgs/main/linux-64::wheel-0.41.2-py312h06a4308_0
  xz                 pkgs/main/linux-64::xz-5.4.6-h5eee18b_0
  zlib               pkgs/main/linux-64::zlib-1.2.13-h5eee18b_0


Proceed ([y]/n)? y
....

Exporting environments

You can export versions of all installed packages and libaries inside a coda environment with conda env export. It is good practice to keep track of all versions that you have used for a particular experiment by exporting it into a YAML file typically called environment.yml:

$ conda env export --no-builds > environment.yml

Install your own mamba/conda

Sometimes the versions provided by module are outdated and users need their own installation of conda or mamba. A minimal version can be installed as demonstrated in the following:

$ alias install-miniforge='
    wget https://github.com/conda-forge/miniforge/releases/latest/download/Miniforge3-Linux-x86_64.sh \
    && bash Miniforge3-Linux-x86_64.sh -b \
    && rm -f Miniforge3-Linux-x86_64.sh \
    && eval "$($HOME/miniforge3/bin/conda shell.bash hook)" \
    && conda init \
    && conda install -n base -c conda-forge mamba'

$ cd ~ && install-miniforge

(base) $  # This shows that the 'base' environment is active.
(base) $ which python
~/miniforge3/bin/python

This will already occupy around 500MB in your $HOME directory, totaling ~20,000 files. Make sure to stay within your quota (see system specifications).

$ du -h miniforge3 --max-depth=0
486M	miniforge3

$ find miniforge3 -type f | wc -l
20719

Now, you can install your own versions of libraries and programs, or create entire environments as descibed above.

Stop!

You are limited to a fixed quota in your $HOME directory (see Personal Storage). Installing a full development environment (e.g. for PyTorch) can easily exceed this quota. Therefore, it is recommended to install only essential tools and libraries in your $HOME directory. For larger environments, consider installing them in a project (preferred) or group share.

Using binaries

Some programs come as precompiled binaries or are written in a scripting language such as Perl, PHP, Python or shell script. Most of these programs don’t actually need to be “installed” since you can simply run these programs directly. In certain scenarios, you may need to make the program executable first using chmod +x:

$ ./my-executable        # attempting to run the binary `my-executable`
-bash: ./my-executable: Permission denied

$ chmod +x program       # making `my-executable` executable, since it fails due to permissions

$ ./my-executable        # checking `my-executable` works!
Hello world!

Installing from source

When a pre-made binary of your software is not available, you’ll have to install the software yourself from the source. You may need to set up your Installation environment before following this Installation recipe.

Installation environment

When you are installing software for the very first time, you need to set up your environment. If you have already done this before , you can skip this section and go directly to the Installation recipe section.

To set up your environment, first, add the following lines to your ~/.bash_profile or, alternatively, download this (bash_profile.txt) as shown in the subsequent commands:

bash_profile.txt


# Get the aliases and functions
if [ -f ~/.bashrc ]; then
   . ~/.bashrc
fi

# User specific environment and startup settings
export PREFIX="$HOME/.local"
export ACLOCAL_PATH="$PREFIX/share/aclocal${ACLOCAL_PATH:+:$ACLOCAL_PATH}"
export CPATH="$PREFIX/include${CPATH:+:$CPATH}"
export LD_LIBRARY_PATH="$PREFIX/lib64:$PREFIX/lib${LD_LIBRARY_PATH:+:$LD_LIBRARY_PATH}"
export LIBRARY_PATH="$PREFIX/lib64:$PREFIX/lib${LIBRARY_PATH:+:$LIBRARY_PATH}"
export MANPATH="$PREFIX/share/man${MANPATH:+:$MANPATH}"
export PATH="$HOME/bin:$PREFIX/bin:$PATH"
export PERL5LIB="$PREFIX/lib64/perl5:$PREFIX/share/perl5${PERL5LIB:+:$PERL5LIB}"
export PKG_CONFIG_PATH="$PREFIX/lib64/pkgconfig:$PREFIX/share/pkgconfig${PKG_CONFIG_PATH:+:$PKG_CONFIG_PATH}"
export PYTHONPATH="$PREFIX/lib/python2.7/site-packages${PYTHONPATH:+:$PYTHONPATH}"

Note!

if you already have some of these settings in your ~/.bash_profile (or elsewhere), you should combine them so they don’t duplicate the paths.
if you want to use python3.6 instead of python2.7, you need to set the PYTHONPATH to python3.6.

$ cp ~/.bash_profile ~/.bash_profile.bak # back up your file
$ curl -s https://wiki.tudelft.nl/pub/Research/InsyCluster/InstallingSoftware/bash_profile.txt >> ~/.bash_profile # download and append the lines above

Then, clean up any duplicate settings, and:

$ source ~/.bash_profile
$ mkdir -p "$PREFIX"

The line export PREFIX="$HOME/.local" sets your software installation directory to /home/nfs/<YourNetID>/.local (which is the default and accessible on all nodes). This is in your personal home directory, which has a fixed quota (see Personal storage). For software intended to be shared with others, you should instead use a project (preferred) or group share.

export PREFIX="/tudelft.net/staff-umbrella/<your-project>/software"

The other variables will let you use your self-installed programs. You are now ready to install your software!

Installation recipe

Software installation usually just requires you to follow the general installation recipe described below, but you always need to consult the documentation for your software.

Place the source of the software in a folder under /tmp:

$ mkdir /tmp/$USER
$ cd /tmp/$USER

You can sometimes download the software directly from the internet:

$ wget http://host/path/software.tar.gz
$ tar -xzf software.tar.gz

Or, clone the software from a git repository:

$ git clone https://github.com/software

Then:

$ cd software

Note

Note: .tgz is the same as .tar.gz, for .tar.bz2 files use tar -xjf software.tar.bz2.

If the software provides a configure script, run it:

$ ./configure --prefix="$PREFIX"

If configure complains about missing software, you’ll either have to install that software, tell configure where it is (--with-feature _path_=) or disable the feature (--disable-feature).

If your software provides a CMakeLists.txt file, run cmake (note: the trailing two dots on the last line are needed exactly as shown):

$ mkdir -p build $ cd build $ cmake -DCMAKE_INSTALL_PREFIX="$PREFIX" ..

Again, if cmake complains about missing software, you’ll either have to install that software or tell cmake where it is (-DCMAKE_SYSTEM_PREFIX_PATH="/usr/local;/usr;$PREFIX;path").

If neither is provided, consult the documentation for dependencies and configuration (specifically for the installation directory).

There is no point in continuing until all reported problems have been fixed.

Compile the software:

$ make

If compilation is aborted due to an error, Google the error for possible solutions. Again, there is no point in continuing until all reported problems have been fixed.

Install the software. When you used configure or cmake, you can simply run:

$ make install

When you used neither, you need to use:

$ make prefix="$PREFIX" install

Your software should now be ready to use, so check it:

$ cd $ _program_

When the program works, clean up /tmp/netid:

$ rm -r /tmp/$USER

4.3.5 - Containerization

How to use Apptainer on DAIC?

Apptainer

Apptainer is a container platform. It allows you to create and run containers that package up pieces of software in a way that is portable and reproducible. You can build a container using Apptainer on your laptop, and then run it on many on an HPC cluster. Apptainer was created to run complex applications on HPC clusters in a simple, portable, and reproducible way. This repository contains a template for building a Apptainer (former Singularity) container using miniforge, and mamba (similar to conda). The examples directory also contains examples for other setups.

Apptainer features

Verifiable reproducibility and security, using cryptographic signatures, an immutable container image format, and in-memory decryption.
Integration over isolation by default. Easily make use of GPUs, high speed networks, parallel filesystems on a cluster or server by default.
Mobility of compute. The single file SIF container format is easy to transport and share.
A simple, effective security model. You are the same user inside a container as outside, and cannot gain additional privilege on the host system by default. Read more about Security in Apptainer.

Template

The Apptainer template repository maintained by the Research Engineering and Infrastructure Team is a good starting point to create your own apptainers.

How to use Apptainer on the cluster with SLURM?

Here is an example how to use the container in a SLURM script.

#!/bin/sh
#SBATCH --job-name="apptainer-job"
#SBATCH --account="my-account"
#SBATCH --partition="general"      # Request partition.
#SBATCH --time=01:00:00            # Request run time (wall-clock). Default is 1 minute
#SBATCH --nodes=1.                 # Request 1 node
#SBATCH --tasks-per-node=1         # Set one task per node
#SBATCH --cpus-per-task=4          # Request number of CPUs (threads) per task.
#SBATCH --gres=gpu:1               # Request 1 GPU
#SBATCH --mem=4GB                  # Request 4 GB of RAM in total
#SBATCH --mail-type=END            # Set mail type to 'END' to receive a mail when the job finishes. 
#SBATCH --output=slurm-%x-%j.out   # Set name of output log. %j is the Slurm jobId
#SBATCH --error=slurm-%x-%j.err    # Set name of error log. %j is the Slurm jobId

export APPTAINER_IMAGE="/path/to/my-container.sif"

# If you use GPUs
module use /opt/insy/modulefiles
module load cuda/12.1

# Run script
srun apptainer exec \
  --nv \
  --env-file ~/.env \                 
  -B $HOME:$HOME \
  -B /tudelft.net/:/tudelft.net/ \
  $APPTAINER_IMAGE \
  python script.py

# --nv binds NVIDIA libraries from the host (only if you use CUDA)
# --env-file source additional environment variables from e.g. .env file (optional)
# -B /$HOME:/$HOME/ mounts host file-sytem inside container
# The home folder should be mounted by default, but sometimes it is not
# -B can be used several times, change this to match your cluster file-system
# APPTAINER_IMAGE is the full path to the container.sif file
# python script.py is the command that you want to use inside the container

Tutorial

See the Apptainer tutorial.

4.4 - Job submission

How to submit jobs to slurm?

Slurm job’s terminology: job, job step, task and CPUs

A slurm job (submitted via sbatch) can consists of multiple steps in series. Each step (specified via srun) can run multiple tasks (ie programs) in parallel. Each task gets its own set of CPUs. As an example, consider the workflow and corresponding breakdown shown in fig 2.

In this example, note:

When you explicitly request 1 CPU per task (--cpus-per-task=1), you should also explicitly specify the number of tasks (--ntasks). Otherwise, srun may start the task twice in parallel (because CPUs are allocated in multiples of 2)
The default slurm allocation is a single task and single CPU (ie --ntasks=1 --cpus-per-task=1). Thus, it is not necessary to explicitly request these to run a single task on a single CPU.
When using multiple tasks, specify --mem-per-cpu.

Note

DAIC is dual-threaded. It means that CPUs are automatically allocated in multiples of 2. Thus, in your job use (a multiple of) 2 threads.

4.4.1 - Basics of Slurm jobs

How to submit jobs to the cluster?

Job script

Job scripts are text files, where the header set of directives that specify compute resources, and the remainder is the code that needs to run. All resources and scheduling are specified in the header as #SBATCH directives (see man sbatch for more information). Code could be a set of steps to run in series, or parallel tasks within these steps (see Slurm job’s terminology).

The code snippet below is a template script that can be customized to run jobs on DAIC. A useful tool that can be used to streamline the debugging of such scripts is ShellCheck .

jobscript.sbatch


#!/bin/sh
#SBATCH --partition=general # Request partition. Default is 'general' 
#SBATCH --qos=short         # Request Quality of Service. Default is 'short' (maximum run time: 4 hours)
#SBATCH --time=0:01:00      # Request run time (wall-clock). Default is 1 minute
#SBATCH --ntasks=1          # Request number of parallel tasks per job. Default is 1
#SBATCH --cpus-per-task=2   # Request number of CPUs (threads) per task. Default is 1 (note: CPUs are always allocated to jobs per 2).
#SBATCH --mem=1024          # Request memory (MB) per node. Default is 1024MB (1GB). For multiple tasks, specify --mem-per-cpu instead
#SBATCH --mail-type=END     # Set mail type to 'END' to receive a mail when the job finishes. 
#SBATCH --output=slurm_%j.out # Set name of output log. %j is the Slurm jobId
#SBATCH --error=slurm_%j.err # Set name of error log. %j is the Slurm jobId

/usr/bin/scontrol show job -d "$SLURM_JOB_ID"  # check sbatch directives are working

#Remaining job commands go below here. For example, to run a Matlab script named "matlab_script.m", uncomment:
#module use /opt/insy/modulefiles # Use DAIC INSY software collection
#module load matlab/R2020b        # Load Matlab 2020b version
#srun matlab < matlab_script.m # Computations should be started with 'srun'.

Note

DAIC is dual-threaded. It means that CPUs are automatically allocated in multiples of 2. Thus, in your job use (a multiple of) 2 threads.
Do not enable mails when submitting large numbers (>20) of jobs at once

Job submission

To submit a job script jobscript.sbatch, login to DAIC, and:

To only test:

$ sbatch --test-only jobscript.sbatch
Job 1 to start at 2015-06-30T14:00:00 using 2 processors on nodes insy15 in partition general

To actually submit the job and do the computations:

$ sbatch jobscript.sbatch
Submitted batch job 2

Using GPU resources

Some DAIC nodes have GPUs of different types, that can be used for various compute purposes (see GPUs).

To request a gpu for a job, use the sbatch directive --gres=gpu[:type][:number], where the optional [:type] and [:number] specify the type and number of the GPUs requested, as in the examples below:

Slurm directives to request gpus for a job

Note

For CUDA programs, first, load the needed modules (CUDA, cuDNN) before running your code (see Available software).

An example batch script with GPU resources

#!/bin/sh
#SBATCH --partition=general # Request partition. Default is 'general' 
#SBATCH --qos=short         # Request Quality of Service. Default is 'short' (maximum run time: 4 hours)
#SBATCH --time=0:01:00      # Request run time (wall-clock). Default is 1 minute
#SBATCH --ntasks=1          # Request number of parallel tasks per job. Default is 1
#SBATCH --cpus-per-task=2   # Request number of CPUs (threads) per task. Default is 1 (note: CPUs are always allocated to jobs per 2).
#SBATCH --mem=1024          # Request memory (MB) per node. Default is 1024MB (1GB). For multiple tasks, specify --mem-per-cpu instead
#SBATCH --mail-type=END     # Set mail type to 'END' to receive a mail when the job finishes. 
#SBATCH --output=slurm_%j.out # Set name of output log. %j is the Slurm jobId
#SBATCH --error=slurm_%j.err # Set name of error log. %j is the Slurm jobId

#SBATCH --gres=gpu:1 # Request 1 GPU

# Measure GPU usage of your job (initialization)
previous=$(/usr/bin/nvidia-smi --query-accounted-apps='gpu_utilization,mem_utilization,max_memory_usage,time' --format='csv' | /usr/bin/tail -n '+2') 

/usr/bin/nvidia-smi # Check sbatch settings are working (it should show the GPU that you requested)

# Remaining job commands go below here. For example, to run python code that makes use of GPU resources:

# Uncomment these lines and adapt them to load the software that your job requires
#module use /opt/insy/modulefiles          # Use DAIC INSY software collection
#module load cuda/11.2 cudnn/11.2-8.1.1.33 # Load certain versions of cuda and cudnn 
#srun python my_program.py # Computations should be started with 'srun'. For example:

# Measure GPU usage of your job (result)
/usr/bin/nvidia-smi --query-accounted-apps='gpu_utilization,mem_utilization,max_memory_usage,time' --format='csv' | /usr/bin/grep -v -F "$previous"

Similarly, to interactively work in a GPU node:

$ hostname # check you are in one of the login nodes
login1.daic.tudelft.nl
$
$ sinteractive --cpus-per-task=1 --mem=500 --time=00:01:00 --gres=gpu:v100:1
Note: interactive sessions are automatically terminated when they reach their time limit (1 hour)!
srun: job 8607665 queued and waiting for resources
srun: job 8607665 has been allocated resources
 15:27:18 up 51 days,  3:04,  0 users,  load average: 62,09, 59,43, 44,04
SomeNetID@insy11:~$
SomeNetID@insy11:~$ hostname # check you are in one of the compute nodes
insy11.daic.tudelft.nl
SomeNetID@insy11:~$
SomeNetID@insy11:~$ nvidia-smi # check characteristics of GPU
Mon Jul 24 15:37:01 2023       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 530.30.02              Driver Version: 530.30.02    CUDA Version: 12.1     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                  Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf            Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  Tesla V100-SXM2-32GB            On | 00000000:88:00.0 Off |                    0 |
| N/A   32C    P0               40W / 300W|      0MiB / 32768MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
                                                                                         
+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|  No running processes found                                                           |
+---------------------------------------------------------------------------------------+
SomeNetID@insy11:~$
SomeNetID@insy11:~$ exit # exit the interactive session

Note

To inspect a given GPU and obtain its details, you can run the following commands on an interactive session or an sbatch script.

$ sinteractive --cpus-per-task=2 --mem=500 --time=00:02:00 --gres=gpu
Note: interactive sessions are automatically terminated when they reach their time limit (1 hour)!
srun: job 8607783 queued and waiting for resources
srun: job 8607783 has been allocated resources
15:50:29 up 51 days,  3:26,  0 users,  load average: 60,33, 59,72, 54,65

SomeNetID@influ1:~$ nvidia-smi  --format=csv,noheader --query-gpu=name	
NVIDIA GeForce RTX 2080 Ti

SomeNetID@influ1:~$ nvidia-smi -q | grep Architecture	
Product Architecture                  : Turing                                                                     

SomeNetID@influ1:~$ nvidia-smi --query-gpu=compute_cap --format=csv,noheader
7.5	

SomeNetID@influ1:~$ apptainer run --nv  cuda_based_image.sif | grep "CUDA Cores"	 # using the apptainer image of the tutorial
(068) Multiprocessors, (064) CUDA Cores/MP:    4352 CUDA Cores

SomeNetID@influ1:~$ nvidia-smi  --format=csv,noheader --query-gpu=memory.total
11264 MiB

SomeNetID@influ1:~$ exit

Interactive jobs on compute nodes

To work interactively on a node, e.g., to debug a running code, or test on a GPU, start an interactive session using sinteractve <compute requirements>. If no parameters were provided, the default are applied. <compute requirement> can be specified the same way as sbatch directives within an sbatch script (see Submitting jobs), as in the examples below:

$ hostname # check you are in one of the login nodes
login1.daic.tudelft.nl
$ sinteractive 
 16:07:20 up 12 days, 4:09, 2 users, load average: 7.06, 7.04, 7.12
$ hostname # check you are in a compute node
insy15
$ squeue -u SomeNetID  # Replace SomeNetId with your NetID 
JOBID PARTITION  NAME     USER ST  TIME  NODES NODELIST(REASON)
    2   general  bash SomeNetI  R  1:23      1 insy15  
$ logout # exit the interactive job

To request a node with certain compute requirements:

$ sinteractive --ntasks=1 --cpus-per-task=2 --mem=4096
 16:07:20 up 12 days, 4:09, 2 users, load average: 7.06, 7.04, 7.12

Warning

When you logout from an interactive session, all running processes will be terminated

Note

Requesting interactive sessions is subject to the same resource availability constraints as submitting an sbatch script. It means you may need to wait until resources are available as you would when you submit an sbatch script

Monitoring slurm jobs

To check your job has actually been submitted:

$ squeue -u SomeNetID  # Replace SomeNetId with your NetID 
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
                 2   general  jobscip SomeNetI  R       0:01      1 insy15

And to check the log of your job, use an editor or viewer of choice (eg, vi, nano or simply cat) to view the log:

$ cat slurm-2.out
JobId=2 JobName=jobscript.sbatch
   UserId=SomeNetId(123) GroupId=domain users(100513) MCS_label=N/A
   Priority=23909774 Nice=0 Account=ewi-insy QOS=short
   JobState=RUNNING Reason=None Dependency=(null)
   Requeue=0 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
   DerivedExitCode=0:0
   RunTime=00:00:00 TimeLimit=00:01:00 TimeMin=N/A
   SubmitTime=2015-06-30T14:00:00 EligibleTime=2015-06-30T14:00:00
   AccrueTime=2015-06-30T14:00:00
   StartTime=2015-06-30T14:00:01 EndTime=2015-06-30T14:01:01 Deadline=N/A
   SuspendTime=None SecsPreSuspend=0 LastSchedEval=2015-06-30T14:01:01  Scheduler=Main
   Partition=general AllocNode:Sid=login1:2220
   ReqNodeList=(null) ExcNodeList=(null)
   NodeList=insy15
   BatchHost=insy15
   NumNodes=1 NumCPUs=2 NumTasks=1 CPUs/Task=2 ReqB:S:C:T=0:0:*:*
   TRES=cpu=2,mem=1G,node=1,billing=1
   Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*
   JOB_GRES=(null)
     Nodes=insy15 CPU_IDs=26-27 Mem=1024 GRES=
   MinCPUsNode=2 MinMemoryNode=1G MinTmpDiskNode=50M
   Features=(null) DelayBoot=00:00:00
   OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)
   Command=/home/nfs/SomeNetId/jobscript.sbatch
   WorkDir=/home/nfs/SomeNetId
   StdErr=/home/nfs/SomeNetId/slurm_2.err
   StdIn=/dev/null
   StdOut=/home/nfs/SomeNetId/slurm_2.out
   Power=
   MailUser=SomeNetId@tudelft.nl MailType=END

Sometimes, it may be desirable to inspect slurm jobs beyond their status in the queue. For example, to check which script was submitted, or how the resources were requested and allocated. Below are a few useful commands for this purpose:

See job definition

$  scontrol show job 8580148
JobId=8580148 JobName=jobscript.sbatch
   UserId=SomeNetID(123) GroupId=domain users(100513) MCS_label=N/A
   Priority=23721804 Nice=0 Account=ewi-insy QOS=short
   JobState=RUNNING Reason=None Dependency=(null)
   Requeue=0 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
   RunTime=00:00:12 TimeLimit=00:01:00 TimeMin=N/A
   SubmitTime=2023-07-10T06:41:57 EligibleTime=2023-07-10T06:41:57
   AccrueTime=2023-07-10T06:41:57
   StartTime=2023-07-10T06:41:58 EndTime=2023-07-10T06:42:58 Deadline=N/A
   SuspendTime=None SecsPreSuspend=0 LastSchedEval=2023-07-10T06:41:58 Scheduler=Main
   Partition=general AllocNode:Sid=login1:19162
   ReqNodeList=(null) ExcNodeList=(null)
   NodeList=awi18
   BatchHost=awi18
   NumNodes=1 NumCPUs=2 NumTasks=1 CPUs/Task=2 ReqB:S:C:T=0:0:*:*
   TRES=cpu=2,mem=1G,node=1,billing=1
   Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*
   MinCPUsNode=2 MinMemoryNode=1G MinTmpDiskNode=50M
   Features=(null) DelayBoot=00:00:00
   OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)
   Command=/home/nfs/SomeNetID/jobscript.sbatch
   WorkDir=/home/nfs/SomeNetID
   StdErr=/home/nfs/SomeNetID/slurm_8580148.err
   StdIn=/dev/null
   StdOut=/home/nfs/SomeNetID/slurm_8580148.out
   Power=
   MailUser=SomeNetId@tudelft.nl MailType=END

See statistics of a running job

$ sstat 1
  JobID  AveRSS  AveCPU  NTasks  AveDiskRead AveDiskWrite
------- ------- ------- ------- ------------ ------------
1.0        426K 00:00.0       1        0.52M        0.01M

See accounting information of a finished job (also see –long option)

$ sacct -j 8580148
JobID           JobName  Partition    Account  AllocCPUS      State ExitCode 
------------ ---------- ---------- ---------- ---------- ---------- -------- 
8580148      jobscript+    general   ewi-insy          2  COMPLETED      0:0 
8580148.bat+      batch              ewi-insy          2  COMPLETED      0:0

See overall job efficiency of a finished job

$ seff 8580148
Job ID: 8580148
Cluster: insy
User/Group: SomeNetID/domain users
State: COMPLETED (exit code 0)
Nodes: 1
Cores per node: 2
CPU Utilized: 00:00:00
CPU Efficiency: 0.00% of 00:01:00 core-walltime
Job Wall-clock time: 00:00:30
Memory Utilized: 340.00 KB
Memory Efficiency: 0.03% of 1.00 GB

Cancelling jobs

And finally, to cancel a given job:

$ scancel <jobID>

Note

It is possible to specify the sbatch directives, like --mem, --ntasks, … etc in the command line as in:

$ sbatch --time=00:02:00 jobscript.sbatch

This specification is generally not recommended for production, as it is less reproducible than specifying within the job script itself.

4.4.2 - Priorities, Partitions, Quality of Service & Reservations

How to submit jobs to slurm?

Slurm’s job scheduling and waiting times

When slurm is not configured for FIFO scheduling, jobs are prioritized in the following order:

Jobs that can preempt: Not enabled in DAIC
Jobs with an advanced reservation: See Slurm's Advanced Resource Reservation Guide
Partition Priority Tier: See Priority tiers
Job priority: See Priority calculations and QoS priority
Job ID

Partitions

In SLURM, a partition is a scheduling construct that groups nodes or resources based on certain characteristics or policies. Partitions are used to organize and manage resources within a cluster, and they allow system administrators to control how jobs are allocated and executed on different nodes.

To see all paritions on DAIC, you can use the command scontrol show partition -a. To check owners of these partitions, check the Contributing departments page.

Partitions & priority tiers

DAIC partitions are tiered:

The general partition is in the lowest priority tier,
Department partitions (eg, insy, st) are in the middle priority tier, and
Partitions for specific groups (eg, influence, mmll) are in the highest priority tier. Those partitions correspond to resources contributed by the respective groups or departments (see Contributing departments).

When resources become available, the scheduler will first look for jobs in the highest priority partition that those resources are in, and start the highest (user) priority jobs that fit within the resources (if any). When resources remain, the scheduler will check the next lower priority tier, and so on. Finally, the scheduler will try to backfill lower (user) priority jobs that fit (if any).

The partition priorities have no impact on resources that are in use, so jobs have to wait until the resources become available.

Partition selection

The purpose of this tiering is to let you submit your jobs to multiple partitions (e.g., --partition=mml,insy,general), allowing the scheduler to determine where the job can start the soonest. This ensures your job has the highest possible priority across different partitions in the cluster, without negatively impacting your or others’ resource access.

Keep in mind that:

Resources of all partitions (eg, st) are also part of the general partition (see Fig 1). Thus:
- Submitting to the general partition allows jobs to use all nodes
- Submitting to group-specific partitions alone results in longer waiting times, since the general partition has much more resources than any of them (The bigger the resource pool, the more chances a job has to be scheduled or back-filled)
- The optimal strategy is to submit to both general and group-specific partitions when accessible. This is to skip over higher-priority jobs that would otherwise get started first on resources that are also in the specific partition.
You should only submit jobs to partitions that your account has access to. Submitting jobs to unauthorized partitions (e.g., using --partition=insy,st when your submitting account does not have access to both of these) will result in the job remaining in a pending state and generate excessive logging, potentially overloading the Slurm controller nodes.

Warning

Always ensure you are submitting jobs to partitions accessible by your account. You can check your account and partition permissions with the following commands- example output for a user is shown below:

$ sacctmgr show user "$USER" withassoc Format='DefaultAccount,Account' --parsable # Check your account(s)
Def Acct|Account|
ewi-insy-prb|ewi-st|
ewi-insy-prb|ewi-insy-prb|

$ echo "Partition   AllowAccounts"; scontrol show partition -a | \
> awk '
>     /PartitionName=/ {
>         split($1, a, "=");
>         partition = a[2]
>     } 
>     /AllowAccounts=/ {
>         split($2, b, "=");
>         print partition, b[2]
>     }
> ' | \
> grep -E 'ALL|ewi-insy-prb'      # Check paritions accessible to your *default* account
Partition   AllowAccounts
general ALL
insy ewi-insy,ewi-insy-cgv,ewi-insy-cys,ewi-insy-ii,ewi-insy-ii-influence,ewi-insy-mmc,ewi-insy-prb,ewi-insy-prb-dbl,ewi-insy-prb-prlab,ewi-insy-prb-spclab,ewi-insy-prb-visionlab,ewi-insy-reit,ewi-insy-sdm,ewi-insy-sup

This shows that the user can use the ewi-insy-prb or the ewi-st accounts. The second command shows that all accounts can submit to the general partition and several accounts can submit to the insy partition. Replace the ewi-insy-prb in the grep line above to get the partition details for your specific account.

For the example above note the following correct and incorrect examples:

Correct: explicit default account and partition specification


#SBATCH --account=ewi-insy-prb
#SBATCH --partition=insy,general

Correct: Implicit default account omitted since it has access to the specified patition


#SBATCH --partition=insy,general

Incorrect: Multiple partitions with account mismatch


#SBATCH --account=ewi-insy-prb
#SBATCH --partition=insy,st

Incorrect: Specifying a wrong account for the partition


#SBATCH --account=ewi-st
#SBATCH --partition=insy

Consequences: Submitting jobs to unauthorized partitions may result in jobs remaining pending and could overload the system, leading to potential job cancellations without warning.

Priority calculations

Slurm continually calculates job priorities and schedules the execution of jobs based on its configurations. A few configuration parameters affect priority computations:

SchedulerType: The type of scheduling used based on available resources, requested resources, and job priorities. On DAIC, slurm is used with backfill scheduling mechanism. This mechanism allows low priority jobs to backfill idle resources if doing so does not delay the expected start time of any high priority job (based on resource availability).

Tip

With sched/backfil, jobs can only be started when the resources that they request fit within the available idle resources. Thus:

The fewer resources a job request, the higher the chance that it will fit within the available idle resources.
The more resources a job request, the long it will have to wait before enough resources become available to start. To check how the cluster is configured, you may run:

$ scontrol show config | grep SchedulerType
SchedulerType           = sched/backfil

More details is available in Slurm’s SchedulerType

PriorityType: The way priority is computed. On DAIC, a multifactor computation is applied, where job priority at any given time is a weighted sum of the following factors:
- Fairshare: a measure of the amount of resources that a group (ie account in slurm terminology) has contributed, and the historical usage of the group and the user.
- QOS: the quality of service associated with the job, which is specified with the slurm --qos directive (see QoS priority).

Info

The whole idea behind the FairShare scheduling in DAIC is to share all the available resources fairly and efficiently with all users (instead of having strict limitations in the amount of resource use or in which hardware users can compute). The resources in the cluster are contributed in different amounts by different groups (see Contributing departments), and the scheduler makes sure that each group can use a share of the resource relative to what the group contributed. To check how the cluster is configured you may run:

$ scontrol show config | grep PriorityType
PriorityType            = priority/multifactor
$ sprio --weights
          JOBID PARTITION   PRIORITY       SITE  FAIRSHARE        QOS
        Weights                               1   20000000   40000000

The following commands are useful for checking prioritization of your own jobs:

Command	Purpose
`sprio -j <YourJobID>`	Determine the priority of your job
`squeue -j <YourJobID> --start`	Request your job’s estimated start time
`sshare -u <YourNetID>`	Determine your current fairshare value

Info

To get more complete priority configurations of a cluster, run the command:

$ scontrol show config | grep ^Priority
PriorityParameters      = (null)
PrioritySiteFactorParameters = (null)
PrioritySiteFactorPlugin = (null)
PriorityDecayHalfLife   = 2-00:00:00
PriorityCalcPeriod      = 00:05:00
PriorityFavorSmall      = No
PriorityFlags           = 
PriorityMaxAge          = 7-00:00:00
PriorityUsageResetPeriod = NONE
PriorityType            = priority/multifactor
PriorityWeightAge       = 0
PriorityWeightAssoc     = 0
PriorityWeightFairShare = 20000000
PriorityWeightJobSize   = 0
PriorityWeightPartition = 0
PriorityWeightQOS       = 40000000
PriorityWeightTRES      = (null)

Quality of Service (QoS)

When you submit a job in a slurm-based system, it enters a queue waiting for resources. The partition and Quality of Service(QoS) are the two job parameters slurm uses to assign resources for a job:

The partition is a set of compute nodes on which a job can be scheduled. In DAIC, the nodes contributed or funded by a certain group are lumped into a corresponding partition (see Contributing departments). All nodes in DAIC are part of the general partition, but other partitions exist for prioritization purposes on select nodes (see Priority tiers).
The Quality of Service is a set of limits that controls what resources a job can use and, therefore, determines the priority level of a job. This includes the run time, CPU, GPU and memory limits on the given partition. Jobs that exceed these limits are automatically terminated (see QoS priority).

For DAIC, Table 1 shows the QoS limits on the general partition.

Table 1: The general partition and its operational and per-QoS per-user limits; specific groups use other partitions and QoS
Partition	QoS	Priority	Max run time	Jobs per user	CPU limits		GPU limits		Memory limits
*infinite QoS jobs will be killed when servers go down, eg, during maintenance. It is not recommended to submit jobs with this QoS.
Partition	QoS	Priority	Max run time	Jobs per user	Per QoS	Per user	Per QoS	Per user	Per QoS	Per User
general	interactive	high	1 hour	1 running	-	2	-	2	-	16G
	short	normal	4 hours	10000	3672 (85%)	2160 (50%)	109 (85%)	64 (50%)	23159G (85%)	13623G (50%)
	medium	medium	1 ½ day	2000	3456 (80%)	1512 (35%)	103 (80%)	45 (35%)	21796G (80%)	9536G (35%)
	long	low	7 days	1000	3240 (75%)	864 (20%)	96 (75%)	25 (20%)	20434G (75%)	5449G (20%)
	infinite*	none	infinite	1 running	32	-	2	-	250G	-

Note

The priority of a job is a function of both QoS and previous usage (less is better). Read Priority and waiting times for more information.

See Quality of Service definitions

On DAIC you can check the QoS policies with the sacctmgr command:

$ sacctmgr list qos
      Name   Priority  GraceTime    Preempt   PreemptExemptTime PreemptMode                                    Flags UsageThres UsageFactor       GrpTRES   GrpTRESMins GrpTRESRunMin GrpJobs GrpSubmit     GrpWall       MaxTRES MaxTRESPerNode   MaxTRESMins     MaxWall     MaxTRESPU MaxJobsPU MaxSubmitPU     MaxTRESPA MaxJobsPA MaxSubmitPA       MinTRES 
---------- ---------- ---------- ---------- ------------------- ----------- ---------------------------------------- ---------- ----------- ------------- ------------- ------------- ------- --------- ----------- ------------- -------------- ------------- ----------- ------------- --------- ----------- ------------- --------- ----------- ------------- 
    normal          0   00:00:00                                    cluster                              DenyOnLimit               1.000000                                                                                                                                                                                                                cpu=1 
     short         50   00:00:00                                    cluster                              DenyOnLimit               1.000000 cpu=3562,gre+                                         65536                                                           04:00:00 cpu=2096,gre+                 10000                                      cpu=1,mem=1M 
      long         25   00:00:00                                    cluster                              DenyOnLimit               1.000000 cpu=3144,gre+                                         65536                                                         7-00:00:00 cpu=838,gres+                  1000                                      cpu=1,mem=1M 
  infinite          0   00:00:00                                    cluster                              DenyOnLimit               1.000000 cpu=32,gres/+                                         65536                                                                                          1         100                                      cpu=1,mem=1M 
interacti+        100   00:00:00                                    cluster                              DenyOnLimit               2.000000                                                       65536                                                           01:00:00 cpu=2,gres/g+         1           1                                      cpu=1,mem=1M 
   student         10   00:00:00                                    cluster                              DenyOnLimit               1.000000 cpu=192,gres+                                         65536                                                           04:00:00 cpu=2,gres/g+         1         100                                      cpu=1,mem=1M 
reservati+        100   00:00:00                                    cluster          DenyOnLimit,RequiresReservation               1.000000                                                       65536                                                                                                  10000                                      cpu=1,mem=1M 
 influence        100   00:00:00                                    cluster                              DenyOnLimit               1.000000                                                       65536                                                                                                  10000                                      cpu=1,mem=1M 
guest-sho+         10   00:00:00                                    cluster                              DenyOnLimit               1.000000 cpu=200,gres+                                         65536                                                           04:00:00 cpu=128,gres+                   100                                      cpu=1,mem=1M 
guest-long          0   00:00:00                                    cluster                              DenyOnLimit               1.000000 cpu=200,gres+                                         65536                                                         7-00:00:00 cpu=128,gres+         1          10                                      cpu=1,mem=1M 
    medium         35   00:00:00                                    cluster                              DenyOnLimit               1.000000 cpu=3352,gre+                                         65536                                                         1-12:00:00 cpu=1466,gre+                  2000                                      cpu=1,mem=1M

How to use QoS in your `sbatch` scripts?

In your sbatch.slurm script you can specify the QoS with #SBATCH --qos=... option.

Example:

#!/bin/bash
#SBATCH --job-name=hello-world
#SBATCH --partition=general
#SBATCH --account=ewi-insy-reit
#SBATCH --qos=short               # This is how you specify QoS
#SBATCH --time=0:01:00     
#SBATCH --nodes=1        
#SBATCH --tasks-per-node=1        
#SBATCH --cpus-per-task=2        
#SBATCH --mem=1GB                
#SBATCH --output=slurm-%n-%j.out  
#SBATCH --error=slurm-%n-%j.err

srun echo 'Hi, from Slurm!'
sleep 30  # Wait for 30 seconds before exiting.

QoS priority

The purpose of the (multiple) QoSs in DAIC is to optimize the throughput of the cluster and to reduce the waiting times for jobs:

Long jobs block resources for a long time, thus leading to long waiting times and fragmentation of resources.
Short jobs block resources only for short times, and can more easily fill in the gaps in the scheduling of resources (thus start sooner), and are therefore better for throughput and waiting times.

Thus, DAIC has the following policy:

To stimulate short jobs, the short QoS has a higher priority, and allows you to use a larger part of all resources, than the medium and long QoS.
To prevent long jobs from blocking all resources in the cluster for long times (thus causing long waiting times), only a certain part of all cluster resources is available to all running long QoS jobs (of all users) combined.
All running medium QoS jobs together can use a somewhat larger part of all resources in the cluster, and all running short QoS jobs combined are allowed to fill the biggest part of the cluster.
- These limits are called the QoS group limits.
- When this limit is reached, no new jobs with this QoS can be started, until some of the running jobs with this QoS finish and release some resources.
- The scheduler will indicate this with the reason QoS Group CPU/memory/GRES limit.
To prevent one user from single-handedly using all available resources in a certain QoS, there are also limits for the total resources that all running jobs of one user in a specific QoS can use.
- These are called the QoS per-user limits.
- When this limit is reached, no new jobs of this user with this QoS can be started, until some of the running jobs of this user and with this QoS finish and release some resources.
- The scheduler will indicate this with the reason QoS User CPU/memory/GRES limit.

These per-group and per-user limits are set by the DAIC user board, and the scheduler strictly enforces these limits. Thus, no user can use more resources than the amount that was set by the user board. Any (perceived) imbalance in the use of resources by a certain QoS or user should not be held against a user or the scheduler, but should be discussed in the user board.

Resources reservations

Slurm gives the possibility to reserve one or more compute nodes exclusively for a specific user or group of users. A reservation ensures that the designated node (or nodes) are dedicated solely to the reservation holder’s tasks and are not shared with other users during the reserved period. This feature allows users to plan the execution of future workloads, and accommodates cluster users with special needs beyond the batch system (eg latency measurement scenarios).

Note

Using reservations is in line with the General cluster usage clauses of DAIC users’ agreement. However, please be mindful that reservations are intended to facilitate special needs that cannot be satisfied by the batch system, and should not be requested to guarantee fast throughput for production runs.

Requesting a Reservation

To request a reservation for nodes, please use to the Request Reservation form. You can request a reservation for an entire compute node (or a group of nodes) if you have contributed this (or these) nodes to the cluster and you have special needs that needs to be accommodated.

General guidelines for reservations’ requests:

You can be granted a reservation only on nodes from a partition that is contributed by your group (See Computing nodes for a listing of available nodes, their features, and which paritions they belong to).
Please ask for the least amount of resources you need as to minimize impact on other users.
Plan ahead and request your reservation as soon as possible: Reservations usually ignore running jobs, so any running job on the machine(s) you request will continue to run when the reservation starts. While jobs from other users will not start on the reserved node(s), the resources in use by an already running job at the start time of the reservation will not be available in the reservation until this running job ends. The earlier ahead you request resources, the easier it is to allocate the requested resources.

Using reservations

Once your reservation request is approved and a reservation is placed on the system, you can run your jobs in the reservation by specifying --qos=reservation along with the following directives to your slurm commands: --reservation=<name> and --partition=<partition>. For example, to submit the job job.sbatch to a reservation named icra_iv on the cor1 node on the cor partition use:

$ sbatch --qos=reservation --reservation=icra_iv --partition=cor job.sbatch

Alternatively, it is possible to add the following lines to the job.sbatch file, and submitting this file as usual:

#SBATCH --qos=reservation
#SBATCH --reservation=icra_iv
#SBATCH --partition=cor

Note

It is possible to submit jobs to a reservation once it is created. Jobs will start immediately when the reservation is available, but already running jobs on resources will not be canceled for the reservation to start.

Note

When a reservation is used to run your jobs, remember to also pass the reservation parameters to your srun steps:

$ srun --qos=reservation --reservation=<reservation_name> --partition=<partition_name> <some_script.sh>

To make use of an existing reservation you have to specify --qos=reservation and --reservation=<reservation-name> in your sbatch script.

Viewing reservations

To view all active and future reservations run the scontrol command as follows:

$ scontrol show reservations
ReservationName=icra_iv StartTime=2023-09-09T00:00:00 EndTime=2023-09-16T00:00:00 Duration=7-00:00:00
   Nodes=cor1 NodeCnt=1 CoreCnt=32 Features=(null) PartitionName=cor Flags=
   TRES=cpu=64
   Users=(null) Groups=(null) Accounts=3me-cor Licenses=(null) State=ACTIVE BurstBuffer=(null) Watts=n/a
   MaxStartDelay=(null)

ReservationName=maintenance weekend 2023-10-14 StartTime=2023-10-13T20:00:00 EndTime=2023-10-16T09:00:00 Duration=2-13:00:00
   Nodes=3dgi[1-2],100plus,awi[01-26],cor1,gpu[01-11],grs[1-4],influ[1-6],insy[11-12,14-16],tbm5,wis1 NodeCnt=58 CoreCnt=2000 Features=(null) PartitionName=(null) Flags=MAINT,IGNORE_JOBS,SPEC_NODES,ALL_NODES
   TRES=cpu=4000
   Users=root Groups=(null) Accounts=(null) Licenses=(null) State=INACTIVE BurstBuffer=(null) Watts=n/a
   MaxStartDelay=(null)

Note

Jobs can run on a reservation only if explicitly requested as shown in the Requesting a reservation section.
Only jobs from the Users or Accounts associated with the reservation (as shown in the scontrol show reservations output) will be run on the reservation
STATE of a reservation will show as ACTIVE (instead of INACTIVE) during the reservation window.

4.4.3 - Advanced Slurm jobs

How to submit jobs to slurm?

Parallelizing jobs with Job Arrays

There can be scenarios, eg in simulations or benchmarking, where a job script needs to run many times with only different parameter set each time. If done manually, keeping track of the parameter values and corresponding jobIds is cumbersome. Job Arrays are a convenient mechanism for submitting and managing such jobs.

A job array is created by adding the --array=<indexes> directive to an sbatch script (or in the command line), where <indexes> can be either a comma separated list of integers, or a range with optional step size, eg, 1-10:2. The minimum index value is 0, and the maximum is a Slurm configuration parameter (MaxArraySize - 1).

Within a job array, all jobs have the same SLURM_ARRAY_JOB_ID, but each job will have its own environment variable SLURM_ARRAY_TASK_ID that corresponds to the array index value. Additionally, all jobs in the array inherit the same compute resources requirements. In the following examples, arrays of size 2 are created, but with different indexes:

$ sbatch --array=1,4 jobscript.sbatch # Indexes specified as a list, and have values 1 and 4
Submitted batch job 8580151
$
$ squeue -u SomeNetID  # Replace SomeNetId with your NetID 
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
         8580151_1   general jobscrip SomeNetID  R       0:01      1 grs4
         8580151_4   general jobscrip SomeNetID  R       0:01      1 awi18

$ sbatch --array=1-2 jobscript.sbatch  # Range specified with default step size = 1. Index have values 1 and 2
Submitted batch job 8580149
$
$ squeue -u SomeNetID  # Replace SomeNetId with your NetID 
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
         8580149_1   general jobscrip SomeNetID  R       0:21      1 grs4
         8580149_2   general jobscrip SomeNetID  R       0:21      1 awi18

Note

To limit the maximum number of simultaneously running jobs in an array use the % separator, eg--array=1-15%3 to run only 3 tasks at a time.

JobId and environment variables

As shown in the previous section, Parallelizing jobs with job arrays, jobs within an array are assigned special slurm variables. These variables can be exploited for various computational objectives. Among these, SLURM_ARRAY_TASK_ID is the index of an individual task within the array, and SLURM_ARRAY_JOB_ID is the slurm jobId of the entire array job.

In the simplest case, you can use the ${SLURM_ARRAY_TASK_ID} directly in a script to assign parameter values. For example, to run a workflow across a set of images image_1.png … image_5.png, you can simply create an array using the sbatch directive --array=1-5, and then, within your sbatch script, use image_${SLURM_ARRAY_TASK_ID}.png to indicate the corresponding image.

In more complex scenarios, eg, when the parameters of interest are not mappable to indexes (of a job array), you can use a config file to map the parameters to the job array indexes. For example, let’s assume the following parameters:

$ cat jobarray.config
i       Flower  Color   Origin  
1       Rose    Red     Worldwide
2       Jasmine  White   Asia
3       Tulip   Various Persia&Turkey
4       Orchid  Various Worldwide
5       Lily    Various Worldwide

Now, you can use these parameters inside a job script as follows:

$ cat jobarray.sbatch
#!/bin/bash
#SBATCH --job-name=JobArrayExample
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=1
#SBATCH --array=1-5             # Arry with 5 tasks
#SBATCH --output=slurm-%A_%a.out # Set name of output log. %A is SLURM_ARRAY_JOB_ID and %a is SLURM_ARRAY_TASK_ID
#SBATCH --error=slurm-%A_%a.err  # Set name of error log. %A is SLURM_ARRAY_JOB_ID and %a is SLURM_ARRAY_TASK_ID

config=jobarray.config          # Path to config file

# Obtain parameters from config file:
flower=$(awk -v ArrayTaskID=$SLURM_ARRAY_TASK_ID '$1==ArrayTaskID {print $2}' $config)
color=$(awk -v ArrayTaskID=$SLURM_ARRAY_TASK_ID '$1==ArrayTaskID {print $3}' $config)
origin=$(awk -v ArrayTaskID=$SLURM_ARRAY_TASK_ID '$1==ArrayTaskID {print $4}' $config)

# Use the parameters, eg, print the index and parameter values to a file:
echo "Array task: ${SLURM_ARRAY_TASK_ID},  Flower: ${flower}, color: ${color}, origin: ${origin}" >> output.txt

$
$ sbatch jobArray.sbatch
Submitted batch job 8580317
$ squeue -u SomeNetID  # Replace SomeNetId with your NetID 
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
     8580317_[1-5]   general JobArray SomeNetID PD       0:00      1 (Priority)

In this example, slurm created 5 jobs in a job array, each using the same settings (the name JobArrayExample, the general partition, short QoS, 00:01:00 time, 1 task with 1 CPU and 1G memory, and an output and error file with both array job Id and task id). Each task looks up certain parameter values from a config file leveraging its index via the awk command.

Note

The command:

flower=$(awk -v ArrayTaskID=$SLURM_ARRAY_TASK_ID '$1==ArrayTaskID {print $2}' $config)

assigns a value to the variable flower by reading a configuration file ($config), and printing the value in the second column ({print $2}) where the first column matches the value of the ArrayTaskID variable ($1==ArrayTaskID). The ArrayTaskID is an awk variable set to the value of the SLURM environment variable SLURM_ARRAY_TASK_ID. For more on the awk utility, see this awk tutorial.

Jobs within a task array are run in parallel, and hence, there’s no guarantee about their order of execution. This is evident looking at the output file from this example:

$ cat output.txt
Array task: 2,  Flower: Jasmine, color: White, origin: Asia
Array task: 3,  Flower: Tulip, color: Various, origin: Persia&Turkey
Array task: 1,  Flower: Rose, color: Red, origin: Worldwide
Array task: 5,  Flower: Lily, color: Various, origin: Worldwide
Array task: 4,  Flower: Orchid, color: Various, origin: Worldwide

Other slurm variables that are set inside a job array are shown in the following table, with values based on the preceding example:

Slurm Environment Variable	Description	Value in example
`SLURM_ARRAY_JOB_ID`	The first job ID of the array.	8580317
`SLURM_ARRAY_TASK_ID`	The job array index value.	A value in range 1-5
`SLURM_ARRAY_TASK_COUNT`	The number of tasks in the job array.	5
`SLURM_ARRAY_TASK_MAX`	The highest job array index value.	5
`SLURM_ARRAY_TASK_MIN`	The lowest job array index value	1

Slurm commands and job arrays

The squeue command reports all submitted jobs. By default, squeue reports all of the tasks associated with a job array in one line and uses a regular expression to indicate the SLURM_ARRAY_TASK_ID values. To explicitly print one job array element per line, use the --array or -r flag. The following examples highlight the difference, using the same jobarray.sbatch file from the JobId and environment variables section:

$ sbatch jobarray.sbatch 
Submitted batch job 8593299
$
$ squeue -u SomeNetID  # Replace SomeNetId with your NetID 
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
     8593299_[1-5]   general JobArray SomeNetID PD       0:00      1 (Priority)
$     
$ squeue -r -u SomeNetID  # Replace SomeNetId with your NetID 
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
         8593299_1   general JobArray SomeNetID PD       0:00      1 (Priority)
         8593299_2   general JobArray SomeNetID PD       0:00      1 (Priority)
         8593299_3   general JobArray SomeNetID PD       0:00      1 (Priority)
         8593299_4   general JobArray SomeNetID PD       0:00      1 (Priority)
         8593299_5   general JobArray SomeNetID PD       0:00      1 (Priority)

scancel, on the other hand, can be used to cancel an entire job array by specifying its SLURM_ARRAY_JOB_ID. Alternatively, to cancel a specific task (or tasks), both its SLURM_ARRAY_JOB_ID and SLURM_ARRAY_TASK_ID must be specified, possibly with a regular expression, as shown in the following examples:

$ sbatch jobarray.sbatch
$ squeue -u SomeNetID  # Replace SomeNetId with your NetID 
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
     8593321_[1-5]   general JobArray SomeNetID PD       0:00      1 (Priority)
$     
$ scancel 8593321_4     # Cancel task with index 4 in the array
$ squeue -u SomeNetID   # Replace SomeNetId with your NetID 
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
   8593321_[1-3,5]   general JobArray SomeNetID PD       0:00      1 (Priority)
$
$ scancel 8593321_[1-3] # Cancel tasks in index range 1-3 in the array
$ squeue -u SomeNetID   # Replace SomeNetId with your NetID 
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
         8593321_5   general JobArray SomeNetID PD       0:00      1 (Priority)
$
$ scancel 8593321       # Cancel all tasks in the array
$ squeue -u SomeNetID  # Replace SomeNetId with your NetID 
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
$

Note

For more information on job arrays, refer to Slurm Job Array Support

Deploying dependent jobs (job chains)

In certain scenarios, it might be desirable to condition the execution of a certain job on the status of another job. In such cases, the sbatch directive --dependency=<condition>:<jobID> can be used, where <condition> specifies the type of dependency (See table 2), and <jobID> is the slurm jobID upon which dependency is based. To specify more than one dependency, the , separator is used to indicate that all dependencies must be specified, and, ? is used denotes that any dependency may be satisfied.

For example, assume the slurm job scripts, job_1.sbatch, … job_3.sbatch need to run sequentially one after the other. To start this chain, submit the first job and obtain its jobID:

$ sbatch job_1.sbatch
Submitted batch job 8580135

Next, submit the second job to run only if the first job is successful:

$ sbatch --dependency=afterok:8580135 job_2.sbatch
Submitted batch job 8580136

Note

Note that if the first job (with jobID 8580135 in the example) fails, the second job (with jobID 8580136) will not run, but it will remain in the queue. You have to use scancel 8580136 to cancel this job

And, now, to run the third job only after the first two jobs have both run successfully:

$ sbatch --dependency=afterok:8580135,8580136 job_3.sbatch
Submitted batch job 8580140

Alternatively, if the third job is dependent on either job running successfully:

$ sbatch --dependency=afterok:8580135?8580136 job_3.sbatch
Submitted batch job 8580141

Warning

If the jobs within a chain involve copying data files to a local disk (/tmp) on a node, you need to make sure all jobs use the same node (--nodelist=<node>, for example --nodelist=insy15)

Table 2: Possible sbatch dependency conditions
Argument	Description
after	This job can begin execution after the specified jobs have begun execution
afterany	This job can begin execution after the specified jobs have terminated.
aftercorr	A task of this job array can begin execution after the corresponding task ID in the specified job has completed successfully
afternotok	This job can begin execution after the specified jobs have terminated in some failed state
afterok	This job can begin execution after the specified jobs have successfully executed
singleton	This job can begin execution after any previously launched jobs sharing the same job name and user have terminated

Troubleshooting Common Issues

Please see the Frequently asked questions on Scheduler problems and Job resources

4.4.4 - Kerberos

How to submit jobs to slurm?

Kerberos Authentication

Kerberos is an authentication protocol which uses tickets to authenticate users (and computers). You automatically get a ticket when you log in with your password on a TU Delft installed computer. You can use this ticket to authenticate yourself without password when connecting to other computers or accessing your files. To protect you from misuse, the ticket expires after 10 hours or less (even when you’re still logged in).

File access

Your Linux and Windows Home directories and the Group and Project shares are located on network fileservers, which allows you to access your files from all TU Delft installed computers. Kerberos authentication is used to enable access to, or protect, your files. Without a valid Kerberos ticket (e.g. when the ticket has expired) you will not be able to access your files but instead you will receive a Permission denied error.

Lifetime of Kerberos Tickets

Kerberos tickets have a limited valid lifetime (of up to 10 hours) to reduce the risk of abuse, even when you stay logged in. If your tickets expire, you will receive a Permission Denied error when you try to access your files and a password prompt when you try to connect to another computer. When you want your program to be able to access your files for longer than the valid ticket lifetime, you’ll have to renew your ticket (repeatedly) until your program is done. Kerberos tickets can be renewed up to a maximum renewable life period of 7 days (again to reduce the risk of abuse).

The command klist -5 lists your cached Kerberos tickets together with their expiration time and maximum renewal time:

$ klist -5
Ticket cache: FILE:/tmp/krb5cc_uid_random
Default principal: YourNetID@TUDELFT.NET

Valid starting     Expires            Service principal
01/01/01 00:00:00  01/01/01 10:00:00  krbtgt/TUDELFT.NET@TUDELFT.NET
        renew until 01/08/01 00:00:00

Where:

Ticket cache: The Kerberos tickets that have been issued to you are stored in a ticket cache file. You can have multiple ticket cache files on the same computer (from different connections, for example) with different tickets and ticket expiration times. Some ticket cache files are automatically removed when you logout.
Tip
Make sure that you renew the tickets in the right ticket cache file (see this screen example).
Default principal: Your identity.
Service principal: The identity of services that you have gotten tickets for. You always need a Kerberos ticket-granting ticket (krbtgt) in order to obtain other tickets for specific services like accessing files (nfs) or connecting to computers (host).
Valid starting, Expires: Your ticket is only valid between these times (this period is called the valid lifetime). After this time you will not be able to use the service nor automatically renew the ticket (without password).
Renew until: Your ticket can only be renewed without password up to this time. After this time you will have to obtain a new ticket using your password.

Renewing Kerberos tickets

If you have a valid Kerberos krbtgt ticket, you can renew it at any time (until it expires) by running the command kinit -R:

$ kinit -R
$ klist -5
Ticket cache: FILE:/tmp/krb5cc_uid_random
Default principal: YourNetID@TUDELFT.NET

Valid starting     Expires            Service principal
01/01/01 01:00:00  01/01/01 11:00:00  krbtgt/TUDELFT.NET@TUDELFT.NET
        renew until 01/08/01 00:00:00

Note

Renewing the ticket will not change the duration of the valid lifetime, i.e. a krbtgt ticket with a valid lifetime of 1 hour will, after renewal, be valid for another hour.

When the krbtgt ticket has expired or reached it’s renew until time, you will have to obtain a new ticket by running kinit -r 7d (note the difference in case for the r) and authenticating with your password:

$ kinit -r 7d
Password for YourNetID@TUDELFT.NET:
$ klist -5
Ticket cache: FILE:/tmp/krb5cc_uid_random
Default principal: YourNetID@TUDELFT.NET

Valid starting     Expires            Service principal
01/01/01 11:00:00  01/01/01 21:00:00  krbtgt/TUDELFT.NET@TUDELFT.NET
        renew until 01/08/01 11:00:00

The new ticket will have a valid lifetime of 10 hours and a renewable life of 7 days.

On the TU Delft Linux desktops your Kerberos ticket is refreshed (i.e. replaced by a new ticket) automatically every time you enter your password for unlocking the screen saver.

Tip

Do not disable the screen saver password lock.

On remote computers you have to manually renew your tickets before they expire.

Slurm & Kerberos

Slurm caches your Kerberos ticket, and uses it to execute your job
Regularly renew the ticket in Slurm’s cache while your jobs are queued or running:

$ auks -a
Auks API request succeed

To automatically renew your ticket in Slurm’s cache until you change your NetID password, run the following on the login1 node:

$ install_keytab
Password for somebody@TUDELFT.NET:
Installed keytab.

You need to rerun this command whenever you change your NetID password (at least every 6 months). Otherwise, the automatic renewal will not work and you will receive a warning e-mail.

Renewal using `screen`

On the compute nodes, the screen program has been modified to allow jobs to run unattended for up to 7 days. It creates a private ticket cache (to prevent the cache from being destroyed at logout) and automatically renews your ticket up to the maximum renewable life. For example, start MATLAB in Screen with screen matlab (the order is important!).

$ screen matlab
Warning: No display specified.  You will not be able to display graphics on the screen.

                           < M A T L A B (R) >
                 Copyright 1984-2010 The MathWorks, Inc.
              Version 7.11.0.584 (R2010b) 64-bit (glnxa64)
                             August 16, 2010


  To get started, type one of these: helpwin, helpdesk, or demo.
  For product information, visit www.mathworks.com.

>>

For longer jobs you have to manually obtain a new ticket at least every 7 days by running kinit -r 7d from within screen (so you use the specific ticket cache file that screen is using):

connect to screen (screen -r),
create a new window (Ctrl-a c),
run kinit -r 7d,
exit the window (exit) and
detach from screen (Ctrl-a d).

$ kinit -r 7d
Password for YourNetID@TUDELFT.NET:
$ klist -5
Ticket cache: FILE:/tmp/krb5cc_uid_private
Default principal: YourNetID@TUDELFT.NET

Valid starting     Expires            Service principal
01/08/01 09:00:00  01/08/01 19:00:00  krbtgt/TUDELFT.NET@TUDELFT.NET
        renew until 01/15/01 09:00:00
$ exit

Tip

Use a repeating reminder (twice a week) in your agenda so you don’t forget.

Important

When the end of the renewable life is reached, your tickets expire and your program(s) will return Permission denied errors when trying to access your files. Your program(s) will not be terminated automatically; you still have to terminate the program(s) yourself.

Extra functionality can be provided by the k5start and krenew programs. On most computers these are not available by default but can be installed.

4.5 - Best practices

What is acceptable usage of DAIC?

Using shared resources

The computing nodes within DAIC are primarily meant to run large, long (non-interactive) jobs. You share these resources with other users across departments. Thus, you need to be cautious of your usage so you do not hinder other users.

To help protect the active jobs and resources, when a login node becomes overloaded, new logins to this node are automatically disabled. This means that you will sometimes have to wait for other jobs to finish and at other times ICT may have to kill a job to create space for other users.

One rule: Respect your fellow users.

Best practices

Always choose the login node with the lowest use (most importantly system load and memory usage), by checking the Current resource usage page or the servers command for information.
- Each login node displays a message at login. Make sure you understand it before proceeding. This message includes the current load of the node, so look at it at every login
Only use the storage best suited to your files (See Storage).

Do interactive code development, debugging and testing in your local machine, as much as possible. In the cluster, try to organize your code as scripts, instead of working interactively in the command line.
If you need to test and debug in the cluster, for example, in a GPU node, request an interactive session and do not work in the login node itself (See Interactive jobs on compute nodes).
Save results frequently: your job can crash, the compute node can become overloaded, or the network shares can become unavailable.
Write your code in a modular way, so that you can continue the job from the point where it last crashed.

Actively monitor the status of your jobs:
- Make sure your job runs normally and is not hindering other jobs. Check the following at the start of a job and thereafter at least twice a day:
  - If your job is not working correctly (or halted) because of a programming error, terminate it immediately; debug and fix the problem instead of just trying again (the result will almost certainly be exactly the same).
  - If your screen’s Kerberos ticket has expired, renew it so your job can successfully save it’s results.
  - Use the top program to monitor the cpu (%CPU) and memory (%MEM) usage of your code. If either is too high, kill your code so it doesn’t cause problems for other users.
  - Don’t leave top running unless your are continuously watching it; press q to quit.
  - Watch the current resource usage (see Current resource usage page or use the servers command), and if the node is running close to it’s limits (higher than 90% load or memory, swap or disk usage), consider moving your job to a less busy node.

You can use login nodes for basic tasks like compiling software, preparing submission scripts for the batch queue, submitting and monitoring jobs in the batch queue, analyzing results, and moving data or managing files.
Small-scale interactive work may be acceptable on login nodes if your resource requirements are minimal.

Please do not run production research computations on the login nodes. If necessary, request an interactive session in these cases (See Interactive jobs on compute nodes)

Note

Most multi-threaded applications (such as Java and Matlab) will automatically use all cpu cores of a node, and thus take away processing power from other jobs. If you can specify the number of threads, set it to at most 25% (¼) of the cores in that node (for a node with 16 cores, use at most 4; this leaves enough processing capacity for other users). Also see How do I request CPUs for a multithreaded program?

4.6 - Handy commands on DAIC

Brief description of useful commandline tools.

BASH commands

BASH (Bourne Again SHell) is an open-source Unix shell and command language. It is the default shell on many Linux distributions and macOS, and it’s available on Windows via the Windows Subsystem for Linux, Git BASH, and other emulators. BASH is widely used for scripting and automating tasks in a computing environment. Below are some fundamental BASH commands with examples and brief explanations, aiding users in effective navigation and task execution. Remember to use these commands carefully, especially those that can modify or delete files and directories. They are fundamental tools for interacting with BASH and managing your tasks effectively.

man

The man command is a tool for displaying the manual pages (documentation) of various commands and utilities available on Unix-like operating systems. It is an essential resource for users seeking detailed information about a specific command, program, or configuration file.

Basic Usage

Display the manual page for a command:

man <command>

This displays the manual page for the specified command.

Examples

Show the manual page for the ls command:

man ls

Show the manual page for the man command:

man man

echo

Used for displaying a line of text/string that is passed as an argument. This is a fundamental command for displaying output in shell scripts.

Example: Display “Hello, World!”.

echo "Hello, World!"

cd

Changes the current directory to another directory. It’s a basic command to navigate through the filesystem.

Example: Change to the home directory.

cd ~

ls

Lists the contents of a directory. It’s a key command to view files and directories.

Example: List all files and directories in the current directory, including hidden files.

ls -a

tree

The tree command is a utility that displays the directory structure of a path in a tree-like format. It provides a visual representation of the hierarchy of files and directories, making it easier to understand the organization of a file system.

Basic Usage

Display the directory tree structure:
```
tree [path]
```
This command displays the directory structure starting from the specified path or the current directory if no path is specified.

Options

-a: Display all files and directories, including hidden ones (those starting with a dot).
-d: Display only directories, omitting files.
-L level: Limit the depth of the tree to the specified level.
--noreport: Suppress the file and directory count summary at the end of the output.
-H baseHREF: Create an HTML output starting with the specified base URL.
-o filename: Output the tree structure to a file with the specified name.
--charset encoding: Use the specified character encoding (e.g., UTF-8).
-P pattern: Only display files matching the specified pattern (e.g., *.txt).
-I pattern: Exclude files and directories matching the specified pattern (e.g., *.bak).

Examples

Display the directory tree structure starting from the current directory:
```
tree
```
Display the directory tree structure from a specific path:
```
tree /path/to/start
```
Display only directories in the tree structure:
```
tree -d
```
Display the tree structure and limit the depth to 2 levels:
```
tree -L 2
```
Display the tree structure and output it to a file:
```
tree -o output.txt
```
Display all files and directories, including hidden ones:
```
tree -a
```

The tree command is a helpful tool for quickly understanding the layout of a directory and its contents. It is especially useful for navigating complex file systems and identifying the location of files and directories within a hierarchy.

which

The which command shows the full path of a command’s executable file by searching the directories listed in the PATH environment variable.

Basic Usage

Find the path of a command:
```
which command
```
This displays the full path of the specified command’s executable file.

Examples

Find the path of the ls command:
```
which ls
```
Find the path of the python command:
```
which python
```

whereis

The whereis command locates not only the executable file but also the source and manual page files of a command, if available.

Basic Usage

Locate a command:
```
whereis command
```
This displays the paths to the executable, source, and manual page files of the specified command, if they exist.

Options

-b: Search only for binaries (executable files).
-m: Search only for manual pages.
-s: Search only for source files.
-u: Search for any missing information (binaries, source, or manual) and report it.
-B path: Add a directory to the search path for binaries.
-M path: Add a directory to the search path for manual pages.
-S path: Add a directory to the search path for source files.

Examples

Locate the ls command:
```
whereis ls
```
Locate the gcc command and its source files:
```
whereis -s gcc
```

cat

Concatenates and displays file contents. It’s commonly used to view the contents of a file.

Example: Display the contents of a file named example.txt.

cat example.txt

grep

Searches for patterns in files. It’s a powerful tool for searching text using patterns.

Example: Search for the word “example” in file.txt.

grep "example" file.txt

find

Searches for files in a directory hierarchy. This command is essential for locating files and directories.

Example: Find all .txt files in the current directory.

find . -name "*.txt"

mkdir

Creates a new directory.

Example: Create a directory named new_directory.

mkdir new_directory

rm

Removes files or directories. It’s a critical command for file management.

Example 1: Remove a file named example.txt.

rm example.txt

Example 2: Remove a directory and its contents (recursively).

rm -r directory_name

Warning: Be extremely cautious with rm -r, especially when used with . (current directory) or .. (parent directory), as this can lead to irreversible deletion of files. Never use rm -r . in a directory unless you are absolutely sure about deleting all its contents.

cp

Copies files and directories.

Example: Copy file1.txt to file2.txt.

cp file1.txt file2.txt

mv

Moves or renames files and directories.

Example: Rename oldname.txt to newname.txt.

mv oldname.txt newname.txt

for, do, done

A for loop in Bash allows you to iterate over a list of items, such as an array, a set of files, or even a range of numbers. Below, I will provide you with a few examples of how you can use a for loop in Bash.

Iterating over a list of strings

In this example, the for loop iterates over a list of strings and prints each one:

# List of items
items=("apple" "banana" "cherry")

# Loop through each item
for item in "${items[@]}"; do
    echo "Item: $item"
done

if, (else), then

The if statement in Bash scripting is used to execute a block of code conditionally based on whether an expression evaluates to true or false. Below are examples of how you can use an if statement in Bash:

filepath="/path/to/file.txt"

if [ -f "$filepath" ]; then
    echo "The file exists."
else
    echo "The file does not exist."
fi

alias

In Bash, an alias is a shortcut for a command. You can define an alias to simplify the execution of commonly used commands or to add default options to commands you frequently use. Here are some examples of how to create and use aliases in Bash:

Creating a simple alias

You can create an alias by using the alias command followed by the alias name and the command it represents. Here’s an example of a simple alias:

alias ll="ls -l"

Another commonly used alias is md as a shortcut for mkdir:

alias md="mkdir"

You can add these instructions to your .bashrc file in order to load them when logging in to the cluster.

Slurm commands

SLURM (Simple Linux Utility for Resource Management) is an open-source job scheduler used on many of the world’s supercomputers and compute clusters. It allows users to efficiently manage computing resources and queue their computational jobs for execution. Below are some essential SLURM commands with examples and brief explanations, helping users navigate and utilize these resources effectively. Remember to replace <jobid> with your specific job ID where necessary. These commands are vital tools for interacting with SLURM and managing your compute tasks effectively.

sinteractive

For requesting an interactive node, typically during testing phases. Compute resources such as memory, time, and GPUs are specified as part of the command, similar to sbatch directives.

Example: Request a 10-minute GPU node session.

sinteractive --time=00:10:00 --gres=gpu

sbatch

Used for submitting a script to SLURM for queuing in batch mode. The script includes directives at the top to specify required resources.

Example: Submit a job using a script named script.sh.

sbatch script.sh

squeue

Checks the status of jobs in the SLURM queue. Useful for tracking your job’s status and understanding the queue’s state, and to find a specific jobid of a particular job.

Example: Check the status of all your queued jobs.

squeue -u $USER

scancel

Cancels a job or all jobs of a user. Vital for managing jobs that are no longer needed or were submitted in error.

Example 1: Cancel a specific job with job ID <jobid>.

scancel <jobid>

Example 2: Cancel all jobs for the current user.

scancel -u $USER

slurmtop

A DAIC-specific command to view the top jobs in the queues and their resource usage.

Example:

slurmtop

scontrol

Shows detailed information and resources allocated to the job with the specified SLURM job ID.

Example: Show details of a job with job ID jobid.

scontrol show job <jobid>

sinfo

Displays information about SLURM nodes and partitions. Key command for understanding the state of the cluster.

Example: Display information about all nodes and partitions.

sinfo

sacct

Displays accounting data for all jobs and job steps. Useful for tracking resource usage and performance metrics. Example: Display accounting data for all jobs.

sacct --format=JobID,JobName%30,State,Elapsed,Timelimit,AllocNodes,Priority,Start,NodeList

Other

module

Basic Usage

Load a module:
```
module load module-name
```
This command loads the specified module, setting up the environment variables and paths needed for the software package.
Unload a module:
```
module unload module-name
```
This command unloads the specified module, removing any environment variables and paths associated with it.

For a more detailed description of module see Modules.

Documentation

1 - About DAIC

What is an HPC cluster?

What is DAIC?

1.1 - Contributors and funding

Joining DAIC?

Contributing departments

Note

Funding sources

1.2 - Advisors and Impact

Advisory board

Department of Intelligent Systems

Pattern Recognition and Bioinformatics group

Department of Intelligent Systems

Sequential Decision Making group

Software Technology Department

Data-intensive Systems group

Citation and Acknowledgement

Scientific impact in numbers

Reference

Publications using DAIC

Note

2 - Policies & Usage guidelines

User agreement

Defintions

Access and accounts

Needing access to DAIC?

Terms of service

Expectations from cluster users

Responsible cluster usage

Consequences of irresponsible usage

What to do in case of problems?

Usage guidelines

Using shared resources

Recommendations

Computing on login nodes

Note

3 - System specifications

3.1 - Login Nodes

Specifications and usage notes

Login1 resource limits (effective immediately)

3.2 - Compute nodes

CPUs

GPUs

Memory

Note

List of all nodes

Note

3.3 - Storage

Storage

File System Overview

Personal storage (aka home folder)

Note

Group storage

Project Storage

Tip

Local Storage

Memory Storage

Warning

Info

Checking quota limits

3.4 - Scheduler

Workload scheduler

3.5 - Cluster comparison

Cluster comparison

TL;DR

TU Delft clusters

Tip

Other EuroHPC resources

TU Delft cloud resources

Strategic opportunities

4 - User manual

4.1 - Connecting to DAIC

SSH access

SSH clients

Access from the TU Delft Network

Note

Note

Graphical applications

Access from outside university network using a VPN