Slurm - yuuk1's Digital Garden

- QuickStart: [Slurm Workload Manager - Quick Start User Guide](https://slurm.schedmd.com/quickstart.html) ## Architecture ![[Pasted image 20241001230831.png|400]] 図1に示すように、Slurmは、各計算ノード上で動作するslurmdデーモンと、管理ノード上で動作する中央のslurmctldデーモン（オプションでフェイルオーバーツインも可能）で構成される。 slurmdデーモンはフォールト・トレラントな階層通信を提供する。ユーザコマンドには sacct, sacctmgr, salloc, sattach, sbatch, sbcast, scancel, scontrol, scrontab, sdiag, sh5util, sinfo, sprio, squeue, sreport, srun, sshare, sstat, strigger, sview がある。すべてのコマンドはクラスタ内の任意の場所で実行できます。 ![[Pasted image 20241001231048.png|400]] 図2に示すように、Slurmデーモンが管理するエンティティには、Slurmの計算リソースであるノード、ノードを論理的な（重複する可能性のある）集合にグループ化するパーティション、指定された時間だけユーザーに割り当てられるリソースの割り当てであるジョブ、ジョブ内の（並列の可能性のある）タスクの集合であるジョブステップが含まれる。パーティションはジョブキューと考えることができ、各キューはジョブサイズの制限、ジョブ時間の制限、使用を許可されたユーザーなどの制約を持ちます。優先順位付けされたジョブは、パーティション内のリソース（ノード、プロセッサ、メモリなど）がなくなるまで、パーティション内のノードに割り当てられます。ジョブにノードのセットが割り当てられると、ユーザは割り当て内の任意の構成でジョブステップの形で並列作業を開始できます。例えば、ジョブに割り当てられたすべてのノードを使用する単一のジョブステップを開始することもできますし、複数のジョブステップが独立して割り当ての一部を使用することもできます。 ## Slurm and/or/vs Kubernetes https://slurm.schedmd.com/SC23/Slurm-and-or-vs-Kubernetes.pdf Perspectives - HPC Batch Scheduling ● HPC systems assume system size is fixed ○ And the workload is infinite ○ Queue prioritization is thus critical ● "Slurm is a policy engine" - quote stolen from a colleague ● Slurm manages a number of intertwined HPC system management tasks ○ Job queuing and prioritization - scheduling ○ Job accounting ○ Control user access to compute resources (cgroups, pam_slurm_adopt) ○ Enable large-scale concurrent job launch (MPI, PMIx, nss_slurm, sbcast) ● Jobs assume access to a usable, fully-featured, default Linux environment ○ Containerization - including Slurm's built-in container support - is optional ● Jobs are usually ad-hoc scripts, submitted through the command line ○ Newer features such as Slurm's RESTful API can support more programmatic interaction, but are not yet as widely adopted