Systems for MLのMap of Contentsノート。
## Terminology
- [[GPU]]
- [[RoCE]]
- [[NCCL]]
- [[Rail-Optimized]]
- [[PXN]]
## Collections
- [[ml-systems-papers]]
## Model Training
- [[分散深層学習 - MOC]]
## Model Serving
### Terminology
- [[LLM推論]]
- [[vLLM]]
- [[LLMサービングの性能評価をどうやるか?]]
### Blog/Slide
- [[gpt-ossモデルのサービングにおけるリクエスト処理性能評価 ― NVIDIA H100・A100・L4の比較 - ペパボ研究所ブログ]]
### Paper
- ServeGen: [[2025__arXiv__ServeGen - Workload Characterization and Generation of Large Language Model Serving in Production]]
- [[2025__VLDB__Approximation-First Timeseries Monitoring Query At Scale]]
- Shapeshifter: [[2025__EuroMLSys__Manage the Workloads not the Cluster Designing a Control Plane for Large-Scale AI Clusters]]
- Taming the Titans: [[2025__arXiv__Taming the Titans - A Survey of Efficient LLM Inference Serving]]
- Tempo: [[2025__arXiv__Tempo - Application-aware LLM Serving with Mixed SLO Requirements]]
- [[2025__arXiv__ByteScale - Efficient Scaling of LLM Training with a 2048K Context Length on More Than 12,000 GPUs]]
- [[2025__arXiv__Every FLOP Counts - Scaling a 300B Mixture-of-Experts LING LLM without Premium GPUs]]
- [[2025__ISPASS__Characterizing and Optimizing LLM Inference Workloads on CPU-GPU Coupled Architectures]]
- [[2025__ArXiv__AIBrix - Towards Scalable, Cost-Effective Large Language Model Inference Infrastructure]]
- [[2025__ASAP__ReaLLM - A Trace-Driven Framework for Rapid Simulation of Large-Scale LLM Inference]]
- [[2025__arXiv__Hierarchical Prediction-based Management for LMaaS Systems]]
- [[2025__SIGMOD__Database as Runtime - Compiling LLMs to SQL for In-database Model Serving]]
- [[2025__ASPLOS__PIM Is All You Need - A CXL-Enabled GPU-Free System for Large Language Model Inference]]
- [[2025__CLOUD__Mind the Memory Gap Unveiling GPU Bottlenecks in Large Batch LLM Inference]]
- [[2025__ArXiv__PipeBoost - Resilient Pipelined Architecture for Fast Serverless LLM Scaling]]
- [[2025__KDD__BurstGPT - A Real-World Workload Dataset to Optimize LLM Serving Systems]]
- [[2025__ArXiv__Energy-Aware LLMs - A step towards sustainable AI for downstream applications]]
- [[2025_arXiv_FlashInfer - Efficient and Customizable Attention Engine for LLM Inference Serving]]
- DistServe: [[2024__OSDI__DistServe - Disaggregating Prefill and Decoding for Goodput-optimized Large Language Model Serving]]
- [[2024__arXiv__Towards a Middleware for Large Language Models]]
- [[2024__arXiv__One Queue Is All You Need - Resolving Head-of-Line Blocking in Large Language]]
- [[2024__ASPLOS__Queue Management for Large Language Model Serving]]
- [[2024__ICSOC__UELLM - A Unified and Efficient Approach for Large Language Model Inference Serving]]
- [[2024__arXiv__Intelligent Router for LLM Workloads - Improving Performance Through Workload-Aware Scheduling]]
- [[2018__OSDI__Ray A Distributed Framework for Emerging AI Applications]]
- [Overcoming Challenges in Serving Large Language Models](https://www.usenix.org/conference/srecon23emea/presentation/papapanagiotou)
#### Auto Tuning
- AutoHere: [[2025__arXiv__AutoHete - An Automatic and Efficient Heterogeneous Training System for LLMs]]
- Mist: [[2025__EuroSys__Mist - Efficient Distributed Training of Large Language Models via Memory-Parallelism Co-Optimization]]
## Network
- ICCL: [[2025__arXiv__An Efficient, Reliable and Observable Collective Communication Library in Large-scale GPU Training Clusters]]
- [[2025__CLOUD__Routing Strategies for RoCE Networks in AI Clouds]]
- PCCL: [[2025__arXiv__The Big Send-off - High Performance Collectives on GPU-based Supercomputers]]
- [[2025__arXiv__An Extensible Software Transport Layer for GPU Networking]]
- [[2025__CLOUD__Routing Strategies for RoCE Networks in AI Clouds]]
- [[JANOG55 UECセッション: AIインフラ解説資料 共有ページ]]
- [[GPUクラスタネットワークとその設計思想(Rethinking AI Infrastructure Part 2) ]]
- [[GPUネットワーク設計・運用 基礎勉強会 Lossless Ethernet – PFCECN編]]
- [[EthernetベースのGPUクラスタ導入による学びと展望]]
- [[PFNにおけるアクセラレータ間通信の実際 - Preferred Networks Research & Development]]
- [[AI(人工知能)の為のネットワーク - JANOG53 Meeting in Hakata]]
- [[HPCネットワーク基礎(RDMAInfinibandRoCE編)]]
- CyberAgent: [[AIML基盤の400G DCネットワークを構築した話 - JANOG52 Meeting in Nagasaki]]
- [[[3]中核はフロー制御をつかさどる「PFC」と「ETS」]]
## MLSys
- [[MLPerf Training]]
## さくら
- [[生成AI向けパブリッククラウドサービスをつくってみた話 さくらのナレッジ]]
- [[生成AI向け機械学習クラスタ構築のレシピ 北海道石狩編 さくらのナレッジ]]
- [[実際に運用してわかった! 多種GPU混載Kubernetesクラスタの使われ方と運用省力化]]
## GPU
- Model
- [[A100]]
- [[H100]]
- [[H200]]
- [[B200]]
- [[2025__SOSP__LithOS - An Operating System for Efficient Machine Learning on GPUs]]
- [[NVIDIA の最新ハードウェアの分析 B100B200GH200NVL72SuperPod]]
## Storage
- [[NECのAI研究用スーパーコンピュータとDDN Lustreストレージシステム]]
- [[Performance and Data Management at Scale for the AI Data Center - DDN]]
### Papers
- [[2025__ACM-IEEE__EMLIO - Minimizing I-O Latency and Energy Consumption for Large-Scale AI Training]]
## Reliability Engineering
- [[LLM学習中の計算機効率と障害]]
## Telemetry
- [[AI Infra Telemetry - MOC]]
## Books
- [[AI Systems Performance Engineering]]
## 関連
- [[ML for Systems - MOC]]
- [[MLOps - MOC]]
- [[PFN AI Infra references]]