Systems for MLのMap of Contentsノート。 ## Terminology - [[GPU]] - [[RoCE]] - [[NCCL]] - [[Rail-Optimized]] - [[PXN]] ## Collections - [[ml-systems-papers]] ## Model Training - [[分散深層学習 - MOC]] ## Model Serving ### Terminology - [[LLM推論]] - [[vLLM]] - [[LLMサービングの性能評価をどうやるか?]] ### Blog/Slide - [[gpt-ossモデルのサービングにおけるリクエスト処理性能評価 ― NVIDIA H100・A100・L4の比較 - ペパボ研究所ブログ]] ### Paper - ServeGen: [[2025__arXiv__ServeGen - Workload Characterization and Generation of Large Language Model Serving in Production]] - [[2025__VLDB__Approximation-First Timeseries Monitoring Query At Scale]] - Shapeshifter: [[2025__EuroMLSys__Manage the Workloads not the Cluster Designing a Control Plane for Large-Scale AI Clusters]] - Taming the Titans: [[2025__arXiv__Taming the Titans - A Survey of Efficient LLM Inference Serving]] - Tempo: [[2025__arXiv__Tempo - Application-aware LLM Serving with Mixed SLO Requirements]] - [[2025__arXiv__ByteScale - Efficient Scaling of LLM Training with a 2048K Context Length on More Than 12,000 GPUs]] - [[2025__arXiv__Every FLOP Counts - Scaling a 300B Mixture-of-Experts LING LLM without Premium GPUs]] - [[2025__ISPASS__Characterizing and Optimizing LLM Inference Workloads on CPU-GPU Coupled Architectures]] - [[2025__ArXiv__AIBrix - Towards Scalable, Cost-Effective Large Language Model Inference Infrastructure]] - [[2025__ASAP__ReaLLM - A Trace-Driven Framework for Rapid Simulation of Large-Scale LLM Inference]] - [[2025__arXiv__Hierarchical Prediction-based Management for LMaaS Systems]] - [[2025__SIGMOD__Database as Runtime - Compiling LLMs to SQL for In-database Model Serving]] - [[2025__ASPLOS__PIM Is All You Need - A CXL-Enabled GPU-Free System for Large Language Model Inference]] - [[2025__CLOUD__Mind the Memory Gap Unveiling GPU Bottlenecks in Large Batch LLM Inference]] - [[2025__ArXiv__PipeBoost - Resilient Pipelined Architecture for Fast Serverless LLM Scaling]] - [[2025__KDD__BurstGPT - A Real-World Workload Dataset to Optimize LLM Serving Systems]] - [[2025__ArXiv__Energy-Aware LLMs - A step towards sustainable AI for downstream applications]] - [[2025_arXiv_FlashInfer - Efficient and Customizable Attention Engine for LLM Inference Serving]] - DistServe: [[2024__OSDI__DistServe - Disaggregating Prefill and Decoding for Goodput-optimized Large Language Model Serving]] - [[2024__arXiv__Towards a Middleware for Large Language Models]] - [[2024__arXiv__One Queue Is All You Need - Resolving Head-of-Line Blocking in Large Language]] - [[2024__ASPLOS__Queue Management for Large Language Model Serving]] - [[2024__ICSOC__UELLM - A Unified and Efficient Approach for Large Language Model Inference Serving]] - [[2024__arXiv__Intelligent Router for LLM Workloads - Improving Performance Through Workload-Aware Scheduling]] - [[2018__OSDI__Ray A Distributed Framework for Emerging AI Applications]] - [Overcoming Challenges in Serving Large Language Models](https://www.usenix.org/conference/srecon23emea/presentation/papapanagiotou) #### Auto Tuning - AutoHere: [[2025__arXiv__AutoHete - An Automatic and Efficient Heterogeneous Training System for LLMs]] - Mist: [[2025__EuroSys__Mist - Efficient Distributed Training of Large Language Models via Memory-Parallelism Co-Optimization]] ## Network - ICCL: [[2025__arXiv__An Efficient, Reliable and Observable Collective Communication Library in Large-scale GPU Training Clusters]] - [[2025__CLOUD__Routing Strategies for RoCE Networks in AI Clouds]] - PCCL: [[2025__arXiv__The Big Send-off - High Performance Collectives on GPU-based Supercomputers]] - [[2025__arXiv__An Extensible Software Transport Layer for GPU Networking]] - [[2025__CLOUD__Routing Strategies for RoCE Networks in AI Clouds]] - [[JANOG55 UECセッション: AIインフラ解説資料 共有ページ]] - [[GPUクラスタネットワークとその設計思想(Rethinking AI Infrastructure Part 2) ]] - [[GPUネットワーク設計・運用 基礎勉強会 Lossless Ethernet – PFCECN編]] - [[EthernetベースのGPUクラスタ導入による学びと展望]] - [[PFNにおけるアクセラレータ間通信の実際 - Preferred Networks Research & Development]] - [[AI(人工知能)の為のネットワーク - JANOG53 Meeting in Hakata]] - [[HPCネットワーク基礎(RDMAInfinibandRoCE編)]] - CyberAgent: [[AIML基盤の400G DCネットワークを構築した話 - JANOG52 Meeting in Nagasaki]] - [[[3]中核はフロー制御をつかさどる「PFC」と「ETS」]] ## MLSys - [[MLPerf Training]] ## さくら - [[生成AI向けパブリッククラウドサービスをつくってみた話 さくらのナレッジ]] - [[生成AI向け機械学習クラスタ構築のレシピ 北海道石狩編 さくらのナレッジ]] - [[実際に運用してわかった! 多種GPU混載Kubernetesクラスタの使われ方と運用省力化]] ## GPU - Model - [[A100]] - [[H100]] - [[H200]] - [[B200]] - [[2025__SOSP__LithOS - An Operating System for Efficient Machine Learning on GPUs]] - [[NVIDIA の最新ハードウェアの分析 B100B200GH200NVL72SuperPod]] ## Storage - [[NECのAI研究用スーパーコンピュータとDDN Lustreストレージシステム]] - [[Performance and Data Management at Scale for the AI Data Center - DDN]] ### Papers - [[2025__ACM-IEEE__EMLIO - Minimizing I-O Latency and Energy Consumption for Large-Scale AI Training]] ## Reliability Engineering - [[LLM学習中の計算機効率と障害]] ## Telemetry - [[AI Infra Telemetry - MOC]] ## Books - [[AI Systems Performance Engineering]] ## 関連 - [[ML for Systems - MOC]] - [[MLOps - MOC]] - [[PFN AI Infra references]]