Systems for ML - MOC - yuuk1's Digital Garden

Systems for MLのMap of Contentsノート。 ## Terminology - [[GPU]] - [[RoCE]] - [[NCCL]] - [[Rail-Optimized]] - [[PXN]] ## Collections - [[ml-systems-papers]] ## Model Training - [[分散深層学習 - MOC]] ## Model Serving ### Terminology - [[LLM推論]] - [[vLLM]] - [[LLMサービングの性能評価をどうやるか？]] ### Blog/Slide - [[gpt-ossモデルのサービングにおけるリクエスト処理性能評価 ― NVIDIA H100・A100・L4の比較 - ペパボ研究所ブログ]] ### Paper - ServeGen: [[2025__arXiv__ServeGen - Workload Characterization and Generation of Large Language Model Serving in Production]] - [[2025__VLDB__Approximation-First Timeseries Monitoring Query At Scale]] - Shapeshifter: [[2025__EuroMLSys__Manage the Workloads not the Cluster Designing a Control Plane for Large-Scale AI Clusters]] - Taming the Titans: [[2025__arXiv__Taming the Titans - A Survey of Efficient LLM Inference Serving]] - Tempo: [[2025__arXiv__Tempo - Application-aware LLM Serving with Mixed SLO Requirements]] - [[2025__arXiv__ByteScale - Efficient Scaling of LLM Training with a 2048K Context Length on More Than 12,000 GPUs]] - [[2025__arXiv__Every FLOP Counts - Scaling a 300B Mixture-of-Experts LING LLM without Premium GPUs]] - [[2025__ISPASS__Characterizing and Optimizing LLM Inference Workloads on CPU-GPU Coupled Architectures]] - [[2025__ArXiv__AIBrix - Towards Scalable, Cost-Effective Large Language Model Inference Infrastructure]] - [[2025__ASAP__ReaLLM - A Trace-Driven Framework for Rapid Simulation of Large-Scale LLM Inference]] - [[2025__arXiv__Hierarchical Prediction-based Management for LMaaS Systems]] - [[2025__SIGMOD__Database as Runtime - Compiling LLMs to SQL for In-database Model Serving]] - [[2025__ASPLOS__PIM Is All You Need - A CXL-Enabled GPU-Free System for Large Language Model Inference]] - [[2025__CLOUD__Mind the Memory Gap Unveiling GPU Bottlenecks in Large Batch LLM Inference]] - [[2025__ArXiv__PipeBoost - Resilient Pipelined Architecture for Fast Serverless LLM Scaling]] - [[2025__KDD__BurstGPT - A Real-World Workload Dataset to Optimize LLM Serving Systems]] - [[2025__ArXiv__Energy-Aware LLMs - A step towards sustainable AI for downstream applications]] - [[2025_arXiv_FlashInfer - Efficient and Customizable Attention Engine for LLM Inference Serving]] - DistServe: [[2024__OSDI__DistServe - Disaggregating Prefill and Decoding for Goodput-optimized Large Language Model Serving]] - [[2024__arXiv__Towards a Middleware for Large Language Models]] - [[2024__arXiv__One Queue Is All You Need - Resolving Head-of-Line Blocking in Large Language]] - [[2024__ASPLOS__Queue Management for Large Language Model Serving]] - [[2024__ICSOC__UELLM - A Unified and Efficient Approach for Large Language Model Inference Serving]] - [[2024__arXiv__Intelligent Router for LLM Workloads - Improving Performance Through Workload-Aware Scheduling]] - [[2018__OSDI__Ray A Distributed Framework for Emerging AI Applications]] - [Overcoming Challenges in Serving Large Language Models](https://www.usenix.org/conference/srecon23emea/presentation/papapanagiotou) #### Auto Tuning - AutoHere: [[2025__arXiv__AutoHete - An Automatic and Efficient Heterogeneous Training System for LLMs]] - Mist: [[2025__EuroSys__Mist - Efficient Distributed Training of Large Language Models via Memory-Parallelism Co-Optimization]] ## Network - ICCL: [[2025__arXiv__An Efficient, Reliable and Observable Collective Communication Library in Large-scale GPU Training Clusters]] - [[2025__CLOUD__Routing Strategies for RoCE Networks in AI Clouds]] - PCCL: [[2025__arXiv__The Big Send-off - High Performance Collectives on GPU-based Supercomputers]] - [[2025__arXiv__An Extensible Software Transport Layer for GPU Networking]] - [[2025__CLOUD__Routing Strategies for RoCE Networks in AI Clouds]] - [[JANOG55 UECセッション： AIインフラ解説資料共有ページ]] - [[GPUクラスタネットワークとその設計思想（Rethinking AI Infrastructure Part 2） ]] - [[GPUネットワーク設計・運用基礎勉強会 Lossless Ethernet – PFCECN編]] - [[EthernetベースのGPUクラスタ導入による学びと展望]] - [[PFNにおけるアクセラレータ間通信の実際 - Preferred Networks Research & Development]] - [[AI(人工知能)の為のネットワーク - JANOG53 Meeting in Hakata]] - [[HPCネットワーク基礎(RDMAInfinibandRoCE編)]] - CyberAgent: [[AIML基盤の400G DCネットワークを構築した話 - JANOG52 Meeting in Nagasaki]] - [[［3］中核はフロー制御をつかさどる「PFC」と「ETS」]] ## MLSys - [[MLPerf Training]] ## さくら - [[生成AI向けパブリッククラウドサービスをつくってみた話さくらのナレッジ]] - [[生成AI向け機械学習クラスタ構築のレシピ北海道石狩編さくらのナレッジ]] - [[実際に運用してわかった！多種GPU混載Kubernetesクラスタの使われ方と運用省力化]] ## GPU - Model - [[A100]] - [[H100]] - [[H200]] - [[B200]] - [[2025__SOSP__LithOS - An Operating System for Efficient Machine Learning on GPUs]] - [[NVIDIA の最新ハードウェアの分析 B100B200GH200NVL72SuperPod]] ## Storage - [[NECのAI研究用スーパーコンピュータとDDN Lustreストレージシステム]] - [[Performance and Data Management at Scale for the AI Data Center - DDN]] ### Papers - [[2025__ACM-IEEE__EMLIO - Minimizing I-O Latency and Energy Consumption for Large-Scale AI Training]] ## Reliability Engineering - [[LLM学習中の計算機効率と障害]] ## Telemetry - [[AI Infra Telemetry - MOC]] ## Books - [[AI Systems Performance Engineering]] ## 関連 - [[ML for Systems - MOC]] - [[MLOps - MOC]] - [[PFN AI Infra references]]