分散深層学習 - MOC - yuuk1's Digital Garden

[[分散深層学習]]のMOCノート。 Parent: [[Systems for ML - MOC]] ## Keywords - [[LLM学習]] - [[集合通信]] - [[分散学習スループットの相場]] ## Algoritms / Orchestration - [[LLM学習の効率性指標]] - [[分散深層学習のパラメータ制約]] ### Entries - [[大規模言語モデル開発を支える分散学習技術 - W&Bマンスリーミートアップ]] - [[NVIDIA NeMoの分散学習高速化技術詳細まとめ - Perplexity Pro]] - [[Megatron-LMの概要と各種パラメータについて(1027日勉強会公開用)]] - [[Megatron]] - [[ZeRO & DeepSpeed New system optimizations enable training models with over 100 billion parameters]] - [[Fixstars セミナー - パフォーマンスエンジニアリングで実現するAIワークロードの高速化実践セミナー]] - [[NVIDIA B200対応「高火力」を徹底解説！]] ### Papers - [[2025__ArXiv__Nonuniform-Tensor-Parallelism - Mitigating GPU failure impact for Scaled-up LLM Training]] - [[2024__APNet__Understanding Communication Characteristics of Distributed Training]] - FP8-LM: [[2023__FP8-LM Training FP8 Large Language Models]] - FSDP: [[2023__VLDB__PyTorch FSDP - Experiences on Scaling Fully Sharded Data Parallel]] - [[2023__MLSys__Reducing Activation Recomputation in Large Transformer Models]] - [[2021__SC21__Efficient Large-Scale Language Model Training on GPU Clusters Using Megatron-LM]] - [[Zero-infi]] - [[2020__KDD__DeepSpeed - System Optimizations Enable Training Deep Learning Models with Over 100 Billion Parameters]] - ZeRO: [[2019__SC__ZeRO - Memory optimizations Toward Training Trillion Parameter Models]] - Megatron-LM: [[2019__arXiv__Megatron-LM - Training Multi-Billion Parameter Language Models Using Model Parallelism]] - Ray: [[2018__OSDI__Ray A Distributed Framework for Emerging AI Applications]] ### Case Studies ## Infrastructure ### Entries - [Operationalizing ML Training Infra at Meta Scale](https://www.usenix.org/conference/srecon22apac/presentation/bharuka) - [[How Meta trains large language models at scale]] - [[GPUクラスタネットワークとその設計思想（Rethinking AI Infrastructure Part 2）]] - [PFNにおけるアクセラレータ間通信の実際 - Preferred Networks Research & Development](https://tech.preferred.jp/ja/blog/rdma-in-pfn/) - [[AI-ML基盤における800GbEスイッチ導入とその挑戦 - JANOG56 Meeting in Matsue]] ### Papers - xDeepServe: [[2025__arXiv__xDeepServe - Model-as-a-Service on Huawei CloudMatrix384]] - MegaScale-MoE: [[2025__arXiv__MegaScale-MoE Large-Scale Communication-Efficient Training of Mixture-of-Experts Models in Production]] - Astral: [[2025__SIGCOMM__Astral - A Datacenter Infrastructure for Large Language Model Training at Scale]] - [[2025__ISCA__Insights into DeepSeek-V3 - Scaling Challenges and Reflections on Hardware for AI Architectures]] - Vela: [[2025__ASPLOS__Vela - A Virtualized LLM Training System with GPU Direct RoCE]] - [[2025__WANT@ICML__Memory and Bandwidth are All You Need for Fully Sharded Data Parallel]] - [[2025__CISOSE__Sustainable AI Training via Hardware - Software Co - Design on NVIDIA, AMD, and Emerging GPU Architectures]] - [[2025__DSN-W__Characterizing Modern GPU Resilience and Impact in HPC Systems A Case Study of A100 GPUs]] - [[2025__HPCA__Revisiting Reliability in Large-Scale Machine Learning Research Clusters]] - [[2025__arXiv__SAKURAONE - Empowering Transparent and Open AI Platforms through Private-Sector HPC Investment in Japan]] - [[2025__arXiv__Characterizing GPU Resilience and Impact on AI-HPC Systems]] - [[2025__arXiv__Beyond A Single AI Cluster - A Survey of Decentralized LLM Training]] - [[2025__SC__ATLAHS - An Application-centric Network Simulator Toolchain for AI, HPC, and Distributed Storage]] - [[2025__ArXiv__CRIUgpu - Transparent Checkpointing of GPU-Accelerated Workloads]] - [[2025__arXiv__Demystifying NCCL - An In-depth Analysis of GPU Communication Protocols and Algorithms]] - [[2025__arXiv__Compute Can’t Handle the Truth - Why Communication Tax Prioritizes Memory and Interconnects in Modern AI Infrastructure]] - [[2024__SC__Exploring GPU-to-GPU Communication - Insights into Supercomputer Interconnects]] - [[2024__HotNet__I’ve Got 99 Problems But FLOPS Ain’t One]] - Echo: [[2024__arXiv__Echo - Simulating Distributed Training At Scale]] - [[2024__ATC__Accelerating the Training of Large Language Models using Efficient Activation Rematerialization and Optimal Hybrid Parallelism]] - [[2024__arXiv__Hardware Scaling Trends and Diminishing Returns in Large-Scale Distributed Training]] - [[2024__SOSP__ReCycle - Resilient Training of Large DNNs using Pipeline Adaptation]] - [[2024__arXiv__Computational Bottlenecks of Training Small-scale Large Language Models]] - [[2024__arXiv__Efficient Training of Large Language Models on Distributed Infrastructures - A Survey]] - [[2023__arXiv__Unicron - Economizing Self-Healing LLM Training at Scale]] - [[2024__arXiv__Generic and ML Workloads in an HPC Datacenter - Node Energy, Job Failures, and Node-Job Analysis]] - [[2024__HOTI__Rail-only - A Low-Cost High-Performance Network for Training LLMs with Trillion Parameters]] - [[2024__SIGCOMM__Accelerating Model Training in Multi-cluster Environments with Consumer-grade GPUs]] - [[2024__ICSE__An Empirical Study on Low GPU Utilization of Deep Learning Jobs]] - [[2024__arXiv__Efficient Training of Large Language Models on Distributed Infrastructures - A Survey]] - [[2024__NSDI__Characterization of Large Language Model Development in the Datacenter]] - [[2024__NSDI__MegaScale - Scaling Large Language Model Training to More Than 10,000 GPUs]] - [[2023__ICSE__An Empirical Study on Quality Issues of Deep Learning Platform]] ## Data Loader - [[2025__arXiv__OVERLORD Ultimate Scaling of DataLoader for Multi Source Large Foundation Model Training]]