[[分散深層学習]]のMOCノート。
Parent: [[Systems for ML - MOC]]
## Keywords
- [[LLM学習]]
- [[集合通信]]
- [[分散学習スループットの相場]]
## Algoritms / Orchestration
- [[LLM学習の効率性指標]]
- [[分散深層学習のパラメータ制約]]
### Entries
- [[大規模言語モデル開発を支える分散学習技術 - W&Bマンスリーミートアップ]]
- [[NVIDIA NeMoの分散学習高速化技術 詳細まとめ - Perplexity Pro]]
- [[Megatron-LMの概要と各種パラメータについて(1027日勉強会公開用)]]
- [[Megatron]]
- [[ZeRO & DeepSpeed New system optimizations enable training models with over 100 billion parameters]]
- [[Fixstars セミナー - パフォーマンスエンジニアリングで実現するAIワークロードの高速化実践セミナー]]
- [[NVIDIA B200対応「高火力」を徹底解説!]]
### Papers
- [[2025__ArXiv__Nonuniform-Tensor-Parallelism - Mitigating GPU failure impact for Scaled-up LLM Training]]
- [[2024__APNet__Understanding Communication Characteristics of Distributed Training]]
- FP8-LM: [[2023__FP8-LM Training FP8 Large Language Models]]
- FSDP: [[2023__VLDB__PyTorch FSDP - Experiences on Scaling Fully Sharded Data Parallel]]
- [[2023__MLSys__Reducing Activation Recomputation in Large Transformer Models]]
- [[2021__SC21__Efficient Large-Scale Language Model Training on GPU Clusters Using Megatron-LM]]
- [[Zero-infi]]
- [[2020__KDD__DeepSpeed - System Optimizations Enable Training Deep Learning Models with Over 100 Billion Parameters]]
- ZeRO: [[2019__SC__ZeRO - Memory optimizations Toward Training Trillion Parameter Models]]
- Megatron-LM: [[2019__arXiv__Megatron-LM - Training Multi-Billion Parameter Language Models Using Model Parallelism]]
- Ray: [[2018__OSDI__Ray A Distributed Framework for Emerging AI Applications]]
### Case Studies
## Infrastructure
### Entries
- [Operationalizing ML Training Infra at Meta Scale](https://www.usenix.org/conference/srecon22apac/presentation/bharuka)
- [[How Meta trains large language models at scale]]
- [[GPUクラスタネットワークとその設計思想(Rethinking AI Infrastructure Part 2)]]
- [PFNにおけるアクセラレータ間通信の実際 - Preferred Networks Research & Development](https://tech.preferred.jp/ja/blog/rdma-in-pfn/)
- [[AI-ML基盤における800GbEスイッチ導入とその挑戦 - JANOG56 Meeting in Matsue]]
### Papers
- xDeepServe: [[2025__arXiv__xDeepServe - Model-as-a-Service on Huawei CloudMatrix384]]
- MegaScale-MoE: [[2025__arXiv__MegaScale-MoE Large-Scale Communication-Efficient Training of Mixture-of-Experts Models in Production]]
- Astral: [[2025__SIGCOMM__Astral - A Datacenter Infrastructure for Large Language Model Training at Scale]]
- [[2025__ISCA__Insights into DeepSeek-V3 - Scaling Challenges and Reflections on Hardware for AI Architectures]]
- Vela: [[2025__ASPLOS__Vela - A Virtualized LLM Training System with GPU Direct RoCE]]
- [[2025__WANT@ICML__Memory and Bandwidth are All You Need for Fully Sharded Data Parallel]]
- [[2025__CISOSE__Sustainable AI Training via Hardware - Software Co - Design on NVIDIA, AMD, and Emerging GPU Architectures]]
- [[2025__DSN-W__Characterizing Modern GPU Resilience and Impact in HPC Systems A Case Study of A100 GPUs]]
- [[2025__HPCA__Revisiting Reliability in Large-Scale Machine Learning Research Clusters]]
- [[2025__arXiv__SAKURAONE - Empowering Transparent and Open AI Platforms through Private-Sector HPC Investment in Japan]]
- [[2025__arXiv__Characterizing GPU Resilience and Impact on AI-HPC Systems]]
- [[2025__arXiv__Beyond A Single AI Cluster - A Survey of Decentralized LLM Training]]
- [[2025__SC__ATLAHS - An Application-centric Network Simulator Toolchain for AI, HPC, and Distributed Storage]]
- [[2025__ArXiv__CRIUgpu - Transparent Checkpointing of GPU-Accelerated Workloads]]
- [[2025__arXiv__Demystifying NCCL - An In-depth Analysis of GPU Communication Protocols and Algorithms]]
- [[2025__arXiv__Compute Can’t Handle the Truth - Why Communication Tax Prioritizes Memory and Interconnects in Modern AI Infrastructure]]
- [[2024__SC__Exploring GPU-to-GPU Communication - Insights into Supercomputer Interconnects]]
- [[2024__HotNet__I’ve Got 99 Problems But FLOPS Ain’t One]]
- Echo: [[2024__arXiv__Echo - Simulating Distributed Training At Scale]]
- [[2024__ATC__Accelerating the Training of Large Language Models using Efficient Activation Rematerialization and Optimal Hybrid Parallelism]]
- [[2024__arXiv__Hardware Scaling Trends and Diminishing Returns in Large-Scale Distributed Training]]
- [[2024__SOSP__ReCycle - Resilient Training of Large DNNs using Pipeline Adaptation]]
- [[2024__arXiv__Computational Bottlenecks of Training Small-scale Large Language Models]]
- [[2024__arXiv__Efficient Training of Large Language Models on Distributed Infrastructures - A Survey]]
- [[2023__arXiv__Unicron - Economizing Self-Healing LLM Training at Scale]]
- [[2024__arXiv__Generic and ML Workloads in an HPC Datacenter - Node Energy, Job Failures, and Node-Job Analysis]]
- [[2024__HOTI__Rail-only - A Low-Cost High-Performance Network for Training LLMs with Trillion Parameters]]
- [[2024__SIGCOMM__Accelerating Model Training in Multi-cluster Environments with Consumer-grade GPUs]]
- [[2024__ICSE__An Empirical Study on Low GPU Utilization of Deep Learning Jobs]]
- [[2024__arXiv__Efficient Training of Large Language Models on Distributed Infrastructures - A Survey]]
- [[2024__NSDI__Characterization of Large Language Model Development in the Datacenter]]
- [[2024__NSDI__MegaScale - Scaling Large Language Model Training to More Than 10,000 GPUs]]
- [[2023__ICSE__An Empirical Study on Quality Issues of Deep Learning Platform]]
## Data Loader
- [[2025__arXiv__OVERLORD Ultimate Scaling of DataLoader for Multi Source Large Foundation Model Training]]