Megatron-LM - yuuk1's Digital Garden

# Megatron-LM NVIDIA による SOTA の OSS LLM 訓練フレームワーク。3D parallelism(data + tensor + pipeline)を統合しハードウェア資源を活用する。tensor parallelism と interleaved 1F1B pipeline scheduling を提供する。(Source: [[@2024__NSDI__MegaScale - Scaling Large Language Model Training to More Than 10,000 GPUs]], §2, §6.1) - リポジトリ: github.com/NVIDIA/Megatron-LM - [[MegaScale]] はこの上に構築され、本番ベンチマークの比較対象となる。MegaScale は 175B・12,288 GPU で Megatron-LM 比 1.34× MFU(55.2% vs 41.2%)。 - **原典論文**: [[Mohammad Shoeybi]] ほか NVIDIA による arXiv:1909.08053(2019)。Transformer の MLP(列-行分割)と多頭注意(ヘッド単位分割)へのレイヤー内テンソル並列を提案し、カスタムコンパイラ不要で PyTorch 数行で実装。512 GPU・8.3B パラメータで 76% スケーリング効率・15.1 PetaFLOPS を達成した。BERT Pre-LayerNorm という副次的発見も含む。(Source: [[@2019__arXiv__Megatron-LM Training Multi-Billion Parameter Language Models Using Model Parallelism]]) - 並列化戦略の文献では tensor parallelism の 1-D 分割の代表として参照される([[並列化戦略]])。 - ストラグラー分析論文(OSDI 2025)はカスタマイズ版の Megatron-LM を学習基盤として用いている。(Source: [[@2025__OSDI__Understanding Stragglers in Large Model Training Using What-if Analysis]]) - Mycroft(SOSP 2025)は集合通信層の信頼性デバッグの関連実体として Megatron-LM を挙げる。(Source: [[@2025__SOSP__Mycroft - Tracing Dependencies in Collective Communication Towards Reliable LLM Training]]) - [[NCCLX]]([[@2025__arXiv__Collective Communication for 100k+ GPUs]])では、Llama 訓練の TP が Megatron-LM 類似であると参照され、[[CTran]] の RMA Put による細粒度オーバーラップの説明文脈で用いられる。(Source: [[@2025__arXiv__Collective Communication for 100k+ GPUs]], §5.2) - GPUPerf([[@2025__arXiv__Efficient Fine-Grained GPU Performance Modeling for Distributed Deep Learning of LLM]])は [[DeepSpeed]] と Megatron-LM を統合した [[GPT-NeoX]] を実装フレームワークとし、その演算をオペレータ単位に分解して性能モデリングする。(Source: [[@2025__arXiv__Efficient Fine-Grained GPU Performance Modeling for Distributed Deep Learning of LLM]]) - [[XPUTimer]](Flare、[[@2025__arXiv__XPUTimer - Anomaly Diagnostics for Divergent LLM Training in GPU Clusters of Thousand-Plus Scale]])は Megatron を含む 4 バックエンドを非侵入計装の対象とし、Megatron timer の誤有効化による不要同期を新規メトリクスで検知するケーススタディを報告する。(Source: [[@2025__arXiv__XPUTimer - Anomaly Diagnostics for Divergent LLM Training in GPU Clusters of Thousand-Plus Scale]]) - [[@2023__MLSys__Reducing Activation Recomputation in Large Transformer Models]](Korthikanti+ MLSys2023)では Megatron-LM のテンソル並列化を前提に [[シーケンス並列化]] と [[選択的活性化再計算]] を提案・実装した。530B MT-NLG モデルで 29% 高速化(MFU 42.1% → 54.2%)を実証。[[Vijay Korthikanti]] 筆頭・[[Bryan Catanzaro]] 責任著者。(Source: [[@2023__MLSys__Reducing Activation Recomputation in Large Transformer Models]]) ## 関連 - ソース: [[@2019__arXiv__Megatron-LM Training Multi-Billion Parameter Language Models Using Model Parallelism]] / [[@2024__NSDI__MegaScale - Scaling Large Language Model Training to More Than 10,000 GPUs]] / [[@2025__OSDI__Understanding Stragglers in Large Model Training Using What-if Analysis]] / [[@2025__SOSP__Mycroft - Tracing Dependencies in Collective Communication Towards Reliable LLM Training]] / [[@2025__arXiv__Collective Communication for 100k+ GPUs]] / [[@2025__arXiv__Efficient Fine-Grained GPU Performance Modeling for Distributed Deep Learning of LLM]] / [[@2025__arXiv__XPUTimer - Anomaly Diagnostics for Divergent LLM Training in GPU Clusters of Thousand-Plus Scale]] / [[@2023__MLSys__Reducing Activation Recomputation in Large Transformer Models]] - エンティティ: [[MegaScale]] / [[ByteDance]] / [[NCCLX]] / [[GPT-NeoX]] / [[DeepSpeed]] / [[XPUTimer]] / [[Vijay Korthikanti]] / [[Bryan Catanzaro]] / [[Mohammad Shoeybi]] - 概念: [[並列化戦略]] / [[LLM分散学習]] / [[シーケンス並列化]] / [[選択的活性化再計算]]