[[分散深層学習]]に特化したテレメトリーシステムに関するMOCノート。
## General
- [[分散深層学習]]
## [[GPU]]
- [[DCGM]]
- [[DOCA]]
- [[GPUD-COMPARISONS]]
- [gpud]
- [[Snooping on your GPU Using eBPF to Build Zero-instrumentation CUDA Monitoring]]
## Network
- [[SONiCのテレメトリー]]
## Telemetry
- [[LLM学習中の計算機効率と障害]]
### Case Studies
#### Meta
- [[How Meta trains large language models at scale]]
- [[2025__HPCA__Revisiting Reliability in Large-Scale Machine Learning Research Clusters]]
- [[System@Scale - AI Observability]]
- [[GPU Profiling with BPF at Meta - Riham Selim]]
- [[Network Observability for AI-HPC Training Workflows]]
#### Yunshan Networks:
- [[Unlocking LLM Performance with EBPF - Optimizing Training and Inference Pipelines - KubeCon24 Chaina]]
-
#### Ant Group:
- [[Transformers in SRE Land - Evolving to Manage AI Infrastructure at SREcon25 Americas]]
## Tool
- [[Strobelight]]
- [[Kineto]]
- [[Dynolog]]
- [[DeepFlow]]
## Papers
LLMの分散学習計算機環境に発生する障害パターンやモニタリングに言及している論文。
- [[2025__arXiv__Efficient Fine-Grained GPU Performance Modeling for Distributed Deep Learning of LLM]]
- FlashRecovery: [[2025__arXiv__FlashRecovery - Fast and Low Cost Recovery from Failures for Large Scale Training of LLMs]]
- ByteRobust: [[2025__SOSP__Robust LLM Training Infrastructure at ByteDance]]
- Astral: [[2025__SIGCOMM__Astral - A Datacenter Infrastructure for Large Language Model Training at Scale]]
- SkeletonHunter: [[2025__SIGCOMM__SkeletonHunter Diagnosing and Localizing Network Failures in Containerized Large Model Training]]
- Hawkeye: [[2025__SIGCOMM__Hawkeye - Diagnosing RDMA Network Performance Anomalies with PFC Provenance]]
- [[2025__HPCA__Enhancing Large-Scale AI Training Efficiency - The C4 Solution for Real-Time Anomaly Detection and Communication Optimization]]
- OptProphet: [[2025__APNET__Forewarned is Forearmed - Joint Prediction and Classification of Optical Transceiver Failures in Large-Scale LLM Training Clusters]]
- [[2025__IWQoS__eACGM - Non-instrumented Performance Tracing and Anomaly Detection towards Machine Learning Systems]]
- [[2025__arXiv__XPUTimer - Anomaly Diagnostics for Divergent LLM Training in GPU Clusters of Thousand-Plus Scale]]
- [[2025__NSDI__Evolution of Aegis - Fault Diagnosis for AI Model Training Service in Production]]
- [[2025__DSN__LLMPrism - Black-box Performance Diagnosis for Production LLM Training Platforms]]
- [[2025__ESEC-FSE__L4 - Diagnosing Large-scale LLM Training Failures via Automated Log Analysis]]
- [[2025__HCDS__eGPU - Extending eBPF Programmability and Observability to GPUs]]
- [[2025__arXiv__bpftime-super - A GPU observability tool]]
- [[2025__arXiv__Measuring GPU utilization one level deeper]]
- [[2024__TOPC__Low-Overhead Trace Collection and Profiling on GPU Compute Kernels]]
- [[2024__arXiv__Light-Weight Fault Tolerant Attention for Large Language Model Training]]
- [[2024__arXiv__FlowTracer - A Tool for Uncovering Network Path Usage Imbalance in AI Training Clusters]]
- [[2024__INFOCOM__INSERT - In-Network Stateful End-to-End RDMA Telemetry]]
- [[2025__NSDI__Minder - Faulty Machine Detection for Large-scale Distributed Model Training]]
- [[2024__APNET__Hostmesh - Monitor and Diagnose Networks in Rail-optimized RoCE Clusters]]
- [[2024__SIGCOMM__R-Pingmesh - A Service-Aware RoCE Network Monitoring and Diagnostic System]]
- [[2024__NSDI__MegaScale - Scaling Large Language Model Training to More Than 10,000 GPUs]]
- [[2023__ISSREW__A Survey of Metrics to Enhance Training Dependability in Large Language Models]]
- [[2022__NSDI__MLaaS in the Wild - Workload Analysis and Scheduling in Large-Scale Heterogeneous GPU Clusters]]
## Simulation / Modeling
- [[2025__NSDI__Accelerating Design Space Exploration for LLM Training Systems with Multi-experiment Parallel Simulation]]
- ATLAHS: [[2025__arXiv__ATLAHS - An Application-centric Network Simulator Toolchain for AI, HPC, and Distributed Storage]]
- Lumos: [[2025__MLSys__Lumos - Efficient Performance Modeling and Estimation for Large-scale LLM Training]]
- [[2025__ASAP__ReaLLM - A Trace-Driven Framework for Rapid Simulation of Large-Scale LLM Inference]]
- [[2025__SC__ATLAHS - An Application-centric Network Simulator Toolchain for AI, HPC, and Distributed Storage]]
## Ideas
- [[MetricSifterをHPCクラスタのヒートマップに応用する]]
## 関連MOC
- [[Systems for ML - MOC]]
- [[Telemetry - MOC]]
- [[分散深層学習 - MOC|Distributed Training - MOC]]