[[分散深層学習]]に特化したテレメトリーシステムに関するMOCノート。 ## General - [[分散深層学習]] ## [[GPU]] - [[DCGM]] - [[DOCA]] - [[GPUD-COMPARISONS]] - [gpud] - [[Snooping on your GPU Using eBPF to Build Zero-instrumentation CUDA Monitoring]] ## Network - [[SONiCのテレメトリー]] ## Telemetry - [[LLM学習中の計算機効率と障害]] ### Case Studies #### Meta - [[How Meta trains large language models at scale]] - [[2025__HPCA__Revisiting Reliability in Large-Scale Machine Learning Research Clusters]] - [[System@Scale - AI Observability]] - [[GPU Profiling with BPF at Meta - Riham Selim]] - [[Network Observability for AI-HPC Training Workflows]] #### Yunshan Networks: - [[Unlocking LLM Performance with EBPF - Optimizing Training and Inference Pipelines - KubeCon24 Chaina]] - #### Ant Group: - [[Transformers in SRE Land - Evolving to Manage AI Infrastructure at SREcon25 Americas]] ## Tool - [[Strobelight]] - [[Kineto]] - [[Dynolog]] - [[DeepFlow]] ## Papers LLMの分散学習計算機環境に発生する障害パターンやモニタリングに言及している論文。 - [[2025__arXiv__Efficient Fine-Grained GPU Performance Modeling for Distributed Deep Learning of LLM]] - FlashRecovery: [[2025__arXiv__FlashRecovery - Fast and Low Cost Recovery from Failures for Large Scale Training of LLMs]] - ByteRobust: [[2025__SOSP__Robust LLM Training Infrastructure at ByteDance]] - Astral: [[2025__SIGCOMM__Astral - A Datacenter Infrastructure for Large Language Model Training at Scale]] - SkeletonHunter: [[2025__SIGCOMM__SkeletonHunter Diagnosing and Localizing Network Failures in Containerized Large Model Training]] - Hawkeye: [[2025__SIGCOMM__Hawkeye - Diagnosing RDMA Network Performance Anomalies with PFC Provenance]] - [[2025__HPCA__Enhancing Large-Scale AI Training Efficiency - The C4 Solution for Real-Time Anomaly Detection and Communication Optimization]] - OptProphet: [[2025__APNET__Forewarned is Forearmed - Joint Prediction and Classification of Optical Transceiver Failures in Large-Scale LLM Training Clusters]] - [[2025__IWQoS__eACGM - Non-instrumented Performance Tracing and Anomaly Detection towards Machine Learning Systems]] - [[2025__arXiv__XPUTimer - Anomaly Diagnostics for Divergent LLM Training in GPU Clusters of Thousand-Plus Scale]] - [[2025__NSDI__Evolution of Aegis - Fault Diagnosis for AI Model Training Service in Production]] - [[2025__DSN__LLMPrism - Black-box Performance Diagnosis for Production LLM Training Platforms]] - [[2025__ESEC-FSE__L4 - Diagnosing Large-scale LLM Training Failures via Automated Log Analysis]] - [[2025__HCDS__eGPU - Extending eBPF Programmability and Observability to GPUs]] - [[2025__arXiv__bpftime-super - A GPU observability tool]] - [[2025__arXiv__Measuring GPU utilization one level deeper]] - [[2024__TOPC__Low-Overhead Trace Collection and Profiling on GPU Compute Kernels]] - [[2024__arXiv__Light-Weight Fault Tolerant Attention for Large Language Model Training]] - [[2024__arXiv__FlowTracer - A Tool for Uncovering Network Path Usage Imbalance in AI Training Clusters]] - [[2024__INFOCOM__INSERT - In-Network Stateful End-to-End RDMA Telemetry]] - [[2025__NSDI__Minder - Faulty Machine Detection for Large-scale Distributed Model Training]] - [[2024__APNET__Hostmesh - Monitor and Diagnose Networks in Rail-optimized RoCE Clusters]] - [[2024__SIGCOMM__R-Pingmesh - A Service-Aware RoCE Network Monitoring and Diagnostic System]] - [[2024__NSDI__MegaScale - Scaling Large Language Model Training to More Than 10,000 GPUs]] - [[2023__ISSREW__A Survey of Metrics to Enhance Training Dependability in Large Language Models]] - [[2022__NSDI__MLaaS in the Wild - Workload Analysis and Scheduling in Large-Scale Heterogeneous GPU Clusters]] ## Simulation / Modeling - [[2025__NSDI__Accelerating Design Space Exploration for LLM Training Systems with Multi-experiment Parallel Simulation]] - ATLAHS: [[2025__arXiv__ATLAHS - An Application-centric Network Simulator Toolchain for AI, HPC, and Distributed Storage]] - Lumos: [[2025__MLSys__Lumos - Efficient Performance Modeling and Estimation for Large-scale LLM Training]] - [[2025__ASAP__ReaLLM - A Trace-Driven Framework for Rapid Simulation of Large-Scale LLM Inference]] - [[2025__SC__ATLAHS - An Application-centric Network Simulator Toolchain for AI, HPC, and Distributed Storage]] ## Ideas - [[MetricSifterをHPCクラスタのヒートマップに応用する]] ## 関連MOC - [[Systems for ML - MOC]] - [[Telemetry - MOC]] - [[分散深層学習 - MOC|Distributed Training - MOC]]