AI Infra Telemetry - MOC - yuuk1's Digital Garden

[[分散深層学習]]に特化したテレメトリーシステムに関するMOCノート。 ## General - [[分散深層学習]] - [[LLM学習の計算機効率と障害]] ## [[GPU]] - [[DCGM]] - [[DOCA]] - [[GPUD-COMPARISONS]] - [gpud] - [[Snooping on your GPU - Using eBPF to Build Zero-instrumentation CUDA Monitoring]] - [[GPU Cluster Monitoring GCM]] ## Network - [[SONiCのテレメトリー]] ## [[eBPF]] - [[eunomia - AI GPU eBPF Tracing Documents]] - [[Lecture 98 - GPU Observability]] ### Case Studies #### Meta - [[GPU Cluster Monitoring GCM]] - [[How Meta trains large language models at scale]] - [[2025__HPCA__Revisiting Reliability in Large-Scale Machine Learning Research Clusters]] - [[System@Scale - AI Observability]] - [[GPU Profiling with BPF at Meta - Riham Selim]] - [[Network Observability for AI-HPC Training Workflows]] #### Yunshan Networks: - [[Unlocking LLM Performance with EBPF - Optimizing Training and Inference Pipelines - KubeCon24 Chaina]] - #### Ant Group: - [[Transformers in SRE Land - Evolving to Manage AI Infrastructure at SREcon25 Americas]] ## Tool - [[Strobelight]] - [[Kineto]] - [[Dynolog]] - [[DeepFlow]] ## Papers LLMの分散学習計算機環境に発生する障害パターンやモニタリングに言及している論文。 ### Reliability - [[2025__SC__Characterizing GPU Resilience and Impact on AI - HPC Systems]] - [[2025__OSDI__Understanding Stragglers in Large Model Training Using What-if Analysis]] - ByteRobust: [[2025__SOSP__Robust LLM Training Infrastructure at ByteDance]] - Astral: [[2025__SIGCOMM__Astral - A Datacenter Infrastructure for Large Language Model Training at Scale]] - [[2025__arXiv__Measuring GPU utilization one level deeper]] - [[2024__arXiv__Light-Weight Fault Tolerant Attention for Large Language Model Training]] - [[2024__NSDI__MegaScale - Scaling Large Language Model Training to More Than 10,000 GPUs]] - [[2023__ISSREW__A Survey of Metrics to Enhance Training Dependability in Large Language Models]] - [[2022__NSDI__MLaaS in the Wild - Workload Analysis and Scheduling in Large-Scale Heterogeneous GPU Clusters]] ### Tracing - eInfer: [[2025__eBPF__eInfer - Unlocking Fine - Grained Tracing for Distributed LLM Inference with eBPF]] - [[2025__WoSC__GPU Tail Latency Diagnosis for Serverless and HPC Workloads using eBPF]] - Mycroft: [[2025__SOSP__Mycroft - Tracing Dependencies in Collective Communication Towards Reliable LLM Training]] - [[2025__HCDS__eGPU - Extending eBPF Programmability and Observability to GPUs]] - [[2025__arXiv__bpftime-super - A GPU observability tool]] - [[2024__TOPC__Low-Overhead Trace Collection and Profiling on GPU Compute Kernels]] - [[2024__INFOCOM__INSERT - In-Network Stateful End-to-End RDMA Telemetry]] - [[2024__APNET__Hostmesh - Monitor and Diagnose Networks in Rail-optimized RoCE Clusters]] - [[2024__SIGCOMM__R-Pingmesh - A Service-Aware RoCE Network Monitoring and Diagnostic System]] - [[2024__arXiv__FlowTracer - A Tool for Uncovering Network Path Usage Imbalance in AI Training Clusters]] - [[2023__EuroMLSys__Profiling and Monitoring Deep Learning Training Tasks]] ### Failure Management - [[2025__SC__Fine-grained Automated Failure Management for Extreme-Scale GPU Accelerated Systems]] - FlashRecovery: [[2025__arXiv__FlashRecovery - Fast and Low Cost Recovery from Failures for Large Scale Training of LLMs]] - Hawkeye: [[2025__SIGCOMM__Hawkeye - Diagnosing RDMA Network Performance Anomalies with PFC Provenance]] - [[2025__HPCA__Enhancing Large-Scale AI Training Efficiency - The C4 Solution for Real-Time Anomaly Detection and Communication Optimization]] - OptProphet: [[2025__APNET__Forewarned is Forearmed - Joint Prediction and Classification of Optical Transceiver Failures in Large-Scale LLM Training Clusters]] ### [[AIOps]] - Pulse: [[2026__ASPLOS__Fine-grained and Non - intrusive LLM Training Monitoring via Microsecond - level Traffic Measurement]] - [[2025__ISAV__From Exploration to Explanation - ML-Driven Causal Discovery for Datacenter Reliability at Scale]] - [[2025__arXiv__Efficient Fine-Grained GPU Performance Modeling for Distributed Deep Learning of LLM]] - SkeletonHunter: [[2025__SIGCOMM__SkeletonHunter Diagnosing and Localizing Network Failures in Containerized Large Model Training]] - [[2025__IWQoS__eACGM - Non-instrumented Performance Tracing and Anomaly Detection towards Machine Learning Systems]] - [[2025__arXiv__XPUTimer - Anomaly Diagnostics for Divergent LLM Training in GPU Clusters of Thousand-Plus Scale]] - [[2025__NSDI__Evolution of Aegis - Fault Diagnosis for AI Model Training Service in Production]] - [[2025__DSN__LLMPrism - Black-box Performance Diagnosis for Production LLM Training Platforms]] - [[2025__ESEC-FSE__L4 - Diagnosing Large-scale LLM Training Failures via Automated Log Analysis]] - [[2025__NSDI__Minder - Faulty Machine Detection for Large-scale Distributed Model Training]] ## Simulation / Modeling - [[2025__NSDI__Accelerating Design Space Exploration for LLM Training Systems with Multi-experiment Parallel Simulation]] - ATLAHS: [[2025__arXiv__ATLAHS - An Application-centric Network Simulator Toolchain for AI, HPC, and Distributed Storage]] - Lumos: [[2025__MLSys__Lumos - Efficient Performance Modeling and Estimation for Large-scale LLM Training]] - [[2025__ASAP__ReaLLM - A Trace-Driven Framework for Rapid Simulation of Large-Scale LLM Inference]] - [[2025__SC__ATLAHS - An Application-centric Network Simulator Toolchain for AI, HPC, and Distributed Storage]] ## Ideas - [[MetricSifterをHPCクラスタのヒートマップに応用する]] ## 関連MOC - [[Systems for ML - MOC]] - [[Telemetry - MOC]] - [[分散深層学習 - MOC|Distributed Training - MOC]]