- [[Observability Conference Tokyo 2025 プロポーザル]] ## References - [[GPUクラスタ モニタリング・オブザーバビリティ 事例 - OpenAI Deep Research]] - [[System@Scale - AI Observability]] ## オブザーバビリティ・ギャップ - マイクロバーストモニタリング - [[AI-ML基盤における800GbEスイッチ導入とその挑戦 - JANOG56 Meeting in Matsue]] - Network or not? - [[2024__SIGCOMM__R-Pingmesh - A Service-Aware RoCE Network Monitoring and Diagnostic System]] - ボトルネック分析 - [[Unlocking LLM Performance with EBPF - Optimizing Training and Inference Pipelines - KubeCon24 Chaina]] - [[2025__HCDS__eGPU - Extending eBPF Programmability and Observability to GPUs]] - 障害や劣化の原因調査 - [[Transformers in SRE Land - Evolving to Manage AI Infrastructure at SREcon25 Americas]] - [[2023__EuroMLSys__Profiling and Monitoring Deep Learning Training Tasks]] - [[2025__OSDI__Understanding Stragglers in Large Model Training Using What-if Analysis]] ChatGPT - https://chatgpt.com/c/68ea708f-fb9c-8322-9149-a3e75d9716b5 - https://chatgpt.com/g/g-p-68091498420081918f640bd067b5c174/c/68f5a147-cf3c-8324-967a-858bbe19f3bb - https://chatgpt.com/g/g-p-68091498420081918f640bd067b5c174/c/68f5add3-3808-8322-bbab-6126be2c7251 ## HPCユーザーヒアリング