fold-k4-from-2026-06-02-to-2026-06-03-n16

Level-4 fold of 16 log entries spanning 2026-06-02 to 2026-06-03. Dominant themes: agentic SRE ベンチマーク・産業 AI SRE の基盤構築、LLM 分散学習インフラの体系化、テレメトリ 3 層の確立と時系列基盤モデルの導入。 ## Child Entries | Date | Op | Title | Page | Summary (extractive) | |---|---|---|---|---| | 2026-06-03 | ingest-paper | Minder: Faulty Machine Detection | [[2025__NSDI__Minder]] | 訓練クラスタの faulty machine detection を machine-level similarity + LSTM-VAE + decision tree で教師なし特定（precision 0.904・F1 0.893・3.6 秒）。 | | 2026-06-03 | ingest-paper | SAKURAONE | [[2026__MLSys2026__SAKURAONE]] | SONiC + RoCEv2 フルオープン 800 GbE が InfiniBand 比 time-to-train 1.02–1.26×。MFU 38–41%、fault の 42.9% が GPU 起因。vault 所有者共著。 | | 2026-06-03 | ingest-paper | MegaScale | [[2024__NSDI__MegaScale]] | 12,288 GPU で MFU 55.2%（Megatron-LM 比 1.34×）を実測。100 回超の自動復旧と straggler 診断が AIOps と同型の distributed-view 課題であることを横断的知見化。 | | 2026-06-03 | ingest-paper | Efficient Training Survey | [[2026__Vicinagearth__Efficient Training Survey]] | SER（Scalability/Efficiency/Reliability）3 軸とインフラ/並列化/最適化/fault tolerance 4 層で LLM 訓練インフラを体系化。wiki 初の ML systems クラスタ新設。 | | 2026-06-03 | ingest-paper | Scaling Telemetry Workloads | [[2025__Kyoto University__Scaling Telemetry Workloads]] | テレメトリを instrumentation（in-kernel flow bundling）・storage（HeteroTSDB）・mining の 3 層として wiki に確立。vault 所有者の博士論文。 | | 2026-06-03 | ingest-paper | MetricSifter | [[2024__IEEE Access__MetricSifter]] | 無関係メトリクスが localization を阻害する課題を定式化。LLM エージェントの telemetry 過剰消費病理と同型。wiki 初の vault 所有者論文。 | | 2026-06-03 | ingest-paper | Falcon-X | [[2026__arXiv__Falcon-X]] | 異種多変量の cross-variate モデリングを latent prototype 空間で実現。GIFT-Eval 全体最高だが SRE 下流タスクは未評価。TSFM 2 ソース目。 | | 2026-06-03 | ingest-paper | This Time is Different (Toto) | [[2025__NeurIPS2025__This Time is Different]] | 観測データが一般時系列と統計的に異なることを定量化し、専用 decoder-only アーキテクチャで zero-shot SOTA。事前学習 2.36 兆点。wiki 初の時系列基盤モデル。 | | 2026-06-03 | ingest | Building Bits AI SRE | [[2026__Datadog__Building Bits AI SRE]] | hypothesis-driven investigation で telemetry 過剰消費を回避。RCA（第 3 段）を [[根本原因分析]] として wiki に新設。TTR 最大 95% 減。 | | 2026-06-03 | ingest | AI in SRE: Google | [[2026__GoogleSRE__AI in SRE]] | SRE AI Autonomy Levels（L0–L4）で AI-Ops を統治。推論と actuation の分離が [[Transactional No-Regression]] の産業実装。産業界初の一次情報。 | | 2026-06-03 | ingest-paper | MicroRemed | [[2025__arXiv__MicroRemed]] | AIOps 4-level taxonomy 最上位の Mitigation を「診断レポート→Ansible playbook 生成」として切り出した初の専門ベンチ。reflection > probe を ablation で実証。 | | 2026-06-03 | ingest-paper | PAGER | [[@2026__AAAI__PAGER]] | reactive 一色の wiki に proactive な [[障害予測]] 軸を追加。予測は古典 random forest、LLM は説明・対話インターフェース層に限定。 | | 2026-06-03 | ingest | STRATUS | [[2025__NeurIPS2025__STRATUS]] | 一次論文で安全仕様 [[Transactional No-Regression]] を形式化。AIOpsLab・ITBench 両ベンチで SOTA を 1.5 倍上回ると主張。 | | 2026-06-03 | ingest-paper | AIOpsLab | [[2025__MLSys2025__AIOpsLab]] | 障害を detection/localization/RCA/mitigation の 4 サブ問題に分解して個別採点。「最初の仮説に固執し telemetry を取りすぎる」失敗を観測。 | | 2026-06-03 | ingest-paper | SREGym | [[2026__arXiv__SREGym]] | noise・低位層 fault・metastable/concurrent/correlated 障害で E2E が 60%→18–28% に崩壊。greedy approach の最初の異常固着を報告。 | | 2026-06-02 | init | LLM wiki レイヤー初期化 | — | mode=generic、transport=filesystem で wiki レイヤー初期化。スコープは新規ソースのみ。 | ## Key Outcomes - wiki レイヤーを初期化し「新規ソースのみ・既存ノート温存」のスコープを確定、`.raw/` + `wiki/{sources,entities,concepts}/` の構造を構築 (from 2026-06-02 init entry) - SRE ベンチマーク 3 種（SREGym・AIOpsLab・ITBench）を一次論文から wiki 化し、フロンティアエージェントの E2E 成功率がノイズ・低位層 fault で 60%→18–28% に崩壊する限界を確認。独立に「最初の仮説に固執し telemetry を取りすぎる」失敗パターンが観測された (from 2026-06-03 SREGym, AIOpsLab entries) - 産業 AI SRE を Google（SRE AI Autonomy Levels L0–L4、推論/actuation 分離）と Datadog（hypothesis-driven investigation、TTR 最大 95% 減）の 2 系統で wiki に構造化。学術ベンチが「エージェントの telemetry 過剰消費」を観測した病理を、Datadog が製品設計の出発点として明示回避 (from 2026-06-03 Google, Bits AI SRE entries) - STRATUS の一次論文で安全仕様 [[Transactional No-Regression]] を形式化し、AIOpsLab・ITBench 両ベンチで SOTA 1.5 倍を主張。AIOps 4-level taxonomy の detection/localization/RCA/mitigation を 4 つの concept ページ（[[根本原因分析]]・[[障害緩和]]・[[障害予測]]・[[Fault Localization]]）に展開 (from 2026-06-03 STRATUS, MicroRemed, PAGER, Bits AI SRE entries) - MetricSifter と博士論文からテレメトリ 3 層（instrumentation・storage・mining）を確立し、「データ削減は文脈が豊富な両端で行う」設計指針がメトリクスの特徴量削減から LLM エージェントの telemetry 過剰消費まで貫くことを明示 (from 2026-06-03 MetricSifter, Scaling Telemetry entries) - LLM 分散学習を SER 3 軸 + 4 層 taxonomy で新設。MegaScale が 12,288 GPU で MFU 55.2%（Megatron-LM 比 1.34×）を実測し、SAKURAONE が SONiC + RoCEv2 で InfiniBand 比 time-to-train 1.02–1.26× を実証 (from 2026-06-03 Efficient Training Survey, MegaScale, SAKURAONE entries) - 時系列基盤モデルを Toto（観測データ特化、事前学習 2.36 兆点、zero-shot SOTA）と Falcon-X（異種多変量 cross-variate、GIFT-Eval 最高）の 2 ソースで wiki に導入 (from 2026-06-03 This Time is Different, Falcon-X entries) ## Cross-entry Themes - **「telemetry を取りすぎる」病理が手法世代・組織を超えて連続する**: MetricSifter（2026-06-03）が pre-LLM で無関係メトリクスの localization 阻害を定式化し、AIOpsLab（2026-06-03）と SREGym（2026-06-03）が LLM エージェントで同型の過剰消費を独立に観測、Bits AI SRE（2026-06-03）が hypothesis-driven investigation でこれを製品設計として回避。博士論文（2026-06-03）の「データ削減は文脈が豊富な両端で」が上流の設計指針として一般化する (supported by: 2026-06-03 MetricSifter, AIOpsLab, SREGym, Bits AI SRE, Scaling Telemetry entries) - **訓練クラスタ診断と本番 AIOps が distributed-view で同型だが信号源が真逆**: MegaScale の straggler 診断と Minder の metric-pattern ベース fault detection は分散トレーシング・障害箇所特定と同型の distributed-view 課題だが、訓練は homogeneity からの逸脱、マイクロサービスは heterogeneity の依存伝播で信号源が真逆 (supported by: 2026-06-03 MegaScale, Minder, SAKURAONE entries) - **undo-and-retry / reflection が緩和の鍵として独立に収斂**: STRATUS の TNR（2026-06-03）が rollback を安全仕様として形式化し、MicroRemed（2026-06-03）が reflection > probe を ablation で実証、SREGym（2026-06-03）の「undo-and-retry が最強の緩和を生む」と独立に一致 (supported by: 2026-06-03 STRATUS, MicroRemed, SREGym entries) - **hardware 起因 fault の支配性が複数クラスタで連続**: SAKURAONE の fault 42.9% GPU 起因（2026-06-03）、Minder の hardware dominant 55.8%/ECC 38.9%（2026-06-03）が独立に hardware 支配を報告 (supported by: 2026-06-03 SAKURAONE, Minder entries) ## Contradictions or Corrections - AIOpsLab エントリ: SREGym 由来の「AIOpsLab は ReAct ループを要求/非 ReAct は移植必要」は一次論文（get_action のみ要求、任意 framework 可）と食い違い。[[AIOpsLab]] に callout 設置済み。 - Scaling Telemetry エントリ: 既存 [[Yuuki Tsubouchi]] の「2023 年博士号取得」記述と本論文表紙の「March, 2025」が食い違い。一次資料に合わせ修正済み。 ## Child Pages - [[2026__arXiv__SREGym - A Live Benchmark for AI SRE Agents with High-Fidelity Failure Scenarios]] - [[2025__MLSys2025__AIOpsLab - A Holistic Framework to Evaluate AI Agents for Enabling Autonomous Clouds]] - [[2025__NeurIPS2025__STRATUS - A Multi-agent System for Autonomous Reliability Engineering of Modern Clouds]] - [[@2026__AAAI__PAGER - Proactive Monitoring Agent for Enterprise AI Assistant]] - [[2025__arXiv__MicroRemed - Benchmarking LLMs in Microservices Remediation]] - [[2026__GoogleSRE__AI in SRE - Engineering the Future of Reliable Operations]] - [[2026__Datadog__Building Bits AI SRE - Autonomous Incident Investigation Agent]] - [[2025__NeurIPS2025__This Time is Different - An Observability Perspective on Time Series Foundation Models]] - [[2026__arXiv__Falcon-X - A Time Series Foundation Model for Heterogeneous Multivariate Modeling]] - [[2024__IEEE Access__MetricSifter - Feature Reduction of Multivariate Time Series Data for Efficient Fault Localization in Cloud Applications]] - [[2025__Kyoto University__Scaling Telemetry Workloads in Cloud Applications - Techniques for Instrumentation, Storage, and Mining]] - [[2026__Vicinagearth__Efficient Training of Large Language Models on Distributed Infrastructures - A Survey]] - [[2024__NSDI__MegaScale - Scaling Large Language Model Training to More Than 10,000 GPUs]] - [[2026__MLSys2026__SAKURAONE - An Open Ethernet-Based AI HPC System]] - [[2025__NSDI__Minder - Faulty Machine Detection for Large-scale Distributed Model Training]] - [[SREGym]] - [[AIOpsLab]] - [[ITBench]] - [[Stratus]] - [[PAGER]] - [[MicroRemed]] - [[Bits AI SRE]] - [[Toto]] - [[Falcon-X]] - [[MetricSifter]] - [[MegaScale]] - [[SAKURAONE]] - [[Minder]] - [[HeteroTSDB]] - [[Meltria]] - [[AIOps]] - [[agentic SRE]] - [[SRE Benchmark]] - [[根本原因分析]] - [[Fault Localization]] - [[障害予測]] - [[障害緩和]] - [[Transactional No-Regression]] - [[SRE AI Autonomy Levels]] - [[Metastable Failure]] - [[時系列基盤モデル]] - [[多変量時系列予測]] - [[LLM分散学習]] - [[並列化戦略]] - [[テレメトリ]] - [[時系列データベース]] - [[分散トレーシング]] - [[特徴量削減]] - [[変化点検知]] - [[GPUクラスタ運用]] - [[Model Context Protocol]] - [[Yuuki Tsubouchi]] - [[Hirofumi Tsuruta]] - [[SAKURA Internet]] - [[ByteDance]] - [[Datadog]] - [[Google]] - [[Microsoft]] - [[Adobe]] - [[Carnegie Mellon University]] - [[Kyoto University]] - [[Tsinghua University]] - [[Harvard University]] - [[Peking University]] - [[University of Illinois Urbana-Champaign]] ## Related - [[DragonScale Memory]] - fold-operator spec - [[log]] - source entries - [[index]] - vault catalog