fold-k4-from-2026-06-04-to-2026-06-05-n16

Level-4 fold of 16 log entries spanning 2026-06-04 to 2026-06-05. Dominant themes: LLM 訓練の障害・性能診断の大規模体系化（8+6+7+5 本の並行取り込み）、エージェント型時系列予測の理論・実装一巡、インシデント自動化の産業パターンとログ解析パイプラインの地図化。 ## Child Entries | Date | Op | Title | Page | Summary (extractive) | |---|---|---|---|---| | 2026-06-05 | ingest-paper | インシデント自動化 4 本 | [[FLASH]] / [[StepFly]] / [[LLexus]] 他 | TSG 自動化 3 パターン（LLM オンライン/計画前置/両方+並列）が「TSG 品質が律速」へ独立収束。assurance contract が TNR・Google L0–L4 を上位一般化。 | | 2026-06-05 | ingest-paper | LLM × DATA Survey | [[@2025__arXiv__A Survey of LLM × DATA]] | DB 異常診断とクラウド RCA が「直接プロンプト/RAG/マルチエージェント」の同一パターンを共有。 | | 2026-06-05 | ingest-paper | LLM4Log | [[@2026__arXiv__LLM4Log]] | ログ生成→パース→表現→下流診断のパイプライン全体地図。162 レコード中 deployment 証拠 5 のみ。「情報を絞ってから LLM を呼ぶ階層設計」が全段共通。 | | 2026-06-04 | ingest-paper (followup) | PACE 正式版 | [[@2025__ISAV__PACE]] | PDF 入手で 6 段パイプライン（Granger 因果・グラフ合成）を詳細化。定量精度指標なし・物理整合性による定性評価。 | | 2026-06-04 | ingest-paper | 本番 LLM 訓練 8 論文 | [[Aegis]] / [[SkeletonHunter]] / [[L4]] 他 | 「全マシン対称」の規則性を性能予測・異常検知・ログ外れ値が別目的で利用。箇所特定は CCL カウンタ/ネットワークパス/ログ/メトリクス類似度とモダリティごとに分岐。 | | 2026-06-04 | ingest-paper | GPU 障害管理 6 本 | [[@2025__SC__Fine-grained Automated Failure Management]] 他 | 検知の一次シグナル・復旧の 3+1 系統・緩和の段階化・計装位置の 4 軸で横断整理。 | | 2026-06-04 | ingest-paper | GPU/eBPF 観測性 7 論文 | [[Mycroft]] / [[NCCLX]] / [[eGPU]] 他 | 計装の挿入時点（コンパイル時/実行時 PTX/ホスト側 eBPF）でオーバーヘッドと観測対象が分岐。Mycroft と NCCLX が CCL の観測側・機構側を成す。 | | 2026-06-04 | ingest-paper | PromSketch | [[@2025__VLDB__PromSketch]] | TSDB に「近似クエリ処理」第 3 軸を追加。EH×スケッチで 5% 誤差許容しレイテンシ最大 2 桁・コスト約 400× 削減。 | | 2026-06-04 | ingest-paper ×5 | GPU 訓練インフラ/NW 5 本 | [[ByteRobust]] / [[R-Pingmesh]] / [[Astral]] 他 | ハードウェアの床→ストラグラー→耐障害インフラ→ネットワーク監視の縦系譜。H100 メモリ MTBE が A100 の 1/3.2。 | | 2026-06-04 | ingest-paper | Cloud Infra Management | [[@2025__OSR__Cloud Infrastructure Management in the Age of AI Agents]] | IaC 3 部作を「4 モダリティ（SDK/CLI/IaC/ClickOps）」で相対化。agent-cloud interface が AIOpsLab の ACI と収束。 | | 2026-06-04 | ingest-paper | TimeCopilot | [[@2025__arXiv__TimeCopilot]] | 複数 TSFM + LLM の統一 API。MedianEnsemble が GIFT-Eval CRPS 最良を約 $24 で達成。Workflow パラダイム代表。 | | 2026-06-04 | ingest-paper | TimeSeriesScientist | [[@2025__arXiv__TimeSeriesScientist]] | 基盤モデル不使用で LLM 直接予測を平均 38.2% 上回る。前処理除去アブレーション MAE +41.8%。Workflow パラダイムの対極。 | | 2026-06-04 | query | TSFM 単体と VLM 統合の差異 | [[TSFM単体とVLM統合の本質的差異]] | VLM 統合版は Toto を予測器でなく時系列エンコーダとして再利用。ARFBench 精度 63.9%（GPT-5 を 1.2pp 上回る）。 | | 2026-06-04 | ingest-paper | OpenRCA / Cloud-OpsBench / AlertGuardian | [[2025__ICLR__OpenRCA]] / [[2026__arXiv__Cloud-OpsBench]] / [[2025__ASE__AlertGuardian]] | RCA 特化の「第三のベンチ型」。能力天井が桁違いに低い（Claude 3.5=11.34%・Hard 0%）。 | | 2026-06-04 | ingest-paper | TSFM Survey 6 次元 | [[2025__arXiv__Foundation Models for Time Series - A Survey]] | 6 次元タクソノミー（アーキテクチャ/パッチ/目的関数/単変量・多変量/確率的・決定論的/規模）で TSFM 群を俯瞰。 | | 2026-06-04 | ingest-paper | Cast-R1 | [[2026__arXiv__Cast-R1]] | ATSF の AgenticRL 代表。RL 除去が最大劣化（NP 24.750→54.631）。Chronos-2 除去で volatile NP MSE 22.5→55.4。 | ## Key Outcomes - 本番 LLM 訓練の障害・性能診断を 8+6+7+5=26 本の一次論文から 4 回の並行取り込みで体系化。検知の一次シグナル（ステップ時間/集合通信同期点/物理メトリクス/ネットワークトラフィック）、復旧の 3+1 系統（高速 CP/予備機/べき等省略/データ並列冗長）、Fault Localization の「精密に当てる vs あえて粗く切る（過剰排除）」対比を確立 (from 2026-06-04 8 本・6 本・7 本・5 本バッチ entries) - Cast-R1 のアブレーションで ATSF ポジションペーパーの主張を個別に裏づけ——RL 除去が最大劣化（NP 24.750→54.631）、Chronos-2 単独除去で volatile NP MSE 22.5→55.4、予測モデルツール全除去で ETTh1 6.062→15.993。TimeCopilot（Workflow/TSFM アンサンブル）と TimeSeriesScientist（Workflow/基盤モデル不使用）で 3 パラダイムの実装が一巡 (from 2026-06-04 Cast-R1, TimeCopilot, TimeSeriesScientist entries) - OpenRCA（静的テレメトリ QA）と Cloud-OpsBench（決定論的 State Snapshot）が RCA 特化の「第三のベンチ型」を示し、Claude 3.5=11.34%・Hard 0% と能力天井が桁違いに低いことを確認 (from 2026-06-04 OpenRCA/Cloud-OpsBench/AlertGuardian entry) - Microsoft の TSG 自動化 3 本（FLASH/StepFly/LLexus）が「LLM をどこで働かせるか」で分岐しつつ「TSG 品質が自動化の律速」へ独立収束。NetOps-AIOps サーベイの assurance contract が TNR・Google L0–L4 を同一語彙で上位一般化 (from 2026-06-05 インシデント自動化 4 本 entry) - LLM4Log でログ生成→パース→表現→下流診断のパイプライン全体を地図化し、「情報を絞ってから LLM を呼ぶ階層設計」が全段共通の設計原理であることを上位一般化。162 レコード中 deployment 証拠は 5 のみ (from 2026-06-05 LLM4Log entry) - TSFM の 6 次元タクソノミーで vault が個別に深掘りしてきた TSFM 群を俯瞰し、目的関数による分類が独自軸であること、「多変量」の分類基準の食い違い（Falcon-X vs サーベイ）を明示 (from 2026-06-04 TSFM Survey entry) ## Cross-entry Themes - **LLM 訓練インフラの観測・診断が一挙に wiki の最大クラスタとなった**: 26 本の並行取り込みで概念（[[LLM学習モニタリング]]/[[集合通信]]/[[GPUクラスタ運用]]/[[耐障害LLM訓練]]/[[ストラグラー]]/[[GPUレジリエンス]]/[[RDMAネットワーク監視]]）と 100+ エンティティが密に接続。「全マシン対称」の規則性、計装位置ごとのモダリティ分岐、hardware 起因 fault の支配性が個別論文を越えた構造として浮上 (supported by: 2026-06-04 8 本・6 本・7 本・5 本バッチ entries) - **エージェント型時系列予測の 3 パラダイムが一巡**: Cast-R1（AgenticRL、2026-06-04）・TimeCopilot（Workflow/TSFM アンサンブル、2026-06-04）・TimeSeriesScientist（Workflow/基盤モデル不使用、2026-06-04）で ATSF の理論的分類（前バッチ）に実装が対応。「予測力の源泉はモデル規模でなくプロセスの組織化」が TimeSeriesScientist で例証、一方 Cast-R1 は規模依存性も示す (supported by: 2026-06-04 Cast-R1, TimeCopilot, TimeSeriesScientist, TSFM query entries) - **「情報を絞ってから推論」が AIOps パイプラインの設計原理として確立**: LLM4Log のログパイプライン全体（2026-06-05）、MonitorAssistant のメタ層限定（前バッチ）、AlertGuardian のライフサイクル全体最適化（2026-06-04）、LLM × DATA の DB/クラウド横断パターン一致（2026-06-05）が、個別ソースの観察を上位一般化 (supported by: 2026-06-05 LLM4Log, LLM × DATA entries + 2026-06-04 OpenRCA/Cloud-OpsBench/AlertGuardian entry) - **IaC/クラウド管理がモダリティ横断の設計空間として整理された**: Cloud Infra Management ビジョン論文（2026-06-04）が IaC 3 部作（前バッチ）を 4 モダリティの 1 つとして相対化し、agent-cloud interface が AIOpsLab・SRE AI Autonomy Levels と収束 (supported by: 2026-06-04 Cloud Infra Management entry) ## Contradictions or Corrections - Cast-R1: 本文（8B/4×A800）と Appendix（1.7B/RTX 4090D）で実装設定が矛盾。Table 2 の主結果数値が scaling 表の 4B 行と一致。ACM プレースホルダ残存——未完成プレプリントの瑕疵として source ページに記録。 - XPUTimer: arXiv v2 で著者構成変化・システム名を Flare に改名。source ページに contradiction callout、entity の aliases に両名保持。 - TSFM Survey: Toto スペック（サーベイ 103M・1 兆点 vs vault 151M・約 2.36 兆点）はモデルバージョン差。Chronos はサーベイの初代（T5）と Chronos-2 が別世代。 - Guangba Yu: 所属 SYSU↔CUHK の食い違い。contradiction callout 保持。 ## Child Pages - [[@2024__MSR__FLASH - A Workflow Automation Agent for Diagnosing Recurring Incidents]] - [[@2025__arXiv__StepFly - Agentic Troubleshooting Guide Automation for Incident Diagnosis]] - [[@2024__OSR__LLexus - an AI agent system for incident management]] - [[@2026__arXiv__Large Language Models for Agentic NetOps and AIOps - Architectures, Evaluation, and Safety]] - [[@2025__arXiv__A Survey of LLM × DATA]] - [[@2026__arXiv__LLM4Log - A Systematic Review of Large Language Model-based Log Analysis]] - [[@2025__ISAV__From Exploration to Explanation - ML-Driven Causal Discovery for Datacenter Reliability at Scale]] - [[@2025__arXiv__Efficient Fine-Grained GPU Performance Modeling for Distributed Deep Learning of LLM]] - [[@2025__SIGCOMM__SkeletonHunter - Diagnosing and Localizing Network Failures in Containerized Large Model Training]] - [[@2025__IWQoS__eACGM - Non-instrumented Performance Tracing and Anomaly Detection towards Machine Learning Systems]] - [[@2025__arXiv__XPUTimer - Anomaly Diagnostics for Divergent LLM Training in GPU Clusters of Thousand-Plus Scale]] - [[@2025__NSDI__Evolution of Aegis - Fault Diagnosis for AI Model Training Service in Production]] - [[@2025__DSN__LLMPrism - Black-box Performance Diagnosis for Production LLM Training Platforms]] - [[@2025__ESEC-FSE__L4 - Diagnosing Large-scale LLM Training Failures via Automated Log Analysis]] - [[@2025__SC__Fine-grained Automated Failure Management for Extreme-Scale GPU Accelerated Systems]] - [[@2026__MLSys2026__Guard - Scalable Straggler Detection and Node Health Management for Large-Scale Training]] - [[@2025__arXiv__FlashRecovery - Fast and Low-Cost Recovery from Failures for Large-Scale Training of LLMs]] - [[@2025__SIGCOMM__Hawkeye - Diagnosing RDMA Network Performance Anomalies with PFC Provenance]] - [[@2025__HPCA__Enhancing Large-Scale AI Training Efficiency - The C4 Solution]] - [[@2025__APNET__Forewarned is Forearmed - Joint Prediction and Classification of Optical Transceiver Failures]] - [[@2025__SOSP__Mycroft - Tracing Dependencies in Collective Communication Towards Reliable LLM Training]] - [[@2025__arXiv__Collective Communication for 100k+ GPUs]] - [[@2025__HCDS__eGPU - Extending eBPF Programmability and Observability to GPUs]] - [[@2024__TOPC__Low-Overhead Trace Collection and Profiling on GPU Compute Kernels]] - [[@2024__arXiv__Microsecond-scale Dynamic Validation of Idempotency for GPU Kernels]] - [[@2026__arXiv__ProfInfer - An eBPF-based Fine-Grained LLM Inference Profiler]] - [[@2025__eBPF__eInfer - Unlocking Fine-Grained Tracing for Distributed LLM Inference with eBPF]] - [[@2025__VLDB__Approximation-First Timeseries Monitoring Query At Scale]] - [[@2024__SIGCOMM__R-Pingmesh - A Service-Aware RoCE Network Monitoring and Diagnostic System]] - [[@2025__SC__Characterizing GPU Resilience and Impact on AI - HPC Systems]] - [[@2025__OSDI__Understanding Stragglers in Large Model Training Using What-if Analysis]] - [[@2025__SOSP__Robust LLM Training Infrastructure at ByteDance]] - [[@2025__SIGCOMM__Astral - A Datacenter Infrastructure for Large Language Model Training at Scale]] - [[@2025__OSR__Cloud Infrastructure Management in the Age of AI Agents]] - [[@2025__arXiv__TimeCopilot]] - [[@2025__arXiv__TimeSeriesScientist - A General-Purpose AI Agent for Time Series Analysis]] - [[TSFM単体とVLM統合の本質的差異]] - [[2025__ICLR__OpenRCA - Can Large Language Models Locate the Root Cause of Software Failures]] - [[2026__arXiv__Cloud-OpsBench - A Reproducible Benchmark for Agentic Root Cause Analysis in Cloud Systems]] - [[2025__ASE__AlertGuardian - Intelligent Alert Life-Cycle Management for Large-scale Cloud Systems]] - [[2025__arXiv__Foundation Models for Time Series - A Survey]] - [[2026__arXiv__Cast-R1 - Learning Tool-Augmented Sequential Decision Policies for Time Series Forecasting]] - [[FLASH]] - [[StepFly]] - [[LLexus]] - [[OpenRCA]] - [[Cloud-OpsBench]] - [[AlertGuardian]] - [[Aegis]] - [[SkeletonHunter]] - [[L4]] - [[LLMPrism]] - [[XPUTimer]] - [[eACGM]] - [[Mycroft]] - [[NCCLX]] - [[eGPU]] - [[PromSketch]] - [[ByteRobust]] - [[R-Pingmesh]] - [[Astral]] - [[Cast-R1]] - [[TimeCopilot]] - [[TimeSeriesScientist]] - [[PACE]] - [[DyTwin]] - [[TSG自動化]] - [[NetOps]] - [[エージェント運用安全性]] - [[GPU観測性]] - [[集合通信]] - [[動的インストルメンテーション]] - [[べき等性]] - [[チェックポイント]] - [[耐障害LLM訓練]] - [[ストラグラー]] - [[GPUレジリエンス]] - [[RDMAネットワーク監視]] - [[近似クエリ処理]] - [[クラウド管理モダリティ]] - [[ログパース]] - [[ログ生成]] - [[エージェント型時系列予測]] - [[時系列基盤モデル]] - [[多変量時系列予測]] ## Related - [[DragonScale Memory]] - fold-operator spec - [[log]] - source entries - [[index]] - vault catalog