log - yuuk1's Digital Garden

## [2026-07-21] ingest-paper | Don't Predict, Prioritize: Rethinking GPU Reliability Assessment - Source: `.raw/papers/arxiv-2607.15115.pdf`(https://arxiv.org/pdf/2607.15115、arXiv 公式 PDF を `scripts/fetch-paper-pdf.sh` で取得、書誌情報は arXiv abs ページを WebFetch で補完) - Summary: [[@2026__arXiv__Don't Predict, Prioritize - Rethinking GPU Reliability Assessment]] - Pages created: [[@2026__arXiv__Don't Predict, Prioritize - Rethinking GPU Reliability Assessment]], [[Difeng Ma]], [[Yuanwei Lu]], [[Quan Zhou]], [[Daxin Jiang]], [[Jingjing Li]] - Pages updated: [[Changhua Pei]], [[Gaogang Xie]], [[Zexin Wang]], [[Yibo Zhu]], [[Dan Pei]], [[Chinese Academy of Sciences]], [[University of Chinese Academy of Sciences]], [[Tsinghua University]], [[StepFun]], [[障害予測]], [[GPUレジリエンス]], [[wiki/index.md]], [[wiki/hot.md]], [[wiki/sources/_index.md]], [[wiki/entities/_index.md]], [[wiki/concepts/_index.md]] - Key insight: Difeng Ma・[[Changhua Pei]]ほか(Computer Network Information Center, [[Chinese Academy of Sciences]] / [[University of Chinese Academy of Sciences]] / [[StepFun]] / [[Tsinghua University]])による、GPU 障害予測の系統的限界実証とリスクランキングへの再定式化論文(KDD '26 V.2、arXiv 2026-07-16)。5モデル横断(XGBoost・CNN・LSTM・Transformer・MoE)の時系列予測実験で、8時間観測窓での最良モデルでもF1最大0.4837にとどまることを示し、Kendall相関・SNR分析・分布比較の3種の統計分析で原因をテレメトリの性質(ワークロード依存の非定常性・信号拡散・分布重複)に帰属させた。一方でホスト単位の障害はPareto分布に集中する(上位10%未満のホストが critical 障害の30%超を占め、四半期を通じ24〜33%で安定、χ²検定p≪10^-10)ことを発見し、これを軽量MLPベースのLearning-to-Rankモデル HeaRank でリスクランキングタスクへ再定式化した。本番クラスタでAUC 0.834・NDCG@5=0.427(LightGBM Ranker比38%改善)を達成し、6ヶ月間の本番展開(2025-07〜2026-01)で上位5%リスクノードに障害の64%が集中(既存Health Scoreシステムは21%)、月あたり約5万ドルのGPU時間節約を試算した。既存concept [[障害予測]] に「精密な時間予測が破綻する領域ではランキングへの再定式化が代替パラダイムになる」という中心的知見(Salfner+ 2010の時間軸パラメータがランキングでは逆機能する点も含む)を、[[GPUレジリエンス]] にホスト単位Pareto集中とコンポーネント単位弱点分布の粒度対応・リスクランキングという運用対処の第3経路を追記した。図7点(Kendall相関ワークロード比較・SNRボックスプロット・Paretoホスト障害密度・アブレーション・スケジューリングアーキテクチャ・CDF比較・テレメトリパイプライン)を選定して埋め込んだ(埋め込みラスター画像13枚から選定、page-render画像12枚は削除)。CNIC/CAS著者(Changhua Pei・Gaogang Xie・Zexin Wang)はCOMET・UModel等の既存AIOps研究と同一グループ、StepFunのYibo ZhuはTiresias(GPUクラスタスケジューリング)・DistServe(LLM推論)に続く3つ目の研究軸として参加。 ## [2026-07-21] ingest | Tales from the Lunar Module Guidance Computer - Source: `.raw/articles/tales-2026-07-21.md`(https://www.doneyles.com/LM/Tales.html、WebFetch 経由の defuddle 直接パースが `Error: aborted` で失敗したためブラウザ UA 付き curl で HTML を取得後 defuddle parse でローカルファイルとしてMarkdown化) - Summary: [[@2004__AAS__Tales from the Lunar Module Guidance Computer]] - Pages created: [[@2004__AAS__Tales from the Lunar Module Guidance Computer]], [[Don Eyles]], [[Allan Klumpp]], [[Hal Laning]], [[Apollo Guidance Computer]], [[MIT Instrumentation Laboratory]], [[優先度駆動リアルタイム実行系]], [[リスタート保護]], [[インターフェース仕様の齟齬による障害]], [[制御ループの安定性とタイムラグ補償]] - Pages updated: [[Margaret Hamilton]], [[べき等性]], [[チェックポイント]], [[根本原因分析]], [[ポストモーテム]], [[wiki/index.md]], [[wiki/hot.md]] - Key insight: Apollo Lunar Module Guidance Computer のフライトソフトウェアエンジニア [[Don Eyles]] による回顧録(AAS 04-064、2004年)。Apollo 11 の 1201/1202 プログラムアラームの根本原因が、ランデブーレーダーと ATCA 間の ICD が「周波数同期」のみを規定し「位相同期」を規定しなかったこと(→ [[インターフェース仕様の齟齬による障害]])にあると当事者証言で特定し、当時の報道の「コンピュータエラー」という表層的帰属に異議を唱える。[[Hal Laning]] 設計の優先度駆動プリエンプティブ Executive/Waitlist(→ [[優先度駆動リアルタイム実行系]])と、waypoint によるリスタート保護(→ [[リスタート保護]])が、資源枯渇時にも致命的崩壊を招かず、意図せぬ耐障害機構として機能したことを詳述する。スロットル振動「キャッスレーション」問題では、ICD記載のタイムラグ値(0.3秒)が既に陳腐化していたにもかかわらず、著者の経験的判断による過小補償(0.2秒)が結果的に安定側に働き着陸を救っていた可能性を[[Allan Klumpp]]の事後解析が示す(→ [[制御ループの安定性とタイムラグ補償]])。既存concept [[べき等性]]・[[チェックポイント]]に、動的計装のない環境での手作業waypoint方式という現代の自動最適化とは異なる極の知見を追記し、[[根本原因分析]]・[[ポストモーテム]]に単一原因帰属への異議が現代のSRE文化に半世紀先行していたという知見を追記した。[[Margaret Hamilton]]ページとの間で、Executive設計の個人帰属粒度(組織リーダー vs 個別設計者)に関するcontradiction calloutを両ページに追加。図5点(PGNS構成図・DSKY・ランデブーレーダーインターフェース図・キャッスレーション発見時の手書きアクションアイテムmemo・スロットル振動実測データ)をcurlでダウンロードし埋め込んだ。 ## [2026-07-21] ingest-paper | Mach: A Pluggable Metrics Storage Engine for the Age of Observability - Source: `.raw/papers/p12-solleza.pdf`(https://vldb.org/cidrdb/papers/2022/p12-solleza.pdf、vldb.org cidrdb 公式アーカイブ CIDR 2022 収録論文) - Summary: [[@2022__CIDR__Mach - A Pluggable Metrics Storage Engine for the Age of Observability]] - Pages created: [[@2022__CIDR__Mach - A Pluggable Metrics Storage Engine for the Age of Observability]], [[Andrew Crotty]], [[Mach]] - Pages updated: [[Franco Solleza]], [[Nesime Tatbul]], [[Stan Zdonik]], [[Suman Karumuri]], [[Brown University]], [[Carnegie Mellon University]], [[Slack Technologies]], [[時系列データベース]], [[専用データベースシステム]], [[wiki/index.md]], [[wiki/hot.md]], [[wiki/sources/_index.md]], [[wiki/entities/_index.md]] - Key insight: [[Franco Solleza]]・[[Andrew Crotty]]・[[Suman Karumuri]]・[[Nesime Tatbul]]・[[Stan Zdonik]]([[Brown University]]・[[Carnegie Mellon University]]・[[Slack Technologies]]・Intel Labs・MIT)によるメトリクス専用プラガブルストレージエンジン Mach の提案論文(CIDR 2022)。複数の独立ライタースレッドが疎結合(mutex 協調なし)に振る舞うアーキテクチャで、mutex 獲得だけで Prometheus の書き込みオーバーヘッドの約25%を占めるという観察に基づき協調オーバーヘッド自体を除去した。追記主体の高速パス・アクティブセグメント単位の一括圧縮・短く決定的なスナップショット機構と組み合わせ、予備実験(Rust実装、Prometheus/InfluxDB/RocksDB比較)で単一ノード最大480M f64/秒の書き込み(既存最良比約10倍)・100万データソースまでのスケーリング・最大3倍の読み取りスループットを達成した。既存concept [[時系列データベース]] に「協調そのものの除去」という取り込み最適化の第6の軸と「書き込みが読み取りをブロックしうる」非対称トレードオフの知見を、[[専用データベースシステム]] に「除去の対象が機能ではなく同期プリミティブ自体になりうる」という知見を追記した。著者4名(Solleza・Tatbul・Zdonik・Karumuri)は既存 [[@2021__SIGMOD Record__Towards Observability Data Management at Scale]] の共著者と同一人物であり、Slack の2020年5月12日アウテージ・規模感(4Bソース/日・12Mサンプル/秒)も同論文と同じ事実を再引用する形で確認された(矛盾なし)。図は全8枚が埋め込みラスター画像(データ例・時間空間次元図・アーキテクチャ図・書き込み/読み取りパス図2枚・スループットグラフ3枚)ですべて再利用可能だったため PyMuPDF クロップは不要だった。 ## [2026-07-20] ingest-paper | Dremel: Interactive Analysis of Web-Scale Datasets - Source: `.raw/papers/dremel-vldb2010.pdf`(https://storage.googleapis.com/gweb-research2023-media/pubtools/3293.pdf、research.google公式ページ https://research.google/pubs/dremel-interactive-analysis-of-web-scale-datasets-2/ 経由でPDFリンクを取得) - Summary: [[@2010__VLDB__Dremel - Interactive Analysis of Web-Scale Datasets]] - Pages created: [[@2010__VLDB__Dremel - Interactive Analysis of Web-Scale Datasets]], [[Sergey Melnik]], [[Andrey Gubarev]], [[Jing Jing Long]], [[Geoffrey Romer]], [[Shiva Shivakumar]], [[Matt Tolton]], [[Theo Vassilakis]], [[MapReduce]], [[Protocol Buffers]], [[ネスト型カラムナストレージ]] - Pages updated: [[Google]], [[列指向OLAPデータベース]], [[並列データベース]], [[wiki/index.md]], [[wiki/hot.md]], [[wiki/sources/_index.md]], [[wiki/entities/_index.md]], [[wiki/concepts/_index.md]] - Key insight: Sergey Melnikほか([[Google]], Inc.)による対話的クエリシステムDremelの提案論文(VLDB 2010)。ネストレコードを損失なくカラムへ分解・再構成するrepetition level / definition levelという列指向ストレージ表現と、ウェブ検索エンジン由来の多段サービス木を組み合わせ、兆行規模テーブルへの集計クエリを数秒で実行する。MapReduceを置き換えず補完する設計思想を明示し、3000ノード規模の実験でMR-on-recordsに対し87TBに対し約0.5TBしか読まず実行時間を2桁短縮する(時間→分→秒)。新規concept [[ネスト型カラムナストレージ]] を作成し、既存concept [[列指向OLAPデータベース]] に「ネストデータへの列指向拡張は2010年に一度到達しており、2016年のSnowflake VARIANT型はリレーショナルDBMS側からの再到達である」という横断的知見を、[[並列データベース]] に「ウェブ検索由来の多段サービス木がDeWitt/Grayの想定しなかった第三の並列化手段として応用された」という横断的知見を追記した。奇しくも同日並行してingestされた [[@2004__OSDI__MapReduce - Simplified Data Processing on Large Clusters]] が、本論文の言うMapReduceの実体として既に本wikiに存在していたため、entity [[MapReduce]] は両論文を横断参照する形で作成した。図はすべてベクター描画(埋め込みラスター画像87枚はいずれもフィールド装飾用の微小フラグメントで図全体としては使えず)のためPyMuPDFのキャプション座標クロップでFigure 1・2・3・7・9・10の計6枚を抽出した。 ## [2026-07-20] ingest-paper | MapReduce: Simplified Data Processing on Large Clusters - Source: `.raw/papers/dean.pdf`(https://www.usenix.org/legacy/events/osdi04/tech/full_papers/dean/dean.pdf、USENIX 公式アーカイブ。会議ページ https://www.usenix.org/conference/osdi-04/mapreduce-simplified-data-processing-large-clusters は WebFetch 403 のためブラウザ UA 付き curl でフォールバック取得) - Summary: [[@2004__OSDI__MapReduce - Simplified Data Processing on Large Clusters]] - Pages created: [[@2004__OSDI__MapReduce - Simplified Data Processing on Large Clusters]] - Pages updated: [[Jeffrey Dean]], [[Sanjay Ghemawat]], [[Google]], [[Google File System]], [[タスク並列フレームワーク]], [[wiki/index.md]], [[wiki/hot.md]], [[wiki/sources/_index.md]] - Key insight: [[Jeffrey Dean]]・[[Sanjay Ghemawat]]([[Google]])による MapReduce 提案論文(OSDI '04)。map/reduce の2関数だけで大規模クラスタ上の並列分散計算を記述できるプログラミングモデルと、master 中央集権スケジューリング・タスク再実行による耐障害性・[[Google File System]] 局所性最適化・straggler 緩和のバックアップタスク機構(sort ベンチマークで無効化すると44%時間増加)を特徴とする耐障害実装を報告する。2004年8月時点で月29,423ジョブ・入力3,288TB規模で本番稼働。既存 concept [[タスク並列フレームワーク]](従来は Ray OSDI 2018 の単一ソースに基づいていた)に、BSP 静的 DAG モデルの起源としての MapReduce と Ray の GCS 分離設計との対比、タスク全体再実行 vs Spark RDD 由来の血統ベース部分再計算という耐障害性戦略の分岐、eager scheduling を発展させたバックアップタスク機構という3つの横断的知見を追加し、複数ソース間の系譜を初めて明示した。図表は全てベクター描画(埋め込みラスター画像0枚)のため PyMuPDF のキャプション座標クロップで Figure 1〜4・Table 1 の計5点を抽出した。 ## [2026-07-20] ingest-paper | The Snowflake Elastic Data Warehouse - Source: `.raw/papers/2026_Unknown_The_Snowflake_Elastic_Data_Warehouse.pdf`(ローカルファイル入力。ファイル名のヒント(2026年・著者不明)は誤りで、本文の copyright 表記・DOI から SIGMOD 2016 論文と確定) - Summary: [[@2016__SIGMOD__The Snowflake Elastic Data Warehouse]] - Pages created: [[@2016__SIGMOD__The Snowflake Elastic Data Warehouse]], [[Snowflake Computing]], [[Benoit Dageville]], [[Thierry Cruanes]], [[Marcin Zukowski]] - Pages updated: [[Amazon Web Services]], [[シェアードナッシング]], [[並列データベース]], [[データパーティショニング]], [[列指向OLAPデータベース]], [[wiki/index.md]], [[wiki/hot.md]], [[wiki/sources/_index.md]], [[wiki/entities/_index.md]], [[wiki/concepts/_index.md]] - Key insight: ストレージ(S3)とコンピュートを疎結合サービスへ分離する「マルチクラスタ・シェアードデータ・アーキテクチャ」を導入した Snowflake の産業論文。テーブルファイルの不変性を核に Snapshot Isolation・時間旅行・クローン・オンラインアップグレードを単一原理から導出する設計の一貫性が特徴。既存 concept [[シェアードナッシング]]・[[並列データベース]] が保持していた「クラウドネイティブ分離アーキテクチャの分類」という未解決の問いに対し、Snowflake 論文自身は「マルチクラスタ・シェアードデータ」という新語の自称にとどまり学術的な4分類は未確立であることが判明し、両ページの未解決の問いを更新した。図はすべてベクター描画(埋め込みラスター画像0枚)のため PyMuPDF のキャプション座標クロップで Figure 1〜6 を抽出した。 ## [2026-07-20] ingest-paper | Using Lightweight Formal Methods to Validate a Key-Value Storage Node in Amazon S3 - Source: `.raw/papers/shardstore-sosp21.pdf` - Summary: [[@2021__SOSP__Using Lightweight Formal Methods to Validate a Key-Value Storage Node in Amazon S3]] - Pages created: [[@2021__SOSP__Using Lightweight Formal Methods to Validate a Key-Value Storage Node in Amazon S3]], [[ShardStore]], [[James Bornholt]], [[軽量形式手法]] - Pages updated: [[Amazon Web Services]], [[LSMツリー]], [[wiki/index.md]], [[wiki/hot.md]], [[wiki/sources/_index.md]], [[wiki/entities/_index.md]], [[wiki/concepts/_index.md]] - Key insight: Amazon S3 の新ストレージノード ShardStore を、実装と同じ言語(Rust)で書く参照モデル + property-based testing + stateless model checking(Loom/Shuttle)で検証する「軽量形式手法」アプローチを新規 concept として定義。ShardStore がshardデータをLSMツリー外のエクステントへ配置し、crash consistencyを宣言的Dependency型でLSMツリーから分離する設計を、既存 concept [[LSMツリー]] に Bigtable/Cassandra との対比で追記した。Figure 1(オンディスクレイアウト)・Figure 2(dependencyグラフ)は埋め込みラスター画像0枚のためPyMuPDFのキャプション座標クロップで取得。Figure 3・4(コードリスティング)とFigure 5・6(表)は画像化せず本文中に構造化して転記した。 ## [2026-07-20] ingest-paper (重複検知・図版更新) | B-Trees Are Back: Engineering Fast and Pageable Node Layouts - Source: `/Users/y-tsubouchi/Downloads/2026_Unknown_B_Trees_Back_Engineering_Fast.pdf`(2026-06-14 取り込み済みの `.raw/papers/1-3709664.pdf` と MD5 完全一致・重複) - Summary: [[@2025__SIGMOD__B-Trees Are Back - Engineering Fast and Pageable Node Layouts]] - Pages updated: [[@2025__SIGMOD__B-Trees Are Back - Engineering Fast and Pageable Node Layouts]] - Key insight: 新規 ingest ではなく重複検知(source ページ・entity・concept は既存済み)。ユーザー選択により図版のみ更新: PyMuPDF のキャプション座標クロップで Figure 2・3・4・5(提案手法節)と Figure 1・17(実験結果節)を高解像度・正確なキャプション付きで再取得し、Figure 6(semi/fully dense leaf)・Figure 14(leaf layout 遷移)を新規追加。旧来の汎用キャプション(「〜を示す。」)と低品質画像・未参照の orphan page-render 画像を置き換えた。 ## [2026-07-20] ingest-paper | Aurora DSQL: Scalable, Multi-Region OLTP - Source: `.raw/papers/arxiv-2607.13276.pdf`(https://arxiv.org/abs/2607.13276) - Summary: [[@2026__arXiv__Aurora DSQL - Scalable, Multi-Region OLTP]] - Pages created: [[@2026__arXiv__Aurora DSQL - Scalable, Multi-Region OLTP]], [[Aurora DSQL]] - Pages updated: [[Marc Brooker]], [[Amazon Aurora (Database)]], [[分散SQLデータベース]], [[地理分散SQLデータベース]], [[分散トランザクション]], [[分散コンセンサス回避]], [[クォーラムベースレプリケーション]] - Key insight: Aurora DSQL は Query Processor・Adjudicator・Journal・Crossbar・Storage に分離した disaggregated アーキテクチャを持ち、MVCC による座標不要読み取りと OCC による書き込みバッファリングを組み合わせ、コミット時のみクロスリージョン座標する設計でマルチリージョン分散 SQL を実現する。Spanner/CockroachDB の悲観的ロック方式・classic Aurora のログ縮小・Aurora Limitless の cross-AZ 限定に続く第4の座標削減系統として位置づけ、複数 Adjudicator 間のコミットプロトコルが「投票のアトミック性」と「コミットのアトミック性」を分離する新パターンであること、Journal 間の2-of-3イレイジャーコーディングがレイテンシ分散と可用性の双方を改善することを5つの既存 concept に横断的知見として追記した。図表はいずれもベクター描画(埋め込みラスター画像0枚)のため PyMuPDF のキャプション座標クロップで Figure 1・3・4・6・9・10 を抽出した。 ## [2026-07-20] ingest-paper | LLM hallucinations in the wild: Large-scale evidence from non-existent citations - Source: `.raw/papers/arxiv-2605.07723.pdf`(https://arxiv.org/abs/2605.07723) - Summary: [[@2026__arXiv__LLM hallucinations in the wild]] - Pages created: [[@2026__arXiv__LLM hallucinations in the wild]], [[LLMのハルシネーション]], [[Zhenyue Zhao]], [[Yihe Wang]], [[Toby Stuart]], [[Mathijs De Vaan]], [[Paul Ginsparg]], [[Yian Yin]] - Pages updated: [[Cornell University]], [[University of California, Berkeley]], [[Tsinghua University]], [[@2023__arXiv__GPT-4 Technical Report]] - Key insight: Cornell University・UCLA・Tsinghua University・UC Berkeley の研究チームが、学術引用を検証可能な対象として使い、arXiv・bioRxiv・SSRN・PubMed Central の参照1億1,100万件を監査し、LLM 登場前後の unmatched 引用率の差分から2025年単年で146,932件のハルシネーション引用を population スケールで推定した。汚染は少数の重度汚染論文への集中ではなく多数の論文への薄い拡散パターンであり、arXiv モデレーションはハルシネーション引用の78.8%を通過させ、bioRxiv→PMC 出版移行後も85.3%が残存するなど、既存の品質管理が拡大に追いついていないことを示した。 ## [2026-07-20] ingest-slides | 30分でわかるデータ指向アプリケーションデザイン (Data Engineering Study #18) - Source: `https://speakerdeck.com/xerial/30fen-dewakarudetazhi-xiang-apurikesiyondezain-data-engineering-study-number-18`(全37ページ) - Visual pages: `.raw/slides/2023__DataEngineeringStudy__30bun-de-wakaru-data-shikou-application-design/pages/` - Media: none - Summary: [[@2023__DataEngineeringStudy__30分でわかるデータ指向アプリケーションデザイン]] - Pages created: [[@2023__DataEngineeringStudy__30分でわかるデータ指向アプリケーションデザイン]], [[Taro L. Saito]], [[導出データ]] - Pages updated: [[Amazon Aurora (Database)]], [[DuckDB]], [[分散トランザクション]] - Key insight: 『データ指向アプリケーションデザイン』監訳者 Taro L. Saito による、原著出版から5年間の発展を原著の枠組みに沿って再構成する講演。classic Amazon Aurora(SIGMOD 2018)の gossip プロトコルによる2PC回避を「プロトコル最適化ではなくアーキテクチャによる協調回避」の系統として[[分散トランザクション]]に位置づけ、RDBMSの「テーブル」がスナップショットから導出データ(derived data)へ意味を変えてきた変遷を新規concept [[導出データ]] として立てた。 - Note: wiki-ingest-slides スキルのヘルパースクリプト(`scripts/fetch-slide-deck.sh` 等)がリポジトリに存在しなかったため、SpeakerDeck ページから PDF リンクを手動抽出し、`pdftotext`/`pdftoppm` を直接実行して同等の処理を行った。 ## [2026-07-20] ingest-slides | Query Rewriting and Optimization (DiDi Course #8) - Source: `https://blobs.duckdb.org/slides/DiDi-08.pdf`(全36ページ) - Visual pages: `.raw/slides/DiDi-08-Query-Rewriting-Optimization/pages/` - Media: none - Summary: [[@2026__DiDi__Query Rewriting and Optimization]] - Pages created: [[@2026__DiDi__Query Rewriting and Optimization]], [[クエリオプティマイザ]], [[結合順序最適化]], [[クエリ非相関化]] - Pages updated: [[Torsten Grust]], [[Universität Tübingen]], [[DuckDB]] - Key insight: DuckDBのクエリオプティマイザは30以上の最適化パスをfixpoint反復なしの一方向で適用し、結合順序探索の組み合わせ爆発(カタラン数)にはMoerkotte & NeumannのDPhyp動的計画法、相関サブクエリにはNeumann & KemperのUnnesting Arbitrary Queriesに基づく系統的なDEPENDENT_JOIN書き換え(クエリ非相関化)を採用する。 ## [2026-07-20] ingest-slides | Vectorized Query Execution (DiDi Course #7) - Source: `https://blobs.duckdb.org/slides/DiDi-07.pdf`(全31ページ) - Visual pages: `.raw/slides/DiDi-07/pages/` - Media: none - Summary: [[@2026__DiDi__Vectorized Query Execution]] - Pages created: [[@2026__DiDi__Vectorized Query Execution]] - Pages updated: [[Torsten Grust]], [[Universität Tübingen]], [[DuckDB]], [[SIMDベクトル処理]], [[分岐予測]], [[パイプライン処理]] - Key insight: DuckDBはベクトル演算の型×物理表現の全組み合わせに対するsuper-specificなコード生成を理想としつつ、組み合わせ爆発を避けるためunified representation(data vector + selection vector)への変換とC++テンプレートによるコンパイル時コード生成という2段構えを取る。この設計判断はDuckDB 1.4の実ソースコード呼び出し連鎖(`ExpressionExecutor`→`VectorOperations::Equals`→`BinaryExecutor::ExecuteGenericLoop`)として具体的に追跡できる。 ## [2026-07-20] ingest-slides | Query Execution Plans and Pipelining (DiDi Course #6) - Source: `https://blobs.duckdb.org/slides/DiDi-06.pdf`(全17ページ) - Visual pages: `.raw/slides/DiDi-06/pages/` - Media: none - Summary: [[@2026__DiDi__Query Execution Plans and Pipelining]] - Pages created: [[@2026__DiDi__Query Execution Plans and Pipelining]], [[クエリ実行プラン]], [[プッシュ型パイプライン実行]] - Pages updated: [[Torsten Grust]], [[Universität Tübingen]], [[DuckDB]], [[並列データベース]] - Key insight: DeWitt/Gray(1992)のパイプライン並列化・パーティション並列化という分類が、DuckDBでは単一プロセスDBMSの演算子レベル実装(自明並列演算子の連なり+シンクのSink/Combine/Finalize3フェーズ)として具体化されており、skew回避は動的負荷分散ではなく設計選択(モーセル粒度・パイプライン分解)で行われる。埋め込み画像の一部(page-016→page-017)にページ対応のずれがあり、出典検査で発見・修正した。 ## [2026-07-20] ingest-slides | The ART of Indexing (DiDi Course #5) - Source: `https://blobs.duckdb.org/slides/DiDi-05.pdf`(全22ページ) - Visual pages: `.raw/slides/didi-05-the-art-of-indexing/pages/` - Media: none - Summary: [[@2026__DiDi__The ART of Indexing]] - Pages created: [[@2026__DiDi__The ART of Indexing]], [[Adaptive Radix Tree]], [[Zonemap]] - Pages updated: [[Torsten Grust]], [[Universität Tübingen]], [[DuckDB]], [[B-Tree]] - Key insight: ARTとZonemapは「全行スキャンを避ける」という同一課題への意図的に異なる解決である。Zonemapは常時有効・行グループ粒度のほぼ無コストなスキップフィルタであり、ARTはオプトイン・行粒度で実メモリ/保守コストを伴う構造で、DuckDBの2階層インデックス設計はどちらか単独では選択性の全域をカバーできないことに起因する。 ## [2026-07-20] ingest-slides | Sorting Large Tables (DiDi Course #4) - Source: `https://blobs.duckdb.org/slides/DiDi-04.pdf`(全11ページ) - Visual pages: `.raw/slides/didi-04-sorting-large-tables/pages/` - Media: none - Summary: [[@2026__DiDi__Sorting Large Tables]] - Pages created: [[@2026__DiDi__Sorting Large Tables]], [[外部マージソート]], [[キー正規化]] - Pages updated: [[Torsten Grust]], [[Universität Tübingen]], [[DuckDB]] - Key insight: DuckDBの二相マージソートはフェーズ➊でキーを固定長へ正規化(`FixedSortKey`構造体)することで、可変長バイト列比較よりも高速な固定長整数比較(`LessThan`)を可能にし、Vergesort/Ska Sort/Pattern-defeating QuickSortを組み合わせて生成したランをフェーズ➋でT-way mergeする。 ## [2026-07-20] ingest-slides | Managing Memory + Grouped Aggregation (DiDi Course #3) - Source: `https://blobs.duckdb.org/slides/DiDi-03.pdf`(全16ページ) - Visual pages: `.raw/slides/DuckDB-DiDi-03-Memory-GroupedAgg/pages/` - Media: none - Summary: [[@2026__DiDi__Managing Memory + Grouped Aggregation]] - Pages created: [[@2026__DiDi__Managing Memory + Grouped Aggregation]], [[アウトオブコア処理]], [[ハッシュベースグループ集約]] - Pages updated: [[Torsten Grust]], [[Universität Tübingen]], [[DuckDB]] - Key insight: DuckDBの外部グループ集約は、ページ化中間データ構造をハッシュテーブルエントリの格納形式に採用することで、GROUP BY固有のスピリングロジックを実装せずメモリマネージャの汎用アウトオブコア機構に委譲している。スライドはPhase 2(パーティション単位集約)の説明手前で終わっており、続きは同シリーズの後続回で扱われる可能性が高い。 ## [2026-07-20] ingest-slides | The Query Performance Spectrum (DiDi Course #2) - Source: `https://blobs.duckdb.org/slides/DiDi-02.pdf` - Visual pages: `.raw/slides/duckdb-didi-02-query-performance-spectrum/pages/` - Media: none - Summary: [[@2026__DiDi__The Query Performance Spectrum]] - Pages created: [[@2026__DiDi__The Query Performance Spectrum]] - Pages updated: [[Torsten Grust]], [[DuckDB]], [[列指向OLAPデータベース]] - Key insight: 同一の単純な集約クエリ(TPC-H lineitemのquantity列合計)の実行時間は実装言語・技法だけでawk 1.60秒からC+mmap+マルチスレッド0.04秒まで40倍以上変動し、DuckDBのSQL実装(約0.45秒)はuser時間がreal時間を大きく上回ることから内部並列処理を行っていることが分かる。 ## [2026-07-20] ingest-slides | Welcome & Setup (Design and Implementation of DuckDB Internals, Lecture 1) - Source: `https://blobs.duckdb.org/slides/DiDi-01.pdf` - Visual pages: `.raw/slides/DiDi-01-Welcome-Setup/pages/` - Media: none - Summary: [[@2026__DuckDB__Welcome & Setup (DiDi Course, Lecture 1)]] - Pages created: [[@2026__DuckDB__Welcome & Setup (DiDi Course, Lecture 1)]], [[Torsten Grust]], [[DuckDB Labs]] - Pages updated: [[DuckDB]], [[Hannes Mühleisen]], [[Mark Raasveldt]] - Key insight: Torsten Grust(University of Tübingen)による15週構成の講義シリーズ「Design and Implementation of DuckDB Internals(DiDi)」全体の射程を導入する初回。DuckDBの「zero copy」プロセス内蔵設計と、名称がHannes Mühleisenの飼っていたアヒルWilburに由来することを紹介する。 ## [2026-07-20] ingest-paper | DuckDB: an Embeddable Analytical Database - Source: `.raw/papers/28800.pdf`(SIGMOD '19、4ページ、DOI 10.1145/3299869.3320212) - Summary: [[@2019__SIGMOD__DuckDB - an Embeddable Analytical Database]] - Pages created: [[@2019__SIGMOD__DuckDB - an Embeddable Analytical Database]], [[Mark Raasveldt]], [[Hannes Mühleisen]], [[CWI]], [[DuckDB]], [[MonetDBLite]] - Pages updated: [[列指向OLAPデータベース]] - Key insight: CWI([[Mark Raasveldt]]・[[Hannes Mühleisen]])によるSIGMOD '19デモンストレーション論文。SQLiteのような組み込みデータベースがOLTP向け設計のためOLAP性能に乏しいという課題認識から、パーサ・コストベースオプティマイザ・ベクトル化解釈実行エンジン・HyPer由来のシリアライザブルMVCC・DataBlocksストレージを組み合わせ、ゼロから組み込み分析用途向けに設計されたデータベースDuckDBを提示する。前身[[MonetDBLite]]の非purpose-built性に起因する課題が開発動機であること、JITコンパイル不採用が移植性優先の判断であることが本文から確認できる。既存concept [[列指向OLAPデータベース]] に、サーバプロセス型(ClickHouse)と組み込み型(DuckDB)という配備形態の違い、組み込み型特有の「結果セット転送コスト」という性能軸を横断的知見として追加した。埋め込みラスター画像は0枚のため図表埋め込みは行っていない。 - Open questions: 組み込み型(DuckDB)とサーバプロセス型(ClickHouse)の列指向OLAPを同一条件で比較した定量評価は存在するか。 ## [2026-07-20] ingest-paper | Niyama: Breaking the Silos of LLM Inference Serving - Source: `.raw/papers/arxiv-2503.22562.pdf`(arXiv:2503.22562、2025-03-28投稿、12ページ) - Summary: [[@2025__arXiv__Niyama - Breaking the Silos of LLM Inference Serving]] - Pages created: [[@2025__arXiv__Niyama - Breaking the Silos of LLM Inference Serving]], [[Kanishk Goel]], [[Jayashree Mohan]], [[Nipun Kwatra]], [[Ravi Shreyas Anupindi]], [[Ramachandran Ramjee]], [[Sarathi-Serve]] - Pages updated: [[Microsoft Research]], [[vLLM]], [[LLM推論]], [[Prefill-Decode分離]], [[LLMサービング管理]] - Key insight: ユーザーが渡した Microsoft Research publication ページの URL(タイトル "QoServe: Breaking the Silos of LLM Inference Serving"、ASPLOS 2026 採録)は、abstract を照合した結果、arXiv:2503.22562(2025-03-28 投稿時の初出タイトル "Niyama")と同一論文であることを確認し、PDF・図表を取得しやすい arXiv 版を原本として取り込んだ。既存の LLM サービングが interactive/batch サイロに依存する非効率を、[[Sarathi-Serve]] の chunked-prefill スケジューラを拡張した QoS 駆動 co-scheduling(動的チャンキング・EDF/SRPF ハイブリッド優先度付け・積極的降格)で解消するという設計。[[Prefill-Decode分離]] が「分離してそれぞれ最適化する」路線であるのに対し、Niyama は「同居を維持したまま共有インフラでスラックを再配分する」対照的な路線であることを 3 つの concept ページ(LLM推論・Prefill-Decode分離・LLMサービング管理)の横断的知見に記録した。図はベクター描画のグラフ(Figure 7)を PyMuPDF でキャプション座標クロップし、埋め込みラスター画像 3 枚(Figure 1 右図・Figure 3 アーキテクチャ・Figure 6 動的チャンキング図解)と合わせて計 4 枚を source ページに埋め込んだ。 - Open questions: ハイブリッド優先度付けの補間パラメータ α は静的 deployment パラメータとして評価されており、負荷変動に応じた自動調整の有効性は未検証。Niyama(同居 QoS co-scheduling)と DistServe(物理分離)の直接比較評価は論文内に存在しない。 ## [2026-07-20] ingest | In-House LLM Serving at Netflix - Source: `.raw/articles/in-house-llm-serving-at-netflix-2026-07-20.md`(Netflix TechBlog、curl+defuddleでUA 403回避後に取得) - Summary: [[@2026__Netflix TechBlog__In-House LLM Serving at Netflix]] - Pages created: [[@2026__Netflix TechBlog__In-House LLM Serving at Netflix]], [[Triton Inference Server]], [[制約付きデコーディング]] - Pages updated: [[Netflix]], [[vLLM]], [[TensorRT-LLM]], [[NVIDIA]], [[LLM推論]] - Key insight: Netflix AI Platformチームによる、既存のJVM統合サービングシステムとModel Scoring Service(MSS)/Triton Inference Server上でLLM推論を内製運用する本番事例。2026年夏の[[TensorRT-LLM]]→[[vLLM]]移行が性能ベンチマークでなく運用適合性(カスタムモデル対応・デバッグ性・研究本番間の移行コスト)を根拠にしていた点、TritonのPython/vLLMバックエンド選択とバージョン整合の運用課題、OpenAI互換APIのresponse_format欠落パッチ、Red-Black/Versionedデプロイ戦略の使い分け、vLLM V0(GIL律速per-request logits processor)→V1(batch-level API)移行によるテイルレイテンシ解消を報告。図3枚(サービングアーキテクチャ全体図、logits processorのV0逐次実行/V1バッチ実行の比較)をWebP形式でダウンロードし埋め込んだ。 - Open questions: TritonのvLLM/Pythonバックエンド共存がカスタムモデル向けエスケープハッチとして今後どの程度必要であり続けるか。V1のBatchUpdateの部分prefill・プリエンプション粒度不足は他推論エンジンのlogits processor相当機構にも一般化する制約か。 ## [2026-07-20] ingest-paper | FailSafe: High-performance Resilient Serving - Source: `.raw/papers/arxiv-2511.14116.pdf`(arXiv:2511.14116、2025-11-18投稿、13ページ) - Summary: [[@2025__arXiv__FailSafe - High-performance Resilient Serving]] - Pages created: [[@2025__arXiv__FailSafe - High-performance Resilient Serving]], [[Ziyi Xu]], [[耐障害LLMサービング]] - Pages updated: [[Zhiqiang Xie]], [[Swapnil Gandhi]], [[Christos Kozyrakis]], [[Stanford University]], [[Shanghai Jiao Tong University]], [[ReCycle]], [[テンソル並列]], [[KVキャッシュ管理]], [[耐障害LLM訓練]] - Key insight: MLSys 2026 Oral の talk ページ(https://mlsys.org/virtual/2026/oral/3856)から出発し、掲載タイトル "RaidServe" が OpenReview の Cloudflare ブラウザ検証でアクセス不能だったため、arXiv API で著者(Kozyrakis・Xie)横断検索により "FailSafe: High-performance Resilient Serving"(arXiv:2511.14116)を発見して取り込んだ。テンソル並列 LLM サービングの GPU 障害を「復旧オーバーヘッド」(KVCache 再計算・重み再読み込み)と「持続的不均衡オーバーヘッド」(不規則 GPU 数でのアテンションヘッド分割の粒度制約)に分離する枠組みが新規性の核。同じ著者(Swapnil Gandhi・Christos Kozyrakis、Stanford)が SOSP '24 で発表した訓練向け耐障害システム [[ReCycle]] との設計思想対比(訓練の冗長性活用 vs サービングの負荷分配均等化)を新規 concept [[耐障害LLMサービング]] の横断的知見として記録した。図表は pdfimages でページ番号付き埋め込み画像を抽出し、代表7図(Cyclic KVCache Placement・Hybrid Attention・On-demand Weight Recovery・障害注入下スループット・スループットレイテンシ曲線・バランシング寄与分解・復旧レイテンシCDF)を source ページに埋め込んだ。 - Open questions: シングルノード(8 GPU、NVLink内)限定の評価がマルチノード・ノード全体障害へどう一般化するか。MoE のエキスパート並列が TP より部分 GPU 損失に耐性が高いという著者らの指摘(§6)は本論文自身では未検証。 ## [2026-07-20] ingest-paper(重複検出・既存ページ更新) | RaidServe: High-performance Resilient Serving - Source: `.raw/papers/80_RaidServe_High_performance_.pdf`(ローカルPDF、15ページ、MLSys 提出テンプレート組版) - Summary: [[@2025__arXiv__FailSafe - High-performance Resilient Serving]](新規ページは作成せず、既存ページを更新) - Pages created: なし - Pages updated: [[@2025__arXiv__FailSafe - High-performance Resilient Serving]] - Key insight: ユーザーが投入した PDF `80_RaidServe_High_performance_.pdf` は、2026-07-20 に先行 ingest 済みの arXiv 論文([[@2025__arXiv__FailSafe - High-performance Resilient Serving]]、arXiv:2511.14116)と本文が一字一句一致する同一論文の別ドラフトだった(システム名 RaidServe↔FailSafe の置換と組版差のみ)。PDF 脚注は "Proceedings of the 8th MLSys Conference ... 2025" と主張するが、dblp の MLSys 2025 採択論文リストに本論文は見当たらず、この脚注は提出テンプレートの決まり文句である可能性が高いと判断した。新規 source ページの重複作成を避け、既存ページの frontmatter に本 PDF を追加ソースとして記録し、MLSys 対応関係の注記をこの検証結果で訂正した。 - Open questions: 実際の発表先(MLSys 2026 Oral か、それとも別の査読トラックか)は依然未確定。 ## [2026-07-20] ingest | Kimi K3: Open Frontier Intelligence - Source: `.raw/articles/kimi-k3-2026-07-20.md`(URL: https://www.kimi.com/blog/kimi-k3、kimi.com はサンドボックスのネットワーク許可リスト外のため curl/defuddle での直接取得不可、WebFetch による構造化要約に基づく) - Summary: [[@2026__Moonshot AI__Kimi K3 - Open Frontier Intelligence]] - Pages created: [[@2026__Moonshot AI__Kimi K3 - Open Frontier Intelligence]], [[Kimi K3]], [[Kimi Delta Attention]], [[Attention Residuals]], [[Stable LatentMoE]] - Pages updated: [[Moonshot AI]], [[Kimi Linear]], [[Mixture-of-Experts]] - Key insight: [[Moonshot AI]] が 2026-07-17 発表した [[Kimi K3]](総パラメータ 2.8 兆、コンテキスト 100 万トークン)は「世界初のオープンな 3T クラスモデル」を謳う。[[Kimi Linear]](48B、2025-10)で導入された [[Kimi Delta Attention]](KDA、Gated DeltaNet のチャネルワイズゲート改良版)を 512-head MLA と組み合わせて 2.8T 級へスケールアップした点が最大の技術的連続性であり、48B→2.8T という約 58 倍のスケール実証事例になる。[[Stable LatentMoE]](896 エキスパート中 16 活性化、スパーシティ 56)は Kimi K2(384 エキスパート/活性化 8、スパーシティ 48)からの拡張で、NVIDIA の LatentMoE(Nemotron 3)と名称が類似するが機構の異同は不明。Attention Residuals(AttnRes)・Per-Head Muon・Quantile Balancing・Sigmoid Tanh Unit(SiTU)はいずれも記事内で名前のみ言及され、具体的な数式・アルゴリズムは非公開(技術レポートは 2026-07-27 公開予定)。ソース自体が公式ブログ(マーケティング寄り)であり、AI 抽出要約に基づくため一次資料としての厳密性に限界がある点を source ページに明記した。 - Open questions: Stable LatentMoE は NVIDIA の LatentMoE と同一設計か。Attention Residuals の具体的機構。Per-Head Muon と MuonClip/Sharded Muon/NorMuon の関係。Kimi K3 の活性化パラメータ数(記事に総パラメータ 2.8T のみで活性化数の記載なし)。技術レポート公開(2026-07-27 予定)後の裏取りが必須。 ## [2026-07-20] ingest-paper | Adversarial dynamical systems characterize when data-driven learning succeeds or fails - Source: `.raw/papers/2026_Colbrook_Adversarial_dynamical_systems_characterize_when.pdf`(ローカルPDF、18ページ、Nature Communications (2026) 17:5397、DOI: 10.1038/s41467-026-74220-8) - Summary: [[@2026__NatCommun__Adversarial dynamical systems characterize when data-driven learning succeeds or fails]] - Pages created: [[@2026__NatCommun__Adversarial dynamical systems characterize when data-driven learning succeeds or fails]], [[Matthew J. Colbrook]], [[Igor Mezić]], [[Alexei Stepanenko]], [[Koopman作用素]], [[可解性複雑性指標]] - Pages updated: [[UC Santa Barbara]], [[University of Cambridge]](lint-stubから実体化) - Key insight: University of Cambridge の Matthew J. Colbrook・UC Santa Barbara の Igor Mezić らによる理論研究。Koopman作用素のスペクトルをデータから学習する問題に対し、敵対的力学系(adversarial dynamical systems)を構成することで、測度保存性・連続性の法という2条件が揃わない限りいかなる単一極限アルゴリズム(確率的なものを含む)も50%を超える確率で収束を保証できないという不可能性を証明した。条件が揃えば誤差保証つきの最適アルゴリズムを構成でき、可解性複雑性指標(Solvability Complexity Index, SCI)によって問題の複雑性を上界(収束アルゴリズム)・下界(不可能性)の一致として完全に分類する点が本論文の中心的貢献である。北極海氷濃度データ(1979-2021)に応用し、既存手法EDMDでは大量のスプリアス固有値に埋もれてしまう「隠れた減衰モード」(Barents海・Kara海に集中)を誤差保証つきで検出し、深層学習モデルIceNet・力学モデルSEAS5を大幅に低い計算コスト(ラップトップで1秒未満の訓練)で上回る長期予測を実現した。LLMのhallucinationとの類推(敵対的系のKoopman作用素が連続的周波数分布を持つことと、プロンプト微小変化による出力の発散との類似)も考察として提示されているが、これは厳密な対応関係の証明ではなく推測的な議論として区別して記載した。Koopman作用素・可解性複雑性指標は本wiki初出のconceptで、既存概念との接続は薄い(独立した新規領域)。図はFig.1(手法比較・RAGE定理)・Fig.2(北極海氷隠れモード)・Fig.4(予測ベンチマーク)・Fig.5/Fig.6(SCI分類階層)・Fig.9(敵対的構成の証明アイデア)の6枚を選定して埋め込んだ。 ## [2026-07-20] ingest | LLM生成テキストの統計的検知: TF-IDF+SVMによるAIGC分類器の構築 - Source: `.raw/articles/llm-classifier-2026-07-20.md`(URL: https://blog.lyc8503.net/en/post/llm-classifier/、WebFetch 403のためcurl+ブラウザUAで取得、defuddleで整形) - Summary: [[AI生成テキスト分類器]] - Pages created: [[AI生成テキスト分類器]], [[lyc8503]], [[AITextDetector]], [[AI生成テキスト検知]] - Pages updated: なし - Key insight: 個人ブログ記事([[lyc8503]])。パープレキシティベースのAI生成テキスト検知は実用性に乏しいと判明したため、`TF-IDF` + `LinearSVC` による文単位分類に切り替え、7つのLLM(gemini・qwen・GLM-5・kimi25・glm47・doubao・deepseek-v3.2)それぞれの二値分類器を多数決で統合する方式で文単位精度約85%、未知モデルにも約70%以上の検知率で汎化する[[AITextDetector]]を構築した。Lofter実データでの偽陽性率は閾値60%で0.04%と極めて低い一方、同プラットフォームのトレンド記事の32.22%がAIスコア50%超と判定され無断AI生成コンテンツの広範な存在を示唆する結果を報告している。翻訳往復・脱AI感プロンプトによる回避策の効果は軽微であることも検証済み。既存vaultはAIOps/障害診断ドメインが中心で、AI生成テキスト検知という新規トピック領域の最初のsourceとなるため、既存概念との接続は薄い(独立した新規concept立ち上げ)。 ## [2026-07-18] ingest-paper | The Too-Much-Talent Effect: Team Interdependence Determines When More Talent Is Too Much Versus Not Enough - Source: `.raw/papers/The-too-much-talent-effect_-Team-interdependence-determines-when.pdf`(ローカルPDF、29ページ、Psychological Science 2014年8月号掲載版、DOI: 10.1177/0956797614537280) - Summary: [[@2014__PsychSci__The Too-Much-Talent Effect - Team Interdependence Determines When More Talent Is Too Much or Not Enough]] - Pages created: [[@2014__PsychSci__The Too-Much-Talent Effect - Team Interdependence Determines When More Talent Is Too Much or Not Enough]], [[Roderick I. Swaab]], [[Michael Schaerer]], [[Eric M. Anicich]], [[Richard Ronay]], [[Adam D. Galinsky]], [[INSEAD]], [[過剰人材効果]], [[タスク相互依存性]] - Pages updated: [[Columbia University]], [[Vrije Universiteit Amsterdam]], [[Singapore Management University]] - Key insight: INSEAD の Roderick I. Swaab を筆頭著者とする組織行動論の実証研究(Psychological Science, 2014)。サッカー(FIFA)・バスケットボール(NBA)・野球(MLB)のアーカイバルデータから、トップタレント比率とチーム成績の関係がタスク相互依存性の高低によって異なることを実証した。相互依存性が高いサッカー・バスケでは人材比率50%超で成績が負に転じる逆U字型曲線が現れ、相互依存性が低い野球では単調増加のままだった。NBAのplay-by-byデータを用いた媒介分析で、チーム内コーディネーションの低下が人材過多効果を媒介することをSobel検定・ブートストラップで統計的に立証した点が本論文の中心的貢献である。著者欄の所属(INSEAD)とSMU機関リポジトリのカバーページ表記(Schaererの所属がSingapore Management University)に食い違いがあったため、Michael Schaerer entityページに事実として両論記載した。pdf.jsによる埋め込み画像抽出で6図(Figure 1〜6)のうち5図(Figure 1〜4, 6)を取得でき、ベクター描画のFigure 5(媒介モデルの経路図)のみPyMuPDFキャプション座標クロップで補完し、6図すべてをsourceページに埋め込んだ。 ## [2026-07-18] ingest-paper | MLCommons Chakra: Advancing Performance Benchmarking and Co-design using Standardized Execution Traces - Source: `.raw/papers/arxiv-2605.11333.pdf`(arXiv 2605.11333v3、論文全文18ページ) + `.raw/slides/mlcommons-chakra-mlsys2026/mlcommons-chakra-mlsys2026.pdf`(MLSys 2026 発表スライド25ページ、全ページ画像化) - Summary: [[@2026__MLSys2026__MLCommons Chakra - Advancing Performance Benchmarking and Co-design using Standardized Execution Traces]] - Pages created: [[@2026__MLSys2026__MLCommons Chakra - Advancing Performance Benchmarking and Co-design using Standardized Execution Traces]], [[MLCommons Chakra]], [[MLCommons]], [[Georgia Institute of Technology]], [[Tushar Krishna]], [[Srinivas Sridharan]], [[ASTRA-sim]], [[実行トレース]] - Pages updated: [[NVIDIA]], [[AMD]], [[vLLM]], [[Prefill-Decode分離]], [[KVキャッシュ管理]] - Key insight: ユーザーから当初 `https://mlsys.org/virtual/2026/oral/3742`(paper: `https://openreview.net/pdf?id=s2WcSv2Hzt`)としてContextPilot論文の取り込みを依頼されたが誤認であり、実際にはこのURLはMLCommons Chakra論文(著者Srinivas Sridharan・Tushar Krishnaら29名)を指すことをWebFetchで確認して訂正した。OpenReview PDFはCloudflare保護で直接取得できなかったため、arXiv APIでタイトル検索してプレプリント版(2605.11333)を特定し代替取得した。分散AI/MLワークロードの標準実行トレース表現Chakra ETを核に、Trace Linker/Converterによるホスト・デバイストレース統合、trace analysis/replay/simulation-emulationの3用途を持つMLCommons公認エコシステムを報告。40以上の企業・組織が参加するワーキンググループとして標準化されている点、vLLM統合によりMoEトークンルーティング・KVキャッシュオフロード・Prefill-Decode分離間のKV転送を定量化した点を、既存の[[Prefill-Decode分離]]・[[KVキャッシュ管理]]概念に接続した。論文の埋め込み/クロップ図6枚とスライドの独自図解3枚(PyTorchトレース収集実装フロー、Hardware-in-the-Loopエミュレーション実測、Chakraエコシステム6ステップ図)を source ページに埋め込んだ。 ## [2026-07-18] ingest-paper | OpsMem: Dual-Memory Reasoning with Cross-Memory Resonance for Failure Diagnosis - Source: `.raw/papers/arxiv-2607.11357.pdf`(arXiv 2607.11357v1、論文全文6ページ) - Summary: [[@2026__arXiv__OpsMem - Dual-Memory Reasoning with Cross-Memory Resonance for Failure Diagnosis]] - Pages created: [[@2026__arXiv__OpsMem - Dual-Memory Reasoning with Cross-Memory Resonance for Failure Diagnosis]], [[OpsMem]], [[Rongchen Gao]], [[Qingyi Guo]], [[Yaoliang Wu]] - Pages updated: [[Yongqian Sun]], [[Yu Luo]], [[Wenwei Gu]], [[Shenglin Zhang]], [[Dan Pei]], [[Qiuai Fu]], [[Nankai University]], [[Tsinghua University]], [[Huawei Technologies]], [[エージェントメモリ]], [[仮説駆動RCA]], [[LLMによる根本原因分析]] - Key insight: Nankai University・Tsinghua University・Huawei Technologies による失敗診断向けデュアルメモリフレームワーク OpsMem を報告。短期記憶(STM、GoS の belief-state 抽象化を踏襲)と長期記憶(LTM、パターン・ケース・プロシージャのグラフ)を cross-memory resonance で動的に結合し、Huawei の実運用マイクロサービス障害 120 件データセットで既存のエージェント的推論(ReAct・GoS)・知識拡張(GoS+VectorRAG/GraphRAG/LinearRAG)の全ベースラインを上回る(最強ベースライン比 Match +6.66〜25.00pt)ことを実証した。既存の[[エージェントメモリ]]概念に「検索を状態変化のたびに再計算するトリガー条件」という新しい軸を、[[LLMによる根本原因分析]]概念に「静的 RAG から状態条件付き動的検索への移行」と「経験蒸留器」という LLM 役割分化の新カテゴリを追加した。共著者の Wenwei Gu が同姓同名で CUHK([[LLMPrism]])と Nankai の両方に登場する既存の未解決 contradiction を、Nankai 側の 9 本目の継続共著論文として補強した。pdf.js の埋め込み画像抽出では図表(ベクター描画のアーキテクチャ図)を取得できなかったため、PyMuPDF によるキャプション座標クロップで論文の図1〜5を全て切り出して source ページに埋め込んだ。 ## [2026-07-18] ingest-paper | ContextPilot: Fast Long-Context Inference via Context Reuse - Source: `.raw/papers/arxiv-2511.03475.pdf`(arXiv 2511.03475v4、論文全文21ページ) + `.raw/slides/contextpilot-mlsys2026/contextpilot-mlsys2026.pdf`(MLSys 2026 発表スライド25ページ、全ページ画像化) - Summary: [[@2026__MLSys2026__ContextPilot - Fast Long-Context Inference via Context Reuse]] - Pages created: [[@2026__MLSys2026__ContextPilot - Fast Long-Context Inference via Context Reuse]], [[ContextPilot]] - Pages updated: [[University of Edinburgh]], [[LMCache]], [[CacheBlend]], [[Mem0]], [[KVキャッシュ管理]] - Key insight: OpenReview 経由の PDF 取得は Cloudflare のブラウザ認証(Turnstile)で失敗したため、arXiv API 検索でプレプリント版(2511.03475)を特定して代替取得した。MLSys 2026 のスライドページ URL(`mlsys.org/media/mlsys-2026/Slides/...`)は直接ダウンロード可能だった。論文本体は、完全一致 prefix caching(RadixCache・LMCache)の低い再利用率と、近似 KV マッチング(CacheBlend)の精度劣化(9〜11%)という既存手法の二律背反を、KV 値でなく検索文書・エージェントメモリ等のコンテキストブロック単位で整列・重複排除・優先順位注釈を行う設計で回避したことを報告する。整列由来の精度低下は0.1〜3.3%と小さく、これは DEmO 順序感度研究の再現実験で現代 LLM(GPT-5.1)が入力順序への感度を大幅に失っていることに支えられている。CacheBlend 自身の精度劣化報告(F1/Rouge-Lで0.01〜0.03)と本論文の観測値(9〜11%劣化)が食い違う点を CacheBlend entity・source 双方に contradiction callout として記録した。論文の埋め込みアテンションマップ図1枚とスライドの独自図解6枚(既存手法トレードオフ、3メカニズム概要、コンテキストインデックス距離関数、整列・注釈の効果、システムアーキテクチャ、結果サマリ)を source ページに埋め込んだ。 ## [2026-07-17] ingest | Deploying DeepSeek with PD Disaggregation and Large-Scale Expert Parallelism on 96 H100 GPUs - Source: `.raw/articles/large-scale-ep-2025-05-05.md`(LMSYS Blog、URL ingest) - Summary: [[@2025__LMSYS Blog__Deploying DeepSeek with PD Disaggregation and Large-Scale Expert Parallelism on 96 H100 GPUs]] - Pages created: [[@2025__LMSYS Blog__Deploying DeepSeek with PD Disaggregation and Large-Scale Expert Parallelism on 96 H100 GPUs]], [[DeepGEMM]], [[EPLB]] - Pages updated: [[SGLang]], [[DeepEP]], [[DeepSeek-V3]], [[LMSYS]], [[Prefill-Decode分離]], [[Mixture-of-Experts]], [[並列化戦略]], [[負荷分散]] - Key insight: SGLang チームが 96 台の H100 GPU(12 ノード)で DeepSeek-V3 級モデルを PD Disaggregation + 大規模 Expert Parallelism により配備し、DeepSeek 公式ブログの報告値にオープンソース実装として初めて接近した(TP16 基準比 Prefill 最大 3.3 倍・Decode 最大 5.2 倍、公式プロフィール比 Prefill 94%・Decode はノード数半分でほぼ同等)。密な FFN 層で TP でなく DP を採用する理由が中間次元 18,432 の TP32 アラインメント非対応というハードウェア制約にあること、DeepEP の Normal/Low-Latency Dispatch を PD 分離で使い分けること、EPLB がスループットを Prefill 1.49 倍・Decode 2.54 倍向上させることを、既存の [[Mixture-of-Experts]]・[[並列化戦略]]・[[Prefill-Decode分離]]・[[負荷分散]] の各 concept に接続した。 ## [2026-07-17] ingest-paper (update) | Machine Learning Fleet Efficiency: Improving TPU Systems at Scale with ML Productivity Goodput - Source: `.raw/papers/arxiv-2502.06982.pdf`(arXiv 2502.06982v2、論文全文 13 ページ) + `.raw/slides/mlsys2026-3734/slides.pdf`(既存取り込み済み、MLSys 2026 発表スライド 32 ページ) - Summary: [[@2026__MLSys2026__Machine Learning Fleet Efficiency - Improving TPU Systems at Scale with ML Productivity Goodput]] - Pages created: なし(既存 source/concept ページの更新) - Pages updated: [[@2026__MLSys2026__Machine Learning Fleet Efficiency - Improving TPU Systems at Scale with ML Productivity Goodput]], [[ML Productivity Goodput]] - Key insight: 2026-07-02 の初回取り込み時は OpenReview の Cloudflare 保護で論文 PDF を取得できずスライドのみを出典としていたが、今回 arXiv 版(2502.06982)の存在を特定し論文全文を取得できた。これにより Program Goodput の「predicted step time」が HLO グラフの静的解析(コンパイラー決定に非依存)で算出されることが判明し、通信計算オーバーラップによる 1024 TPU チップ・500B パラメータ LLM でのスループット 1.38 倍・FLOPS 利用率 72% という具体的成果、XTAT オートチューナー(150 モデルで評価)などの詳細を追加した。論文の埋め込み図(アクセラレーター種別推移・システムスタック・MPG 式)とスライドの独自図解(従来指標の落とし穴の 3 列比較・roofline vs フュージョンの説明図・ランタイム最適化 4 施策の統合タイムライン)を計 7 枚 source ページに埋め込んだ。 ## [2026-07-17] ingest-paper | A New Golden Age for Computer Architecture - Source: `.raw/papers/cacm19golden-age.pdf`(CACM, 2019年2月、John L. Hennessy・David A. Patterson 著、2017年ACM Turing Lecture のCACM掲載版) - Summary: [[@2019__CACM__A New Golden Age for Computer Architecture]] - Pages created: [[@2019__CACM__A New Golden Age for Computer Architecture]], [[John L. Hennessy]], [[RISC-V]], [[ドメイン固有アーキテクチャ]], [[ムーアの法則とデナードスケーリングの終焉]] - Pages updated: [[David A. Patterson]], [[Google]], [[VLIW]], [[メモリウォール]] - Key insight: IBM System/360からRISC-Vまでの ISA 史を、著者ら自身がRISC-I/MIPSの開発当事者であった立場から回顧しつつ、Moore の法則・Dennard スケーリングの終焉により汎用プロセッサの性能向上率がCISC期22%/年→RISC期52%/年→マルチコア期23%/年→Amdahl期12%/年→予測3%/年へと段階的に低下したと定量分析し、ドメイン固有アーキテクチャ(Google TPU v1が汎用CPU比29倍高速・80倍超のエネルギー効率)・オープンISA(RISC-V)・アジャイルハードウェア開発の3つを次の黄金時代の道筋として提示した。既存の[[VLIW]]・[[メモリウォール]]概念に、Itanium/EPICの失敗経緯や投機実行の無駄(平均19%)という定量的裏付けを補う形で接続した。 ## [2026-07-16] ingest-slides | LLM高速化(勉強会) - Source: `.raw/slides/llm-kosokuka-benkyoukai/llm-kosokuka-benkyoukai.pdf`(SpeakerDeck、全50ページ、著者 SuperHotDog) - Visual pages: `.raw/slides/llm-kosokuka-benkyoukai/pages/` - Media: なし(音源・動画は提供されず) - Summary: [[@2026__SpeakerDeck__LLM高速化(勉強会)]] - Pages created: [[@2026__SpeakerDeck__LLM高速化(勉強会)]], [[SuperHotDog]], [[PagedAttention]], [[Speculative Decoding]], [[CUDAGraph]] - Pages updated: [[vLLM]], [[KVキャッシュ管理]], [[FlashAttention]], [[Grouped-Query Attention]], [[Multi-Head Latent Attention]], [[線形注意]], [[スライディングウィンドウアテンション]], [[Prefill-Decode分離]], [[GPU最適化]], [[カーネルフュージョン]], [[混合精度訓練]] - Key insight: LLM 推論高速化の勉強会資料。KVCache・FlashAttention・PagedAttention・Speculative Decoding のアルゴリズム的高速化、CUDA/Triton/CuTe による実装、GQA/MLA/Sliding/Linear Attention のアーキテクチャ的工夫、量子化(Ozaki Scheme 含む)、Nsight プロファイラ、CUDAGraph、vLLM 内部構造を一気通貫で扱い、Qwen2.5-0.5B のハンズオンで素の Transformers 推論から vLLM 推論への 15.88 倍高速化を実演した。既存の学術ベース concept 群(FlashAttention・KVキャッシュ管理・GQA・MLA 等)に、KVCache サイズ見積もり式や MLA low-rank 圧縮率の具体的計算例のような定量的裏付けを補う形で接続した。 ## [2026-07-16] ingest | ISC26 Recap - Source: `.raw/articles/isc26-recap-2026-07-16.md`(Glenn K. Lockwood Blog、URL ingest、画像5点添付) - Summary: [[@2026__Glenn K. Lockwood Blog__ISC26 Recap]] - Pages created: [[@2026__Glenn K. Lockwood Blog__ISC26 Recap]], [[LineShine]], [[Top500]], [[IO500]], [[Sugon]], [[ParaStor F9000]], [[Yutong Lu]], [[James Lin]], [[Weicheng Huang]], [[主権AI]] - Pages updated: [[Glenn K. Lockwood]], [[Lustre]], [[Shanghai Jiao Tong University]], [[ヨーロッパのAI主権]] - Key insight: 2026年ISCの参加記。中国の全CPU(Arm)スパコン[[LineShine]]がTop500首位を獲得し、SugonのParaStor F9000もIO500でDAOSを上回るなど、計算・ストレージ両面で中国製HPCスタックの成熟が示された。同時に、2026年6月12日の米国政府によるAnthropicモデルへの外国人アクセス遮断を引き金に、世界的な主権AIインフラ投資の機運が生まれたことを報告する。 ## [2026-07-16] ingest-paper | AI 2040: Plan A — The Deal - Source: `.raw/papers/AI-2040.pdf`(AI Futures Project、90ページ。ユーザーがローカルファイルとして直接提供) - Summary: [[@2026__AI Futures Project__AI 2040 - Plan A - The Deal]] - Pages created: [[@2026__AI Futures Project__AI 2040 - Plan A - The Deal]], [[AI Futures Project]], [[Daniel Kokotajlo]], [[AI国際検証レジーム]], [[権力集中リスク]] - Pages updated: [[知能爆発]], [[テイクオフ速度論争]] - Key insight: [[AI Futures Project]](2027年発表の存亡リスク警鐘シナリオ「AI 2027」と同じチーム)が発表した政策シナリオ文書。研究の完全透明化・コンピュート宣言・訓練一時停止・相互確証コンピュート破壊(MACD)からなる国際検証レジーム「Plan A」によって、既定路線(2030年の完全自動化されたAI研究開発から年内に超知能へ到達というハードテイクオフ的想定)を2040年まで人為的に先送りする成功シナリオを年表形式で描く。誤整合による制御喪失とは独立した「権力集中リスク」(少数の個人・企業が超知能軍団を実効支配する不可逆な独裁のリスク)を軸に据える点、代替プランB(Sabotage)/C(Slowdown)/D(Race)/S(Shutdown)を著者ら自身が確率評価つきで比較する点、中国による秘密裏AGI計画の検知確率分析(未検知でのTED-AI到達確率は2043年まで10%未満)、著者ら自身の卓上演習で繰り返し再現された最悪の失敗モード(欠陥のある安全性ケースの承認)を自己批判的に開示する点が特徴。既存concept [[知能爆発]]・[[テイクオフ速度論争]]に対し、「知能爆発が起こるかどうか・どう起こるか」という理論的問いから「いつ・どの速度で起こることを許すか」というガバナンス上の制御変数へと問題設定を転換する新たな視点を接続した。 ## [2026-07-15] ingest-paper | Scalable and Energy-Efficient AI: System-Level Profiling of NVIDIA GPU Clusters for Distributed LLM Training - Source: `.raw/papers/mdpi-ai7070232.pdf`(*AI* (MDPI) 2026, 7(7), 232、DOI: 10.3390/ai7070232、掲載日 2026-06-23。mdpi.comがAkamaiボット防御で自動取得を拒否したためユーザーが手動提供したPDFを取り込み) - Summary: [[@2026__AI__Scalable and Energy-Efficient AI - System-Level Profiling of NVIDIA GPU Clusters for Distributed LLM Training]] - Pages created: [[@2026__AI__Scalable and Energy-Efficient AI - System-Level Profiling of NVIDIA GPU Clusters for Distributed LLM Training]], [[Muhammad Ali Shafique]], [[Imran Latif]], [[Hayat Ullah]], [[Alex C. Newkirk]], [[Arslan Munir]], [[Kansas State University]], [[Johnson Controls]], [[Florida Atlantic University]], [[Lawrence Berkeley National Laboratory]], [[GPUエネルギー効率]] - Pages updated: [[@2026__IPDPS__Beyond Throughput - Performance and Energy Insights of LLM Inference Across AI Accelerators]] - Key insight: シングルノード8×NVIDIA H100と8×NVIDIA B200を、5種のLLM(Mistral-7B-v0.3・LLaMA-3.1-8B・Mistral-NeMo-Base-2407・Gemma-2-27B・Qwen2.5-32B)と3種のVLM(X-CLIP・EVL・Vita-CLIP)のDDP訓練で統制比較した実証研究。B200はGPU利用率1〜6%向上・訓練時間最大15%短縮・TFLOPs/GPU最大32%向上を達成する一方、TFLOPs/kWとtokens-per-kilojouleは全5 LLMでH100を下回り、「計算-エネルギー不整合(compute–energy misalignment)」を実測で提示。施設規模モデリング(2000ノード/5000ノード)では、B200が中負荷で年間+$0.62M、高負荷で+$4.26Mのエネルギーコスト超過となることを示した。新設conceptの[[GPUエネルギー効率]]で、既存の[[@2026__IPDPS__Beyond Throughput - Performance and Energy Insights of LLM Inference Across AI Accelerators]](推論フェーズ)との横断的知見として「スループット優位はエネルギー効率優位を意味しない」という命題が訓練・推論の両フェーズで独立に確認されたことを記録した。 ## [2026-07-15] ingest-paper | Can Large Language Models Generate Observability-Aware Code? - Source: `.raw/papers/arxiv-2607.05785v1.pdf`(arXiv:2607.05785v1, 投稿日 2026-07-07) - Summary: [[@2026__arXiv__Can Large Language Models Generate Observability-Aware Code?]] - Pages created: [[@2026__arXiv__Can Large Language Models Generate Observability-Aware Code?]], [[Yongliang Tao]], [[Pengfei Gao]], [[Zhiyu Fan]], [[Jue Zhang]] - Pages updated: [[Hongyu Zhang]], [[Chongqing University]], [[Minghua Ma]], [[Qingwei Lin]], [[Saravan Rajmohan]], [[Si Qin]], [[Liqun Li]], [[Yu Kang]], [[Microsoft]], [[オブザーバビリティ]], [[コーディングエージェント評価]], [[ログ生成]], [[障害注入]], [[バイブコーディング]] - Key insight: コーディングエージェント(GPT-5.5・Claude Opus 4.8・Gemini 3.5 Flash)が生成するコードのオブザーバビリティを、18 リポジトリ 1,223 インスタンスのソースレベル復元研究(Position F1・KeyBag F1)と、200 個の agent 生成マイクロサービスを Kubernetes 上にデプロイし 13 種の障害を Chaos Mesh で注入した実行時評価(1,615 件、Fault Signals Rate)の 2 軸で実証した初の系統的研究。全プロンプト戦略・全モデルで Position F1 が KeyBag F1 を一貫して上回り、「どこに計装するか」より「何を記録すべきか」の再現が体系的に弱いことを示した。explicit instruction は生成量を倍増(2.1→4.9 文/インスタンス)させるが Precision を犠牲にする Quantity over Quality 現象、few-shot プロンプトが Recall 主導で両指標を改善する現象を確認。実行時には FSR が 4.95〜13.99% にとどまり、「ログは大量に生成されるが障害固有の明示的意味論を欠く」ことを保守的な指標で定量化した。約 200 件の実失敗修復コミットから抽出した軽量 observability-oriented skill は FSR・Position F1・KeyBag F1 を改善するが効果は限定的(GPT-5.5: FSR +8.67pp、Claude Opus 4.8: +0.99pp、Gemini 3.5 Flash: +2.54pp)。 ## [2026-07-15] ingest | Recursive Self-Improvement (LessWrong) - Source: `.raw/articles/recursive-self-improvement-2026-07-15.md`(LessWrong, 2008-12-01) - Summary: [[@2008__LessWrong__Recursive Self-Improvement]] - Pages created: [[@2008__LessWrong__Recursive Self-Improvement]], [[Eliezer Yudkowsky]], [[Robin Hanson]], [[I. J. Good]], [[知能爆発]], [[テイクオフ速度論争]], [[リソースオーバーハング]] - Pages updated: [[Recursive Self-Improvement]] - Key insight: [[Eliezer Yudkowsky]] が2008年に提示した「AI go FOOM」論の原論証。因果を5層(metacognitive/cognitive/metaknowledge/knowledge/object level)に分解し、AIが自身の記憶検索アルゴリズムを改善する課題を与えられた瞬間にmetacognitive層とobject層が同一化する現象を「真の再帰」と定義、「自分のソースコードを直接書き換えること」と「農業を発明すること」を明確に区別すべきと論じる。複雑な最適化連鎖を再帰で自己に畳み込むと理論上「横ばいか爆発かのどちらか」になるはずで、ソフトテイクオフには「正確に都合の良い収穫逓減則」という狭い条件が必要という数理的議論は、2026年の [[@2026__Lil'Log__Harness Engineering for Self-Improvement]] が報告するハーネスレベルの漸進的自己改善(間接的RSI)との対比で、[[Recursive Self-Improvement]] 概念に理論と実務の18年越しの往復を持ち込んだ。 ## [2026-07-15] ingest-paper | Speculations Concerning the First Ultraintelligent Machine - Source: `.raw/papers/Good1964.pdf`(Advances in Computers, Vol. 6, Academic Press, 1965) - Summary: [[@1965__AdvComput__Speculations Concerning the First Ultraintelligent Machine]] - Pages created: [[@1965__AdvComput__Speculations Concerning the First Ultraintelligent Machine]] - Pages updated: [[知能爆発]], [[Recursive Self-Improvement]], [[I. J. Good]] - Key insight: [[I. J. Good]] が「ウルトラ知能機械」の定義から「知能爆発」を初めて明示的に定式化した1965年の一次論文。既存の wiki には Good の1965年論文への言及(entity [[I. J. Good]]、concept [[知能爆発]]・[[Recursive Self-Improvement]])は既にあったが、原論文本体は未 ingest で「未 ingest」と明記されていた欠落を埋めた。原論文の知能爆発の議論自体は数段落にとどまり機構的説明を欠くこと、本体の大半は Hebb の細胞集成体理論を修正した「サブアセンブリ理論」による記憶・想起・意味論の統一的説明という別の思弁に割かれていること、そして Good(概念の定式化)と43年後の Yudkowsky(機構の定式化)という役割分担を横断的知見として記録した。 ## [2026-07-15] ingest | Harness Engineering for Self-Improvement (Lil'Log) - Source: `.raw/articles/harness-engineering-for-self-improvement-2026-07-15.md`(Lil'Log, 2026-07-04) - Summary: [[@2026__Lil'Log__Harness Engineering for Self-Improvement]] - Pages created: [[@2026__Lil'Log__Harness Engineering for Self-Improvement]], [[Lilian Weng]], [[Recursive Self-Improvement]], [[ハーネス自己進化]], [[進化的探索によるエージェント設計]] - Pages updated: [[Harness Engineering]], [[コンテキストエンジニアリング]], [[Andrej Karpathy]] - Key insight: [[Recursive Self-Improvement]](RSI)の近未来的経路をモデル重みの直接書き換えではなく訓練パイプラインとデプロイシステム(ハーネス)の改善による間接的ループと位置づけ、既存の [[Harness Engineering]] 実務知見(OpenAI・Anthropic)を、ACE/MCE(コンテキストエンジニアリングの進化するプレイブック化)・Meta-Harness/Self-Harness/AHE(ハーネスコード自体の自己進化)・ADAS/AFlow/AlphaEvolve(進化的探索)という3系統の学術研究に接続する統一的なレビュー。STOP(2023)の「弱いモデルでは再帰改善が劣化する」という知見が、2026年の各手法でも暗黙に前提とされている点を横断的知見として記録した。 ## [2026-07-15] ingest-paper | Valet: Efficient Data Placement on Modern SSDs - Source: `.raw/papers/2026_Unknown_Valet_Efficient_Data_Placement_Modern.pdf`(ACM Symposium on Cloud Computing, SoCC '25, 2025-11-19〜21) - Summary: [[@2025__SoCC__Valet - Efficient Data Placement on Modern SSDs]] - Pages created: [[@2025__SoCC__Valet - Efficient Data Placement on Modern SSDs]], [[Devashish R. Purandare]], [[Peter Alvaro]], [[Avani Wildani]], [[Darrell D. E. Long]], [[Ethan L. Miller]], [[Valet]], [[MongoDB]], [[CacheLib]], [[zenfs]], [[f2fs]], [[Pure Storage]], [[ホスト誘導データ配置]], [[シムレイヤー]], [[ゾーン名前空間SSD]] - Pages updated: [[UC Santa Cruz]], [[Emory University]], [[Cloudflare]], [[RocksDB]], [[LSMツリー]] - Key insight: LD_PRELOAD ベースの userspace シムレイヤーだけで、アプリケーション・ファイルシステム・カーネルを一切変更せずに、affinity(親和性)と lifetime(寿命)の2軸に基づく配置ヒントを注入することで、f2fs に対して2〜6倍のスループット・最大6倍低いテールレイテンシを達成し、アプリケーション固有ソリューション(zenfs)に匹敵する性能とより広い適用性を両立できることを、RocksDB・MongoDB・CacheLib という3つの異なる特性のアプリケーションで実証した。データ配置の一般理論として temperature ベースではなく affinity/lifetime の2軸を提示した点、ヒューリスティックと学習ベース(mini KMeans)の双方のヒント生成を同一アーキテクチャで実証した点が新規性。全ての図は原著がベクター描画のため、PyMuPDF によるキャプション座標クロップで抽出した(pdf.js の埋め込みラスター画像抽出では図が取得できなかった)。 ## [2026-07-15] ingest-paper | The Anatomy of a Large-Scale Hypertextual Web Search Engine - Source: `.raw/papers/Brin98Anatomy.pdf`(Computer Networks and ISDN Systems 30 (1998) 107-117 / WWW7 1998) - Summary: [[@1998__Computer Networks__The Anatomy of a Large-Scale Hypertextual Web Search Engine]] - Pages created: [[@1998__Computer Networks__The Anatomy of a Large-Scale Hypertextual Web Search Engine]], [[Sergey Brin]], [[Lawrence Page]], [[PageRank]] - Pages updated: [[Stanford University]], [[Google]] - Key insight: Google の検索エンジンとしての創業論文。リンク構造由来のPageRank(`PR(A)=(1-d)+d·ΣPR(Ti)/C(Ti)`、ランダムサーファーモデル)とアンカーテキストのリンク先索引化を核とし、2,400万ページを1週間未満で索引化する実測性能(圧縮リポジトリ53.5GB・完全転置索引37.2GB・合計108.7GB)を報告する。本 wiki に情報検索・PageRankという新規ドメインを導入した。PDF はスキャン起源でOCRノイズが多く(著者名が"S. Brftz. L. Pup"等に誤認識)、注記のうえ引用した。Fig.1 のアーキテクチャ図はPyMuPDFでの直接画像抽出(xref指定)により復元できた。 ## [2026-07-14] ingest-slides | 言語モデルの内部機序：解析と解釈 (NLP2025 チュートリアル) - Source: `.raw/slides/NLP_2025_interpretability_tutorial__E68F90E587BAE78988_-/NLP_2025_interpretability_tutorial__E68F90E587BAE78988_-.pdf`(SpeakerDeck, 2025-03-10) - Visual pages: `.raw/slides/NLP_2025_interpretability_tutorial__E68F90E587BAE78988_-/pages/`(全144ページ) - Media: none(transcript なし) - Summary: [[@2025__SpeakerDeck__言語モデルの内部機序：解析と解釈]] - Pages created: [[@2025__SpeakerDeck__言語モデルの内部機序：解析と解釈]], [[Benjamin Heinzerling]], [[横井祥]], [[小林悟郎]], [[理化学研究所]], [[東北大学]], [[国立国語研究所]], [[SAE]], [[活性化パッチング]], [[言語モデルのプロービング]] - Pages updated: [[Anthropic]], [[機構的解釈性]], [[プラトン的表現仮説]], [[モデル表現収束]], [[ロジットレンズ]], [[帰納ヘッド]], [[アテンションヘッド]] - Key insight: 言語モデルの内部機序理解を「内部表現の解析→計算過程の解析→言語・世界・知識との対応づけ(解釈)」の3段階フレームワークで整理した上で、その基盤にある「局所性・一対一対応」という仮定自体をSAEのfeature absorption・複数の等価な回路・複数の実際の計算メカニズムの共存という反例で掘り崩す、方法論への内省まで踏み込むチュートリアルである。 ## [2026-07-14] ingest-paper | OpenRCA 2.0: From Outcome Labels to Causal Process Supervision - Source: `.raw/papers/arxiv-2606.27154.pdf`(arXiv:2606.27154, 2026) - Summary: [[@2026__arXiv__OpenRCA 2.0 - From Outcome Labels to Causal Process Supervision]] - Pages created: [[@2026__arXiv__OpenRCA 2.0 - From Outcome Labels to Causal Process Supervision]], [[Yifan Yang]], [[Jin'ao Shang]], [[Qisheng Lu]], [[Rui Wang]], [[Songhan Zhang]], [[Yuzhong Zhang]], [[Boxi Yu]] - Pages updated: [[Aoyang Fang]], [[Pinjia He]], [[Junjielong Xu]], [[The Chinese University of Hong Kong, Shenzhen]], [[OpenRCA]], [[RCA評価設計]], [[因果発見]], [[障害注入]] - Key insight: OpenRCA 2.0 は根本原因ラベルのみを持つ既存 RCA ベンチマークの限界に対し、障害注入時の既知介入 do(v_root) を利用する段階的因果ラベリング PAVE で検証済みの因果伝播経路まで持つ初の cross-system ベンチマーク(500 インスタンス)を構築した。11 の最先端 LLM で、正しいサービスを言い当てる AnySvc(76.0%)と検証済み経路まで裏づける Path Reachability(61.5%)の間に 14.5pp のギャップがあることを示し、これを「grounding されていない診断(ungrounded diagnosis)」と定義。outcome-only 評価がこの失敗モードを隠すことを、人手ラベルや LLM judge なしにグラフ形状指標(Path Reachability・Node F1・Edge F1)だけで定量化した点が新しい。 ## [2026-07-14] ingest-paper | A Survey of DevOps Concepts and Challenges - Source: `.raw/papers/Leite-et-al.-2019---A-Survey-of-DevOps-Concepts-and-Challenges.pdf`(ACM Computing Surveys, Vol. 52, No. 6, Article 127, 2019) - Summary: [[@2019__ACM CSUR__A Survey of DevOps Concepts and Challenges]] - Pages created: [[@2019__ACM CSUR__A Survey of DevOps Concepts and Challenges]], [[Leonardo Leite]], [[Carla Rocha]], [[Fabio Kon]], [[Paulo Meirelles]], [[University of São Paulo]], [[University of Brasília]], [[Federal University of São Paulo]] - Pages updated: [[DevOps]], [[Dejan Milojicic]], [[Hewlett Packard Labs]] - Key insight: 2019年のACM CSURサーベイは、DevOpsが10年近く議論されても広く合意された定義を欠くと明記した上でGrounded Theory的手法によりprocess/people/delivery/runtimeの4カテゴリからなるconceptual frameworkを構築し、既存DevOps SLR群がdelivery/runtime(技術的含意)を軽視してきたことを指摘した。DevOpsツールを7カテゴリに分類し担当者・目標・関連概念と対応づけ、Site Reliability Engineeringも運用エンジニア役割の進化形として既に文献に組み込んでいた。 ## [2026-07-14] ingest | The Origins of DevOps: What's in a Name? - Source: `.raw/articles/the-origins-of-devops-whats-in-a-name-2026-07-14.md`(devops.com, 2018-01-25) - Summary: [[@2018__devops.com__The Origins of DevOps - What's in a Name]] - Pages created: [[@2018__devops.com__The Origins of DevOps - What's in a Name]], [[Paul Hammond]], [[Gene Kim]], [[Kevin Behr]], [[George Spafford]] - Pages updated: [[DevOps]], [[Patrick Debois]], [[Andrew Clay Shafer]], [[John Allspaw]] - Key insight: DevOpsの起源に関する事実関係(Agile Infrastructure BoF・Velocity 2009 Flickr発表・Devopsdays創設)が [[@2026__mizzy.org__DevOpsとは何だったのか]] と独立ソースで一致することを確認した。2013年『The Phoenix Project』(Gene Kim・Kevin Behr・George Spafford)によるビジネス小説形式での普及という補足事実を追加した。 ## [2026-07-14] ingest | DevOpsとは何だったのか - Source: `.raw/articles/devops-towa-nan-datta-noka-2026-07-14.md`(mizzy.org, 2026-07-13) - Summary: [[@2026__mizzy.org__DevOpsとは何だったのか]] - Pages created: [[@2026__mizzy.org__DevOpsとは何だったのか]], [[DevOps]], [[Patrick Debois]], [[John Willis]], [[Andrew Clay Shafer]], [[Gosuke Miyashita]] - Pages updated: [[SRE]], [[DORA]], [[プラットフォームエンジニアリング]], [[ChatOps]], [[John Allspaw]] - Key insight: DevOps は元々「開発部門と運用部門という組織の分断を解消する文化運動」だったが、CAMSのAutomation偏重・ツール名・職種名として消費されていき、最終的にInfrastructure as Code・CI/CD・DORA・ChatOpsという独立領域へ分解された(2025年のDORA改称が象徴的)。同じ「文化→職種・技術レイヤーへの縮約」が、「class SRE implements DevOps」という定式化を通じてSREにも反復されつつある。 ## [2026-07-13] ingest-paper | Bridging Edge and Cloud: A Knowledge-Enhanced Framework for Efficient Time Series Anomaly Detection - Source: `.raw/papers/Bridging_Edge_and_Cloud_A_Knowledge-Enhanced_Framework_for_Efficient_Time_Series_Anomaly_Detection.pdf`(IEEE TSC 2025) - Summary: [[@2025__TSC__Bridging Edge and Cloud - A Knowledge-Enhanced Framework for Efficient Time Series Anomaly Detection]] - Pages created: [[@2025__TSC__Bridging Edge and Cloud - A Knowledge-Enhanced Framework for Efficient Time Series Anomaly Detection]], [[RefinedEdge]], [[Jiacheng Zhang]], [[Guohua Liu]], [[Shiqi Chen]], [[Yutong Chen]] - Pages updated: [[Shenglin Zhang]], [[Yongqian Sun]], [[Dan Pei]], [[Minghua Ma]], [[Chenyu Zhao]], [[Nankai University]], [[Alibaba Cloud]], [[異常検知]], [[知識蒸留]], [[モデル圧縮]], [[Edge-cloud Collaboration]] - Key insight: RefinedEdge は多変量時系列異常検知モデルをエッジ配置可能な水準(0.15M パラメータ未満)まで圧縮しつつ、クラウド訓練の大型モデル(7M パラメータ)に匹敵・凌駕する精度を Aggregated Compression + Knowledge Refinement で達成し、概念ドリフトのあるデータセットでのみ Reciprocal Edge-Cloud Updating が有意な改善をもたらすことを示した。 ## [2026-07-13] ingest-paper | From Chaos to Clarity: Log-based Kernel Panic Root Cause Analysis for Large-Scale Cloud Services - Source: `.raw/papers/LogSage.pdf`(FCS 2025) - Summary: [[@2025__FCS__From Chaos to Clarity - Log-based Kernel Panic Root Cause Analysis for Large-Scale Cloud Services]] - Pages created: [[@2025__FCS__From Chaos to Clarity - Log-based Kernel Panic Root Cause Analysis for Large-Scale Cloud Services]], [[Tianyu Cui]] - Pages updated: [[Shenglin Zhang]], [[Yongqian Sun]], [[Yicheng Sui]], [[Zeyu Che]], [[Nankai University]], [[ByteDance]], [[ログ解析]], [[根本原因分析]], [[グラフベースRCA]], [[LLMによる根本原因分析]] - Key insight: LogSage(通称)はカーネルパニック RCA を「スパースな障害指示ログ抽出」と「ログ間長距離依存」の2課題に分解し、GraphSAGE+能動学習と LLM 要約でByteDance本番20,000件データにおいてLogKGを15.5〜20.3pt F1上回り、6ヶ月超本番デプロイされている。 ## [2026-07-13] ingest-paper | A Comprehensive Benchmark and Empirical Study of Trace Anomaly Detection - Source: `.raw/papers/TSC3622122.pdf`(IEEE TSC 2025) - Summary: [[@2025__TSC__A Comprehensive Benchmark and Empirical Study of Trace Anomaly Detection]] - Pages created: [[@2025__TSC__A Comprehensive Benchmark and Empirical Study of Trace Anomaly Detection]], [[Minyi Shao]], [[Kaiwen Yang]], [[Xingda Li]], [[Dongbiao He]], [[Yanbiao Li]], [[トレース異常検知]] - Pages updated: [[Yongqian Sun]], [[Nankai University]] - Key insight: トレース異常検知には全データセット横断で一貫最良のアルゴリズムは存在せず(GTrace/TraceVAE/PUTraceADが条件依存で優劣を分ける)、TADBenchはトレース深さ・スパン数・サービス数・異常比率の4特性から決定木でアルゴリズムを推奨する初の横断ベンチマークを提示した。 ## [2026-07-13] ingest-paper | PerfScout: An Adaptive Workload Generator in Software Performance Testing - Source: `.raw/papers/PerfScout_ICSE_26_Camera_Ready.pdf`(ICSE-SEIP '26) - Summary: [[@2026__ICSE-SEIP__PerfScout - An Adaptive Workload Generator in Software Performance Testing]] - Pages created: [[@2026__ICSE-SEIP__PerfScout - An Adaptive Workload Generator in Software Performance Testing]], [[Qingliang Zhang]], [[Yimin Zuo]], [[Bowen Deng]], [[Xiao Xiong]], [[Mengyao Li]], [[Huandong Zhuang]], [[Ruiyuan Wan]] - Pages updated: [[Yongqian Sun]], [[Shenglin Zhang]], [[Dan Pei]], [[Xidao Wen]], [[Nankai University]], [[Huawei Cloud]], [[BizSeer]], [[Alban Siffer]], [[Tsinghua University]], [[Wenwei Gu]], [[定常性モデル]], [[適応的ワークロード生成]] - Key insight: PerfScout は SPOT(極値理論)・ADF/KPSS(局所定常性判定)・PPO(強化学習)の3モジュールを統合し性能テストのワークロード生成を全自動化するフレームワークで、Huawei Cloud に9か月間本番デプロイされ、全ベースラインを調和平均(HM)で上回り代表ケースで87%のテスト時間短縮を実証した。 ## [2026-07-13] ingest-paper | When LLMs Listen to Experts: Accurate Failure Diagnosis in Operating Systems - Source: `.raw/papers/icse2026-seip-paper13.pdf`(ICSE-SEIP '26) - Summary: [[@2026__ICSE-SEIP__When LLMs Listen to Experts - Accurate Failure Diagnosis in Operating Systems]] - Pages created: [[@2026__ICSE-SEIP__When LLMs Listen to Experts - Accurate Failure Diagnosis in Operating Systems]], [[OScope]], [[Yuxin Sun]], [[Li Shi]], [[Cheng Huang]], [[Guodong Yang]], [[Luping Wang]] - Pages updated: [[Yongxin Zhao]], [[Wenwei Gu]], [[Yongqian Sun]], [[Shenglin Zhang]], [[Dan Pei]], [[Liping Zhang]], [[Nankai University]], [[Alibaba Group]], [[Tsinghua University]], [[TSG自動化]], [[マルチモーダル障害診断]] - Key insight: OScope は症状記述の意味的不整合(TSG 検索精度 AC@5 0.75→0.9)を独立ファインチューニング済み Knowledge Aligner で解消し、SOP ガイドのチャンク逐次検証(Report Validator)と組み合わせて Alibaba 本番 OS 障害診断で AC@5=0.901・平均診断時間を112分→1.5分に短縮した。 ## [2026-07-13] ingest-paper | Aloha: Localizing Batch Failures in Large-scale Cloud Systems via Contrast Analysis and Human-in-the-Loop Agent - Source: `.raw/papers/Yujia__Aloha_to_FSE_26.pdf`(FSE Companion '26) - Summary: [[@2026__FSE Companion__Aloha - Localizing Batch Failures in Large-scale Cloud Systems via Contrast Analysis and Human-in-the-Loop Agent]] - Pages created: [[@2026__FSE Companion__Aloha - Localizing Batch Failures in Large-scale Cloud Systems via Contrast Analysis and Human-in-the-Loop Agent]], [[Yujia Wu]], [[Jinghuan Ren]], [[バッチ障害診断]] - Pages updated: [[Shenglin Zhang]], [[Yongqian Sun]], [[Chaoyun Zhang]], [[Liqun Li]], [[Wenwei Gu]], [[Qingwei Lin]], [[Dongmei Zhang]], [[Saravan Rajmohan]], [[Chetan Bansal]], [[Minghua Ma]], [[Nankai University]], [[Microsoft]], [[Fault Localization]] - Key insight: Aloha(FSE Companion '26)は対照分析ベースのバッチ障害診断で「アルゴリズムでなく usability gap が実務障壁」と指摘し、FTA由来の適格性判定・実行可能検証ツールキット・RAGベース戦略選択をhuman-in-the-loopで統合してCONANをACC@5で0.9370対0.6963、診断時間を約10時間から約0.5時間に短縮した。 ## [2026-07-13] ingest-paper | FoundRoot: Towards Foundation Model for Root Cause Analysis via Structured Deep Thinking - Source: `.raw/papers/foundroot_camera_ready.pdf`(ICSE '26) - Summary: [[@2026__ICSE__FoundRoot - Towards Foundation Model for Root Cause Analysis via Structured Deep Thinking]] - Pages created: [[@2026__ICSE__FoundRoot - Towards Foundation Model for Root Cause Analysis via Structured Deep Thinking]], [[Yuzhuo Yang]], [[構造化深層思考]] - Pages updated: [[Zhe Xie]], [[Zeyan Li]], [[Xiao He]], [[Shenglin Zhang]], [[Longlong Xu]], [[Tieying Zhang]], [[Jianjun Chen]], [[Rui Shi]], [[Dan Pei]], [[Tsinghua University]], [[ByteDance]], [[Nankai University]], [[根本原因分析]], [[LLMによる根本原因分析]], [[検証可能報酬による強化学習]], [[Fault Localization]] - Key insight: FoundRoot は構造化深層思考(メトリクススキャン→伝播分析→リフレクション→ランキング)を warm-up SFT + DAPO で LLM に内在化させることで、プロンプトのみの分解(w/ Workflow)や RL 抜きの構造化(SFT Only/SFT+SFT)を上回り、ゼロショット RCA 4 データセット全てで MRR 4.5%〜48.6% 改善した。 ## [2026-07-13] ingest-paper | LLM-Assisted Joint Ticket and Log Analysis for Incident Triage in Intelligent and Connected Vehicles - Source: `.raw/papers/Ruowei__InsightTriage_to_ASE26.pdf`(ASE'26投稿版) - Summary: [[@2026__ASE__LLM-Assisted Joint Ticket and Log Analysis for Incident Triage in Intelligent and Connected Vehicles]] - Pages created: [[@2026__ASE__LLM-Assisted Joint Ticket and Log Analysis for Incident Triage in Intelligent and Connected Vehicles]], [[Weiguo Li]] - Pages updated: [[Ruowei Fu]], [[Shenglin Zhang]], [[Wenwei Gu]], [[Yongqian Sun]], [[Dan Pei]], [[Nankai University]], [[インシデントトリアージ]], [[オンコール自動化]] - Key insight: 同じ Ruowei Fu / Shenglin Zhang (Nankai University) の著者陣が ByteDance ドメイン(OncallX・CoTriage)に続き Huawei/ICV(車載)ドメインで InsightTriage を提案し、LLMによるコンポーネント知識ベース自動構築+コントラスティブ学習ログ検索によりログを一次証拠として扱う設計の有効性をアブレーション(ログ検索器除去でWeighted F1が19.2%低下)で実証した。 ## [2026-07-13] ingest-paper | Bridging the Delay: Lag-Aware Spatio-Temporal Causal Inference for Microservice Root Cause Analysis - Source: `.raw/papers/LagRCA_4.24.pdf`(FSE Companion '26) - Summary: [[@2026__FSE Companion__Bridging the Delay - Lag-Aware Spatio-Temporal Causal Inference for Microservice Root Cause Analysis]] - Pages created: [[@2026__FSE Companion__Bridging the Delay - Lag-Aware Spatio-Temporal Causal Inference for Microservice Root Cause Analysis]], [[Junhua Kuang]], [[Yimeng Zhang]], [[Jintao Feng]], [[Jingyu Wang]], [[Liping Zhang]], [[LagRCA]], [[遅延認識時空間因果推論]] - Pages updated: [[Shenglin Zhang]], [[Yongqian Sun]], [[Dan Pei]], [[Nankai University]], [[Alibaba Group]], [[Tsinghua University]], [[Sibo Xia]], [[Wenwei Gu]], [[Wei Li]], [[因果推論ベースRCA]], [[Fault Localization]], [[根本原因分析]], [[グラフベースRCA]] - Key insight: マイクロサービス障害伝播は本番データで81.5%が非同期(2分以上の遅延)であり、この時間ラグを明示的にモデル化(スケルトン/強度分離+ラグ条件付きアテンション)することで同期集約前提の既存RCA手法を大きく上回れることをLagRCAが実証した。 ## [2026-07-13] ingest-paper | Can Language Models Go Beyond Coding? Assessing the Capability of Language Models to Build Real-World Systems - Source: `.raw/papers/Can-Language-Models-Go-Beyond-Coding-Assessing-the-Capability-of-Language-Models-to-Build-Real-World-Systems.pdf` - Summary: [[@2026__nkcs.iops.ai__Can Language Models Go Beyond Coding - Assessing the Capability of Language Models to Build Real-World Systems]] - Pages created: [[@2026__nkcs.iops.ai__Can Language Models Go Beyond Coding - Assessing the Capability of Language Models to Build Real-World Systems]], [[Build-bench]], [[Open Build Service]], [[Weilin Jin]], [[クロスISAマイグレーション]], [[自動ビルド修復]] - Pages updated: [[Chenyu Zhao]], [[Shenglin Zhang]], [[Yongqian Sun]], [[Dan Pei]], [[Chaoyun Zhang]], [[Qingwei Lin]], [[Chetan Bansal]], [[Saravan Rajmohan]], [[Minghua Ma]], [[Nankai University]], [[Peking University]], [[Tsinghua University]], [[Microsoft]], [[エージェント型コーディング]] - Key insight: エージェント型のツール利用・反復フィードバックなしでは GPT-5 の成功率は 6.13% にとどまるが、Build-bench の反復ループ環境下では 63.19%(10.3 倍)に到達し、クロス ISA ビルド修復は動的なツールオーケストレーションと検証可能なフィードバックループを要することを実証した。 ## [2026-07-13] ingest-paper | Debugging the Debuggers: Failure-Anchored Structured Recovery for Software Engineering Agents - Source: `.raw/papers/Debugging-the-Debuggers-Failure-Anchored-Structured-Recovery-for-Software-Engineering-Agents.pdf` - Summary: [[@2026__arXiv__Debugging the Debuggers - Failure-Anchored Structured Recovery for Software Engineering Agents]] - Pages created: [[@2026__arXiv__Debugging the Debuggers - Failure-Anchored Structured Recovery for Software Engineering Agents]], [[Yihang Lin]], [[Zhimin Chen]] - Pages updated: [[Chenyu Zhao]], [[Shenglin Zhang]], [[Wenwei Gu]], [[Yongqian Sun]], [[Dan Pei]], [[Chetan Bansal]], [[Saravan Rajmohan]], [[Minghua Ma]], [[AIOpsLab]], [[エージェント修復]] - Key insight: PROBE は診断精度改善(+43.58pt)が回復率改善(+12.45pt)を大きく上回る「diagnosis–recovery gap」を実証し、同著者グループの後続研究 AgentTether が観測したフィードバック遵守の急速な減衰と同一の構造的主張(正しい診断は実行可能な回復の必要条件だが十分条件ではない)に収斂する。 ## [2026-07-13] ingest-paper | Collaborative Knowledge Distillation and Reinforcement Learning for Automated Ticket Triage in Large-Scale Production Systems - Source: `.raw/papers/Ruowei__Triage_to_TOSEM.pdf`(TOSEM投稿版) - Summary: [[@2026__nkcs.iops.ai__Collaborative Knowledge Distillation and Reinforcement Learning for Automated Ticket Triage in Large-Scale Production Systems]] - Pages created: [[@2026__nkcs.iops.ai__Collaborative Knowledge Distillation and Reinforcement Learning for Automated Ticket Triage in Large-Scale Production Systems]], [[Yang Zhang (ByteDance)]], [[Xin Wu (ByteDance)]], [[Feng Wang (ByteDance)]], [[Zeyu Che]], [[Xiaozhou Liu (ByteDance)]], [[知識蒸留]] - Pages updated: [[Ruowei Fu]], [[Yu Zhang (ByteDance)]], [[ByteDance]], [[Yongqian Sun]], [[Nankai University]], [[Wenwei Gu]], [[Shenglin Zhang]], [[オンコール自動化]], [[インシデントトリアージ]] - Key insight: 同一著者陣(Ruowei Fu・Shenglin Zhang、ByteDance STE チーム)が、先行研究 OncallX の知識グラフ拡張路線に続き、CoTriage で知識蒸留+自己強化+DPOによるSLMファインチューニングという対照的な路線を独立に本番デプロイしており、チケットトリアージには決定版アプローチがまだ定まっていないことを示す。 ## [2026-07-13] ingest-paper | Large Language Models Can Provide Accurate and Interpretable Incident Triage - Source: `.raw/papers/ISSRE24_LLM4triage.pdf`(著者による Microsoft Research サイト直接公開版。DOI版はIEEE Xplore有料壁の向こう) - Summary: [[@2024__ISSRE__Large Language Models Can Provide Accurate and Interpretable Incident Triage]] - Pages created: [[@2024__ISSRE__Large Language Models Can Provide Accurate and Interpretable Incident Triage]], [[Ze Li]], [[Jianhui Li]], [[Chinese Academy of Sciences]], [[インシデントトリアージ]] - Pages updated: [[Zexin Wang]], [[Minghua Ma]], [[Chetan Bansal]], [[Qingwei Lin]], [[Dongmei Zhang]], [[Yu Kang]], [[Chaoyun Zhang]], [[Saravan Rajmohan]], [[Murali Chintalapati]], [[Changhua Pei]], [[Gaogang Xie]], [[Microsoft]], [[インシデント管理]], [[インシデントTTM予測]] - Key insight: LLM(GPT-3.5/GPT-4)でログからキーワードを抽出し埋め込み類似検索でチームを推薦する COMET は、生ログ・議論の生テキストよりフィルタ済みログ(TrimmedLogs)が、さらに生成要約よりキーワードがトリアージ入力表現として優れることを比較実験(Table I・II)で実証した。Microsoft の2大規模クラウドサービスに6ヶ月以上本番展開し、オンラインでACC@1を0.47→0.61に改善・TTMを35%短縮。アブレーション(Table VI)により、不正確でもルールベース(AutoAnalysis)の出力を LLM の補助入力として活用する設計が有効であることも定量的に確認した。 ## [2026-07-13] ingest-paper | Integrating Large Language Models into Security Incident Response - Source: `.raw/papers/soups2025-kramer.pdf` - Summary: [[@2025__SOUPS__Integrating Large Language Models into Security Incident Response]] - Pages created: [[@2025__SOUPS__Integrating Large Language Models into Security Incident Response]], [[Diana Kramer]], [[Lambert Rosique]], [[Ajay Narotam]], [[Elie Bursztein]], [[Patrick Gage Kelley]], [[Kurt Thomas]], [[Allison Woodruff]], [[LLMインシデント要約]] - Pages updated: [[Google]], [[インシデントレポート執筆]], [[インシデントレスポンスAIレベル]] - Key insight: Gemini 1.5 Flash によるセキュリティインシデントの自律的な要約は人間要約に61%対39%で劣後する(完全性35%・事実性42%の欠陥率)一方、人間がAI下書きを編集する協働(AI支援)要約は人間単独の要約より77%対11%で優位という非対称な結果を実証。同一モデル・同一プロンプトでも人間の関与度合いで評価が逆転する点は、既存の[[インシデントレスポンスAIレベル]]概念(IR2/IR3自律度議論)に定量的な裏付けを与える。要約作成者本人の主観評価(品質向上について意見割れ)と独立第三者評価(77%でAI支援要約を高評価)の乖離も報告。 ## [2026-07-13] ingest-paper | AgentTether: Graph-Guided Diagnosis and Runtime Intervention for Reliable LLM Agent Operations - Source: `.raw/papers/arxiv-2607.06273.pdf` - Summary: [[@2026__arXiv__AgentTether - Graph-Guided Diagnosis and Runtime Intervention for Reliable LLM Agent Operations]] - Pages created: [[@2026__arXiv__AgentTether - Graph-Guided Diagnosis and Runtime Intervention for Reliable LLM Agent Operations]], [[Chenyu Zhao]], [[エージェント修復]] - Pages updated: [[Shenglin Zhang]], [[Dan Pei]], [[Chetan Bansal]], [[Saravan Rajmohan]], [[Minghua Ma]], [[Wenwei Gu]], [[Yongqian Sun]], [[Nankai University]], [[Tsinghua University]], [[Microsoft]], [[エージェントシステム運用]], [[グラフベースRCA]] - Key insight: LLM エージェントの失敗した実行を Transition Unit のグラフ(Critical Transition Graph)で診断し、事後のグラフ誘導診断と実行時の保護付き介入(Check→Decide→Inject)を連動させることで、一度きりの診断フィードバックが再実行中に減衰する問題(tool-call ステップ 13 で追従率 50% を割る)に対処する。τ-bench Banking の初回失敗タスクを Qwen3.7-max 59.04%・GPT-5.4 65.12% 修復。Wenwei Gu の著者所属(Nankai University)が既存 LLMPrism エントリ(CUHK)と食い違うため entity ページに contradiction callout を追加(同姓同名の可能性、未確定)。 ## [2026-07-13] ingest | 価値はスケールしない。発酵する。(安宅和人) - Source: `.raw/articles/kaz-ataka-value-doesnt-scale-it-ferments-2026-07-13.md` - Summary: [[@2026__hatenablog__価値はスケールしない、発酵する。]] - Pages created: [[@2026__hatenablog__価値はスケールしない、発酵する。]], [[安宅和人]], [[Dan Hill]], [[堀河屋野村]], [[四資本の時計]], [[価値生成の膜モデル]], [[地域の乳化剤]], [[テロワール(味わうことのできる時間)]], [[存続可能性から生成する力へ]] - Pages updated: なし(新規ドメイン初導入のため既存ページへの言及なし) - Key insight: 成長論・脱成長論はともに全ての価値が経済資本と同じ単一の時計で動くと誤って前提しており、本当の問いは価値がどのような時間で育つかである。経済資本(複利)・文化資本(発酵)・関係資本(熟成)・自然資本(循環)という異なる時間性、完全な混合でも分離でもない「膜」による価値生成、土地の個性ではなく「味わうことのできる時間」としてのテロワール再定義を提示。SRE/インフラ中心だったこの wiki に地域再生・文化資本・脱成長という新規ドメインを導入した。 ## [2026-07-13] ingest | Cognitive Work of Hypothesis Exploration During Anomaly Response - Source: `.raw/articles/cognitive-work-of-hypothesis-exploration-during-anomaly-response-2026-07-13.md`（Cloudflare 403 のため Wayback Machine 経由で全文取得） - Summary: [[@2019__ACMQueue__Cognitive Work of Hypothesis Exploration During Anomaly Response]] - Pages created: [[@2019__ACMQueue__Cognitive Work of Hypothesis Exploration During Anomaly Response]], [[Marisa R. Grayson]], [[Mile Two]], [[SNAFUcatchers Consortium]], [[アノマリー応答]] - Pages updated: [[David D. Woods]], [[Richard I. Cook]], [[仮説駆動RCA]], [[ヒンドサイトバイアス]], [[レジリエンスエンジニアリング]] - Key insight: [[SNAFUcatchers Consortium]] のインシデントケースDBから4件を process tracing 手法で分析し、アノマリー応答における仮説探索空間の時間発展(line of commitment を境に分岐・収束)を可視化。ACM Queue 同号の "Above the Line, Below the Line"（Cook）・"Managing the Hidden Costs of Coordination"（Maguire）は未 ingest。 ## [2026-07-13] ingest | Failure is inevitable: Learning from a large outage at Datadog - Source: `.raw/articles/rethinking-reliability-2026-07-13.md` - Summary: [[@2025__Datadog Engineering Blog__Failure is inevitable - Learning from a large outage and building for reliability in depth at Datadog]] - Pages created: [[@2025__Datadog Engineering Blog__Failure is inevitable - Learning from a large outage and building for reliability in depth at Datadog]], [[グレースフルデグレーデーション]], [[Rob Thomas]], [[Maciej Kowalewski]] - Pages updated: [[Datadog]], [[Laura de Vesine]], [[インシデント管理]], [[ソフトウェア耐障害性]] - Key insight: 「データ完全性を部分可視性より優先する設計」が 50〜60% の部分障害を 100% 停止に見せるスクエアウェーブパターンを生む。グレースフルデグレーデーションへの転換で重大インシデント 30% 削減・緩和時間中央値 10% / 95th 50% 改善という定量成果を達成。 ## [2026-07-13] ingest-slides | Oncall: An Equal-Opportunity Waste of Time - Source: `.raw/slides/srecon22emea-oconnor-oncall/srecon22emea-oconnor-oncall.pdf` - Visual pages: `.raw/slides/srecon22emea-oconnor-oncall/pages/` (10 pages) - Media: none (transcript なし) - Summary: [[@2022__SREcon22EMEA__Oncall - An Equal-Opportunity Waste of Time]] - Pages created: [[@2022__SREcon22EMEA__Oncall - An Equal-Opportunity Waste of Time]], [[Dave O'Connor]], [[Twilio]] - Pages updated: [[SRE組織変革]] - Key insight: オンコールを SRE の専売特許として複雑化する「toxic exceptionalism」が SRE を「fancy-ops」に固定化する。ステークホルダーへの価値証明は工学的乗数効果に置くべきという O'Connor の主張は、Facebook SRO 解散（集中型チームがクラッチとなりエンジニアリングチームの自立を阻んだ）と同じ構造問題を個人/チームの態度レベルで問い直す。 ## [2026-07-13] ingest | 6 Reasons You Don't Need an SRE Team - Source: `.raw/articles/6reasons-2026-07-13.md` - Summary: [[6 Reasons You Don't Need an SRE Team]] - Pages created: [[6 Reasons You Don't Need an SRE Team]], [[Gerro Wadat]], [[カーゴカルトSRE]] - Pages updated: [[SRE]] - Key insight: SREモデルはGoogle固有の文脈（2004年・前例なき規模・ツール不在・無限資本）の産物であり、その文脈なしに模倣する「カーゴカルトSRE」は組織の本質的信頼性課題を隠蔽する危険がある。 ## [2026-07-10] ingest-paper | Failure Trends in a Large Disk Drive Population - Source: `.raw/papers/4445.pdf` - Summary: [[@2007__FAST__Failure Trends in a Large Disk Drive Population]] - Pages created: [[@2007__FAST__Failure Trends in a Large Disk Drive Population]], [[Eduardo Pinheiro]], [[Wolf-Dietrich Weber]], [[ハードディスク信頼性]] - Pages updated: [[Luiz André Barroso]], [[データセンター信頼性]], [[障害予測]] - Key insight: Google 本番 HDD 10 万台超の実証研究で、SMART 強シグナル(スキャンエラー 39×・オフライン再割り当て 21×)が存在する一方、障害ドライブの 56% 超がいかなる強 SMART シグナルも示さず個別障害予測の精度天井を定量化。温度・使用率は中程度レンジで障害との相関が従来通念より弱い。 ## [2026-07-08] ingest-paper | Benchmarking the Overhead of Distributed Tracing Agents - Source: `.raw/papers/3777884.3797004.pdf` - Summary: [[@2026__ICPE__Benchmarking the Overhead of Distributed Tracing Agents]] - Pages created: [[@2026__ICPE__Benchmarking the Overhead of Distributed Tracing Agents]], [[David Georg Reichelt]], [[Wilhelm Hasselbring]], [[MooBench]], [[Kieker]], [[トレーシングオーバーヘッド]] - Pages updated: [[分散トレーシング]], [[継続的プロファイリング]] - Key insight: 7 種の Java トレーシングエージェントを統一比較した結果、Kieker が最速(133.92 ns/depth)で OpenTelemetry は業界標準のわりに遅く(315.28 ns/depth)、Pinpoint と Scouter はスパン損失バグがある。OpenTelemetry の高オーバーヘッドの主因は HashMap の毎回コピー・ArrayBasedContext スタックコピー・過度なメタデータ管理であり、実装改善でオーバーヘッドを大幅削減できる。 ## [2026-07-07] ingest-paper | VAST AI Operating System - Source: `.raw/papers/vast-ai-operating-system.pdf` - Summary: [[@2025__VAST Data__VAST AI Operating System]] - Pages created: [[DASEアーキテクチャ]], [[@2025__VAST Data__VAST AI Operating System]] - Pages updated: [[コンピュートストレージ分離]], [[分散メッセージブローカ]], [[VAST Data]] - Key insight: DASE アーキテクチャが「ステートレス CNode + NVMe-oF 共有 SSD」でネームスペース分割を排除し、同一プール上で Object Store・Database・Event Broker・ベクトルデータベースを統合する。InsightEngine は RAG の権限/ライフサイクル/監査を構造的に一貫させる一方、全性能値がベンダー自己申告である点は批判的に扱う必要がある。 ## [2026-07-06] ingest-paper | INTFusion: Unifying Network and Host Telemetry in Data Center Networks - Source: `.raw/papers/1571262346.pdf` - Summary: [[@2026__IFIP Networking__INTFusion - Unifying Network and Host Telemetry in Data Center Networks]] - Pages created: [[Leonardo Alberro]], [[Matias Richart]], [[Eduardo Grampin]], [[Universidad de la República]], [[インバンドネットワークテレメトリ]], [[@2026__IFIP Networking__INTFusion - Unifying Network and Host Telemetry in Data Center Networks]] - Pages updated: [[テレメトリ]], [[ネットワーク監視]], [[データセンター輻輳制御]] - Key insight: INT ソース/シンクを smartNIC にオフロードしホスト eBPF と per-flow 融合する「エッジ終端 INT」が、ネットワーク/アプリ断片化を解消しつつスイッチ依存を最小化。フローレット抽象化と二層エクスポートが主要な設計革新。 ## [2026-07-06] ingest-paper | Beyond Throughput: Performance and Energy Insights of LLM Inference Across AI Accelerators - Source: `.raw/papers/Beyond_Throughput_Performance_and_Energy_Insights_of_LLM_Inference_Across_AI_Accelerators.pdf` - Summary: [[@2026__IPDPS__Beyond Throughput - Performance and Energy Insights of LLM Inference Across AI Accelerators]] - Pages created: [[Giacomo Brunetta]], [[Cerebras]], [[SambaNova]], [[AIアクセラレータ]], [[@2026__IPDPS__Beyond Throughput - Performance and Energy Insights of LLM Inference Across AI Accelerators]] - Pages updated: [[LLM推論]], [[テンソル並列]], [[Mixture-of-Experts]] - Key insight: データフローアクセラレータは小バッチで GPU 比 1 桁の優位を示すが、エネルギー効率では GPU が大きく勝る。推論では DP > TP が原則(~100% vs 60% スケーリング)だが VRAM 80% 超のモデルでは TP が必要という例外がある。 ## [2026-07-06] ingest-paper | POSTER: Vedrfolnir: RDMA Network Performance Anomalies Diagnosis in Collective Communications - Source: `.raw/papers/3744969.3748396.pdf` - Summary: [[@2025__SIGCOMM__POSTER - Vedrfolnir - RDMA Network Performance Anomalies Diagnosis in Collective Communications]] - Pages created: [[Yuxuan Chen]], [[Xiheng Li]], [[Fangzheng Jiao]], [[Chunming Hu]] - Pages updated: [[Menghao Zhang]], [[Hawkeye]], [[RDMAネットワーク監視]], [[集合通信]] - Key insight: 集合通信アルゴリズムをステップ単位に分解した待機グラフが co-flow 依存を可視化し、[[Hawkeye]] 比 98% テレメトリ削減を実現する。単一フロー監視では見えないホスト側クリティカルパスを「アルゴリズムのステップ」粒度で初めて診断軸にした。 ## [2026-07-06] ingest-paper | ARGUS: Production-Scale Tracing and Performance Diagnosis for over 10,000-GPU Clusters - Source: `.raw/papers/arxiv-2606.20374.pdf` - Summary: [[@2026__arXiv__ARGUS - Production-Scale Tracing and Performance Diagnosis for over 10,000-GPU Clusters]] - Pages created: [[@2026__arXiv__ARGUS - Production-Scale Tracing and Performance Diagnosis for over 10,000-GPU Clusters]], [[Jiasheng Zhou]] - Pages updated: [[LLM学習モニタリング]], [[GPU観測性]], [[ストラグラー]], [[Tencent]], [[wiki/index]], [[wiki/hot]], [[wiki/log]], [[wiki/sources/_index]], [[wiki/entities/_index]] - Key insight: LLM 訓練における fail-slow は CPU コールスタック・フレームワークセマンティクス・GPU カーネルの 3 層を独立に計装することではじめて完全に可視化できる。ストラグラーはパイプライン並列のバブル転写と勾配同期アライメントにより他ランクに拡散・収束するため、単一ランクの観測では根本ランクを特定できない。 ## [2026-07-06] ingest-paper | KRCA: An Efficient Root Cause Analysis System in Hyper-Scale Microservice Systems via Agentic AI - Source: `.raw/papers/arxiv-2607.01788.pdf` - Summary: [[@2026__ASE__KRCA - An Efficient Root Cause Analysis System in Hyper-Scale Microservice Systems via Agentic AI]] - Pages created: [[@2026__ASE__KRCA - An Efficient Root Cause Analysis System in Hyper-Scale Microservice Systems via Agentic AI]], [[Jiamin Jiang]] - Pages updated: [[根本原因分析]], [[LLMによる根本原因分析]], [[因果発見]], [[Yongqian Sun]], [[Dan Pei]], [[Kuaishou Technology]], [[wiki/index]], [[wiki/hot]], [[wiki/log]], [[wiki/sources/_index]], [[wiki/entities/_index]] - Key insight: ハイパースケール(20万超サービス)では LLM ベース RCA の前提である「探索空間の手に届く規模」が成立しないため、API レベルドリルダウンで候補を3サービスに絞り込む段階が必須となる。時系列統計の因果発見は20メトリクス超で20%以下に急落するが、メトリクスの意味情報からスケルトン構造を事前確定することで LLM 推論の精度を60%超に維持できる。 ## [2026-07-06] ingest-paper | A Multi-Dataset Benchmark for Evaluating LLM Agents in Microservice Failure Diagnosis - Source: `.raw/papers/arxiv-2606.29193.pdf` - Summary: [[@2026__arXiv__A Multi-Dataset Benchmark for Evaluating LLM Agents in Microservice Failure Diagnosis]] - Pages created: [[@2026__arXiv__A Multi-Dataset Benchmark for Evaluating LLM Agents in Microservice Failure Diagnosis]], [[Yuanhong Cai]] - Pages updated: [[RCA評価設計]], [[SRE Benchmark]], [[Changhua Pei]], [[Dan Pei]], [[wiki/index]], [[wiki/hot]], [[wiki/log]], [[wiki/sources/_index]], [[wiki/entities/_index]] - Key insight: 既存の RCA ベンチマークは最終回答のみを採点する成果志向であり、推論プロセス評価パラダイム（Localization/Identification/Reason の3軸）と key-evidence / causal-chain の2形式ラベルにより、キーワード一致による偶発的正解と証拠に基づく体系的推論を初めて分離できる大規模競技検証済みベンチマーク。 ## [2026-07-06] ingest | 博士論文を書くということ（北村匡平） - Source: `.raw/articles/na8026bd18753-2026-07-01.md` - Summary: [[博士論文を書くということ]] - Pages created: [[博士論文を書くということ]], [[北村匡平]], [[日本の博士教育]] - Pages updated: [[wiki/index]], [[wiki/hot]], [[wiki/log]] - Key insight: 博士論文を「研究の最終形態」ではなく「特定の時点での研究のまとめ（最初の大きなマイルストーン）」と位置づけることで、人文学系博士課程の無期限先送り問題を回避できる。修士からの継続性と早期査読投稿が完成への近道。 ## [2026-07-05] ingest-paper | Self-Hosted WebAssembly Runtime for Runtime-Neutral Checkpoint/Restore in Edge–Cloud Continuum - Source: `.raw/papers/2025__Mid4CC__Self-Hosted_WebAssembly_Runtime_for_Runtime_Neutral_Checkpoint_Restore.pdf` - Summary: [[@2025__Mid4CC__Self-Hosted WebAssembly Runtime for Runtime-Neutral Checkpoint-Restore in Edge-Cloud Continuum]] - Pages created: [[@2025__Mid4CC__Self-Hosted WebAssembly Runtime for Runtime-Neutral Checkpoint-Restore in Edge-Cloud Continuum]], [[Self-Hosted WebAssembly Runtime]], [[Chiwawa]], [[Wizard]], [[CRIU]] - Pages updated: [[WebAssembly]], [[ランタイム中立チェックポイント]], [[Application Checkpointing]], [[VM Migration]], [[Edge-cloud Collaboration]], [[Yuki Nakata]], [[Katsuya Matsubara]], [[Future University Hakodate]], [[SAKURA internet Inc.]], [[WasmEdge]], [[WAMR]], [[wiki/index]], [[wiki/hot]], [[wiki/sources/_index]], [[wiki/entities/_index]], [[wiki/concepts/_index]], [[wiki/log]] - Key insight: 自己ホスト型 WebAssembly ランタイムを中間層とすることで、ホストランタイム改変なしにランタイムと最適化戦略の両中立な C/R を実現でき、wasmtime・WAMR・WasmEdge のいずれをホストとしても 1076 KB の一貫した小さな実行状態でライブマイグレーションが可能になる。 ## [2026-07-05] ingest-paper | Seamless Self-Healing in WebAssembly Container Orchestration with Runtime-Neutral Checkpointing - Source: `.raw/papers/Seamless_Self-Healing_in_WebAssembly_Container_Orchestration_with_Runtime-Neutral_Checkpointing.pdf` - Summary: [[@2025__CANDARW__Seamless Self-Healing in WebAssembly Container Orchestration with Runtime-Neutral Checkpointing]] - Pages created: [[@2025__CANDARW__Seamless Self-Healing in WebAssembly Container Orchestration with Runtime-Neutral Checkpointing]], [[ランタイム中立チェックポイント]], [[ホットリスタート]], [[動的ランタイム切り替え]], [[セルフヒーリング]], [[Yuzuki Saito]] - Pages updated: [[WebAssembly]], [[チェックポイント]], [[コンテナオーケストレーション]], [[Katsuya Matsubara]], [[Yuki Nakata]], [[Daigo Fujii]], [[Future University Hakodate]], [[SAKURA internet Inc.]], [[WasmEdge]], [[WAMR]] - Key insight: ランタイム中立チェックポイントを用いることで、Wasm コンテナの障害回復をホットリスタートへ、メモリ圧力緩和を動的ランタイム切り替えへ拡張し、Pod 退避なしのセルフヒーリングを実現できる。 ## [2026-07-06] ingest-paper | A Checkpoint/Restore Mechanism with Interoperability Among Distinctive WebAssembly Interpreters - Source: `.raw/papers/apsys24posters-final73.pdf` - Summary: [[@2024__APSys__A Checkpoint-Restore Mechanism with Interoperability Among Distinctive WebAssembly Interpreters]] - Pages created: [[@2024__APSys__A Checkpoint-Restore Mechanism with Interoperability Among Distinctive WebAssembly Interpreters]], [[Wasm3]] - Pages updated: [[WebAssembly]], [[ランタイム中立チェックポイント]], [[Application Checkpointing]], [[VM Migration]], [[Edge-cloud Collaboration]], [[チェックポイント]], [[Daigo Fujii]], [[Katsuya Matsubara]], [[Yuki Nakata]], [[Future University Hakodate]], [[SAKURA internet Inc.]], [[WasmEdge]], [[WAMR]], [[wiki/index]], [[wiki/hot]], [[wiki/sources/_index]], [[wiki/entities/_index]], [[wiki/concepts/_index]], [[wiki/log]] - Key insight: standard interpreter と fast interpreter の間では、プログラムカウンタ・コントロールスタック・バリュースタックを変換することで、異種 Wasm interpreter 間の checkpoint/restore が可能になる。fast interpreter のカスタムコード上の実行点を Wasm バイトコード上の相対アドレスに対応づけ、型情報付きでスタックレイアウトを変換する手法が核心である。 --- type: meta title: "Operation Log" date: 2026-06-02 18:46 tags: - 2026/06/02 - meta - 2026/06/18 - 2026/06/19 - 2026/06/21 - 2026/06/20 - 2026/06/17 - 2026/06/16 - 2026/06/15 - 2026/06/23 - 2026/06/24 - 2026/06/25 - 2026/06/26 - 2026/06/27 - 2026/06/28 - 2026/06/29 - 2026/06/30 - 2026/07/01 - 2026/07/02 - 2026/07/04 - 2026/07/05 - log - enrich-source status: evergreen related: - "[[index]]" - "[[hot]]" - "[[overview]]" created: 2026-06-02 updated: 2026-07-05 --- ## [2026-07-05] ingest-paper | Stateful VM Migration Among Heterogeneous WebAssembly Runtimes for Efficient Edge-cloud Collaborations - Source: `.raw/papers/3642968.3654816.pdf` - Summary: [[@2024__EdgeSys__Stateful VM Migration Among Heterogeneous WebAssembly Runtimes for Efficient Edge-cloud Collaborations]] - Pages created: [[@2024__EdgeSys__Stateful VM Migration Among Heterogeneous WebAssembly Runtimes for Efficient Edge-cloud Collaborations]], [[WebAssembly]], [[VM Migration]], [[Edge Computing]], [[Edge-cloud Collaboration]], [[Application Checkpointing]], [[Daigo Fujii]], [[WasmEdge]], [[WAMR]] - Pages updated: [[チェックポイント]], [[Yuki Nakata]], [[Katsuya Matsubara]], [[Future University Hakodate]], [[SAKURA internet Inc.]], [[wiki/index]], [[wiki/hot]], [[wiki/sources/_index]], [[wiki/entities/_index]], [[wiki/concepts/_index]], [[wiki/log]] - Key insight: WasmEdge と WAMR 間の異種ランタイムステートフルVMマイグレーションは、命令アドレスを関数インデックス+オフセットに変換し、型情報に基づいてスタックを復元することで実現可能であり、dirty memory検出によりCRIU比30〜100倍のチェックポイント時間短縮を達成する。 ## [2026-07-05] ingest-paper | Reducing Attack Surface with Container Transplantation for Lightweight Sandboxing - Source: `.raw/papers/3609510.3609820.pdf` - Summary: [[@2023__APSys__Reducing Attack Surface with Container Transplantation for Lightweight Sandboxing]] - Pages created: [[@2023__APSys__Reducing Attack Surface with Container Transplantation for Lightweight Sandboxing]], [[Container Transplantation]], [[Capability-based Security]], [[Capsicum]], [[Lightweight Sandboxing]], [[Shintaro Suzuki]], [[gVisor]], [[Kata Containers]], [[FreeBSD]], [[Linux]], [[Linuxulator]] - Pages updated: [[コンテナ仮想化]], [[Yuki Nakata]], [[Katsuya Matsubara]], [[SAKURA internet Inc.]], [[Future University Hakodate]], [[Docker]], [[wiki/index]], [[wiki/hot]], [[wiki/sources/_index]], [[wiki/entities/_index]], [[wiki/concepts/_index]], [[wiki/log]] - Key insight: Linux コンテナを FreeBSD カーネルへ移植し Capsicum を透過適用することで、gVisor より小さい性能オーバーヘッド(UnixBench システムコールオーバーヘッドで runC 比 22% 悪化)を維持しつつ、Linux カーネル固有の脆弱性攻撃を回避できる。 ## [2026-07-05] ingest-paper | Concentrated Isolation for Container Networks Toward Application-aware Sandbox Tailoring - Source: `.raw/papers/2026_Unknown_Concentrated_isolation_container_networks_toward.pdf` - Summary: [[@2021__UCC__Concentrated Isolation for Container Networks Toward Application-aware Sandbox Tailoring]] - Pages created: [[@2021__UCC__Concentrated Isolation for Container Networks Toward Application-aware Sandbox Tailoring]], [[Sandbox Tailoring]], [[コンテナネットワーク分離]], [[Para-passthrough Hypervisor]], [[Yuki Nakata]], [[Katsuya Matsubara]], [[Ryosuke Matsumoto (SAKURA internet)|Ryosuke Matsumoto]], [[Future University Hakodate]], [[SAKURA internet Inc.]] - Pages updated: [[コンテナ仮想化]], [[wiki/index]], [[wiki/hot]], [[wiki/sources/_index]], [[wiki/entities/_index]], [[wiki/concepts/_index]], [[wiki/log]] - Key insight: BitVisor ベースの para-passthrough ハイパーバイザ Subaco が、runC と同等の起動時間を維持しつつ L2/L3/L4 のパケット偽装攻撃とネットワークリソース攻撃を防御し、Sandbox Tailoring がコンテナの性能と堅牢性のトレードオフを緩和する。 ## [2026-07-04] ingest-paper | Extending Applications Safely and Efficiently - Source: `.raw/papers/osdi25-zheng-yusheng.pdf` - Summary: [[@2025__OSDI__Extending Applications Safely and Efficiently]] - Pages created: [[@2025__OSDI__Extending Applications Safely and Efficiently]], [[Extension Interface Model]], [[Yanpeng Hu]], [[Xiaozheng Lai]], [[Dan Williams]], [[Andi Quinn]], [[Redis]], [[FUSE]], [[OpenSSL]] - Pages updated: [[eBPF]], [[BPF]], [[uprobe]], [[Yusheng Zheng]], [[Tong Yu]], [[Yiwei Yang]], [[bpftime]], [[eunomia-bpf]], [[DeepFlow]], [[Nginx]], [[wiki/index]], [[wiki/hot]], [[wiki/sources/_index]], [[wiki/entities/_index]], [[wiki/concepts/_index]], [[wiki/log]] - Key insight: EIM と bpftime により、eBPF スタイルの検証とハードウェア支援プロセス内隔離を組み合わせたユーザ空間アプリケーション拡張が、Nginx で 2% オーバーヘッドという高性能を実現する。 ## [2026-07-04] ingest | The GPU Observability Gap: Why We Need eBPF on GPU devices - Source: `.raw/articles/the-gpu-observability-gap-why-we-need-ebpf-on-gpu-devices-2026-07-04.md` - Summary: [[@2025__eunomia.dev__The GPU Observability Gap - Why We Need eBPF on GPU devices]] - Pages created: [[@2025__eunomia.dev__The GPU Observability Gap - Why We Need eBPF on GPU devices]], [[eGPU]], [[PTX 注入]] - Pages updated: [[GPU観測性]], [[eBPF]], [[bpftime]], [[eunomia-bpf]], [[Yusheng Zheng]], [[Tong Yu]], [[Yiwei Yang]] - Key insight: GPU 観測性のギャップを、bpftime による PTX/SPIR-V 注入で GPU カーネル内に eBPF を実行する技術で埋める方向性を整理した。 ## [2026-07-04] ingest | CUDA Events - eBPF-based CUDA API Tracing - Source: `.raw/articles/cuda-events-2026-07-04.md` - Summary: [[@2026__eunomia.dev__CUDA Events - eBPF-based CUDA API Tracing]] - Pages created: [[@2026__eunomia.dev__CUDA Events - eBPF-based CUDA API Tracing]], [[CUDA API トレース]], [[CUDA]], [[uprobe]], [[yunwei37]] - Pages updated: [[eBPF]], [[GPU観測性]], [[動的計装]], [[eunomia-bpf]], [[bpftime]], [[libbpf]], [[NVIDIA]], [[wiki/index]], [[wiki/hot]], [[wiki/sources/_index]], [[wiki/entities/_index]], [[wiki/concepts/_index]], [[wiki/log]] - Key insight: `libcudart.so` への eBPF uprobe で CUDA API 呼び出しをソース改変なしにトレースできる実装例。CPU 側入口の可視化と、bpftime/eGPU による GPU 内部計装の 2 層構造を具体化する。 ## [2026-07-04] ingest | デジタルネイチャーの十年：計算的物質化から発酵する共在へ - Source: `.raw/articles/n8157a439a58d-2026-07-04.md` - Summary: [[@2026__note__デジタルネイチャーの十年 - 計算的物質化から発酵する共在へ]] - Pages created: [[@2026__note__デジタルネイチャーの十年 - 計算的物質化から発酵する共在へ]], [[デジタル発酵]], [[デジタル蒸留]], [[Homo Convivium]], [[アクセシビリティ]], [[null2]], [[xDiversity]], [[Digital Nature Group]] - Pages updated: [[計算機自然]], [[マタギドライヴ]], [[批判的デジタルネイチャー]], [[落合陽一]], [[wiki/index]], [[wiki/hot]], [[wiki/sources/_index]], [[wiki/entities/_index]], [[wiki/concepts/_index]], [[wiki/log]] - Key insight: デジタルネイチャーは、計算的物質化の存在論から、生成AI以後の記号・身体・記憶・制度・環境が発酵する関係論へ拡張される。 ## [2026-07-04] gap-analysis | wiki 構造ギャップ分析 - Source: wiki-lens Gap Finder レポート（2026-07-04） - Pages created: `wiki/meta/gap-report-2026-07-04.md` - Key insight: 概念共引用分析で 10 件の実在ギャップを特定。最強シグナルは「LLMによる根本原因分析 — インシデント管理」（共引用 11 件）。Tier 0 ブリッジ候補として @2024__ASE__MRCA（スコア 14.46）、@2023__TSC__DiagFusion、@2019__ICSE__Incident Triage の 3 件を DOI 検証済み。MoE — 集合通信ギャップに対し Lancet (MLSys 2024) を Tier 1 推薦。12 件の意図的分離（異なる MOC 系統）と 8 件の弱いシグナルを除外。 ## [2026-07-04] ingest | 計算機自然からマタギドライヴへ - 自然の再審と脱人間知性的文明論の10年 - Source: `.raw/articles/n6d470a8f0f75-2026-07-04.md` - Summary: [[@2026__note__計算機自然からマタギドライヴへ - 自然の再審と脱人間知性的文明論の10年]] - Pages created: [[@2026__note__計算機自然からマタギドライヴへ - 自然の再審と脱人間知性的文明論の10年]], [[落合陽一]], [[計算機自然]], [[マタギドライヴ]], [[批判的デジタルネイチャー]], [[主体なき美の美学]], [[ヌルのテトラレンマ]] - Pages updated: [[wiki/index]], [[wiki/hot]], [[wiki/sources/_index]], [[wiki/entities/_index]], [[wiki/concepts/_index]], [[wiki/log]] - Key insight: 計算機自然は「計算と自然の融合」という技術的ビジョンから、自然概念の翻訳不可能性、辺縁的実存、環境・権力・身体への批判を内部化する運動体へ再定式化される。 ## [2026-07-03] ingest-paper | Artificial intelligence tools expand scientists' impact but contract science's focus - Source: `.raw/papers/nature-s41586-025-09922-y.txt`(HTML 抽出テキスト。PDF は Nature ペイウォールにより取得不可) - Summary: [[@2026__Nature__Artificial intelligence tools expand scientists' impact but contract science's focus]] - Pages created: [[@2026__Nature__Artificial intelligence tools expand scientists' impact but contract science's focus]], [[Qianyue Hao]], [[Fengli Xu]], [[Yong Li]], [[James Evans]], [[AIと科学の集中化]] - Pages updated: [[AI研究自動化]](横断的知見・未解決の問いに Hao et al. 観察を追記), sources/_index, entities/_index, concepts/_index, index, hot, log - Key insight: AI ツールは個人の生産性・キャリアを加速する一方、科学全体のトピック多様性を縮小させる——「個人合理性 vs 集団的帰結」の乖離を 4,130 万件の大規模データで実証した。 ## [2026-07-02] ingest-paper | PLaMo 2 Technical Report - Source: `.raw/papers/arxiv-2509.04897.pdf`（29 ページ、arXiv 2509.04897v2） - Summary: [[@2025__arXiv__PLaMo 2 Technical Report]] - Pages created: [[@2025__arXiv__PLaMo 2 Technical Report]], [[Preferred Networks]], [[PLaMo 2]] - Pages updated: [[ハイブリッドアテンションアーキテクチャ]], [[スライディングウィンドウアテンション]], [[状態空間モデル]], [[モデル圧縮]], [[LLM推論]], [[wiki/index]], [[wiki/hot]], [[wiki/sources/_index]], [[wiki/concepts/_index]], [[wiki/entities/_index]] - Key insight: PLaMo 2 は Mamba + SWA の効率的構成で始めつつ、長距離検索限界が見えた段階で CPT によりフルアテンション相当へ移行する。ハイブリッドアーキテクチャは固定設計ではなく、訓練段階で効率と検索性能を切り替える設計対象になる。 ## [2026-07-02] ingest-paper | XProf: An Open, Scalable and Extensible Profiling System for the Modern ML Stack - Source: `.raw/papers/mlsys2026-xprof-slides.pdf`（MLSys 2026 発表スライド PDF。論文 PDF は OpenReview Cloudflare 保護のため取得不可） - Summary: [[@2026__MLSys2026__XProf - An Open, Scalable and Extensible Profiling System for the Modern ML Stack]] - Pages created: [[@2026__MLSys2026__XProf - An Open, Scalable and Extensible Profiling System for the Modern ML Stack]], [[MLプロファイリング]], [[Rooflineモデル]], [[Robert Hundt]], [[OpenXLA]] - Pages updated: [[Google]](XProf 開発元として MLプロファイリングセクション追加)、各索引・hot・log - Key insight: TraceMe の「遅延相関 + ロックフリー + スレッドローカル」設計がキロバイトオーダーのトレース量で 0.3% 未満オーバーヘッドを実現する——「全データ生成 + 遅延的収集」という分散トレーシングの Hindsight と同型の設計思想が ML システム計装にも現れる。 ## [2026-07-02] ingest-paper | Machine Learning Fleet Efficiency: Improving TPU Systems at Scale with ML Productivity Goodput - Source: `.raw/papers/3734_MKeQyls.pdf`（スライド PDF 32 ページ、MLSys 2026 Industry Track。論文 PDF は OpenReview Cloudflare 保護のため取得不可） - Summary: [[@2026__MLSys2026__Machine Learning Fleet Efficiency - Improving TPU Systems at Scale with ML Productivity Goodput]] - Pages created: [[@2026__MLSys2026__Machine Learning Fleet Efficiency - Improving TPU Systems at Scale with ML Productivity Goodput]], [[ML Productivity Goodput]], [[Arissa Wongpanich]], [[Vijay Janapa Reddi]], [[Borg]] - Pages updated: [[Google]]（ML フリート効率セクション追加）、[[GPUクラスタ運用]]（横断的知見・未解決の問い・関連ソース追記）、各索引・hot・log - Key insight: 「有用な仕事」の定義を Scheduling/Runtime/Program の 3 層に分解しないと、高利用率フリートでもボトルネックの在り処が見えない——従来の Capacity・Occupancy・Duty Cycle はいずれもこの分解を持たない。 ## [2026-07-02] ingest-paper | The Case for Learned Index Structures - Source: `.raw/papers/arxiv-1712.01208.pdf`（30 ページ、arXiv 1712.01208v3） - Summary: [[@2017__arXiv__The Case for Learned Index Structures]] - Pages created: [[@2017__arXiv__The Case for Learned Index Structures]], [[Learned Index]], [[Alex Beutel]], [[Ed H. Chi]], [[Neoklis Polyzotis]] - Pages updated: [[B-Tree]], [[Tim Kraska]], [[Jeffrey Dean]], [[Google]], [[MIT]], [[wiki/sources/_index]], [[wiki/concepts/_index]], [[wiki/entities/_index]], [[wiki/index]], [[wiki/hot]] - Key insight: B-Tree は learned index に置き換えられる古典構造であるだけでなく、CDF を近似する回帰木として再解釈され、RMI のフォールバックやハイブリッド構成要素として残る。 ## [2026-07-02] ingest-paper | Retrieval as Reasoning: Self-Evolving Agent-Native Retrieval via LLM-Wiki - Source: `.raw/papers/arxiv-2605.25480.pdf`（15 ページ、arXiv 2605.25480v2） - Summary: [[@2026__arXiv__Retrieval as Reasoning]] - Pages created: [[@2026__arXiv__Retrieval as Reasoning]], [[Retrieval-as-Reasoning]], [[Haoliang Ming]] - Pages updated: [[LLM Wikiパターン]], [[LLM向け情報検索]], [[Tencent]], [[wiki/index]], [[wiki/hot]], [[wiki/sources/_index]], [[wiki/concepts/_index]], [[wiki/entities/_index]] - Key insight: Karpathy の LLM Wiki 抽象パターンが LLM-Wiki（Ming ら、Tencent 2026）によって初めて操作化され、マルチホップ QA で実証された。改善は「より強い類似度関数」でなく「知識とエージェントの間の契約変更」から来る。 ## [2026-07-02] wiki-query deep | インシデント対応の教科書 - Query: SRE のインシデント対応に関する各種文献を基礎から応用まで体系化した教科書を編纂 - Pages read: [[インシデント管理]], [[Incident Commander]], [[インシデント調査戦略]], [[インシデント認識論]], [[インシデント重大度評価]], [[障害緩和]], [[クラウド障害ライフサイクル]], [[変更起因インシデント]], [[人的要因]], [[Common Grounding]], [[Followship]], [[Handover Communications]], [[ChatOps]], [[アンインシデント]], [[インシデントメトリクス]], [[インシデント対応成熟度モデル]], [[インシデントシミュレーション]], [[インシデント後の人的回復]], [[オンコールストレス管理]], [[インシデントレスポンスAIレベル]] - Page created: [[wiki/questions/インシデント対応の教科書]] - Structure: 9 部 19 章(基礎/指揮/調査と診断/緩和/人間/測定/組織/訓練/AI と未来) + 付録(用語集/推奨読書/ソースマッピング) - Key insight: ポストモーテム(姉妹編)を除外し、検知から緩和までのライフサイクルを ICS 指揮体系、認識論的調査手法、フォロワーシップ、人的要因、組織成熟度、AI 自動化の軸で統合 ## [2026-07-01] ingest-slides | Epistemology of Incident Management - Source: `.raw/slides/srecon26americas-kingsman-epistemology/srecon26americas-kingsman-epistemology.pdf` - Visual pages: `.raw/slides/srecon26americas-kingsman-epistemology/pages/` (49ページ全読) - Media: `.raw/slides/srecon26americas-kingsman-epistemology/transcript.md` (YouTube 英語自動字幕 1018 行) - Summary: [[@2026__SREcon26Americas__Epistemology of Incident Management]] - Pages created: [[@2026__SREcon26Americas__Epistemology of Incident Management]], [[Jack Kingsman]], [[インシデント認識論]] - Pages updated: [[Atlassian]], [[インシデント管理]], [[仮説駆動RCA]] - Key insight: インシデント対応の各フェーズを「知識の問い」として再定義することで、証拠収集・探索・仮説・テストに認識論的ツールを与える。"Incidents are all about knowledge" ## [2026-07-01] ingest | Modern Microprocessors: A 90-Minute Guide (Jason Patterson, lighterra.com) - Source: `.raw/articles/modernmicroprocessors-2026-07-01.md` - Summary: [[Modern-Microprocessors-A-90-Minute-Guide|Modern Microprocessors: A 90-Minute Guide]] - Pages created: [[Modern-Microprocessors-A-90-Minute-Guide|Modern Microprocessors: A 90-Minute Guide]], [[パイプライン処理]], [[スーパースカラー実行]], [[分岐予測]], [[アウトオブオーダー実行]], [[VLIW]], [[同時マルチスレッディング]], [[SIMDベクトル処理]], [[メモリ階層とキャッシュ]], [[メモリウォール]], [[Brainiac設計]], [[チップレット]], [[AMD]] - Pages updated: なし - Key insight: プロセッサ性能は「クロック周波数」でなく IPC で決まる。電力の壁・ILP の壁・メモリウォールの3つの限界が現代アーキテクチャのすべてのトレードオフを規定する ## [2026-07-01] ingest-slides | Your System Has Recovered from an Incident, but Have Your Developers? - Source: `.raw/slides/srecon18americas-woo-developer-recovery/srecon18americas-woo-developer-recovery.pdf` - Visual pages: `.raw/slides/srecon18americas-woo-developer-recovery/pages/`(39ページ、うち5枚を `wiki/sources/_attachments/srecon18americas-woo-developer-recovery/` にコピー) - Media: `.raw/slides/srecon18americas-woo-developer-recovery/media/audio.m4a` + `.raw/slides/srecon18americas-woo-developer-recovery/transcript.md`（YouTube `AttVD__QrAo`; Whisper 文字起こし 810 行取得済み） - Summary: [[@2018__SREcon18Americas__Your System Has Recovered from an Incident, but Have Your Developers]] - Pages created: [[@2018__SREcon18Americas__Your System Has Recovered from an Incident, but Have Your Developers]], [[Jaime Woo]], [[インシデント後の人的回復]] - Pages updated: [[オンコールストレス管理]], [[人的要因]] - Key insight: システム復旧後もエンジニアの 42.5% が強いストレスを抱え、80% がピアサポートをほぼ受けていない——セルフコンパッション介入は意図的に訓練できる ## [2026-07-01] ingest-slides | Tales from the VOID: The Scary Truth About Incident Metrics - Source: `.raw/slides/srecon22americas-nash-incident-metrics/srecon22americas-nash-incident-metrics.pdf` - Visual pages: `.raw/slides/srecon22americas-nash-incident-metrics/pages/`(29ページ、うち7枚を `wiki/sources/_attachments/srecon22americas-nash-incident-metrics/` にコピー) - Media: none（transcript なし） - Summary: [[@2022__SREcon22Americas__Tales from the VOID - The Scary Truth About Incident Metrics]] - Pages created: [[@2022__SREcon22Americas__Tales from the VOID - The Scary Truth About Incident Metrics]] - Pages updated: [[Courtney Nash]], [[Verica]], [[インシデントメトリクス]], [[ポストモーテム]], wiki/sources/_index.md, wiki/index.md, wiki/hot.md - Key insight: VOID の 1,856 件実分布が MTTR の統計的不堅牢性を独立実証し、持続時間と深刻度の無相関（23h 11min 顧客影響ゼロ vs 21min Critical）を実データで示した。SREcon23 Americas「Far from the Shallows」の先行発表として位置づけられる。 ## [2026-07-01] enrich | Modernizing Incident Response with LLMs, RAG, and the MCP の横断的知見を関連 concept へ追記 - Source: [[@2025__SREcon25EMEA__Modernizing Incident Response with LLMs, RAG, and the MCP]] - 対象: 前回 ingest 時に touched としなかった 7 つの concept を追加調査し、横断的知見を追記 - Pages updated: [[ReAct]] / [[LLM評価]] / [[オンコール自動化]] / [[時系列マルチモーダルLLM]] / [[コンテキストエンジニアリング]] / [[インシデントレスポンスAIレベル]] / [[認知的徒弟制]] - Key insight: Amazon の産業実装(ReAct 選好、Promptfoo 評価フライホイール、共通インターフェース化によるオンコール属人化対処、時系列画像化入力、組織語彙の埋め込み注釈、MCP 承認ゲート、共通推論による理解共有)が、学術的知見・他の産業実装との比較軸を複数の concept ページに追加した。 ## [2026-07-01] ingest-slides | The Critical Resource Is You: Practical Destressing for On-Call Engineers - Source: `.raw/slides/srecon26americas-long-destressing/srecon26americas-long-destressing.pdf` - Visual pages: `.raw/slides/srecon26americas-long-destressing/pages/`(43ページ、うち9枚を `wiki/sources/_attachments/srecon26americas-long-destressing/` にコピー) - Media: none(transcript 未取得) - Summary: [[@2026__SREcon26Americas__The Critical Resource Is You - Practical Destressing for On-Call Engineers]] - Pages created: [[Beth Adele Long]], [[Continuous Re-integration]], [[オンコールストレス管理]] - Pages updated: [[人的要因]] - Key insight: ANS は自己修正機能を持つが Ordinary Mind に抑制される。身体的介入(Body Scan / Breath / Movement / Boredom)がその回避策となる。 ## [2026-07-01] ingest-slides | The Un-Incident: Extracting Value from the Gray Area of Incident Response - Source: `.raw/slides/2025__SREcon25EMEA__The-Un-Incident/2025__SREcon25EMEA__The-Un-Incident.pdf` - Visual pages: `.raw/slides/2025__SREcon25EMEA__The-Un-Incident/pages/`(26ページ、うち8枚を `wiki/sources/_attachments/2025__SREcon25EMEA__The-Un-Incident/` にコピー) - Media: none(transcript 未取得) - Summary: [[@2025__SREcon25EMEA__The Un-Incident]] - Pages created: [[アンインシデント]], [[Andreas Deuschl]], [[Dynatrace]] - Pages updated: [[インシデント管理]] - Key insight: インシデント管理ライフサイクルの「入口の手前」に 30〜60% の学習機会があり、Gray Zone Playbook(4類型+サイクル)で体系化できる ## [2026-07-01] ingest-slides | Modernizing Incident Response with LLMs, RAG, and the MCP - Source: `.raw/slides/srecon25emea-papapanagiotou-mcp-incident-response/srecon25emea-papapanagiotou-mcp-incident-response.pdf` - Visual pages: `.raw/slides/srecon25emea-papapanagiotou-mcp-incident-response/pages/`(70ページ、うち7枚を `wiki/sources/_attachments/srecon25emea-papapanagiotou-mcp-incident-response/` にコピー) - Media: `.raw/slides/srecon25emea-papapanagiotou-mcp-incident-response/transcript.md`(YouTube THX_qkVLMPw を Whisper で文字起こし) - Summary: [[@2025__SREcon25EMEA__Modernizing Incident Response with LLMs, RAG, and the MCP]] - Pages created: [[Theofilos Papapanagiotou]] - Pages updated: [[Amazon]], [[Model Context Protocol]], [[agentic SRE]], [[RAGベースクラウド運用支援]] - Key insight: Amazon の産業実装は MCP を「人間とエージェント共通のツールハンドル」として使い、IAM ロール分離で権限だけを変える認証設計を採る。時系列データを画像としてエージェントに渡すことで人間に近い水準の推論精度を得たという具体例も加わった。 ## [2026-07-01] ingest-slides | Storytelling as an Incident Management Skill - Source: `.raw/slides/srecon24americas-devesine-storytelling/srecon24americas-devesine-storytelling.pdf` - Visual pages: `.raw/slides/srecon24americas-devesine-storytelling/pages/`(18ページ、うち4枚を `wiki/sources/_attachments/srecon24americas-devesine-storytelling/` にコピー) - Media: `.raw/slides/srecon24americas-devesine-storytelling/transcript.md`(Whisper 音声文字起こし) - Summary: [[@2024__SREcon24Americas__Storytelling as an Incident Management Skill]] - Pages created: [[@2024__SREcon24Americas__Storytelling as an Incident Management Skill]] - Pages updated: [[Laura de Vesine]], [[Datadog]], [[インシデントストーリー]], [[ポストモーテム]] - Key insight: 人物中心の物語(「英雄の旅」)と因果論理中心の物語を目的別に使い分けるという整理、および「対応中の協調的ストーリーテリング」という新しい適用フェーズ、5段階「エンゲージングなポストモーテム」構成を追加した。 ## [2026-07-01] ingest-video | Incident Groundhog Day - Source: `https://www.usenix.org/conference/srecon24emea/presentation/silatani` (YouTube: `AMDB0OV1cVs`) - Transcript: `.raw/videos/AMDB0OV1cVs/transcript.md` (1967行、YouTube 自動字幕 VTT → dedup 変換) - Frames: `.raw/videos/AMDB0OV1cVs/frames/` (31フレーム、10枚を `wiki/sources/_attachments/srecon24emea-silatani-groundhog-day/` にコピー) - Summary: [[@2024__SREcon24EMEA__Incident Groundhog Day]] - Pages created: [[Hamed Silatani]], [[Uptime Labs]], [[インシデントシミュレーション]] - Pages updated: [[インシデント重大度評価]], [[Incident Commander]] - Key insight: 20名実験で解決時間と経験は無相関。severity 議論への時間投資が解決時間を短縮し、Solo Artist vs Band Member の行動パターン差が主要分岐点。 ## [2026-07-01] ingest-slides | Incident Management Metrics that Matter - Source: `.raw/slides/srecon25americas-de-vesine-incident-management-metrics/srecon25americas-de-vesine-incident-management-metrics.pdf` - Visual pages: `.raw/slides/srecon25americas-de-vesine-incident-management-metrics/pages/`(49 ページ全確認) - Media: none(transcript なし) - Summary: [[@2025__SREcon25Americas__Incident Management Metrics that Matter]] - Pages created: [[wiki/sources/@2025__SREcon25Americas__Incident Management Metrics that Matter]], [[wiki/entities/Jamie Luck]], [[wiki/concepts/インシデントメトリクス]] - Pages updated: [[wiki/entities/Laura de Vesine]](役職更新・発表追加), [[wiki/entities/Datadog]](発表追加), [[wiki/concepts/インシデント管理]](横断的知見追記) - Key insight: MTTR は統計的ノイズ優位かつ逆インセンティブを生む。インシデント管理プロセスは目標を先に定義し、それを直接測る 8 次元の指標群で測定する。成熟した組織では MTTR が上昇するのが健全なサイン。 ## [2026-07-01] ingest-slides | From 4 Hours to 8 Minutes with AI Agents that Transform SRE Incident Response - Source: `.raw/slides/srecon25emea_slides-jausovec/srecon25emea_slides-jausovec.pdf` - Visual pages: `.raw/slides/srecon25emea_slides-jausovec/pages/`(17ページ全確認) - Media: none(transcript なし、デモはスライド未記録) - Summary: [[@2025__SREcon25EMEA__From 4 Hours to 8 Minutes with AI Agents that Transform SRE Incident Response]] - Pages created: [[Peter Jausovec]], [[Solo.io]], [[kagent]] - Pages updated: [[インシデントレスポンスAIレベル]], [[エージェントシステム運用]] - Key insight: AIRE フレームワークが示す能力4段階(Operational Knowledge / Awareness / Investigation / Resolution)は IR Levels の IR3〜IR4 の産業実装モデルに相当し、MCP が複数エージェント間のツール共有を実現する標準レイヤーとして浮上している ## [2026-07-01] ingest-video | Embracing the Multi-Party Dilemma: Incident Response Across Company Boundaries - Source: URL https://www.usenix.org/conference/srecon23emea/presentation/butt(yt-dlp 解決 YouTube ID Veq7VUbPwWo。ヘルパーは url-only で終了したため、手動で yt-dlp/ffmpeg/whisper フォールバックを実行) - Transcript: `.raw/videos/Veq7VUbPwWo/transcript.md`(whisper 自動文字起こし、1778行 VTT → 147行 30秒バケット化、全文読了) - Frames: `.raw/videos/Veq7VUbPwWo/frames/`(20枚、7枚を添付・目視確認) - Summary: [[@2023__SREcon23EMEA__Embracing the Multi-Party Dilemma - Incident Response Across Company Boundaries]] - Pages created: [[Alex Elman]], [[SentinelOne]], [[Multi-Party Dilemma]] - Pages updated: [[Sarah Butt]], [[Indeed]], [[Laura Maguire]], [[David D. Woods]], [[John Allspaw]], [[Richard I. Cook]], wiki/index.md, wiki/hot.md, wiki/sources/_index.md, wiki/entities/_index.md, wiki/concepts/_index.md - Key insight: 組織境界を越えたインシデント対応では、顧客・ベンダーという2つの官僚制の間に自発的な一過性組織(transient organization)が形成され、時間圧力下で意思決定権限が官僚から現場の専門知識保持者へ移る多中心的統治モデル(polycentric governance)へ移行する。CDN ベンダーとの深い双方向情報共有により、片方だけでは決して発見できなかったリトライストーム誘発リスクを回避できた事例がこれを裏付ける。 ## [2026-07-01] ingest-slides | Hard Choices, Tight Timelines: A Closer Look at Tradeoff Decisions during Incidents - Source: `.raw/slides/2024__SREcon24Americas__Skip-Level-Tradeoff-Decisions/2024__SREcon24Americas__Skip-Level-Tradeoff-Decisions.pdf`(USENIX SREcon24 Americas、Dr. Laura Maguire (Trace Cognitive Engineering/OSU)・Courtney Nash (The VOID)、2024-03-19。公式ページ https://www.usenix.org/conference/srecon24americas/presentation/maguire はタイトルに「(Skip-Level)」を含むが、スライド本体タイトルにはこの語がなく、両者を aliases に併記) - Visual pages: `.raw/slides/2024__SREcon24Americas__Skip-Level-Tradeoff-Decisions/pages/`(61ページ、全て目視確認) - Media: transcript なし(公式ページに動画リンクの提示なし) - Summary: [[@2024__SREcon24Americas__Hard Choices, Tight Timelines - A Closer Look at Tradeoff Decisions during Incidents]] - Pages created: [[トレードオフ意思決定]] - Pages updated: [[Laura Maguire]], [[Courtney Nash]], wiki/index.md, wiki/sources/_index.md, wiki/entities/_index.md, wiki/concepts/_index.md, wiki/hot.md - Key insight: The Void のような大規模インシデントレポートデータベースは意思決定の「結果」は記録できても「推論過程」は構造的に記録されにくい。この限界を vignette(状況想定シナリオ)法で補うと、組織階層の上下(skip-level)で重視するトレードオフの軸が異なることが見えてくる——上級リーダーは事業継続性・評判・法務リスクを、対応者はシステム状態把握・復旧速度・認知負荷を重視する。 ## [2026-07-01] ingest-video | What Is Incident Severity, but a Lie Agreed Upon? - Source: URL https://www.usenix.org/conference/srecon24americas/presentation/ruppe(USENIX SREcon24 Americas、Em Ruppe、Jeli/PagerDuty。公式ページは WebFetch 403 のため curl+UA でフォールバック取得、埋め込み YouTube URL https://www.youtube.com/watch?v=3LwApIPFrTo を実体として使用) - Transcript: `.raw/videos/srecon24americas-ruppe-incident-severity/transcript.md`(YouTube 自動字幕、1634行・全文読了) - Frames: `.raw/videos/srecon24americas-ruppe-incident-severity/frames/`(17枚、全て目視確認) - Summary: [[@2024__SREcon24 Americas__What Is Incident Severity, but a Lie Agreed Upon?]] - Pages created: なし(既存 entity [[Emily Ruppe]]・[[Jeli]]・[[PagerDuty]] を再利用) - Pages updated: [[Emily Ruppe]], [[Jeli]], [[PagerDuty]], [[インシデント重大度評価]], wiki/index.md, wiki/sources/_index.md - Key insight: severity を巡る議論の長期化は severity 設計そのものの欠陥でなく、過小評価・過大評価・説明不足・組織の未成熟さといった組織的問題の兆候(カナリア)であり、Nash・Allspaw の「Severity は社交的調整物」という静的批判を、severity の摩擦を組織課題発見の材料として使う動的な運用手法へと補完する。 ## [2026-07-01] ingest-video | The Incident Is The Way: Using Your Incidents to Win Reliability Investment - Source: URL https://www.usenix.org/conference/srecon23emea/presentation/mccarthy(USENIX SREcon23 EMEA、Niall McCarthy、Afterpay。公式ページは WebFetch 403 のため curl+UA でフォールバック取得、埋め込み YouTube URL https://www.youtube.com/watch?v=aaaA7gS_EvQ を実体として使用) - Transcript: `.raw/videos/srecon23emea-mccarthy-incident-is-the-way/transcript.md`(YouTube 自動字幕、54ブロック・全文読了) - Frames: `.raw/videos/srecon23emea-mccarthy-incident-is-the-way/frames/`(22枚、全て目視確認) - Summary: [[@2023__SREcon23EMEA__The Incident Is The Way - Using Your Incidents to Win Reliability Investment]] - Pages created: [[Niall McCarthy]], [[Afterpay]] - Pages updated: [[インシデント重大度評価]], wiki/index.md, wiki/sources/_index.md, wiki/entities/_index.md, wiki/concepts/_index.md, wiki/hot.md - Key insight: 可用性(エラー率・応答時間)ベースの重大度評価は「正しさ(correctness)」の毀損を見落とす。エンジニアの意図でなく結果(実際にユーザーが被った害)を重大度判断の基準にすることで、Severity が「社交的調整物であり交渉可能」という既知の批判に対する具体的な処方箋になる。 ## [2026-07-01] ingest-slides | The World Blew Up But We're All Okay: Managing a massive-scale incident at Datadog - Source: URL https://www.usenix.org/conference/srecon23emea/presentation/de-vesine(USENIX SREcon23 EMEA、Laurent Bernaille・Laura de Vesine、Datadog) - Slides: `.raw/slides/srecon23emea-datadog-outage/srecon23emea-datadog-outage.pdf`(76ページ、pages/ 配下の全画像を目視確認) - Transcript: `.raw/slides/srecon23emea-datadog-outage/transcript.md`(Whisper 音声文字起こし、584行、`media/audio.m4a` より生成成功。YouTube 自動字幕は補助として残置) - Summary: [[@2023__SREcon23EMEA__The World Blew Up but We're All Okay - How We Managed a Massive-scale Incident at Datadog]] - Pages created: [[Laurent Bernaille]] - Pages updated: [[Laura de Vesine]], [[Datadog]], [[Kubernetes]], [[インシデント管理]], wiki/index.md, wiki/sources/_index.md, wiki/entities/_index.md, wiki/concepts/_index.md, wiki/hot.md - Key insight: 「グローバルなネットワーク・設定・コントロールプレーンを持たない」という明示的な設計方針を掲げていても、全フリート共通の OS ディストリビューションという共有基盤自体が事実上のグローバルな障害波及経路になりうる。500人超・14時間で493人が入退室した Zoom 通話という規模でも、IC ローテーション+ワークストリーム自己組織化という最小限の骨格と、信頼・非難なき文化・即興力で乗り切れることも同時に示された。 ## [2026-07-01] ingest-video | If I Can Do It on an Ambulance, You Can Do It in an Office: Scalable Incident Response Using ICS - Source: URL https://www.youtube.com/watch?v=aOP796AlOKE(USENIX SREcon23 Americas、Thai Wood) - Transcript: `.raw/videos/srecon23amer-wood-incident-command-system/transcript.md`(YouTube 自動字幕、67ブロック・全文読了) - Frames: `.raw/videos/srecon23amer-wood-incident-command-system/frames/`(12枚、全て目視確認) - Summary: [[@2023__SREcon23Americas__If I Can Do It on an Ambulance - Scalable Incident Response Using ICS]] - Pages created: [[Thai Wood]], [[Resilience Roundup]] - Pages updated: [[Incident Commander]], [[ダッシュボードとランブックの運用]], [[GameDay]], [[Richard I. Cook]], wiki/index.md, wiki/sources/_index.md, wiki/entities/_index.md, wiki/hot.md - Key insight: 「ランブックは安全を買えない」という Wood の認識論的批判は、Douch の「ランブックは本質的に一時的であるべき」という運用面の処方箋と補完関係にあり、両者を合わせると「ランブックを恒久資産として扱わない」という結論に収束する。 ## [2026-07-01] ingest-video | Incident Commanders - Source: `.raw/videos/srecon23amer-granda-incident-commanders/`(YouTube: https://www.youtube.com/watch?v=VLGxGrNnWrY、USENIX SREcon23 Americas、Vanessa Huerta Granda・Emily Ruppe、Jeli) - Transcript: `.raw/videos/srecon23amer-granda-incident-commanders/transcript.md`(YouTube 英語字幕、全 531 行を読了) - Frames: `.raw/videos/srecon23amer-granda-incident-commanders/frames/`(12枚、全て目視確認。frame-003 で登壇者2名の氏名を確認) - Summary: [[@2023__SREcon23Americas__Incident Commanders]] - Pages created: [[インシデントアナリスト]] - Pages updated: [[Vanessa Huerta Granda]], [[Emily Ruppe]], [[Jeli]], [[Incident Commander]], [[インシデント管理]] - 話者の同一性判定: 本講演の Vanessa Huerta Granda は、既存の SREcon25/26([[Enova]] 在籍時)講演の entity と同一人物と確認(frame-003 の氏名表示・Jeli 所属の一致より)。本講演は 2023 年([[Jeli]] 在籍時)のより早いキャリア段階にあたるため、新規 entity は作成せず既存ページを更新した。 - Key insight: IC(Incident Commander)とインシデントアナリストは「似て非なる別々のスキルセット」であり、同一人物が両者を兼務すると IC が事後検証も担当することになり社会技術的要因を見落としやすい。 ## [2026-07-01] ingest-slides | An Organizational Response to Incidents - Source: `.raw/slides/2023-srecon-maguire-organizational-response-incidents/2023-srecon-maguire-organizational-response-incidents.pdf`(USENIX SREcon23 Americas, Laura Maguire, Jeli) - Visual pages: `.raw/slides/2023-srecon-maguire-organizational-response-incidents/pages/`(101ページ、全て目視確認) - Media: transcript なし(音声・動画は USENIX 公式ページ案内のみで直接リンク未確認、スライド画像のみに基づく) - Summary: [[@2023__SREcon23Americas__An Organizational Response to Incidents]] - Pages created: [[Followship]] - Pages updated: [[Laura Maguire]], [[Jeli]], [[Incident Commander]], [[Joint Activity]], [[Common Grounding]] - Key insight: Maguire 本人がFollowshipの定義に「adaptive choreography」という語を用いたことで、Matt Davisが別トークで引用していた「Adaptive Choreography(Response Trio)」が同一の理論的支柱であったことが裏付けられた。一方でMaguire本人はこの概念を「役割分担モデル」ではなく「IC以外の対応者全体の協調行動」というより広い射程で使っており、二次引用と一次資料の間で強調点の違いが明らかになった。 ## [2026-07-01] ingest-slides | Epic Incidents of History: The 1979 NORAD Nuclear Near Miss - Source: `.raw/slides/sre23amer-travaglini-norad-near-miss/sre23amer-travaglini-norad-near-miss.pdf`(USENIX SREcon23 Americas, Nick Travaglini, Honeycomb.io) - Visual pages: `.raw/slides/sre23amer-travaglini-norad-near-miss/pages/`(34ページ、全て目視確認) - Media: `.raw/slides/sre23amer-travaglini-norad-near-miss/transcript.md`(Whisper 未実行、YouTube 自動字幕フォールバック) - Summary: [[@2023__SREcon23Americas__Epic Incidents of History - The 1979 NORAD Nuclear Near Miss]] - Pages created: [[Nick Travaglini]], [[Honeycomb.io]] - Pages updated: [[Vannevar Bush]], [[複雑システム障害論]], [[根本原因分析]], [[人的要因]] - Key insight: Walker・Woods・Rayo(2016)の Distant-Proximal / Blunt-Sharp モデルを、1979年 NORAD 誤警報という歴史的事例に適用したことで、[[複雑システム障害論]]・[[根本原因分析]] の「単一根本原因の探索は構造的に成立しない」という命題が、ソフトウェアシステムに限らない一般則として裏付けられた。 ## [2026-07-01] ingest-slides | Handover Communications in Software Operations: Findings from the Field - Source: `.raw/slides/srecon23americas-todd-handover-communications/srecon23americas-todd-handover-communications.pdf`(USENIX SREcon23 Americas, Chad Todd, CrowdStrike) - Visual pages: `.raw/slides/srecon23americas-todd-handover-communications/pages/`(38ページ、全て目視確認) - Media: `.raw/slides/srecon23americas-todd-handover-communications/transcript.md`(Whisper transcript, 335行) - Summary: [[@2023__SREcon23Americas__Handover Communications in Software Operations - Findings from the Field]] - Pages created: [[Chad Todd]], [[CrowdStrike]], [[Lund University]], [[David D. Woods]], [[Emily Patterson]], [[Gary Klein]], [[Handover Communications]] - Pages updated: [[Joint Activity]], [[Common Grounding]], [[レジリエンスエンジニアリング]] - Key insight: Todd(SREcon23 Americas)が Joint Activity・Common Ground の双方を Klein et al.(2005)に明示的に帰属させたことで、既存 [[Common Grounding]] ページが特定していた書誌情報(Klein, Feltovich, Bradshaw, Woods 2005)との突き合わせが取れ、[[Joint Activity]] の未解決の問いの一部が解消された。 ## [2026-07-01] ingest-video | Dashboards and Runbooks: Scrapbooking for Engineers - Source: URL(`https://www.usenix.org/conference/srecon22apac/presentation/douch`)。動画本体は USENIX ログイン必須のため、YouTube 上の同一動画(`https://www.youtube.com/watch?v=llDMcZLTPSc`)から取得 - Transcript: `.raw/videos/llDMcZLTPSc/transcript.md`(YouTube 自動字幕、英語) - Frames: `.raw/videos/llDMcZLTPSc/frames/`(20枚、全て目視確認) - Summary: [[@2022__SREcon22APAC__Dashboards and Runbooks - Scrapbooking for Engineers]] - Pages created: [[Colin Douch]], [[ダッシュボードとランブックの運用]] - Pages updated: [[Cloudflare]] - Key insight: ランブックの3クラス分類(自動化可能/自由記述/無価値)と「良いランブックは本質的に一時的であるべき」という原則が、事前計装されたテレメトリによるトンネルビジョンという論点を通じて症状ベースアラーティングの議論([[アクショナブルアラート]])と同型の構造を持つことを [[ダッシュボードとランブックの運用]] に記録した。 ## [2026-07-01] ingest-slides | When Systems Flatline—Enhancing Incident Response with Learnings from the Medical Field - Source: `.raw/slides/srecon21-butt-systems-flatline/srecon21-butt-systems-flatline.pdf` - Visual pages: `.raw/slides/srecon21-butt-systems-flatline/pages/`(14ページ) - Media: `.raw/slides/srecon21-butt-systems-flatline/transcript.md`(Whisper 文字起こし) - Summary: [[@2021__SREcon21__When Systems Flatline - Enhancing Incident Response with Learnings from the Medical Field]] - Pages created: [[Sarah Butt]] - Pages updated: [[Salesforce]], [[Incident Commander]] - Key insight: 医療分野のアルゴリズム誘導意思決定・迅速安定化・標準化チェックリストという3コンセプトが、Goldfuss の Nrrd chatbot(属人性排除)・WHO チェックリスト文化の系譜・Collins の Warm Blanket Fallacy とは異なる層(意思決定規律)という3点で [[Incident Commander]] の横断的知見を補強した。 ## [2026-07-01] ingest-slides | Evolution of Incident Management at Slack - Source: `.raw/slides/srecon21-chapman-incident-mgmt-slack/srecon21-chapman-incident-mgmt-slack.pdf` - Visual pages: `.raw/slides/srecon21-chapman-incident-mgmt-slack/pages/`(41ページ) - Media: `.raw/slides/srecon21-chapman-incident-mgmt-slack/transcript.md`(YouTube 音声 Whisper 文字起こし、301行) - Summary: [[@2021__SREcon21__Evolution of Incident Management at Slack]] - Pages created: [[Brent Chapman]] - Pages updated: [[Slack Technologies]], [[PagerDuty]], [[インシデント管理]], [[Incident Commander]] - Key insight: Google iMAG の設計者本人が Slack で ICS 実践を再構築した過程が、Major IC の7課題への個別解決策(Area Command・pillar別ローテーション等)として具体的に追跡できる。 ## [2026-07-01] ingest-slides | The Math behind the Incident Aftermath: A Practical Guide to Measuring Incident Impacts (SREcon22 APAC) - Source: `.raw/slides/srecon22apac-patel-incident-impact/srecon22apac-patel-incident-impact.pdf` - Visual pages: `.raw/slides/srecon22apac-patel-incident-impact/pages/`(34ページ) - Media: none(発表動画は USENIX ログインが必要なため未取得。transcript なし) - Summary: [[@2022__SREcon22APAC__The Math behind the Incident Aftermath]] - Pages created: [[Ashish Patel]], [[Sriram Srinivasan]], [[PayPal]], [[インシデント影響測定]] - Key insight: FCI(Failed Customer Interactions)によりインシデントの顧客影響を、ベースライン予測トラフィックとの乖離+明示的エラー件数の合算で定量化し、Availability 指標と5軸セグメンテーションに変換できる。 ## [2026-07-01] ingest-slides | Incident Response in Unfamiliar Sociotechnical Systems (SREcon20 Americas) - Source: `.raw/slides/srecon20americas_slides_collins/srecon20americas_slides_collins.pdf` - Visual pages: `.raw/slides/srecon20americas_slides_collins/pages/`(16ページ) - Media: none(transcript なし) - Summary: [[@2020__SREcon20Americas__Incident Response in Unfamiliar Sociotechnical Systems]] - Pages created: [[Morgan Collins]], [[Salesforce]] - Pages updated: [[Incident Commander]] - Key insight: 熟練 Incident Commander の経験は不慣れな組織間対応での成功を保証しない(Warm Blanket Fallacy)。ICS 起源については既存ソース(Goldfuss, 2016)と地域・年代が食い違い、contradiction callout で両論併記した。 ## [2026-07-01] ingest-video | Incident Response @ FB, Facebook's SEV Process - Source: https://www.usenix.org/conference/srecon16europe/program/presentation/eason - Transcript: `.raw/videos/srecon16europe-eason-fb-sev-process/transcript.md` - Frames: `.raw/videos/srecon16europe-eason-fb-sev-process/frames/` - Summary: [[@2016__SREcon16__Incident Response @ FB, Facebook's SEV Process]] - Pages created: [[@2016__SREcon16__Incident Response @ FB, Facebook's SEV Process]], [[Gareth Eason]] - Pages updated: [[Facebook]], [[Jay Parikh]], [[Pedro Canahuati]], [[Incident Commander]], [[インシデント重大度評価]], [[クロスインシデント分析]] - Key insight: Facebook の IMOC が「技術的に直さない」という Incident Commander の核心定義(blame umbrella / human mutex)を New Relic(Goldfuss, 2016)と独立に同年確立していたこと、および FB の 2016 年メトリクスゲーミング警告が Granda(2025年、Enova)の「数値はコンテキストなしでは意味がない」という洞察に約9年先行していたこと。 ## [2026-07-01] ingest-slides | You Can't Stop Fires with an Ambulance - Source: `.raw/slides/srecon18asia_slides_chamberlain/srecon18asia_slides_chamberlain.pdf` - Visual pages: `.raw/slides/srecon18asia_slides_chamberlain/pages/`(23ページ) - Media: `.raw/slides/srecon18asia_slides_chamberlain/transcript.md`(Whisper) - Summary: [[@2018__SREcon18Asia__You Can't Stop Fires with an Ambulance]] - Pages created: [[Piers Chamberlain]], [[Xero]], [[Klaxon]], [[Multivac]], [[Report Card]] - Pages updated: [[アラート管理]], [[クロスインシデント分析]] - Key insight: Klaxon の顧客ページヒット率検知は既存の症状ベースアラーティング系譜に「顧客観測を一次シグナルとする」新しい介入点を加え、Chamberlain の専任チームなし単独手動集計はクロスインシデント分析の Granda 3要素が「発見」でなく「発見の継続性とスケール」を担保するものだと対比的に示した。 ## [2026-07-01] ingest-slides | Fixing On-Call When Nobody Thinks It's (Too) Broken - Source: `.raw/slides/srecon19americas-lykke-oncall/srecon19americas-lykke-oncall.pdf` - Visual pages: `.raw/slides/srecon19americas-lykke-oncall/pages/`(34ページ) - Media: `.raw/slides/srecon19americas-lykke-oncall/transcript.md`(YouTube自動字幕フォールバック) - Summary: [[@2019__SREcon19 Americas__Fixing On-Call When Nobody Thinks It's (Too) Broken]] - Pages created: [[Tony Lykke]], [[Hudson River Trading]] - Pages updated: [[アラート疲労]] - Key insight: 最小限の技術変更(フィルタ層追加のみ)+コミュニケーション過剰投資+git shortlogによる定量的バイイン可視化という組み合わせは、アラート疲労の既存事例(インセンティブ設計/技術的介入)の統合例が薄いという観察への具体的な反例であり、同時に「アラート削減自体が沈黙への不安を招く」という新しい副作用を明らかにした。 ## [2026-07-01] enrich-question | SLI/SLO 教科書 — 第 8 章「応用と拡張」の充実化 - Summary: [[wiki/questions/SLI-SLO教科書]] - Pages updated: [[wiki/questions/SLI-SLO教科書]] - 未使用ソースを wiki 全体から棚卸しし、第 8 章に合う 3 件を新規統合: [[@2024__SRENext2024__Enabling Client-side SLO]](Luup のクライアントサイド SLO 事例。p75 採用根拠・Time Slice SLO・Multi-tiered SLOs)、[[@2025__SREcon25Americas__Is the S in SRE for Security]](Security Level Objectives 提唱。8.11 節を新設)、[[@2021__SREcon21__Beyond-Goldilocks-Reliability]](定常性モデルによる Goldilocks Reliability 批判。8.12 節を新設)。 - 既存節も深堀り: 8.3 節(SLO 拡散)に λ パラメータのトレードオフとコンフリクト分類を追記、8.7 節(IoT・モビリティ)に Luup 2024 事例を追記、8.9 節(カーボン認識 SLO)を CASCA の匿名化実験・GDS/RLDS/RDS 比較・宣言的再設定速度で全面拡充。 - 新設: 8.11 セキュリティ領域への応用、8.12 定常性モデル。既存 8.1〜8.10 の番号は維持(相互参照保護)。付録 A(文献年表)・付録 B(ソース一覧)・frontmatter(sources/related)も同期更新。 - 除外判断: [[@2026__SREcon26 Americas__Taming the Unpredictable - Reliability in Chaos]](SLO 言及薄い)、[[@2019__SREcon19 Americas__Latency SLOs Done Right]](第2章寄りでスコープ外)。 ## [2026-07-01] ingest-slides | nrrd 911 ic me: The Incident Commander Role (Alice Goldfuss, SREcon16 Americas, 2016) - Source: `.raw/slides/2016__SREcon16__nrrd-911-ic-me-The-Incident-Commander-Role/2016__SREcon16__nrrd-911-ic-me-The-Incident-Commander-Role.pdf` - Visual pages: `.raw/slides/2016__SREcon16__nrrd-911-ic-me-The-Incident-Commander-Role/pages/`（51 ページ） - Media: `.raw/slides/2016__SREcon16__nrrd-911-ic-me-The-Incident-Commander-Role/transcript.md`（Whisper 音声文字起こし 223 行） - Summary: [[@2016__SREcon16__nrrd 911 ic me - The Incident Commander Role]] - Pages created: [[@2016__SREcon16__nrrd 911 ic me - The Incident Commander Role]] / [[Alice Goldfuss]] - Pages updated: [[New Relic]] / [[Incident Commander]] - Key insight: ICS 草創期（New Relic 2012→2016）の実践証拠。「3日間→3時間」という具体的 ROI と Hubot/Nrrd による chatbot 自動化が、10年後の専任チーム化（Granda 2026）への進化と対比できる最初期ソース。 ## [2026-07-01] ingest-paper | Software Engineering (Barry W. Boehm, 1976) - Source: `.raw/papers/boehm-sw-eng-paper.pdf` - Summary: [[@1976__IEEE-TC__Software Engineering]] - Pages created: [[Barry W. Boehm]], [[TRW Systems and Energy Group]], [[ソフトウェアライフサイクル]], [[ソフトウェア要件工学]], [[ソフトウェア保守]] - Pages updated: `wiki/sources/_index.md`, `wiki/entities/_index.md`, `wiki/concepts/_index.md`, `wiki/index.md`, `wiki/hot.md` - Key insight: 1976 年時点でソフトウェア保守がライフサイクルコストの約 70% を占め、Area 2(応用ソフトウェアの要件・設計・テスト・保守)にはほぼ科学的基礎がないという診断が、50 年後の今もほぼ変わらず成立する。 ## [2026-07-01] ingest-video | Incident Management and Chatops @ Netflix Feat Scorebot (Al Tobey, SREcon16, 2016) - Source: `https://www.usenix.org/conference/srecon16/program/presentation/tobey` - Audio: `.raw/videos/srecon16-tobey-incident-chatops/audio.m4a`（約 20 分 / 1207 秒） - Frames: `.raw/videos/srecon16-tobey-incident-chatops/frames/`（12 枚） - Transcript: `.raw/videos/srecon16-tobey-incident-chatops/whisper/`（Whisper small モデル処理中） - Summary: [[@2016__SREcon16__Incident Management and Chatops @ Netflix Feat Scorebot]] - Pages created: [[Al Tobey]], [[ChatOps]] - Pages updated: [[Netflix]], [[インシデント管理]] - Key insight: Scorebot(Netflix 2015-2016)は ChatOps によるインシデント管理操作自動化の最初期実践で、SAS(Microsoft 2011-2013)の機械学習診断と並ぶ LLM 前産業 AIOps の二本柱として位置づけられる。 ## [2026-07-01] ingest-slides | Unified Theory of SRE (Emil Stolarsky, SREcon22 EMEA, 2022) - Source: `.raw/slides/srecon22emea-stolarsky-unified-theory-sre/srecon22emea-stolarsky-unified-theory-sre.pdf` - Visual pages: `.raw/slides/srecon22emea-stolarsky-unified-theory-sre/pages/` （48 ページ） - Media: `.raw/slides/srecon22emea-stolarsky-unified-theory-sre/transcript.md`（YouTube 自動字幕 696 行） - Summary: [[@2022__SREcon22 EMEA__Unified Theory of SRE]] - Pages created: [[@2022__SREcon22 EMEA__Unified Theory of SRE]] / [[Emil Stolarsky]] - Pages updated: [[SRE]] - Key insight: SRE Book は 2400+ インフラエンジニアを抱える Google の固有文脈で書かれており、スタートアップ（Default Dead）への無批判な適用はカーゴカルティングになる。SRE の組織設計・技術選定・SLO 導入・インシデントレビュー・オンコールはいずれも規模に合わせた根本的再構築が必要。 ## [2026-07-01] ingest-video | Notes from Production Engineering (Pedro Canahuati, SREcon15, 2015) - Source: https://www.usenix.org/conference/srecon15/program/presentation/canahuati (YouTube: ugkkza3vKbc) - Transcript: `.raw/videos/ugkkza3vKbc/transcript.md`（自動字幕変換） - Frames: `.raw/videos/ugkkza3vKbc/frames/`（39 フレーム） - Summary: [[@2015__SREcon15__Notes from Production Engineering]] - Pages created: [[@2015__SREcon15__Notes from Production Engineering]] / [[Pedro Canahuati]] / [[Jay Parikh]] - Pages updated: [[Facebook]] / [[SRE組織変革]] / [[ポストモーテム]] - Key insight: Facebook の SRO（2010-2014）は「集中型オンコールチームがクラッチとして機能しエンジニアリングチームの自立を阻む」という組織パターンの典型例であり、データ駆動の段階移行で 2014 年 3 月 31 日に解散。「FIX MORE, WHINE LESS」スローガンのTシャツ配布という物理的可視化が、ポストモーテム文化定着の戦術として記録されている。 ## [2026-07-01] ingest-video | Keys to SRE (Ben Treynor Sloss, SREcon14, 2014) - Source: https://www.usenix.org/conference/srecon14/technical-sessions/presentation/keys-sre (YouTube: n4Wf14e2jxQ) - Transcript: `.raw/videos/n4Wf14e2jxQ/transcript.md` - Frames: 未取得（動画ダウンロード進行中） - Summary: [[@2014__SREcon14__Keys to SRE]] - Pages created: [[@2014__SREcon14__Keys to SRE]] - Pages updated: [[Ben Treynor Sloss]] / [[SRE]] / [[エラーバジェット]] / [[ポストモーテム]] - Key insight: 2014 年の「ローンチオンブラック」ルールが SRE Book より 2 年早く操作的実施形態として提示されており、「ブレームレスポストモーテム」は 2014 年に既に公言済みで、Gallego（2016-2018）の理論精緻化の先行原則として位置づけられる。 ## [2026-06-30] ingest-paper | Towards Intelligent Incident Management: Why We Need It and How We Make It - Source: `.raw/papers/zchen_esecfse2020_towards.pdf.pdf` - Summary: [[@2020__ESEC-FSE__Towards Intelligent Incident Management - Why We Need It and How We Make It]] - Pages created: [[@2020__ESEC-FSE__Towards Intelligent Incident Management - Why We Need It and How We Make It]] - Pages updated: [[Zhuangbin Chen]] / [[Qingwei Lin]] / [[インシデント管理]] / [[AIOps]] / [[グレイ障害]] / [[サービス依存グラフ]] - Key insight: TTB(全影響サービスへの周知時間)が TTM と同等であることを 2 年超の Microsoft 実運用データで実証し、下流依存性の不完全性を根本原因として特定した pre-LLM 期の最初期産業 AIOps 実証研究。 ## [2026-06-30] ingest-paper | Software Analytics for Incident Management of Online Services: An Experience Report (ASE 2013) - Source: `.raw/papers/ase13experience-p022-p-19538-6242493-19510-preprint.pdf` - Summary: [[@2013__ASE__Software Analytics for Incident Management of Online Services - An Experience Report]] - Pages created: [[@2013__ASE__Software Analytics for Incident Management of Online Services - An Experience Report]] / [[Rui Ding]] / [[Qiang Fu]] / [[Tao Xie]] - Pages updated: [[Jian-Guang Lou]] / [[Qingwei Lin]] / [[Dongmei Zhang]] / [[インシデント管理]] / [[ログベース障害診断]] - Key insight: 最初期産業 AIOps の経験報告。問題主導への転換・HITL 設計の必要性・段階的信頼構築という 3 教訓が 2020 年代の LLM エージェント型 IM 研究まで継続している。 ## [2026-06-30] ingest-paper | ART: A Unified Unsupervised Framework for Incident Management in Microservice Systems (ASE 2024) - Source: `.raw/papers/ART24_to_ASE.pdf` - Summary: [[@2024__ASE__ART - A Unified Unsupervised Framework for Incident Management in Microservice Systems]] - Pages created: [[@2024__ASE__ART - A Unified Unsupervised Framework for Incident Management in Microservice Systems]] / [[Mingyu Mao]] - Pages updated: [[Yongqian Sun]] / [[Binpeng Shi]] / [[Sibo Xia]] / [[Shenglin Zhang]] / [[Dan Pei]] / [[Minghua Ma]] / [[マルチモーダル障害診断]] / [[Fault Localization]] / [[AIOps]] / [[wiki/index]] / [[wiki/hot]] - Key insight: AD・FT・RCL に共通する「偏差ベクトル(ILD/SLD)」という統一表現を SSL で学習することで、ラベル不要の 3 タスク統合が監視あり専門化手法を上回れる。CHA→TEM→CAL の依存関係モデル化順序(細粒度→粗粒度)が性能に決定的に効く。 ## [2026-06-30] ingest-paper | Xpert: Empowering Incident Management with Query Recommendations via Large Language Models (ICSE 2024) - Source: `.raw/papers/arxiv-2312.11988.pdf` - Summary: [[@2024__ICSE__Xpert - Empowering Incident Management with Query Recommendations via Large Language Models]] - Pages created: [[@2024__ICSE__Xpert - Empowering Incident Management with Query Recommendations via Large Language Models]] / [[Zhihao Yang]] / [[DSLクエリ推薦]] - Pages updated: [[インシデント管理]] / [[LLMによる根本原因分析]] / [[wiki/index]] / [[wiki/hot]] - Key insight: インシデント管理での DSL クエリ推薦を初めて実証。LLM の ICL が fine-tune 済み小型モデルを 7 例示で超え、BLEU/METEOR では見えない KQL 品質を Xcore(構文・サブコンポーネント・出力スキーマの 3 観点)で定量化した。 ## [2026-06-30] ingest-paper | AI Assistants for Incident Lifecycle in a Microservice Environment: A Systematic Literature Review - Source: `.raw/papers/arxiv-2410.04334.pdf` - Summary: [[@2024__arXiv__AI Assistants for Incident Lifecycle in a Microservice Environment - A Systematic Literature Review]] - Pages created: [[Dahlia Ziqi Zhou]] / [[Marios Fokaefs]] / [[York University]] / [[@2024__arXiv__AI Assistants for Incident Lifecycle in a Microservice Environment - A Systematic Literature Review]] - Pages updated: [[インシデント管理]] / [[根本原因分析]] / [[異常検知]] / [[LLMによる根本原因分析]] - Key insight: 2021〜2024 年の SLR で Detect フェーズが 54.8% と最大。Prepare/Post-incident は合計 12.9% にとどまり研究空白を定量化。LLM 手法が 38.7% で最多となったが、ユーザースタディ実施は 31 件中 5 件のみという評価偏重が課題。 ## [2026-06-30] ingest-paper | X-lifecycle Learning for Cloud Incident Management using LLMs - Source: `.raw/papers/arxiv-2404.03662.pdf` - Summary: [[@2024__FSE__X-lifecycle Learning for Cloud Incident Management using LLMs]] - Pages created: [[Aditya Singh]] / [[@2024__FSE__X-lifecycle Learning for Cloud Incident Management using LLMs]] - Pages updated: [[Drishti Goel]] / [[Fiza Husain]] / [[Anjaly Parayil]] / [[Saravan Rajmohan]] / [[Supriyo Ghosh]] / [[Xuchao Zhang]] / [[Chetan Bansal]] / [[インシデント管理]] / [[根本原因分析]] / [[クラウドモニタリング]] - Key insight: SDLC 複数段階の X-lifecycle データ（サービス依存・機能説明）補完が LLM RCA を改善するが、タスクに意味的に対応する情報のみが有効で、インコンテキスト例との組み合わせが必須。 ## [2026-06-30] ingest-paper | FaultProfIT: Hierarchical Fault Profiling of Incident Tickets in Large-scale Cloud Systems (ICSE-SEIP 2024) - Source: `.raw/papers/arxiv-2402.17583.pdf` - Summary: [[@2024__ICSE-SEIP__FaultProfIT - Hierarchical Fault Profiling of Incident Tickets in Large-scale Cloud Systems]] - Pages created: [[@2024__ICSE-SEIP__FaultProfIT - Hierarchical Fault Profiling of Incident Tickets in Large-scale Cloud Systems]], [[障害パターンプロファイリング]] - Pages updated: [[Junjie Huang]], [[Michael R. Lyu]], [[Zhuangbin Chen]], [[Jinyang Liu]], [[Yichen Li]], [[Jiazhen Gu]], [[Zhihan Jiang]], [[ポストモーテム]], [[障害傾向分析]], `wiki/index`, `wiki/hot`, `wiki/log`, `wiki/sources/_index`, `wiki/concepts/_index` - Key insight: 障害パターンプロファイリングの自動化は「深刻度バイアス（手動ポストモーテムが S1-S3 に集中する問題）」を克服する最も直接的な手段であることが本番稼働 6 ヶ月で実証された。Graphormer の 5 階層グラフ全域アテンションが GCN・GAT の局所的畳み込みを大幅に上回り、タクソノミ構造の全体を掴むことが分類精度に直結する。 ## [2026-06-30] ingest-paper | Fail through the Cracks: Cross-System Interaction Failures in Modern Cloud Systems (EuroSys 2023) - Source: `.raw/papers/csi-failures.pdf` - Summary: [[@2023__EuroSys__Fail through the Cracks - Cross-System Interaction Failures in Modern Cloud Systems]] - Pages created: [[@2023__EuroSys__Fail through the Cracks - Cross-System Interaction Failures in Modern Cloud Systems]], [[クロスシステムインタラクション障害]], [[Lilia Tang]], [[Chaitanya Bhandari]], [[Indranil Gupta]] - Pages updated: [[Tianyin Xu]], [[Purdue University]], [[分散システム障害]], [[クラウドインシデント]], `wiki/index`, `wiki/hot`, `wiki/log`, `wiki/sources/_index`, `wiki/entities/_index`, `wiki/concepts/_index` - Key insight: CSI 障害は「どちらのシステムも仕様上は正しいのに障害が起きる」という新カテゴリ。クラウドインシデントの20%がこれに起因し、既存の耐障害機構はすべて単一システム内を保護するためCSI障害に無力。コネクタモジュール(コードベース5%未満)が修正の86%を占めることが、クロスシステムテストの実用的な切り口を示す。 ## [2026-06-30] ingest-paper | Metastable Failures in Distributed Systems (HotOS 2021) - Source: `.raw/papers/hotos21-s11-bronson.pdf` - Summary: [[@2021__HotOS__Metastable Failures in Distributed Systems]] - Pages created: [[@2021__HotOS__Metastable Failures in Distributed Systems]], [[Nathan Bronson]], [[Abutalib Aghayev]], [[Aleksey Charapko]], [[Timothy Zhu]], [[Rockset]], [[The Pennsylvania State University]], [[University of New Hampshire]] - Pages updated: [[メタ安定障害]](横断的知見・未解決の問い・3 状態定義追記), `wiki/index`, `wiki/hot`, `wiki/log`, `wiki/sources/_index`, `wiki/entities/_index`, `wiki/concepts/_index` - Key insight: 「再試行・キャッシュ・冗長化など信頼性向上機能が sustaining effect の温床になる」という逆説が本論文の核心。[[グレイ障害]] の提唱（HotOS 2017）と並んで分散システムの「見えない障害」パターンを定式化した 2 本柱の一方。隠れキャパシティと広告キャパシティの乖離が事故の種であり、2 年以上未解決だったリンク不均衡障害は 1 行の修正で解決した。 ## [2026-06-30] ingest-paper | Gray Failure: The Achilles' Heel of Cloud-Scale Systems (HotOS 2017) - Source: `.raw/papers/Huang-et-al.-2017---Gray-failure---The-achilles-heel-of-cloud-scale-systems.pdf` - Summary: [[@2017__HotOS__Gray Failure - The Achilles' Heel of Cloud-Scale Systems]] - Pages created: [[@2017__HotOS__Gray Failure - The Achilles' Heel of Cloud-Scale Systems]], [[Jacob R. Lorch]], [[Murali Chintalapati]], [[Randolph Yao]], [[差分可観測性]] - Pages updated: [[Peng Huang]], [[Chuanxiong Guo]], [[Lidong Zhou]], [[Yingnong Dang]], [[Johns Hopkins University]], [[グレイ障害]], [[wiki/index]], [[wiki/hot]], [[wiki/log]], `wiki/sources/_index`, `wiki/entities/_index`, `wiki/concepts/_index` - Key insight: 「高冗長性が可用性を下げる逆説」の根本原因は、fail-stop を前提にした Observer がグレイ障害を見逃すことにある。解決は監視の多次元化——ハートビート1点からアプリの観測を近似するプローブ群へ。2017年に提唱されたこの差分可観測性という概念は、SuperBench・GrayScope・Harp など後続の大半の実装に継承されている。 ## [2026-06-30] ingest-paper | mTCP: a Highly Scalable User-level TCP Stack for Multicore Systems (NSDI 2014) - Source: `.raw/papers/nsdi14-paper-jeong.pdf` - Summary: [[@2014__NSDI__mTCP - a Highly Scalable User-level TCP Stack for Multicore Systems]] - Pages created: [[@2014__NSDI__mTCP - a Highly Scalable User-level TCP Stack for Multicore Systems]], [[EunYoung Jeong]], [[Dongsu Han]], [[ユーザーレベルTCPスタック]] - Pages updated: [[KyoungSoo Park]], [[KAIST]], [[wiki/index]], [[wiki/hot]], [[wiki/log]], `wiki/sources/_index`, `wiki/entities/_index`, `wiki/concepts/_index` - Key insight: パケットI/Oとシステムコールバッチ化を統合したユーザーレベル実装が、カーネル改変なしに先行研究の個別最適化の和を超える性能を達成できる。コンテキストスイッチのコストは双方向バッチで複数イベントに分散させることが鍵。 ## [2026-06-30] ingest-paper | An Updated Performance Comparison of Virtual Machines and Linux Containers (ISPASS 2015) - Source: `.raw/papers/Felter-et-al.-2015---An-updated-performance-comparison-of-virtual-machines-and-linux-containers.pdf` - Summary: [[@2015__ISPASS__An Updated Performance Comparison of Virtual Machines and Linux Containers]] - Pages created: [[@2015__ISPASS__An Updated Performance Comparison of Virtual Machines and Linux Containers]], [[Wes Felter]], [[コンテナ仮想化]] - Pages updated: [[Docker]], [[IBM Research]], [[wiki/index]], [[wiki/hot]], [[wiki/log]], `wiki/sources/_index`, `wiki/entities/_index`, `wiki/concepts/_index` - Key insight: コンテナはほぼ全ベンチマークで KVM と同等以上の性能。KVM の主要コストはランダム I/O(QEMU 経由で 50% 低下)とネットワーク遅延(+30µs/トランザクション)。Docker の落とし穴は AUFS と NAT。 ## [2026-06-30] ingest | dsync: Efficient Block-wise Synchronization of Multi-Gigabyte Binary Data (LISA13) - Source: `.raw/articles/knauth-2026-06-30.md` - Summary: [[@2013__LISA__dsync - Efficient Block-wise Synchronization of Multi-Gigabyte Binary Data]] - Pages created: [[@2013__LISA__dsync - Efficient Block-wise Synchronization of Multi-Gigabyte Binary Data]], [[Thomas Knauth]], [[ブロックレベル差分同期]] - Pages updated: [[Christof Fetzer]], [[TU Dresden]], [[ファイルレベル同期]] - Key insight: rsync のチェックサムベース事後検出をカーネル内ブロック追跡に置き換えることで最大 100 倍の同期高速化。同じ LISA13 の Marc Merlin 論文（[[ファイルレベル同期]]）と対比すると「同期粒度（ファイル vs ブロック）」という設計軸の両極が見える。 ## [2026-06-30] ingest-paper | Scaling Memcache at Facebook - Source: `.raw/papers/nsdi13-final170_update.pdf` - Summary: [[@2013__NSDI__Scaling Memcache at Facebook]] - Pages created: [[分散キャッシュ]], [[Rajesh Nishtala]], [[@2013__NSDI__Scaling Memcache at Facebook]] - Pages updated: [[Facebook]], [[一貫性ハッシュ法]], [[Incast]], [[結果整合性]] - Key insight: リースメカニズムが thundering herd 時のピーク DB クエリを 17K/s → 1.3K/s に削減。Gutter プールで障害時失敗率 99% 削減。キャッシュ整合性を「調整可能なパラメータ」として扱うベストエフォート方針が、性能・可用性と整合性のバランスを実現した。 ## [2026-06-30] ingest | netmap: A Novel Framework for Fast Packet I/O (USENIX ATC '12) - Source: `.raw/articles/netmap-fast-packet-io-2026-06-30.md` - Summary: [[@2012__USENIX-ATC__netmap A Novel Framework for Fast Packet IO]] - Pages created: [[@2012__USENIX-ATC__netmap A Novel Framework for Fast Packet IO]], [[Luigi Rizzo]], [[netmap]], [[カーネルバイパスネットワーキング]], [[ゼロコピーネットワーキング]] - Pages updated: [[wiki/index.md]] - Key insight: 共有メモリリング + プリアロケーション + バッチ syscall の三原則で、専用ハードウェアも OS 変更も不要のまま 10 Gbit/s 線速（14.88 Mpps）を達成。カーネル保護機構を捨てずにゼロコピーを実現した点が DPDK との本質的差異。 ## [2026-06-30] ingest | SSLShader: Cheap SSL Acceleration with Commodity Processors (NSDI 2011) - Source: `.raw/articles/sslshader-cheap-ssl-acceleration-commodity-processors-2026-06-30.md` - Summary: [[@2011__NSDI11__SSLShader - Cheap SSL Acceleration with Commodity Processors]] - Pages created: [[@2011__NSDI11__SSLShader - Cheap SSL Acceleration with Commodity Processors]], [[Keon Jang]], [[Sangjin Han]], [[Seungyeop Han]], [[Sue Moon]], [[SSL TLS アクセラレーション]] - Pages updated: [[KyoungSoo Park]]（SSLShader 追記）, [[KAIST]]（SSLShader・新著者追記） - Key insight: 2011 年時点でコモディティ GPU による RSA アクセラレーションが最速 CPU 比 22〜31 倍を達成し、高価な専用アプライアンスに匹敵することを実証した。[[KyoungSoo Park]] グループの「ハードウェア限界突破」研究系譜の第一弾。 ## [2026-06-30] ingest-paper | Live Upgrading Thousands of Servers from an Ancient Red Hat Distribution to 10 Year Newer Debian Based One (LISA '13) - Source: `.raw/papers/lisa13-merlin.pdf` - Summary: [[@2013__LISA__Live Upgrading Thousands of Servers from an Ancient Red Hat Distribution to 10 Year Newer Debian Based One]] - Pages created: [[@2013__LISA__Live Upgrading Thousands of Servers from an Ancient Red Hat Distribution to 10 Year Newer Debian Based One]], [[Marc Merlin]], [[Richard Gooch]], [[ファイルレベル同期]], [[ライブアップグレード]] - Pages updated: [[Google]](インフラ管理セクション追記) - Key insight: パッケージマネージャーを迂回するファイルレベル同期は「いかなる状態からも回復できる」べき等性によってフリート管理の信頼性基盤となり、その基盤の上で段階的 rpm→deb 変換・ELF バイナリパッチという創意ある手法により OS 全体をフラグデーなしで入れ替えることができた。 ## 2026-06-30 ingest | Mackerelを支える時系列データベース技術 (yuuk.io 2015) - Source: `.raw/articles/high-performance-graphite-2026-06-30.md` - Summary: [[@2015__yuuk.io__High-Performance-Graphite]] - Pages created: [[Graphite]] - Pages updated: [[Mackerel]], [[Yuuki Tsubouchi]], [[時系列データベース]] - Key insight: whisper の RRD 型（固定サイズ・ローリング・精度劣化）と carbon-cache のマルチコア限界が [[HeteroTSDB]] 移行の設計的動機として連なる一次記録。 ## 2026-06-30 ingest | ウェブシステムの運用自律化に向けた構想 (yuuk.io 2017) - Source: `.raw/articles/the-concept-of-autonomous-web-system-2026-06-30.md` - Summary: [[@2017__yuuk.io__ウェブシステムの運用自律化に向けた構想]] - Pages created: [[@2017__yuuk.io__ウェブシステムの運用自律化に向けた構想]], [[Experimentable Infrastructure]] - Pages updated: [[Yuuki Tsubouchi]], [[Hatena]], [[SRE]] - Key insight: 「SRE = 信頼性を制約条件として費用を最小にする最適化問題」という定義と、観測・制御・実験の3軸で構成される Experimentable Infrastructure の初出。2017年時点の Yuuki Tsubouchi の自律運用ビジョンの起点。 ## [2026-06-30] ingest-article | 2015年Webサーバアーキテクチャ序論（yuuk.io 2015） - Source: `.raw/articles/2015-webserver-architecture-2026-06-30.md` - Summary: [[@2015__yuuk.io__2015年Webサーバアーキテクチャ序論]] - Pages created: [[Webサーバアーキテクチャ]] - Pages updated: [[C10K問題]], [[epoll]], [[Yuuki Tsubouchi]] - Key insight: Web サーバアーキテクチャの 5 モデル分類(シリアル・プリフォーク・マルチスレッド・イベント駆動・ハイブリッド)を体系化し、C10K問題・epoll との設計レイヤーでの連接を確立した。 ## [2026-06-30] ingest-article | Webシステムにおけるデータベース接続アーキテクチャ概論（yuuk.io 2015） - Source: `.raw/articles/architecture-of-database-connection-2026-06-30.md` - Summary: [[@2015__yuuk.io__architecture-of-database-connection]] - Pages created: [[データベース接続モデル]], [[コネクションプーリング]], [[PgBouncer]], [[Pgpool]] - Pages updated: [[Yuuki Tsubouchi]] - Key insight: RDBMSは元々少数クライアントとのステートフル通信向けで多数クライアントとはインピーダンスミスマッチがある。PostgreSQL=プロセスモデルでプロキシ型プーリング必須、MySQL=スレッドモデルで都度接続が多い、という2大DBの設計差が接続戦略を大きく左右する。 ## [2026-06-30] ingest-article | Linux マルチコアスケールカーネルチューニング（yuuk.io 2015） - Source: `.raw/articles/linux-networkstack-tuning-rfs-2015-03-31.md` - Summary: [[@2015__yuuk.io__linux-networkstack-tuning-rfs]] - Pages created: [[@2015__yuuk.io__linux-networkstack-tuning-rfs]], [[RFS（Receive Flow Steering）]], [[RPS（Receive Packet Steering）]], [[RSS（Receive Side Scaling）]] - Pages updated: [[Yuuki Tsubouchi]], `wiki/index.md`, `wiki/hot.md` - Key insight: RFS は Linux 2.6.35+ でハードウェア非依存にシングルキュー NIC の CPU 偏りを解消する最もコスパの高いカーネルチューニング手法。 ## [2026-06-30] re-ingest-article | サーバーレスアーキテクチャ再考 (yuuk.io blog 2019) - Source: `.raw/articles/rethinking-serverless-architecture-2019-09-11.md` - Summary: [[@2019__yuuk.io__Rethinking-Serverless-Architecture]] - Pages created: [[Knative]], [[OpenFaaS]] - Pages updated: [[@2019__yuuk.io__Rethinking-Serverless-Architecture]]（後記「サーバーレスデータベース展望」追記） - Key insight: 記事末尾の @tzkb 議論から生まれた「DB のデータ構造を BaaS+FaaS に分解する」アイデアは、その後の Neon・Aurora Serverless 等のサーバーレス指向クラウドネイティブ DB と方向性が一致しており、2019 年時点での先見性がある。 ## [2026-06-30] wiki-ingest | 工学としてのSRE再訪 — SRE NEXT 2024 登壇後記 (yuuk.io blog) - Source: `.raw/articles/srenext2024-blog-2026-06-30.md` - Summary: [[@2024__yuuk.io__SRE-NEXT-2024]] - Pages created: なし（既存ページに補完追記） - Pages updated: [[@2024__yuuk.io__SRE-NEXT-2024]]（イベント統計・準備プロセス・Gist リンク追記）、[[アラート疲労]]（オープンチャレンジとの接続を横断的知見に追記） - Key insight: SRE コミュニティが 2016〜2026 年に多数の解法を報告してきた「オオカミ少年アラート問題」が、2024 年時点でも SRE の工学的未解決課題と位置づけられている。技術・組織的知識の蓄積と個々の組織への定着の間にギャップが存在することを示す。 ## [2026-06-30] ingest-paper | An AI system to help scientists write expert-level empirical software (Nature 2026) - Source: `.raw/papers/arxiv-2509.06503.pdf` - Summary: [[@2026__Nature__An AI system to help scientists write expert-level empirical software]] - Pages created: [[@2026__Nature__An AI system to help scientists write expert-level empirical software]], [[LLMドリブンコード探索]], [[スコアリング可能タスク]], [[Michael P. Brenner]] - Pages updated: [[コードLLM]], [[DeepMind]], [[Google Research]] - Key insight: ERA（LLM + PUCT 木探索）が複数の科学ドメイン（scRNA-seq・COVID-19 疫学・時系列・地理空間・神経科学・数値積分）で人手の最高水準を超えた初のシステム。「同一 LLM 推論コストなら Best-of-N より木探索が有効」という命題を 5 種の LLM で実証（Table 1）。 ## [2026-06-30] ingest-paper | Towards end-to-end automation of AI research (Nature 2026) - Source: `.raw/papers/s41586-026-10265-5.pdf` - Summary: [[@2026__Nature__Towards end-to-end automation of AI research]] - Pages created: [[@2026__Nature__Towards end-to-end automation of AI research]], [[AI研究自動化]], [[エージェント型科学探索]], [[自動査読]], [[Chris Lu]], [[Cong Lu]], [[Robert Tjarko Lange]], [[Yutaro Yamada]], [[Shengran Hu]], [[Jakob Foerster]], [[David Ha]], [[Jeff Clune]] - Pages updated: [[Sakana AI]] - Key insight: 完全AI生成論文がトップML会議(ICLR 2025 ICBINB, 採択率70%)の正式査読を通過した初事例。自動査読者の均衡精度が人間と同等(69% vs 66%)。基盤モデル世代と計算量の両軸でスケーリングする。 ## [2026-06-30] ingest-slides | Enabling Client-side SLO (SRE NEXT 2024, Wataru Tsuda) - Source: `.raw/slides/20240804_SRENEXT2024/20240804_SRENEXT2024.pdf` - Visual pages: `.raw/slides/20240804_SRENEXT2024/pages/` (41 ページ全ページ画像確認済み) - Media: none（transcript なし） - Summary: [[@2024__SRENext2024__Enabling Client-side SLO]] - Pages created: [[@2024__SRENext2024__Enabling Client-side SLO]] - Pages updated: [[Wataru Tsuda]], [[Luup]], [[SLI-SLO段階的導入]] - Key insight: クライアントサイド SLO は BLE 操作・Firestore 直接通信という「API を迂回するユーザー体験」を捕捉するために必要。p75 SLI の根拠を Core Web Vitals（LCP Good = 75%ile）から借用することで PdM/SWE との合意コストを下げられる。 ## [2026-06-30] ingest-slides | Practices for Making Alerts Actionable (SRE NEXT 2020, Sohei Iwahori) - Source: `.raw/slides/practices-for-making-alerts-actionable/practices-for-making-alerts-actionable.pdf` - Visual pages: `.raw/slides/practices-for-making-alerts-actionable/pages/` (41 ページ、p.1-24 画像確認、p.25-41 テキスト抽出で補完) - Media: none（transcript なし） - Summary: [[@2020__SRENext2020__Practices for Making Alerts Actionable]] - Pages created: [[@2020__SRENext2020__Practices for Making Alerts Actionable]] - Pages updated: [[Sohei Iwahori]], [[GREE, Inc]], [[アクショナブルアラート]], [[アラート疲労]] - Key insight: 振り分けの3段階（Slack通知のみ/JIRA自動起票/PagerDuty）と定型アクション自動化（Alert Operator）の組み合わせが、月300件超のピークを月180件前後に安定させた（約4割削減）。 ## [2026-06-30] ingest-slides | 電動マイクロモビリティのシェアサービス「LUUP」におけるEnabling SLOの実践 (SRE NEXT 2023, Wataru Tsuda) - Source: `.raw/slides/20230929_SRENEXT2023/20230929_SRENEXT2023.pdf` - Visual pages: `.raw/slides/20230929_SRENEXT2023/pages/` (35 ページ全件確認) - Media: none（transcript なし） - Summary: [[@2023__SRENext2023__電動マイクロモビリティのシェアサービス「LUUP」におけるEnabling SLOの実践]] - Pages created: [[@2023__SRENext2023__電動マイクロモビリティのシェアサービス「LUUP」におけるEnabling SLOの実践]], [[Wataru Tsuda]], [[Luup]] - Pages updated: [[サービスレベル目標]]（IoT CMC 概念 × CUJ 拡張・Enabling SLO 組織パターンの横断的知見追記） - Key insight: IoT 向け SLI 設計では CUJ ではなく CMC（Critical Machine Communication）が起点になる——「マシンが期待通りに動作できる状態」を計測する Luup 独自概念で、物理デバイス SRE の SLI 定義の核にある。 ## [2026-06-30] ingest-slides | Who owns the Service Level? (SRE NEXT 2022, 近藤武士) - Source: `.raw/slides/srenext2022-chaspy-who-owns-service-level/srenext2022-chaspy-who-owns-service-level.pdf` - Visual pages: `.raw/slides/srenext2022-chaspy-who-owns-service-level/pages/` (79 ページ全件確認) - Media: none（transcript なし） - Summary: [[@2022__SRENext2022__Who owns the Service Level?]] - Pages created: [[@2022__SRENext2022__Who owns the Service Level?]], [[近藤武士]], [[Recruit]], [[スタディサプリ]] - Pages updated: [[サービスレベル目標]]（「定義・観察」と「行動」の分離という新前提条件追記）, [[エラーバジェット]]（Error Budget Policy 行動定着の組織的前提・リリースストップ幻想の 2 知見追記） - Key insight: SLI/SLO の定義・観察文化は醸成できても Error Budget Policy に従った行動まで至らない失敗の根本原因は「非機能要求への予算・権限が開発チームになかった」組織的制約であり、技術戦略グループによる 1:1:1 予算配分という制度的変更で初めて解決された——測定技術の問題でなく組織設計の問題。 ## [2026-06-30] ingest-slides | プロダクトオーナーとしてSLOに向き合う〜Mackerelチームの事例〜 (SRE NEXT 2023, 渡辺起) - Source: `.raw/slides/2023__SRENext2023__PO-to-SLO-Mackerel/2023__SRENext2023__PO-to-SLO-Mackerel.pdf` - Visual pages: `.raw/slides/2023__SRENext2023__PO-to-SLO-Mackerel/pages/` (39 ページ全件確認) - Media: none（transcript なし） - Summary: [[@2023__SRENext2023__プロダクトオーナーとしてSLOに向き合う〜Mackerelチームの事例〜]] - Pages created: [[@2023__SRENext2023__プロダクトオーナーとしてSLOに向き合う〜Mackerelチームの事例〜]], [[渡辺起]] - Pages updated: [[Mackerel]]（SLO 導入事例・チーム構成・DORA 位置・Error Budget Policy の実態を追記）, [[サービスレベル目標]]（PO 視点の横断的知見・ユーザー主語定義・仮値スタートパターン追記）, [[エラーバジェット]]（「最初は最も緩いアクション」パターンの3社横断知見追記） - Key insight: Mackerel PO が SLO 導入で得た最大の恩恵は「判断が減ること」——数値基準によりチームが自律判断できる状態を作ることが、プロダクトオーナーにとっての SLO の主動機であると PO 目線で明示した初期事例。 ## [2026-06-30] ingest-slides | Measuring Availability the Player Focused Way (SREcon25 Americas, Maxfield Stewart) - Source: `.raw/slides/sre25amer_slides-stewart/sre25amer_slides-stewart.pdf` - Visual pages: `.raw/slides/sre25amer_slides-stewart/pages/` (50 ページ全件確認) - Media: none（transcript なし） - Summary: [[@2025__SREcon25Americas__Measuring Availability the Player Focused Way - How Riot Games Changed Its Availability Culture]] - Pages created: [[@2025__SREcon25Americas__Measuring Availability the Player Focused Way - How Riot Games Changed Its Availability Culture]], [[Maxfield Stewart]], [[Riot Games]], [[Derek Defields]], [[Player Journey]] - Pages updated: [[サービスレベル目標]]（CCU 重み付き可用性計測・CEO OKR 定着手法の横断的知見追記） - Key insight: プレイヤー分（Player Minutes）という CCU 重み付きの可用性指標により、「プレイヤー体験（Player Journey）」で SLO を定義し CEO OKR に接続することで、Riot Games の可用性が 3 年で 97-98% → 99% に改善した。 ## [2026-06-30] ingest-slides | DO, RE, Me: Measuring the Effectiveness of Site Reliability Engineering (SREcon22 Americas, Dave Stanke) - Source: `.raw/slides/2022__SREcon22Americas__DO-RE-Me/2022__SREcon22Americas__DO-RE-Me.pdf` - Visual pages: `.raw/slides/2022__SREcon22Americas__DO-RE-Me/pages/` (49 ページ全件確認) - Media: none（transcript なし） - Summary: [[@2022__SREcon22Americas__DO RE Me - Measuring the Effectiveness of Site Reliability Engineering]] - Pages created: [[@2022__SREcon22Americas__DO RE Me - Measuring the Effectiveness of Site Reliability Engineering]] - Pages updated: [[Dave Stanke]], [[DORA]], [[SRE]] - Key insight: DORA 2021 が SRE を初めて定量調査し、信頼性が Software Delivery Performance のビジネス成果への影響を乗算的に増幅する「force multiplier」であることを実証した。 ## [2026-06-30] ingest-slides | Is the S in SRE for "Security"? (SREcon25 Americas, John Benninghoff) - Source: `.raw/slides/2025__SREcon25Americas__Is-the-S-in-SRE-for-Security/2025__SREcon25Americas__Is-the-S-in-SRE-for-Security.pdf` - Visual pages: `.raw/slides/2025__SREcon25Americas__Is-the-S-in-SRE-for-Security/pages/` (29 ページ全件確認) - Media: none（transcript なし） - Summary: [[@2025__SREcon25Americas__Is the S in SRE for Security]] - Pages created: [[@2025__SREcon25Americas__Is the S in SRE for Security]], [[John Benninghoff]], [[Security Differently]], [[Safety-II]], [[Security Level Objectives]] - Pages updated: (なし) - Key insight: セキュリティのトップ2コントロール（攻撃面管理・パッチ頻度）はSREの中核業務と同一。Safety-IIの「分布の右シフト」モデルがセキュリティ = 組織パフォーマンスの一側面という捉え直しを支える。 ## [2026-06-30] ingest-slides | How to SRE When Everything is Already on Fire (SREcon19 EMEA, Alex Hidalgo + Alex Lee) - Source: `.raw/slides/2019__SREcon19EMEA__How-to-SRE-When-Everythings-Already-on-Fire/2019__SREcon19EMEA__How-to-SRE-When-Everythings-Already-on-Fire.pdf` - Visual pages: `.raw/slides/2019__SREcon19EMEA__How-to-SRE-When-Everythings-Already-on-Fire/pages/` (105 ページ全件確認) - Media: none（transcript なし） - Summary: [[@2019__SREcon19EMEA__How to SRE When Everything is Already on Fire]] - Pages created: [[@2019__SREcon19EMEA__How to SRE When Everything is Already on Fire]], [[Alex Hidalgo]], [[Alex Lee]] - Pages updated: [[Squarespace]], [[アラート疲労]], [[サービスレベル目標]], [[エラーバジェット]], [[ポストモーテム]] - Key insight: エラーバジェット枯渇が「全力対処の組織的許可証」として機能した実例と、ICS による 37 時間インシデント管理を実証した。 ## [2026-06-30] ingest-slides | Beyond Sequential: A Recipe for Async Pipeline Observability and Alerting (SREcon25 Americas) - Source: `.raw/slides/2025__SREcon25Americas__Beyond-Sequential/2025__SREcon25Americas__Beyond-Sequential.pdf` - Visual pages: `.raw/slides/2025__SREcon25Americas__Beyond-Sequential/pages/` (50 pages) - Media: `.raw/slides/2025__SREcon25Americas__Beyond-Sequential/transcript.md`（YouTube 自動字幕 EN から変換、機械精度） - Summary: [[@2025__SREcon25Americas__Beyond Sequential - A Recipe for Async Pipeline Observability and Alerting]] - Pages created: [[@2025__SREcon25Americas__Beyond Sequential - A Recipe for Async Pipeline Observability and Alerting]], [[Jash Mistry]], [[Gabriela Medvetska]] - Pages updated: [[eBay]], [[サービスレベル目標]], [[エラーバジェット]], [[イベントベースSLO]], [[アラート疲労]] - Key insight: 非同期パイプラインでは RETRY を Valid Events から除外することが可用性 SLI の設計基盤となる。レイテンシ SLI は複数ホップの累積時間を Prometheus histogram に記録する end-to-end 計装が必要。マルチウィンドウ・マルチバーンレートアラートに加え、メトリクス欠損を独立検知するデータ損失アラートを必ず併置する。 ## [2026-06-30] ingest-slides | 9 Things You Should Do When Starting to Use SLOs (SREcon23 EMEA) - Source: `.raw/slides/2023__SREcon23EMEA__9-Things-You-Should-Do-When-Starting-to-Use-SLOs/2023__SREcon23EMEA__9-Things-You-Should-Do-When-Starting-to-Use-SLOs.pdf` - Visual pages: `.raw/slides/2023__SREcon23EMEA__9-Things-You-Should-Do-When-Starting-to-Use-SLOs/pages/` (40 pages) - Media: transcript なし（audio.m4a 生成済み、Whisper 未完了） - Summary: [[@2023__SREcon23EMEA__9 Things You Should Do When Starting to Use SLOs]] - Pages created: [[@2023__SREcon23EMEA__9 Things You Should Do When Starting to Use SLOs]], [[Sal Furino]], [[SLODLC]] - Pages updated: [[サービスレベル目標]], [[SLI-SLO段階的導入]] - Key insight: SLO 導入の 9 アドバイスを 3 カテゴリで体系化。「成功率 > エラー率」の SLI 設計原則、ステークホルダー別時間窓（24h/14D/Monthly）、SLODLC の 5 フェーズライフサイクル、「Observability Without Action is Just Storage」の格言。 ## [2026-06-30] ingest-slides | Run, Walk, Crawl, or How We Failed Our Way to SLO Readiness (SREcon25 EMEA) - Source: `.raw/slides/2025__SREcon25EMEA__Run-Walk-Crawl/2025__SREcon25EMEA__Run-Walk-Crawl.pdf` - Visual pages: `.raw/slides/2025__SREcon25EMEA__Run-Walk-Crawl/pages/` (51 pages) - Media: transcript なし（YouTube 動画 URL 未取得） - Summary: [[@2025__SREcon25EMEA__Run Walk Crawl or How We Failed Our Way to SLO Readiness]] - Pages created: [[@2025__SREcon25EMEA__Run Walk Crawl or How We Failed Our Way to SLO Readiness]], [[Rob Durst]], [[Spring Health]] - Pages updated: [[サービスレベル目標]], [[SLI-SLO段階的導入]] - Key insight: SLO 導入は社会技術的問題であり、オブザーバビリティ基盤が揃っていても「所有権・標準プロセス・保護時間」の 3 条件が欠ければ何度でも頓挫する。4 条件「SLO 準備度チェックリスト」は段階的導入の前提条件診断ツールとして機能する。 ## [2026-06-30] ingest-slides | Not All Minutes Are Equal: The Secret behind SLO Adoption Failure (SREcon23 Americas) - Source: `.raw/slides/2023__SREcon23Americas__Not-All-Minutes-Are-Equal/2023__SREcon23Americas__Not-All-Minutes-Are-Equal.pdf` - Visual pages: `.raw/slides/2023__SREcon23Americas__Not-All-Minutes-Are-Equal/pages/` (40 pages) - Media: audio.m4a あり。transcript なし（Whisper 失敗・YouTube 字幕 HTTP 429） - Summary: [[@2023__SREcon23Americas__Not-All-Minutes-Are-Equal]] - Pages created: [[@2023__SREcon23Americas__Not-All-Minutes-Are-Equal]], [[Michael Goins]], [[Troy Koss]], [[イベントベースSLO]] - Pages updated: [[Capital One]], [[エラーバジェット]] - Key insight: 時間スライス SLO は分をリクエスト数に関係なく 1 票として扱うため、ピーク時の深刻インシデントがバジェット消費に反映されず、イベントベース集計に切り替えると深刻度との比例関係が回復する。 ## [2026-06-30] ingest-slides | Measuring Reliability: What Got Us Here Won't Get Us There (SREcon22 EMEA) - Source: `.raw/slides/2022__SREcon22EMEA__Measuring-Reliability/2022__SREcon22EMEA__Measuring-Reliability.pdf` - Visual pages: `.raw/slides/2022__SREcon22EMEA__Measuring-Reliability/pages/` (42 pages) - Media: none（transcript なし、YouTube 動画 URL 取得不可） - Summary: [[@2022__SREcon22EMEA__Measuring Reliability - What Got Us Here Won't Get Us There]] - Pages created: [[@2022__SREcon22EMEA__Measuring Reliability - What Got Us Here Won't Get Us There]] - Pages updated: [[Štěpán Davidovič]] / [[サービスレベル目標]] / [[エラーバジェット]] - Key insight: SLI/SLO モデルは「信頼性測定」ではなく特定の問いへの回答モデルであり、ステークホルダー（オンコール〜CEO）ごとに全く異なる時間窓と対象 SLI 数が必要。現場では既に SLO ウィンドウと目標値を無視したアドホックモデルを構築しており、これを形式化することが次のステップ。 ## [2026-06-30] ingest-slides | HPC Downtime Budgets: Moving SRE Practice to the Rest of the World (SREcon16 Europe) - Source: `.raw/slides/2016__SREcon16Europe__Downtime-Budgets/2016__SREcon16Europe__Downtime-Budgets.pdf` - Visual pages: `.raw/slides/2016__SREcon16Europe__Downtime-Budgets/pages/` (37 pages) - Media: `.raw/slides/2016__SREcon16Europe__Downtime-Budgets/transcript.md`（YouTube 自動字幕変換、機械精度） - Summary: [[@2016__SREcon16Europe__HPC Downtime Budgets]] - Pages created: [[@2016__SREcon16Europe__HPC Downtime Budgets]], [[Cory Lueninghoener]], [[Los Alamos National Laboratory]] - Pages updated: [[エラーバジェット]]（HPC 適応横断的知見・Wolf 余剰時間問い追記） - Key insight: エラーバジェットの本質は「リクエスト失敗率」という単位にはなく「リソースを追跡して意思決定に使う」構造にある。HPC では「時間」単位で同じ原則が機能する。 ## [2026-06-30] ingest-slides | SLX: An Extended SLO Framework to Expedite Incident Recovery (SREcon21) - Source: `.raw/slides/2021__SREcon21__SLX-Extended-SLO-Framework/2021__SREcon21__SLX-Extended-SLO-Framework.pdf` - Visual pages: `.raw/slides/2021__SREcon21__SLX-Extended-SLO-Framework/pages/` (40 pages) - Media: transcript なし（audio.m4a は生成済み、Whisper 未完了、YouTube 字幕フォールバックも未取得） - Summary: [[@2021__SREcon21__SLX - An Extended SLO Framework to Expedite Incident Recovery]] - Pages created: [[@2021__SREcon21__SLX - An Extended SLO Framework to Expedite Incident Recovery]], [[Qian Ding]], [[Xuan Zhang (Ant Group)]] - Pages updated: [[Ant Group]], [[サービスレベル目標]], [[異常検知]] - Key insight: SLO は検知には強いが調査（Investigation）には向かない——SLF/SLD の拡張と SLX Graph で「時系列相関のある異常 SLO 依存チェーン」を自動絞り込み、調査フェーズの認知負荷を下げる。 ## [2026-06-30] ingest-slides | Principled Performance Analytics (SREcon22 Americas) - Source: `.raw/slides/sre22amer-desai-principled-performance-analytics/sre22amer-desai-principled-performance-analytics.pdf` - Visual pages: `.raw/slides/sre22amer-desai-principled-performance-analytics/pages/` (40 pages) - Media: transcript 処理中（YouTube URL あり: `https://www.youtube.com/watch?v=zOu5cLBu4LI`、Whisper 未完了） - Summary: [[@2022__SREcon22Americas__Principled Performance Analytics]] - Pages created: [[@2022__SREcon22Americas__Principled Performance Analytics]], [[2σ手法]], [[Brent Bryan]] - Pages updated: [[Narayan Desai]], [[定常性モデル]], [[サービスレベル目標]] - Key insight: SLO は「エラー認識は人間の集積判断」という構造的問題から根本的に実現不可能——代替の 2σ手法は較正不要かつコホート間結合可能で、GCP 本番で SLO より 18 時間先行する障害検知を実証した。2021 年定常性モデルの数理実装編。 ## [2026-06-30] ingest-slides | Going from 30 to 30 Million SLOs (SREcon22 EMEA) - Source: `.raw/slides/srecon22emea-palcuie-30-to-30m-slos/srecon22emea-palcuie-30-to-30m-slos.pdf` - Visual pages: `.raw/slides/srecon22emea-palcuie-30-to-30m-slos/pages/` (28 pages) - Media: transcript なし（動画 URL 未取得） - Summary: [[@2022__SREcon22EMEA__Going-from-30-to-30-Million-SLOs]] - Pages created: [[@2022__SREcon22EMEA__Going-from-30-to-30-Million-SLOs]], [[Alex Palcuie]] - Pages updated: [[サービスレベル目標]], [[SLI-SLO段階的導入]] - Key insight: 集計 SLO は大規模プロバイダを守るが個別顧客を守らない——「5 エラーのルール」と per-customer SLO で Rachel Kroll "Your nines are not my nines" 問題に本番規模で対処した GCE の実践。 ## [2026-06-30] enrich-source | Latency and Availability Error Budgets Done Right at Scale (transcript 補完) - Source: `.raw/slides/2020__SREcon20Americas__Latency-and-Availability-Error-Budgets-Done-Right-at-Scale/transcript.md`（Whisper 自動生成、バックグラウンドタスク完了後に取得） - Pages updated: [[@2020__SREcon20Americas__Latency-and-Availability-Error-Budgets-Done-Right-at-Scale]]（口頭説明・補足セクション追加） - Key insight: SLI 違反は都度・SLO 違反はバジェット枯渇時 1 回という重要な区別を口頭で明言。Zendesk は Datadog を使用。"EB は責任追及でなく優先順位付けのため"を再確認。 ## [2026-06-30] ingest-slides | Squish Level Objectives (SREcon20 Americas) - Source: `.raw/slides/2020__SREcon20Americas__Squish-Level-Objectives/2020__SREcon20Americas__Squish-Level-Objectives.pdf` - Visual pages: `.raw/slides/2020__SREcon20Americas__Squish-Level-Objectives/pages/` (41 pages) - Media: `.raw/slides/2020__SREcon20Americas__Squish-Level-Objectives/transcript.md`（YouTube 自動字幕 449 行、英語、機械精度） - Summary: [[@2020__SREcon20Americas__Squish Level Objectives]] - Pages created: [[@2020__SREcon20Americas__Squish Level Objectives]], [[Dave Stanke]] - Pages updated: [[サービスレベル目標]], [[SLI-SLO段階的導入]] - Key insight: SLO Policy の Rationale フィールドにユーザー行動データを結び付けることで、技術閾値を「顧客行動観察に基づく設計判断」として文書化する最も直接的な実装例が示された ## [2026-06-30] ingest-slides | Beyond Goldilocks Reliability (SREcon21) - Source: `.raw/slides/2021__SREcon21__Beyond-Goldilocks-Reliability/2021__SREcon21__Beyond-Goldilocks-Reliability.pdf` - Visual pages: `.raw/slides/2021__SREcon21__Beyond-Goldilocks-Reliability/pages/` (23 pages) - Media: YouTube 字幕 429 エラー・Whisper 未完。transcript なし。 - Summary: [[@2021__SREcon21__Beyond-Goldilocks-Reliability]] - Pages created: [[@2021__SREcon21__Beyond-Goldilocks-Reliability]], [[定常性モデル]] - Pages updated: [[Narayan Desai]], [[SREの工学化]] - Key insight: Goldilocks Reliability（閾値監視）の荷重仮定を分析し、代替として定常性（Stationarity）モデルを提唱——「ちょうどいい閾値の設定」から「通常状態からの逸脱の検知」へ信頼性観測のパラダイムを転換する。 ## [2026-06-30] ingest-slides | Latency and Availability Error Budgets Done Right at Scale (SREcon20 Americas) - Source: `.raw/slides/2020__SREcon20Americas__Latency-and-Availability-Error-Budgets-Done-Right-at-Scale/2020__SREcon20Americas__Latency-and-Availability-Error-Budgets-Done-Right-at-Scale.pdf` - Visual pages: `.raw/slides/2020__SREcon20Americas__Latency-and-Availability-Error-Budgets-Done-Right-at-Scale/pages/` (37 pages) - Media: YouTube 字幕 429 エラー・Whisper 未取得。transcript なし。 - Summary: [[@2020__SREcon20Americas__Latency-and-Availability-Error-Budgets-Done-Right-at-Scale]] - Pages created: [[@2020__SREcon20Americas__Latency-and-Availability-Error-Budgets-Done-Right-at-Scale]], [[Zendesk]] - Pages updated: [[Fred Moyer]], [[エラーバジェット]], [[サービスレベル目標]] - Key insight: SLI/SLO/EB を `[Metric Identifier][Operator][Metric Value]` 等の機械解析可能な公式に固定し、OR 結合の複合 SLI で可用性とレイテンシを単一 EB で管理——マルチサービス構成では依存先 ER が上位層に積算されて自身の EB を超えて観測される問題を図示した。 ## [2026-06-30] ingest-slides | Avoiding Goodhart's Law - Use SLO's as Tools Not Cudgels (SREcon20 Americas) - Source: `.raw/slides/2020__SREcon20Americas__Avoiding-Goodharts-Law/2020__SREcon20Americas__Avoiding-Goodharts-Law.pdf` - Visual pages: `.raw/slides/2020__SREcon20Americas__Avoiding-Goodharts-Law/pages/` (35 pages) - Media: Whisper 文字起こし処理中（YouTube `iKjKFeTSJGs`） - Summary: [[@2020__SREcon20Americas__Avoiding Goodhart's Law]] - Pages created: [[@2020__SREcon20Americas__Avoiding Goodhart's Law]], [[Marco Coulter]], [[AppDynamics]] - Pages updated: [[グッドハートの法則]], [[サービスレベル目標]], [[SLI-SLO段階的導入]] - Key insight: SLO が「棍棒」化するとグッドハートの法則が発動しゲーミングが起きる——3 次元 SLI/SLO/SLA フレームワーク（Code/Infra/CX）と行動ベース CX SLI、パフォーマンスカーブ SLO、反復的 SLO 交渉プロセスで対処する。 ## [2026-06-30] ingest-video | The Map Is Not the Territory (SREcon19 EMEA, Narayan Desai) - Source: https://www.youtube.com/watch?v=NW4OOpQ3nz8 (YouTube 自動字幕 en-orig / 低解像度動画から 12 フレーム抽出) - Transcript: `.raw/videos/youtube-NW4OOpQ3nz8/transcript.md` (1175 cue) - Frames: `.raw/videos/youtube-NW4OOpQ3nz8/frames/frame-001 ～ 012.jpg` - Summary: [[@2019__SREcon19EMEA__The Map Is Not the Territory - How SLOs Lead Us Astray, and What We Can Do about It]] - Pages created: [[@2019__SREcon19EMEA__The Map Is Not the Territory - How SLOs Lead Us Astray, and What We Can Do about It]], [[Narayan Desai]] - Pages updated: [[サービスレベル目標]], [[エラーバジェット]] - Key insight: SLO のテール管理適用が「サンドバッギング」を引き起こすこと、および SLO Algebra が 2019 年時点でも未解決の重要課題であることを明示した講演として wiki 化。 ## [2026-06-30] enrich-source | Latency SLOs Done Right (SREcon19 EMEA, Heinrich Hartmann) - Source: `.raw/slides/srecon19emea-latency-slos-done-right/srecon19emea-latency-slos-done-right.pdf` - Visual pages: `.raw/slides/srecon19emea-latency-slos-done-right/pages/` (33 ページ全件再確認) - Media: none (YouTube 字幕フォールバック試みたが別トークのビデオを誤取得・削除済み) - Summary: [[@2019__SREcon19 EMEA__Latency SLOs Done Right]] - Pages created: (なし。2026-06-19 に初回作成済み) - Pages updated: [[@2019__SREcon19 EMEA__Latency SLOs Done Right]], [[Heinrich Hartmann]], [[ヒストグラムメトリクス]] - Key insight: HDR ログリニアヒストグラムの実装詳細(46,081 ビン・スパース符号化・300b/ヒストグラム)と代替要約系譜(circllhist 2013・HDR 2015・t-digest 2015・DD-Sketch 2019)、ベンチマーク正確値(circllhist が精度・速度の両立で最優)を追加。manifest 未記録だったエントリを追加。 ## [2026-06-29] ingest-slides | SLOs for Data-Intensive Services (SREcon19 EMEA) - Source: `.raw/slides/srecon19emea-fouquet-slo-data-intensive/srecon19emea-fouquet-slo-data-intensive.pdf` - Visual pages: `.raw/slides/srecon19emea-fouquet-slo-data-intensive/pages/` (29 pages; p.26–29 は API 制限のためテキスト抽出で補完) - Media: none (transcript なし。YouTube 字幕取得不可) - Summary: [[@2019__SREcon19EMEA__SLOs for Data-Intensive Services]] - Pages created: [[@2019__SREcon19EMEA__SLOs for Data-Intensive Services]], [[Yoann Fouquet]], [[データ品質SLO]] - Pages updated: [[Booking.com]], [[サービスレベル目標]], [[SLI-SLO段階的導入]] - Key insight: 可用性・レイテンシだけでは検索サービスのステークホルダーは無関心。データ品質 SLO（一貫性・新鮮性・完全性・耐久性）を追加して初めて SLO が意思決定と自動化の根拠になった。 ## [2026-06-29] ingest-slides | Extending the Error Budget Model to Security and Feature Freshness (SREcon19 Americas) - Source: `.raw/slides/2019__SREcon19Americas__Extending-the-Error-Budget-Model/2019__SREcon19Americas__Extending-the-Error-Budget-Model.pdf` - Visual pages: `.raw/slides/2019__SREcon19Americas__Extending-the-Error-Budget-Model/pages/` (51 ページ全件確認) - Media: `.raw/slides/2019__SREcon19Americas__Extending-the-Error-Budget-Model/media/audio.m4a` (transcript なし — Whisper 未生成) - Summary: [[@2019__SREcon19Americas__Extending the Error Budget Model to Security and Feature Freshness]] - Pages created: [[wiki/sources/@2019__SREcon19Americas__Extending the Error Budget Model to Security and Feature Freshness]], [[wiki/concepts/脆弱性バジェット]], [[wiki/concepts/フィーチャーフレッシュネス]], [[wiki/entities/Jim Thomson]], [[wiki/entities/David Laing]] - Pages updated: [[wiki/concepts/エラーバジェット]](横断的知見), [[wiki/entities/Pivotal Software]] - Key insight: SLI/SLO/ポリシーのエラーバジェット構造はドメイン非依存の汎用モデルであり、セキュリティ(脆弱性バジェット: 30 日 SLO)とフィーチャーフレッシュネス(レガシーバジェット: 90 日アップグレード)に拡張できる。Equifax 侵害への 30 日 SLO 有効性は具体的な事例論証として重要。 ## [2026-06-29] ingest-slides | Latency SLOs Done Right (SREcon19 Americas) - Source: `.raw/slides/srecon19americas-moyer-latency-slos/srecon19americas-moyer-latency-slos.pdf` - Visual pages: `.raw/slides/srecon19americas-moyer-latency-slos/pages/` (50 pages) - Media: none (transcript なし) - Summary: [[@2019__SREcon19 Americas__Latency SLOs Done Right]] - Pages created: [[@2019__SREcon19 Americas__Latency SLOs Done Right]], [[Fred Moyer]] - Pages updated: [[Circonus]], [[サービスレベル目標]], [[ヒストグラムメトリクス]] - Key insight: パーセンタイル平均化は ~200% の誤差を生む。ログ・カウンタ・ヒストグラムの 3 手法による正しい SLO 計算体系が、同社の SREcon19 EMEA・Americas 両発表で独立確認された。 ## [2026-06-29] ingest-slides | Case Study: Implementing SLOs for a New Service (SREcon19 Americas) - Source: `.raw/slides/2019__SREcon19Americas__Implementing-SLOs-for-a-New-Service/2019__SREcon19Americas__Implementing-SLOs-for-a-New-Service.pdf` - Visual pages: `.raw/slides/2019__SREcon19Americas__Implementing-SLOs-for-a-New-Service/pages/` (23 pages) - Media: none（Whisper 失敗・YouTube 字幕フォールバック未実行） - Summary: [[@2019__SREcon19Americas__Case Study - Implementing SLOs for a New Service]] - Pages created: [[@2019__SREcon19Americas__Case Study - Implementing SLOs for a New Service]], [[Arnaud Lawson]], [[Squarespace]] - Pages updated: [[サービスレベル目標]], [[エラーバジェット]], [[SLI-SLO段階的導入]] - Key insight: ストレージサービスへの SLO 実装では耐久性 SLI が追加で必要。プローバーによる能動的計測が新規サービスへの SLO 導入を支える。エラーバジェットは SLO 設定と同時に計算・文書化することで運用指針になる。 ## [2026-06-29] ingest-slides | Quantifying Empathy Through Service Level Objectives (SREcon18 Asia/Pacific) - Source: `.raw/slides/2018__SREcon18Asia__Quantifying-Empathy-with-SLOs/2018__SREcon18Asia__Quantifying-Empathy-with-SLOs.pdf` - Visual pages: `.raw/slides/2018__SREcon18Asia__Quantifying-Empathy-with-SLOs/pages/` (152 ページ) - Media: `.raw/slides/2018__SREcon18Asia__Quantifying-Empathy-with-SLOs/transcript.md`（YouTube 自動字幕、1104 行） - Summary: [[@2018__SREcon18Asia__Quantifying Empathy Through Service Level Objectives]] - Pages created: [[@2018__SREcon18Asia__Quantifying Empathy Through Service Level Objectives]] / [[Ketan Gangatirkar]] / [[Indeed]] - Pages updated: [[サービスレベル目標]]（共感ギャップ・6 フレーバー・S 字曲線しきい値を横断的知見に追記） - Key insight: SLO 設計における「ユーザー共感の欠如」を共感ギャップ（Empathy gap）として定式化。ユーザー幸福の 6 フレーバー（#ARFCAapBof）と S 字曲線による痛みのしきい値特定で解決する 5 ステップフレームワークを提示。 ## [2026-06-29] ingest-slides | SLOs and SLIs in the Real World: A Deep Dive (SREcon18 Europe/EMEA) - Source: `.raw/slides/2018__SREcon18Europe__Real-World-SLOs-and-SLIs/2018__SREcon18Europe__Real-World-SLOs-and-SLIs.pdf` - Visual pages: `.raw/slides/2018__SREcon18Europe__Real-World-SLOs-and-SLIs/pages/` (29 ページ) - Media: `.raw/slides/2018__SREcon18Europe__Real-World-SLOs-and-SLIs/media/flaming.mp3`（Whisper 文字起こし処理中） - Summary: [[@2018__SREcon18Europe__SLOs and SLIs in the Real World - A Deep Dive]] - Pages created: [[@2018__SREcon18Europe__SLOs and SLIs in the Real World - A Deep Dive]] - Pages updated: [[サービスレベル目標]] / [[Matthew Flaming]] / [[Elisa Binette]] - Key insight: Americas 版（2018-03）の EMEA 再演。音声収録を初取得。ケイパビリティ駆動 SLI/SLO 設計・ハードシャード per-shard SLO・複合 SLO の実演が主体。 ## [2026-06-29] ingest-slides | How Atlassian Is Tackling Error Budgets, Agile Style - Source: `.raw/slides/2018__SREcon18Asia__How-Atlassian-Is-Tackling-Error-Budgets-Agile-Style/2018__SREcon18Asia__How-Atlassian-Is-Tackling-Error-Budgets-Agile-Style.pdf` - Visual pages: `.raw/slides/2018__SREcon18Asia__How-Atlassian-Is-Tackling-Error-Budgets-Agile-Style/pages/` (47 ページ) - Media: none (transcript なし) - Summary: [[@2018__SREcon18Asia__How Atlassian Is Tackling Error Budgets, Agile Style]] - Pages created: [[@2018__SREcon18Asia__How Atlassian Is Tackling Error Budgets, Agile Style]] / [[Gui Vieiro]] / [[Atlassian]] - Pages updated: [[エラーバジェット]]（アジャイル導入・可視化・Not So Good Result の透明化を横断的知見に追記）/ [[sources/_index]] / [[entities/_index]] / [[index]] / [[hot]] - Key insight: エラーバジェット導入の入口は「開発停止トリガー」ではなく「週次 SLO 達成率の可視化と共有」であり、トリガー閾値を段階的に引き締めることで組織の受容を促せる。改善しない判断をプロセスで「承認」する透明化もエラーバジェットの成果である。 ## [2026-06-29] ingest-slides | SLOs and SLIs in the Real World: A Deep Dive - Source: `.raw/slides/2018__SREcon18Americas__Real-World-SLOs-SLIs-Deep-Dive/2018__SREcon18Americas__Real-World-SLOs-SLIs-Deep-Dive.pdf` - Visual pages: `.raw/slides/2018__SREcon18Americas__Real-World-SLOs-SLIs-Deep-Dive/pages/` (25 pages) - Media: none (YouTube 動画あり: https://www.youtube.com/watch?v=4cFl-Ge0x4g、transcript 未取得) - Summary: [[@2018__SREcon18Americas__SLOs and SLIs in the Real World - A Deep Dive]] - Pages created: [[@2018__SREcon18Americas__SLOs and SLIs in the Real World - A Deep Dive]] / [[Matthew Flaming]] / [[Elisa Binette]] / [[New Relic]] - Pages updated: [[サービスレベル目標]]（横断的知見 4 項目・未解決の問い 4 項目追加）/ [[sources/_index]] / [[entities/_index]] / [[index]] - Key insight: ケイパビリティ（機能）を中間概念として挟む 7 ステップレシピにより「何を SLI にすべきか」の具体的な起点が得られる。ハードシャードでは全体集計 SLO が障害を隠蔽するため per-shard SLO が必須。 ## [2026-06-29] wiki-query | Projection MLP 学習の仕組み - Question: Toto-1.0-QA-Experimental における Projection MLP の学習とは何か・どのように行われるか - Pages created: [[Projection-MLP-学習の仕組み]] - Pages updated: [[index]] - Key insight: Projection MLP の学習とは「Toto と Qwen3-VL という異なる座標系を橋渡しする変換関数を、正解 QA ペアへの逆伝播で自動発見する過程」。VLM は時系列を先天的に理解するのではなく、3 段階訓練(合成 SFT → 実データ SFT → RLVR)を通じて「このベクトルパターンとこの回答が対応する」相関を後天的に学ぶ。 ## [2026-06-29] ingest-slides | Error Budgets and Risks (SREcon15, 2015) - Source: `.raw/slides/2015__SREcon15__Error-Budgets-and-Risks/2015__SREcon15__Error-Budgets-and-Risks.pdf` - Visual pages: `.raw/slides/2015__SREcon15__Error-Budgets-and-Risks/pages/` (26 ページ) - Media: `.raw/slides/2015__SREcon15__Error-Budgets-and-Risks/transcript.md` (Whisper 自動文字起こし、MP3 から生成) - Summary: [[@2015__SREcon15__Error Budgets and Risks]] - Pages created: [[Marc Alvidrez]] - Pages updated: [[エラーバジェット]] - Key insight: エラーバジェットは「SLA を超えすぎた結果、機会を無駄にしている」という気づきから実践的に発見されたものであり、公式な方法論的発明ではない——Alvidrez 自身の口頭語りで確認。 ## [2026-06-29] ingest-video | Service Levels and Error Budgets (SREcon16) - Source: https://www.usenix.org/conference/srecon16/program/presentation/jones (YouTube `iOoxtpVBQ4I`) - Transcript: `.raw/videos/iOoxtpVBQ4I/transcript.md`（auto-caption VTT → Markdown 変換、463 セグメント、約 23 分） - Frames: なし（映像取得不可） - Summary: [[@2016__SREcon16__Service Levels and Error Budgets]] - Pages created: [[Chris Jones]], [[@2016__SREcon16__Service Levels and Error Budgets]] - Pages updated: [[Niall Murphy]], [[サービスレベル目標]], [[エラーバジェット]] - Key insight: 「SRE の仕事は可用性最大化ではなくプロダクトベロシティ最大化」という再定義と、エラーバジェットをバンバン制御ではなくバーン率連続監視で使う考え方を SRE Book 著者自身が直接語った。 ## [2026-06-29] ingest | Effective Harnesses for Long-Running Agents (Anthropic 2025) + Harness Design for Long-Running Application Development (Anthropic 2026) - Sources: `.raw/articles/effective-harnesses-for-long-running-agents-2025-11-26.md` / `.raw/articles/harness-design-long-running-apps-2026-03-24.md` - URLs: https://www.anthropic.com/engineering/effective-harnesses-for-long-running-agents / https://www.anthropic.com/engineering/harness-design-long-running-apps - Summary: [[@2025__Anthropic Engineering Blog__Effective Harnesses for Long-Running Agents]] / [[@2026__Anthropic Engineering Blog__Harness Design for Long-Running Application Development]] - Pages created: [[@2025__Anthropic Engineering Blog__Effective Harnesses for Long-Running Agents]], [[@2026__Anthropic Engineering Blog__Harness Design for Long-Running Application Development]], [[Justin Young]], [[マルチコンテキストウィンドウエージェント]] - Pages updated: [[Prithvi Rajasekaran]], [[Harness Engineering]], [[ループエンジニアリング]] - Key insight: Anthropic の 2 本の実践報告。2025 年版は「各コンテキストウィンドウでメモリを失うエージェント」問題に Initializer + Coding 2 役分離と JSON フィーチャーリストで対処。2026 年版はジェネレータ・エバリュエータ分離（自己評価バイアス対策）と「荷重仮定」概念（モデル改善でハーネスコンポーネントを削減すべき）を提示。OpenAI が同時期に提唱した「ハーネスエンジニアリング」と補完的な関係にある。 ## [2026-06-29] ingest | Harness engineering: leveraging Codex in an agent-first world (OpenAI 2026) - Source: `.raw/articles/harness-engineering-openai-2026-06-29.md` - URL: https://openai.com/ja-JP/index/harness-engineering/ - Summary: [[OpenAI-Harness-Engineering]] - Pages created: [[OpenAI-Harness-Engineering]], [[Symphony]], [[Harness Engineering]] - Pages updated: [[OpenAI]] - Key insight: 3 名・5 ヶ月・手書き 0 行で 100 万行の本番コードを生成した実験から「ハーネスエンジニアリング」概念を提唱。エージェントを囲む環境設計（AGENTS.md 縮小・機械的依存強制・フィードバックループ・GC タスク）が次の生産性格差を生む主張。 ## [2026-06-29] ingest-paper | Memory in the Age of AI Agents - Source: `.raw/papers/arxiv-2512.13564.pdf` - URL: https://arxiv.org/abs/2512.13564 - Summary: [[@2025__arXiv__Memory in the Age of AI Agents]] - Pages created: [[エージェントメモリ]], [[Yuyang Hu]], [[MemGPT]], [[Mem0]] - Pages updated: [[コンテキストエンジニアリング]], [[National University of Singapore]] - Key insight: エージェントメモリを形態(トークンレベル/パラメトリック/潜在)・機能(事実/経験/作業)・動態(形成/進化/検索)の 3 軸で統一的に分類する初の体系的タクソノミ。コンテキストエンジニアリングはエージェントメモリの入力設計面であり、エージェントメモリは自律的蓄積面まで射程を広げる補完関係。RL 駆動のメモリ管理自律化が次のフロンティア。 ## [2026-06-29] ingest | VictoriaMetrics vs Prometheus (Jorijn Blog) - Source: `.raw/articles/victoriametrics-vs-prometheus-2026-06-29.md` - URL: https://jorijn.com/en/blog/victoriametrics-vs-prometheus/ - Summary: [[@2025__Jorijn-Blog__VictoriaMetrics vs Prometheus]] - Pages created: [[MetricsQL]], [[Jorijn Schrijvershof]], [[PromLabs]] - Pages updated: [[VictoriaMetrics]], [[Prometheus]] - Key insight: VictoriaMetricsはカーディナリティ爆発時にOOMクラッシュではなく「スローインサート」へグレースフルデグラデーションし、HA構成も`-replicationFactor`単一フラグで完結する。Prometheusは既存安定スタック・CNCF統治要件・PromQL移植性が必須の場合に正当化される。MetricsQLは74%PromQL互換（PromLabs評価）でVictoriaMetrics単独環境では実用上問題ないが複数バックエンド横断環境では負債化しうる。 ## [2026-06-29] ingest-slides | How We Foster "Reliability" in Diversity - Source: `.raw/slides/sre-next-2022/sre-next-2022.pdf` - Visual pages: `.raw/slides/sre-next-2022/pages/` (50 pages) - Media: none (transcript なし) - Summary: [[@2022__SRE NEXT__How We Foster Reliability in Diversity]] - Pages created: [[ダイナミックケイパビリティ]], [[組織の信頼性マインドセット]] - Pages updated: [[SRE組織変革]], [[Narimichi Takamura]], [[Topotal]] - Key insight: SRE 実践の5ステップを氷山モデル（Level 1/2/3）と対応させることで「プラクティス導入（Level 1/2）と価値観変革（Level 3）は別の取り組みが必要」という構造が明示され、MVV 策定が Level 3 へのアプローチとして有効であることが示された ## [2026-06-29] ingest-slides | 小さくはじめるSLI/SLO ～育てながら組織に定着させる実践知～ - Source: `.raw/slides/road-to-sre-next-kobe/road-to-sre-next-kobe.pdf` - Visual pages: `.raw/slides/road-to-sre-next-kobe/pages/` (48 pages) - Media: none (transcript なし) - Summary: [[@2026__Road to SRE NEXT 2026 神戸__小さくはじめるSLI-SLO 育てながら組織に定着させる実践知]] - Pages created: [[SLI-SLO段階的導入]] - Pages updated: [[サービスレベル目標]], [[エラーバジェット]], [[Narimichi Takamura]], [[Topotal]] - Key insight: SLI/SLO の難点を「定義・運用・定着」の 3 軸に整理し、それぞれに 5 段階成熟度モデルを設けることで組織の現在地と目標についてのディスカッション起点として機能させる ## [2026-06-29] ingest-slides | 組織的なインシデント対応を目指して〜成熟度評価と改善のステップ〜 - Source: `.raw/slides/sre-next-2024/sre-next-2024.pdf` - Visual pages: `.raw/slides/sre-next-2024/pages/` (39 pages) - Media: none - Summary: [[@2024__SRE NEXT 2024__組織的なインシデント対応を目指して]] - Pages created: [[インシデント対応成熟度モデル]] - Pages updated: [[インシデント管理]], [[Incident Commander]], [[Narimichi Takamura]], [[Topotal]], [[SRE NEXT]] - Key insight: Google SRE の信頼性マインドセットを組織評価軸に転用し、「IC 導入は前提条件が必要」という実践知を3フェーズ×9プロセス×4段階の成熟度モデルで構造化した ## [2026-06-29] ingest-slides | Rethinking Incident Response: Context-Aware AI in Practice - Source: `.raw/slides/Incident_Buddy_AI_Edition/Incident_Buddy_AI_Edition.pdf` - Visual pages: `.raw/slides/Incident_Buddy_AI_Edition/pages/` (29 pages) - Media: none (transcript なし) - Summary: [[@2025__SRE NEXT 2025__Rethinking Incident Response - Context-Aware AI in Practice]] - Pages created: [[インシデントレスポンスAIレベル]] - Pages updated: [[インシデント管理]], [[AIOps]], [[Waroom]], [[Ryota Yoshikawa]] - Key insight: SAE 自動運転レベル対応の IR0〜IR5 フレームワークを提唱。MCP + Coding Agent で IR2〜IR3 が現実的になった一方、OpenRCA(11%)・AIOpsLab RCA(14%)から RCA・緩和は依然として研究段階。 ## [2026-06-29] query-deep | ポストモーテムの教科書 - [[wiki/questions/ポストモーテムの教科書]](question 新規) — wiki 全体の 25+ ソース・15+ コンセプトを横断し、ポストモーテムとインシデント分析の理論・実践・研究を 25 章構成の教科書として体系化。初学者向け基礎（定義・三つの柱・プロセス）、事故モデル・Cook の 18 命題・ヒューマンファクタの理論基盤、ファシリテーション・IR 執筆・プロセス比較の実践、「修復から学習へ」のパラダイムシフト（Repeat Incident Fallacy・Incident Legalism）、MTTR 批判と TTX メトリクス、インシデントストーリー・クロスインシデント分析・インシデント考古学の発展的手法、AI 自動化と未解決の研究課題を網羅。 ## [2026-06-29] ingest | CoT Monitoring: Where Does a Hot Safety Problem Come From? - Source: `.raw/articles/cot-monitoring-history-2026-06-29.md` - Summary: [[@2026__SAILBlog__CoT-Monitoring-Where-Does-a-Hot-Safety-Problem-Come-From]] - Pages created: [[Peter Hase]], [[Christopher Potts]], [[CoTモニタリング]], [[@2026__SAILBlog__CoT-Monitoring-Where-Does-a-Hot-Safety-Problem-Come-From]] - Pages updated: [[Chain-of-Thought Prompting]], [[Dan Hendrycks]] - Key insight: CoT モニタリングは監視フレームワーク（Hendrycks 2021）と CoT 説明可能性（Ling 2017 / Camburu 2018）の 2 系譜収束であり、OpenAI o1 が 18 ヶ月の空白を終わらせた触媒だった。 ## [2026-06-29] ingest-paper | On-demand Container Loading in AWS Lambda - Source: `.raw/papers/atc23-brooker.pdf` - Summary: [[@2023__ATC__On-demand Container Loading in AWS Lambda]] - Pages created: [[@2023__ATC__On-demand Container Loading in AWS Lambda]], [[Marc Brooker]], [[AWS Lambda]], [[Firecracker]], [[コンテナ起動高速化]], [[収束暗号化]], [[イレイジャーコーディング]], [[メタ安定障害]] - Pages updated: wiki/index.md, wiki/hot.md, wiki/log.md, wiki/sources/_index.md, wiki/concepts/_index.md, wiki/entities/_index.md - Attachments: `fig05-deduplication-unique-chunks-cdf.png`, `fig07-cache-tier-hit-rates.png`, `fig08-l2-cache-hit-rate-cdf.png`, `fig09-erasure-vs-parallel-latency-cdf.png`, `fig10-l2-server-get-put-latency.png`, `fig11-local-agent-read-latency-cdf.png` - Key insight: コンテナ起動高速化の核心は「疎性(起動時必要データは 6.4%)」「共通性(80% がゼロユニークチャンク)」「キャッシャビリティ」の 3 性質の活用にある。収束暗号化はキー共有なしの広域重複排除を可能にし、4-of-5 イレイジャーコーディングは再試行不要の低テールレイテンシを実現する。高ヒット率キャッシュを安全に運用するにはメタ安定障害への備えが必須。 ## [2026-06-29] ingest-paper | Project Silica: Towards Sustainable Cloud Archival Storage in Glass - Source: `.raw/papers/ProjectSilica-SOSP23.pdf` - Summary: [[@2023__SOSP__Project Silica - Towards Sustainable Cloud Archival Storage in Glass]] - Pages created: [[@2023__SOSP__Project Silica - Towards Sustainable Cloud Archival Storage in Glass]], [[Antony Rowstron]], [[Project Silica]], [[アーカイバルストレージ]], [[ガラスストレージ]], [[ネットワーク符号化]] - Pages updated: wiki/index.md, wiki/hot.md, wiki/log.md, wiki/sources/_index.md, wiki/concepts/_index.md, wiki/entities/_index.md - Attachments: `fig04-shuttle-prototype.png`, `fig05a-iops-throughput.png`, `fig06-read-drive-utilization.png`, `fig07a-congestion-management.png`, `fig07b-power-savings.png` - Key insight: クラウドアーカイバルワークロードの実態は I/O 操作の 58.7% が 4 MiB 以下の小規模リードであり、テープ設計(大容量逐次アクセス前提)が根本的に不適合。ガラス媒体の WORM 特性によりスクラビング・リフレッシュ・ガベージコレクションをアーキテクチャから排除できるとともに、符号更新不要性が既存システムで不可能な超大グループサイズのネットワーク符号化を可能にする。 --- ## [2026-06-29] ingest-paper | In Search of an Understandable Consensus Algorithm - Source: `.raw/papers/atc14-paper-ongaro.pdf` - Summary: [[@2014__ATC__In Search of an Understandable Consensus Algorithm]] - Pages created: [[@2014__ATC__In Search of an Understandable Consensus Algorithm]], [[Diego Ongaro]], [[John Ousterhout]], [[分散コンセンサス]], [[複製ステートマシン]], [[リーダー選出]] - Pages updated: [[分散コンセンサス回避]]（横断的知見にジョイントコンセンサス vs クォーラムセット対比を追記） - Attachments: `fig01-replicated-state-machine.png`, `fig02-raft-algorithm-summary.png`, `fig03-raft-safety-properties.png`, `fig04-server-states.png`, `fig05-terms.png`, `fig07-log-inconsistencies.png`, `fig14-leader-election-perf.png` - Key insight: Paxos の難解さの根源はシングルデクリー分解にある。Raft は問題を分解（リーダー選出/ログ複製/安全性）し状態空間を削減する 2 原則だけで設計することで、ユーザースタディで 43 名中 33 名が Raft クイズ高得点を達成。ランダム化タイムアウトが「理解しやすさ設計」がランキング方式より優れた理由は、不確実性を「全選択が等価」として扱い実質的に決定論的になることにある。 ## [2026-06-28] ingest-paper | CockroachDB: The Resilient Geo-Distributed SQL Database - Source: `.raw/papers/2026_Unknown_CockroachDB_Resilient_Geo_Distributed_SQL.pdf` - Summary: [[@2020__SIGMOD__CockroachDB - The Resilient Geo-Distributed SQL Database]] - Pages created: [[@2020__SIGMOD__CockroachDB - The Resilient Geo-Distributed SQL Database]], [[CockroachDB]], [[Cockroach Labs]], [[Rebecca Taft]], [[地理分散SQLデータベース]], [[ハイブリッド論理クロック]] - Pages updated: [[Spanner]], [[分散トランザクション]], [[外部一貫性]] - Attachments: `fig01-global-cluster.png`, `fig02-parallel-commits-performance.png`, `fig03-distributed-hash-join.png`, `fig04-throughput-per-vcpu.png`, `fig05-tpcc-scaling.png`, `fig06-multiregion-availability.png`, `fig07-ycsb-vs-spanner.png` - Key insight: Spanner が TrueTime commit wait で外部一貫性を達成するのに対し、CRDB は HLC + Read Refresh で commit wait を完全に排除しつつ単一キー線形化可能性を保証。専用ハードウェアなしに TPC-C 100,000 ウェアハウスで 98.8% 効率(Aurora の 7.3% と対照的)。 --- ## [2026-06-29] ingest-paper | Aurora PostgreSQL Limitless Database: Building a Highly Scalable OLTP Database - Source: `.raw/papers/2026_Unknown_Aurora_PostgreSQL_Limitless_Database_Building.pdf` - Summary: [[@2026__SIGMOD Companion__Aurora PostgreSQL Limitless Database - Building a Highly Scalable OLTP Database]] - Pages updated: [[@2026__SIGMOD Companion__Aurora PostgreSQL Limitless Database - Building a Highly Scalable OLTP Database]], [[分散トランザクション]], [[分散SQLデータベース]] - Attachments added: `fig02-architecture.png`, `fig03-2pc-protocol.png`, `fig04-nopm-comparison.png`, `fig05-latency-comparison.png`, `fig06a-acu-r1.png`, `fig06b-acu-r2.png` - Key insight: PostgreSQL xid ベーススナップショットを Amazon Time Sync の時刻ベース MVCC に置換し、集中型トランザクションマネージャなしで外部一貫性を実現。2PC コーディネーターをステートレスなルータに置かず lead shard に状態永続化することでルータをコスト効率よく構成できる。 --- ## [2026-06-29] ingest-paper | Data Center Networking 基盤論文 5 本一括取り込み ### Fat-Tree (SIGCOMM 2008) - Source: `.raw/papers/fattree-sigcomm08.pdf` - Summary: [[@2008__SIGCOMM__A Scalable Commodity Data Center Network Architecture]] - Pages created: [[@2008__SIGCOMM__A Scalable Commodity Data Center Network Architecture]], [[Mohammad Al-Fares]], [[Amin Vahdat]], [[データセンターネットワークトポロジ]], [[ECMP]] - Pages updated: [[マルチプレーンClosトポロジ]], [[AIデータセンタートポロジ]] - Key insight: 安価な商用 GigE スイッチのみを均質に用いた k-ary Fat-Tree（k=48 で 27,648 ホスト）は従来階層型設計比 77% コスト削減を達成し、現代データセンターネットワーク設計の出発点となった。 ### VL2 (SIGCOMM 2009) - Source: `.raw/papers/vl2-sigcomm09.pdf` - Summary: [[@2009__SIGCOMM__VL2 - A Scalable and Flexible Data Center Network]] - Pages created: [[@2009__SIGCOMM__VL2 - A Scalable and Flexible Data Center Network]], [[Albert Greenberg]], [[VL2]], [[Valiant Load Balancing]] - Pages updated: [[James Hamilton]], [[負荷分散]], [[データセンター輻輳制御]], [[マルチプレーンClosトポロジ]] - Key insight: データセンタートラフィックの高ボラティリティこそが VLB のランダム化を正当化する——複雑な適応型トラフィックエンジニアリングより単純なランダム分散が実用上ほぼ同等（最重リンク使用率差 5%）。 ### Hedera (NSDI 2010) - Source: `.raw/papers/hedera-nsdi10.pdf` - Summary: [[@2010__NSDI__Hedera - Dynamic Flow Scheduling for Data Center Networks]] - Pages created: [[@2010__NSDI__Hedera - Dynamic Flow Scheduling for Data Center Networks]], [[Barath Raghavan]], [[Sivasankar Radhakrishnan]], [[フロースケジューリング]] - Pages updated: [[Mohammad Al-Fares]], [[Amin Vahdat]], [[ECMP]] - Key insight: ECMP はエレファントフロー支配的ワークロードで二分帯域幅を最大 60.8% 損失させる。Hedera は集中型スケジューラと Simulated Annealing で最適比 96% の二分帯域幅を達成。 ### PortLand (SIGCOMM 2009) - Source: `.raw/papers/portland-sigcomm09.pdf` - Summary: [[@2009__SIGCOMM__PortLand - A Scalable Fault-Tolerant Layer 2 Data Center Network Fabric]] - Pages created: [[@2009__SIGCOMM__PortLand - A Scalable Fault-Tolerant Layer 2 Data Center Network Fabric]], [[Radhika Niranjan Mysore]], [[データセンターL2ファブリック]] - Pages updated: [[Amin Vahdat]], [[データセンターネットワーク信頼性]] - Key insight: 「データセンターのトポロジは既知で固定」という観察の活用が PMAC・LDP・ファブリックマネージャの三位一体を可能にし、L2 セマンティクス・ループフリー・O(n) 障害収束・VM マイグレーション透過性を同時達成。 ### DCTCP (SIGCOMM 2010) - Source: `.raw/papers/dctcp-sigcomm10.pdf` - Summary: [[@2010__SIGCOMM__Data Center TCP (DCTCP)]] - Pages created: [[@2010__SIGCOMM__Data Center TCP (DCTCP)]], [[Mohammad Alizadeh]], [[Incast]] - Pages updated: [[Albert Greenberg]], [[データセンター輻輳制御]] - Key insight: ECN の 1 ビット系列からマルチビット輻輳情報を抽出する発想が、TCP コード変更 30 行・スイッチパラメータ 1 個でデータセンターの 3 大障害（Incast・キュー蓄積・バッファ圧迫）を同時解決した。後継 DCQCN はこの思想をレートベース制御と NIC ハードウェア実装に拡張。 ## [2026-06-28] ingest-paper | Amazon Aurora: On Avoiding Distributed Consensus for I/Os, Commits, and Membership Changes (SIGMOD 2018) - Source: `.raw/papers/aurora-sigmod-18.pdf` - Summary: [[@2018__SIGMOD__Amazon Aurora - On Avoiding Distributed Consensus for I Os, Commits, and Membership Changes]] - Pages created: [[@2018__SIGMOD__Amazon Aurora - On Avoiding Distributed Consensus for I Os, Commits, and Membership Changes]], [[分散コンセンサス回避]] - Pages updated: [[クォーラムベースレプリケーション]], [[クラッシュリカバリ]], [[Write-Ahead Logging (WAL)]], [[Alexandre Verbitski]], [[Amazon Aurora (Database)]], `wiki/index.md`, `wiki/hot.md`, `wiki/log.md`, `wiki/sources/_index.md`, `wiki/concepts/_index.md` - Key insight: SCL/PGCL/VCL という単調増加 LSN 一貫性ポイント階層とクォーラムセット + エポックにより、分散コンセンサス（2PC/Paxos）を「ほとんどの状況で」回避できる。「データベースは状態の管理者」という特性を活用し、汎用コンセンサスの代わりにドメイン特化の不変条件（LSN 単調性・書き込み拒否なし）を使うのが核心。 ## [2026-06-28] ingest-paper | F1: A Distributed SQL Database That Scales (VLDB 2013) - Source: `.raw/papers/p1068-shute.pdf` - Summary: [[@2013__VLDB__F1 - A Distributed SQL Database That Scales]] - Pages created: [[@2013__VLDB__F1 - A Distributed SQL Database That Scales]], [[Jeff Shute]], [[分散SQLデータベース]] - Pages updated: [[分散トランザクション]], [[Google]], `wiki/index.md`, `wiki/hot.md`, `wiki/log.md`, `wiki/sources/_index.md`, `wiki/entities/_index.md`, `wiki/concepts/_index.md` - Key insight: 「スケーラビリティと SQL 一貫性はトレードオフ」という通説を本番 100 TB・5 ナインで覆した。階層スキーマによるデータ局所性確保と ORM の並列/非同期化が「コミットレイテンシ 50-150 ms の隠蔽」を実現し、ユーザー体感レイテンシを MySQL と同等に保った。 ## [2026-06-28] ingest-paper | Amazon MemoryDB: A Fast and Durable Memory-First Cloud Database - Source: `.raw/papers/amazon-memorydb-a-fast-and-durable-memory-first-cloud-database.pdf` - Summary: [[@2024__SIGMOD__Amazon MemoryDB - A Fast and Durable Memory-First Cloud Database]] - Pages created: [[@2024__SIGMOD__Amazon MemoryDB - A Fast and Durable Memory-First Cloud Database]], [[Amazon MemoryDB]], [[Yacine Taleb]], [[インメモリデータベース]], [[ストレージ計算分離]] - Pages updated: [[Amazon Web Services]], `wiki/index.md`, `wiki/hot.md`, `wiki/log.md`, `wiki/sources/_index.md`, `wiki/entities/_index.md`, `wiki/concepts/_index.md` - Key insight: 耐久性をマルチ AZ トランザクションログへ分離することで Redis の完全 API 互換性と 11 9s 耐久性を両立。書き込み後ろロギング + クライアントブロッキングで MVCC 非対応の Redis でも強い整合性を保証。オフボックススナップショットで BGSave の可用性インパクトを排除。 ## [2026-06-28] ingest-paper | Spanner: Google's Globally Distributed Database (OSDI 2012 / TOCS 2013) - Source: `.raw/papers/1974.pdf` - Summary: [[@2013__TOCS__Spanner - Google's Globally Distributed Database]] - Pages created: [[@2013__TOCS__Spanner - Google's Globally Distributed Database]], [[James C. Corbett]], [[外部一貫性]], [[TrueTime]], [[分散トランザクション]] - Pages updated: [[Jeffrey Dean]], [[Sanjay Ghemawat]], `wiki/index.md`, `wiki/hot.md`, `wiki/log.md`, `wiki/sources/_index.md`, `wiki/entities/_index.md`, `wiki/concepts/_index.md` - Key insight: TrueTime(GPS + 原子時計による不確実性区間)と commit wait の組み合わせで外部一貫性を数学的に証明。ε 通常 4ms、commit wait ≥ 2ε というオーバーヘッドで世界規模のシリアライズ可能かつ外部一貫なトランザクションを実現した最初のシステム。 ## [2026-06-28] ingest-paper | Amazon Aurora: Design Considerations for High Throughput Cloud-Native Relational Databases (SIGMOD 2017) - Source: `.raw/papers/aurora-sigmod-17.pdf` - Summary: [[@2017__SIGMOD__Amazon Aurora - Design Considerations for High Throughput Cloud-Native Relational Databases]] - Pages created: [[@2017__SIGMOD__Amazon Aurora - Design Considerations for High Throughput Cloud-Native Relational Databases]], [[Amazon Aurora (Database)]], [[クォーラムベースレプリケーション]], [[コンピュートストレージ分離]] - Pages updated: [[OLTPシステムアーキテクチャ]], [[Write-Ahead Logging (WAL)]], [[クラッシュリカバリ]], [[分散ストレージ]], [[wiki/index]], [[wiki/hot]], [[wiki/log]], [[wiki/sources/_index]], [[wiki/entities/_index]], [[wiki/concepts/_index]] - Key insight: 「ログがデータベース」——Redo ログのみをネットワーク越しに送り、ストレージ層が非同期でデータページを生成することで、クラウド OLTP のネットワークボトルネックを解消し、10 秒以内クラッシュリカバリと 35 倍スループット向上を同時達成した。 ## [2026-06-28] ingest-paper | 縮約，網羅，減算：科学者の仕事とは何か (岡ノ谷一夫, 認知科学 2021) - Source: `.raw/papers/jcss-2021-okanoya.pdf` - Summary: [[@2021__認知科学__縮約，網羅，減算：科学者の仕事とは何か]] - Pages created: [[@2021__認知科学__縮約，網羅，減算：科学者の仕事とは何か]], [[岡ノ谷一夫]], [[東京大学]], [[縮約]], [[網羅]], [[減算]] - Pages updated: `wiki/index.md`, `wiki/hot.md`, `wiki/sources/_index.md`, `wiki/entities/_index.md`, `wiki/concepts/_index.md` - Key insight: 機械学習時代の科学方法論を縮約・網羅・減算の三項対立で整理。「科学者の仕事は人間が理解できる説明体系を構築すること」であり、網羅的計測の台頭後も縮約と減算の並行処理が不可欠。人工知能の縮約提示が特異点を見落とすリスクを指摘。 ## [2026-06-28] ingest | CPU Utilization is Wrong (Brendan Gregg, 2017) - Source: `.raw/articles/cpu-utilization-is-wrong-2026-06-28.md`(URL: https://www.brendangregg.com/blog/2017-05-09/cpu-utilization-is-wrong.html) - Summary: [[@2017__brendangregg.com__CPU Utilization is Wrong]] - Pages created: [[@2017__brendangregg.com__CPU Utilization is Wrong]], [[Brendan Gregg]], [[CPU利用率]], [[Instructions Per Cycle]] - Pages updated: [[ハードウェアカウンタ]](IPC / %CPU 乖離の横断知見追記) - Key insight: %CPU は「非アイドル時間」であり演算とメモリ待機を区別しない。IPC(Instructions Per Cycle)こそが真の処理効率指標で、IPC < 1.0 はメモリバウンドを直接示す。CPU-DRAM ギャップにより現代の「高 CPU 利用率」の多くは実は DRAM 待機である。 ## [2026-06-28] ingest-paper | Characterizing Cloud Computing Hardware Reliability (SoCC 2010) - Source: `.raw/papers/socc088-vishwanath.pdf`(https://www.microsoft.com/en-us/research/wp-content/uploads/2010/06/socc088-vishwanath.pdf) - Summary: [[@2010__SoCC__Characterizing Cloud Computing Hardware Reliability]] - Pages created: [[@2010__SoCC__Characterizing Cloud Computing Hardware Reliability]], [[Kashi Venkatesh Vishwanath]], [[Nachiappan Nagappan]] - Pages updated: [[データセンター信頼性]](横断的知見 3 件追記), [[障害予測]](横断的知見 1 件追記) - Key insight: 障害予測の最強因子がデータセンター名・メーカー名というメタデータ的環境情報であり、サーバー齢・ラック位置・ワークロードは有意でない — コンテキスト情報が個別コンポーネント特徴量より効くという逆説は LLM 期の障害予測設計にも通じる。 ## [2026-06-28] ingest | The SPACE of Developer Productivity (ACM Queue 2021) - Source: `.raw/articles/the-space-of-developer-productivity-2026-06-28.md`(URL: https://queue.acm.org/detail.cfm?id=3454124、Cloudflare ブロックのため WebSearch + 二次資料から再構成) - 論文: [[Nicole Forsgren]] ら、ACM Queue Vol.19 No.1, Feb 2021 - Summary: [[@2021__ACMQueue__The SPACE of Developer Productivity]] - Pages created: [[@2021__ACMQueue__The SPACE of Developer Productivity]], [[開発者生産性]], [[Margaret-Anne Storey]] - Pages updated: [[SPACE]](横断的知見・出典追記), [[Nicole Forsgren]](出典追記) - Key insight: アクティビティ(A)は「最も見えやすく最も危険な次元」——コミット数・PR 数は生産性の代理指標にならず、少なくとも 3 次元での計測が必須という原論文の明示的な警告を wiki に反映した。 ## [2026-06-28] wiki-query | ポストモーテム文献横断ナラティブ - [[ポストモーテムと事後分析の文献横断ナラティブ]] (question 新規) — ポストモーテムと事後分析に関する文献を CS 論文 Introduction 風に統合。[[ポストモーテム]]・[[根本原因分析]]・[[インシデント管理]]・[[運用障害分析]]・[[インシデント考古学]]・[[複雑システム障害論]] の 6 概念ページを横断引用。 ## [2026-06-28] ingest-video | The Power of Stories - Source: https://www.youtube.com/watch?v=Nd0xfNmkgRI (USENIX SREcon26 Americas) - Transcript: `.raw/videos/srecon26-hochstein-power-of-stories/transcript.md` (YouTube 自動字幕 en-orig から生成) - Frames: なし(フレーム取得不可) - Summary: [[@2026__SREcon26Americas__The Power of Stories]] - Pages created: [[Lorin Hochstein]], [[Airbnb]], [[逸脱の正常化]], [[@2026__SREcon26Americas__The Power of Stories]] - Pages updated: [[インシデントストーリー]], [[インシデントレポート執筆]], wiki/index.md, wiki/hot.md, wiki/sources/_index.md, wiki/entities/_index.md, wiki/concepts/_index.md - Key insight: インシデントストーリーの有用性は anomalous + immutable の 2 条件で決まる。逸脱の正常化は SRE の日常的アラート閾値調整にも常在する概念である。 ## [2026-06-28] ingest-paper | Incident Metrics in SRE: Critically Evaluating MTTR and Friends - Source: `.raw/papers/IncidentMeticsInSre.pdf`（36 ページ） - Summary: [[@2021__OReilly__Incident Metrics in SRE]] - Pages created: [[@2021__OReilly__Incident Metrics in SRE]], [[Štěpán Davidovič]] - Pages updated: [[TTXメトリクス]], [[wiki/sources/_index]], [[wiki/entities/_index]], [[wiki/concepts/_index]], [[wiki/index]], [[wiki/hot]], [[wiki/log]] - Key insight: MTTR はインシデント改善の評価指標として機能せず、中央値・幾何平均・パーセンタイルの代替統計でも問題は解決しない。問題の核心はデータ品質でなくインシデント件数の少なさと分散の高さという本質的な性質にある。Google 規模の大規模データでも実用的な信頼区間に収まらない。 ## [2026-06-28] ingest-slides | Unlock High-Frequency Deployments without Blowing Up Prometheus - Source: `.raw/slides/2026__SREcon26Americas__Unlock-High-Frequency-Deployments-without-Blowing-Up-Prometheus/2026__SREcon26Americas__Unlock-High-Frequency-Deployments-without-Blowing-Up-Prometheus.pdf`（35 ページ） - Visual pages: `.raw/slides/2026__SREcon26Americas__Unlock-High-Frequency-Deployments-without-Blowing-Up-Prometheus/pages/` - Media: `.raw/slides/2026__SREcon26Americas__Unlock-High-Frequency-Deployments-without-Blowing-Up-Prometheus/transcript.md`（YouTube 自動字幕） - Summary: [[@2026__SREcon26Americas__Unlock High-Frequency Deployments without Blowing Up Prometheus]] - Pages created: [[@2026__SREcon26Americas__Unlock High-Frequency Deployments without Blowing Up Prometheus]], [[Ganesh Vernekar]], [[Reddit]], [[Prometheusシリーズチャーン]], [[Prometheus TSDB]] - Pages updated: [[Prometheus]] - Key insight: Kubernetes 高頻度デプロイによる Prometheus OOM の根本原因は「2 時間 HEAD flush 待ちの間に失活系列が蓄積する」構造にある。stale-series compaction は RAM→ディスクへの先回りフラッシュで解決するが、クエリ時マージオーバーヘッドが増加するため「保護」目的に限定し、失活系列比率 > 0.5 でのみ検討すべき。閾値選択は試行錯誤が不可避。 ## [2026-06-28] ingest-slides | Reliability Equilibrium: The Hidden Playbook behind SRE Influence - Source: `.raw/slides/sre26amer_slides_barteneva/sre26amer_slides_barteneva.pdf`（60 ページ） - Visual pages: `.raw/slides/sre26amer_slides_barteneva/pages/` - Media: none（transcript なし） - Summary: [[@2026__SREcon26Americas__Reliability Equilibrium - The Hidden Playbook behind SRE Influence]] - Pages created: [[@2026__SREcon26Americas__Reliability Equilibrium - The Hidden Playbook behind SRE Influence]], [[Daria Barteneva]], [[ゲーム理論とSRE]] - Pages updated: [[Microsoft Azure]] - Key insight: SRE の失敗の多くが調整（coordination）の失敗であり、ゲーム理論の 5 類型（囚人のジレンマ・Stag Hunt・公共財・ベイジアン・進化的）で系統診断できる。SRE ツール（SLO・エラーバジェット・カナリア）はメカニズムデザインとして再解釈でき、「良い均衡」へのインセンティブを設計する行為だという視点は、信頼性エンジニアリングの組織的戦略に新しい語彙を与える。 ## [2026-06-28] ingest-paper | Loop Engineering: The Anthropic Playbook for Designing Systems That Prompt Your Agents - Source: `.raw/papers/Loop-Engineering-IEEE.pdf`（11 ページ・HuaShu による Osmani Orange Book 再整形版） - Summary: [[@2026__Working Note__Loop Engineering - The Anthropic Playbook for Designing Systems That Prompt Your Agents]] - Pages created: [[@2026__Working Note__Loop Engineering - The Anthropic Playbook for Designing Systems That Prompt Your Agents]], [[Addy Osmani]], [[Prithvi Rajasekaran]], [[Steve Kaliski]] - Pages updated: [[ループエンジニアリング]] - Key insight: ループエンジニアリングの「5 ムーブ/6 パーツ/4 コスト」フレームワークが明示化された。特に 5 つの失敗パターン（Nodding/Amnesiac/Manual/Blind/Tangled Loop）が各ムーブと 1 対 1 に対応し診断ツールとして機能する点、およびジェネレータ/エバリュエータ分離が「言葉遣いの問題ではなく構造の問題」という点が実用的新規性。 ## [2026-06-28] ingest-slides | Postmortem as a textbook - Source: `.raw/slides/Postmortem_as_a_textbook/Postmortem_as_a_textbook.pdf` - Visual pages: `.raw/slides/Postmortem_as_a_textbook/pages/` (26 ページ) - Media: none (transcript なし) - Summary: [[@2023__SpeakerDeck__Postmortem as a textbook]] - Pages created: [[@2023__SpeakerDeck__Postmortem as a textbook]], [[KATO Toshiya]], [[LINE株式会社]] - Pages updated: [[ポストモーテム]](sources 追記・横断的知見にSRE主導執筆会議の知見追記) - Key insight: 当事者のみが書くポストモーテムには「非専門家が理解できる情報が省略される」という5つの構造的問題があり、SREが事前30分会議をファシリテートすることで品質向上と全体共有会議の時短を同時に達成できる。 ## [2026-06-28] ingest-video | The Ironies of AI² - Source: YouTube https://www.youtube.com/watch?v=cvcGIr4a2Dk (公式ページ: https://www.usenix.org/conference/srecon26americas/presentation/reed) - Transcript: `.raw/videos/srecon26-ironies-of-ai/transcript.md`（YouTube自動生成英語字幕から変換、135行） - Frames: `.raw/videos/srecon26-ironies-of-ai/frames/`（17フレーム抽出済み） - Summary: [[@2026__SREcon26Americas__The Ironies of AI²]] - Pages created: [[@2026__SREcon26Americas__The Ironies of AI²]], [[Chime]] - Pages updated: [[J Paul Reed]], [[自動化のアイロニー]], [[Joint Activity]] - Key insight: インシデント中のAIとの協働がうまくいかない理由は、AIがJoint Cognitive Systemの協調基盤3特性（Directed Attention・Redirectability・Interpredictability）を欠くことに起因する。「AIに推薦でなく説明を求める」ことで、AI誤りによる人間パフォーマンス悪化を大幅に緩和できる。 ## [2026-06-28] ingest-slides | Beyond Loss and Accuracy: Closing the Observability Gaps in AI Training with TrainCheck - Source: `.raw/slides/sre26amer_slides_jiang/sre26amer_slides_jiang.pdf` - Visual pages: `.raw/slides/sre26amer_slides_jiang/pages/` (32 ページ) - Media: none（transcript なし） - Summary: [[@2026__SREcon26Americas__Beyond Loss and Accuracy - Closing the Observability Gaps in AI Training with TrainCheck]] - Pages created: [[Ryan Huang]] - Pages updated: [[Yuxuan Jiang]], [[TrainCheck]], [[DLトレーニングサイレントエラー]], [[訓練不変条件]], [[MLモデル監視]] - Key insight: AI 訓練監視の「活動 vs 正当性」ギャップを SRE 規律で埋める TrainCheck が、BLOOM・MPS の実世界ケースで第 1 イテレーションからサイレント障害を検知できることを具体的に示した ## [2026-06-28] ingest-slides | Human Factors in the Age of AI Ops: Re-Engineering Trust Between Humans and Machines (SREcon26 Americas) - Source: `.raw/slides/2026__SREcon26Americas__Human-Factors-in-the-Age-of-AI-Ops/2026__SREcon26Americas__Human-Factors-in-the-Age-of-AI-Ops.pdf` - Visual pages: `.raw/slides/2026__SREcon26Americas__Human-Factors-in-the-Age-of-AI-Ops/pages/` (63 ページ) - Media: none (transcript なし) - Summary: [[@2026__SREcon26Americas__Human Factors in the Age of AI Ops]] - Pages created: [[@2026__SREcon26Americas__Human Factors in the Age of AI Ops]], [[Eddie Redick]], [[CTC Ops]] - Pages updated: [[SRE AI Autonomy Levels]], [[アラート疲労]], [[人的要因]] - Key insight: 技術的自律度(Google L0-L4)と組織的信頼受容度(Trust Spectrum)は直交する二軸であり、業界の 60% が Observe 段階に留まるという定量的根拠が産業調査から得られた ## [2026-06-28] ingest-slides | Executing Chaos Engineering in Production at a Critical Financial Institution (SREcon26 Americas) - Source: `.raw/slides/2026__SREcon26Americas__Executing-Chaos-Engineering-in-Production/2026__SREcon26Americas__Executing-Chaos-Engineering-in-Production.pdf` - Visual pages: `.raw/slides/2026__SREcon26Americas__Executing-Chaos-Engineering-in-Production/pages/` (17 ページ) - Media: none（transcript 未取得・動画 URL 未取得） - Summary: [[@2026__SREcon26Americas__Executing Chaos Engineering in Production at a Critical Financial Institution]] - Pages created: [[カオスエンジニアリング]]、[[GameDay]]、[[Bradesco]]、[[Leonardo Marques]]、[[Luiz Siqueira]]、[[EasyPerform]] - Pages updated: なし - Key insight: 金融機関本番環境でのカオスエンジニアリング実践。「SRE を拡大せずに信頼性フレームワークをスケールするには自動化が必須」という問いが第2フェーズの起点。MTTD 73% 削減・MTTR 22% 改善を達成。 ## [2026-06-28] ingest-slides | AI Agents for Incident Investigation: The Good, The Bad, and The Ugly (SREcon26 Americas) - Source: `.raw/slides/sre26amer-budichenko/sre26amer-budichenko.pdf` - Visual pages: `.raw/slides/sre26amer-budichenko/pages/` (17 ページ) - Media: none (transcript なし) - Summary: [[@2026__SREcon26Americas__AI Agents for Incident Investigation - The Good, The Bad, and The Ugly]] - Pages created: [[Vladyslav Budichenko]], [[Vocaly AI]] - Pages updated: [[LLMによる根本原因分析]], [[インシデント調査戦略]], [[エージェント運用安全性]], [[エージェントシステム運用]] - Key insight: 本番 RCA 精度 11.34%・プロンプトインジェクション +540%・trust-for/verify フレームワークの実務的定式化 ## [2026-06-28] ingest-slides | So You Want a New Incident Commander (SREcon26 Americas) - Source: `.raw/slides/sre26amer_slides_huerta-granda/sre26amer_slides_huerta-granda.pdf` - Visual pages: `.raw/slides/sre26amer_slides_huerta-granda/pages/` (25 ページ) - Media: none（transcript 未取得・動画 URL 未取得） - Summary: [[@2026__SREcon26 Americas__So You Want a New Incident Commander]] - Pages created: [[Incident Commander]] - Pages updated: [[Vanessa Huerta Granda]]（SREcon26 発表・IC プログラム実践知を追記）、[[Enova]]（IC プログラムとの接続を追記）、[[インシデント管理]]（IC 役割と3チーム類型の横断的知見を追記） - Key insight: IC は最強エンジニアのバッジでなく社会技術的リーダーシップスキルであり、構造（3類型）よりも「IC の役割が優先事項・仕事の一部であること」の明示が普遍的要件。 ## [2026-06-28] ingest-slides | インシデントキーメトリクスによるインシデント対応の改善 - Source: `.raw/slides/sre-kaigi-2025/sre-kaigi-2025.pdf` - Visual pages: `.raw/slides/sre-kaigi-2025/pages/` (56 ページ) - Media: `.raw/slides/sre-kaigi-2025/transcript.md`（YouTube 自動字幕 / ja より変換） - Summary: [[@2025__SRE Kaigi 2025__インシデントキーメトリクスによるインシデント対応の改善]] - Pages created: [[Narimichi Takamura]]、[[TTXメトリクス]] - Pages updated: [[Topotal]]、[[Waroom]]、[[インシデント管理]] - Key insight: MTTR がモンテカルロシミュレーション実証で改善評価指標として機能しないことを定量的に示し、TTX メトリクス 11 種類への代替と Waroom での自動収集実装を提示。 ## [2026-06-28] ingest-slides | 1年間のポストモーテム運用とそこから生まれたツール sre-advisor - Source: `.raw/slides/srenext2022-fujiwara-postmortem-sre-advisor/presentation.pdf` - Visual pages: `.raw/slides/srenext2022-fujiwara-postmortem-sre-advisor/pages/` (32 ページ) - Media: `.raw/slides/srenext2022-fujiwara-postmortem-sre-advisor/transcript.md`（YouTube 自動字幕 / ja より変換） - Summary: [[@2022__SRENEXT2022__1年間のポストモーテム運用とそこから生まれたツール sre-advisor]] - Pages created: [[藤原俊一郎]] - Pages updated: [[面白法人カヤック]], [[ポストモーテム]] - Key insight: ポストモーテムの振り返りから得た設定不備の傾向を sre-advisor としてコード化することで、「インシデント → ポストモーテム → 事前検出 → 予防」の循環ループを実現した。知見をガードレールへ昇華する具体的実装例。 ## [2026-06-28] ingest-slides | Learning from Incidents at Scale; Actually Doing Cross-Incident Analysis - Source: YouTube 自動字幕 (https://www.youtube.com/watch?v=Q69WND8YHag) - Media: `.raw/slides/2025__SREcon25Americas__Learning-from-Incidents-at-Scale/transcript.md` - スライド PDF: 未取得（USENIX ページはサインイン必須） - Summary: [[@2025__SREcon25 Americas__Learning from Incidents at Scale - Actually Doing Cross-Incident Analysis]] - Pages created: [[クロスインシデント分析]], [[Vanessa Huerta Granda]], [[Enova]] - Pages updated: [[ポストモーテム]], [[Jeli]] - Key insight: クロスインシデント分析を「自走プログラム」にするには専任チーム・構造化アーティファクト・組織計画連動の3要素が必要。部門横断の関係者招待が単一最重要変革。MTTR 等の指標はコンテキストなしでは意味がない。 ## [2026-06-28] ingest-slides | The Case of the Misnamed Cities: CAST Analysis of a Google Maps Incident - Source: `.raw/slides/srecon26americas-barroso-cast/srecon26americas-barroso-cast.pdf` - Visual pages: `.raw/slides/srecon26americas-barroso-cast/pages/` (113 ページ・アニメーション重複含む・主要 15 ページ精読) - Media: none (transcript なし) - Summary: [[@2026__SREcon26Americas__The Case of the Misnamed Cities - CAST Analysis of a Google Maps Incident]] - Pages created: [[@2026__SREcon26Americas__The Case of the Misnamed Cities - CAST Analysis of a Google Maps Incident]], [[Ruben Barroso]], [[Nancy G. Leveson]], [[CAST]] - Pages updated: [[事故モデル]], [[根本原因分析]] - Key insight: 時系列(Chronology)は因果(Causality)でない。RCA のイベント選択は「馴染み深い・politically acceptable」な主観的フィルターを通る。CAST は制御構造とメンタルモデル分析で非イベント的・組織的要因まで到達できる。 ## [2026-06-28] ingest-slides | Mean Time to WTF: Why Developer Experience Frameworks Belong in Your Incident Retrospectives - Source: `.raw/slides/srecon26-forsgren/srecon26-forsgren.pdf` - Visual pages: `.raw/slides/srecon26-forsgren/pages/` (37 ページ全確認) - Media: none (transcript なし) - Summary: [[@2026__SREcon26 Americas__The WTF Problem - Developer Experience as a Reliability Property]] - Pages created: なし(2026-06-16 先行作成済み) - Pages updated: [[@2026__SREcon26 Americas__The WTF Problem - Developer Experience as a Reliability Property]](date_published 修正・限界記述更新) - Key insight: SRE の摩擦(認知負荷・ツール摩擦・プロセス摩擦)は信頼性のシステム特性であり、AI 導入により指数的に増幅される。MTWTF(アラートから状況理解までの時間)を先行指標として計測することで、MTTR より前に運用の劣化を検知できる。 ## [2026-06-28] enrich-source | Human Observability of Incident Response（transcript 追補） - Transcript: `.raw/slides/srecon23americas-davis-human-observability/transcript.md`（Whisper による書き起こし） - 追補: source ページに「口頭説明・補足」セクションを新設（注意/認識のマンダラ・コンダクター設計根拠・Complexity vs. Complication・コーディネーション・サプライズ・ガムラン命名由来・Wheel of Expertise の実績・マーケティング担当者の証言・Listening の深い定義） - Key insight: transcript により「Gamelan」命名の由来（反復的即興発展）と Wheel of Expertise の具体的 ROI（2週間前のセッションでインシデントを先取り）が明確化された ## [2026-06-28] ingest-slides | Human Observability of Incident Response - Source: `.raw/slides/srecon23americas-davis-human-observability/srecon23americas-davis-human-observability.pdf` - Visual pages: `.raw/slides/srecon23americas-davis-human-observability/pages/` (39 ページ全確認) - Media: なし（YouTube 字幕取得失敗。media.url のみ保存）→ 後日 Whisper transcript 追補 - Summary: [[@2023__SREcon23Americas__Human Observability of Incident Response]] - Pages created: [[Joint Activity]], [[Common Grounding]], [[Practice of Practice]], [[Matt Davis]], [[Pauline Oliveros]], [[Derek Bailey]] - Pages updated: [[Laura Maguire]], [[Richard I. Cook]], [[人的要因]], [[レジリエンスエンジニアリング]], [[インシデント管理]] - Key insight: インシデント対応の「人間のオブザーバビリティ」——参加者が互いの状態・疲労・注意を観測し合う——は技術的オブザーバビリティとは独立した観測問題であり、即興演奏理論（Joint Activity・Common Grounding・Practice of Practice）で体系化できる ## [2026-06-28] ingest-slides | Far from the Shallows: What We Can Learn From Deeper Incident Stories - Source: `.raw/slides/srecon23amer-nash-far-from-shallows/` (YouTube 動画フレーム 79 枚 + 自動字幕 transcript) - Visual pages: `.raw/slides/srecon23amer-nash-far-from-shallows/pages/` (79 ページ全確認) - Media: `.raw/slides/srecon23amer-nash-far-from-shallows/transcript.md` (YouTube 自動字幕 VTT 変換) - Summary: [[@2023__SREcon23Americas__Far from the Shallows]] - Pages created: [[Courtney Nash]], [[Verica]], [[インシデントストーリー]] - Pages updated: [[インシデント重大度評価]], [[根本原因分析]], [[Jens Rasmussen]] - Key insight: Severity と Duration は無相関（The Void データ）、Root Cause 指定は複雑システムの因果を 3 点で損なう、インシデントストーリーが shallow data の代替枠組みとして機能する。 ## [2026-06-28] ingest-slides | Turning an Incident Report into a Design Issue with TLA+ - Source: `.raw/slides/srecon23-incident-report-tla-plus/srecon23-incident-report-tla-plus.pdf` - Visual pages: `.raw/slides/srecon23-incident-report-tla-plus/pages/` (23 ページ全確認) - Media: none（transcript なし） - Summary: [[@2023__SREcon23Americas__Turning an Incident Report into a Design Issue with TLA+]] - Pages created: [[Finn Hackett]], [[Markus A. Kuppe]], [[Joshua Rowe]], [[Azure CosmosDB]], [[TLA+]] - Pages updated: [[結果整合性]], [[ポストモーテム]] - Key insight: インシデントレポートが文書化できない「設計レベルの洞察」を TLA+ モデルチェッカーのカウンター例として生成するワークフロー。Azure CosmosDB の Session Consistency はセッショントークンを共有しない複数クライアント間では整合性を保証しないという根本原因が確定できた。 ## [2026-06-28] ingest-slides | Incident Archeology: Finding Value in the Paperwork and Narratives of the past - Source: `.raw/slides/srecon23amer-byrum-incident-archaeology/srecon23amer-byrum-incident-archaeology.pdf` - Visual pages: `.raw/slides/srecon23amer-byrum-incident-archaeology/pages/` (27 pages、全確認済み) - Media: transcript なし - Summary: [[@2023__SREcon23Americas__Incident Archeology - Finding Value in the Paperwork and Narratives of the past]] - Pages created: [[インシデント考古学]]、[[Clint Byrum]]、[[Spotify]] - Pages updated: [[ポストモーテム]] - Key insight: 探していなかった知見（業務時間中 80%・変更起因 30%・時刻フィールド 75% デフォルト放置）が設定仮説から得た知見より組織にとって有価値だった。 ## [2026-06-28] ingest-slides | The Repeat Incident Fallacy: What Jurassic Park Can Teach Us about Incidents - Source: `.raw/slides/srecon22emea__ruppe__repeat-incident-fallacy/srecon22emea__ruppe__repeat-incident-fallacy.pdf` - Visual pages: `.raw/slides/srecon22emea__ruppe__repeat-incident-fallacy/pages/` (24 pages, 全確認済み) - Media: transcript なし - Summary: [[@2022__SREcon22EMEA__The Repeat Incident Fallacy - What Jurassic Park Can Teach Us about Incidents]] - Pages created: [[@2022__SREcon22EMEA__The Repeat Incident Fallacy - What Jurassic Park Can Teach Us about Incidents]]、[[Emily Ruppe]]、[[Laura Maguire]] - Pages updated: [[ポストモーテム]]（Repeat Incident Fallacy + 4 者収束を追記）、[[レジリエンスエンジニアリング]]（「カーディオ」実践 + evolving sociotechnical systems を追記）、[[Jeli]]（Emily Ruppe を所属メンバーとして追記） - Key insight: 「二度と起こさない」という誓約の代わりに「Insights from the Past = Options in the Future」——過去の洞察が未来の選択肢を増やすという枠組みへの転換が、Gallego/Lund/Partington と合わせて 4 者収束として wiki に加わった。 ## [2026-06-28] ingest-slides | A Post Incident Review Review - Source: `.raw/slides/srecon22apac-partington/srecon22apac-partington.pdf` - Visual pages: `.raw/slides/srecon22apac-partington/pages/` (53 pages; p.29 API 拒否) - Media: transcript なし（USENIX 音源未取得） - Summary: [[@2022__SREcon22APAC__A Post Incident Review Review]] - Pages created: [[@2022__SREcon22APAC__A Post Incident Review Review]]、[[Tom Partington]]、[[ANZx]]、[[J Paul Reed]]、[[John Allspaw]]、[[Jeli]]、[[Sidney Dekker]]、[[James Reason]]、[[Jens Rasmussen]] - Pages updated: [[ポストモーテム]]（ANZx実践実績・learning>fixing・Record vs Report を横断的知見に追記）、[[事故モデル]]（Rasmussen Safety Model と PIR スタイルの接続を追記）、[[人的要因]]（Mechanistic Reasoning・Dekker's Tunnel を追記）、[[レジリエンスエンジニアリング]]（Safety I→II・STELLA/Woods' Theorem を追記） - Key insight: 「根本原因・アクションアイテム・MTTx なし」で高度規制産業1000人超組織の PIR を運用し「再発がまれ」という実績が、ポストモーテムの従来前提（修復項目がなければ再発する）を直接反証する最強の実践例として wiki に加わった。 ## [2026-06-28] ingest-slides | Running Excellent Retrospectives: Talking for Humans - Source: `.raw/slides/srecon19americas-eckhardt/srecon19americas-eckhardt.pdf` - Visual pages: `.raw/slides/srecon19americas-eckhardt/pages/` (56 pages) - Media: transcript なし（YouTube 429 エラー・Whisper 未インストール） - Summary: [[@2019__SREcon19Americas__Running Excellent Retrospectives - Talking for Humans]] - Pages created: [[@2019__SREcon19Americas__Running Excellent Retrospectives - Talking for Humans]]、[[Lex Neva]]、[[Fastly]] - Pages updated: [[Courtney Eckhardt]]（Americas talk 追記）、[[レトロスペクティブファシリテーション]]（パーセプチュアル学習・ユーモア管理・感情環境制御の横断知見3件追記・sources/related 追加）、[[人的要因]]（Lake Washington 事例を横断的知見に追記）、[[ポストモーテム]]（sources 追記） - Key insight: SREcon19 Americas と Asia/Pacific で同じ Courtney Eckhardt が異なる設計思想で同じテーマを扱っている。Americas 版はパーセプチュアル学習（体験型チュートリアル）を採用し、「ファシリテーション3仕事」という整理とユーモアの体系的管理を前面に出した。また Lake Washington 浮橋事例は「物理的因果連鎖が完全でも人的文脈が欠ければ振り返りは不完全」という命題の具体例として Human Factors 概念の充実に寄与した。 ## [2026-06-28] ingest-slides | Principled Identification of "Root Causes" Using Techniques from Safety Engineering - Source: `.raw/slides/srecon22emea__devesine__root-causes/srecon22emea__devesine__root-causes.pdf` - Visual pages: `.raw/slides/srecon22emea__devesine__root-causes/pages/` (23 pages) - Media: `.raw/slides/srecon22emea__devesine__root-causes/transcript.md`（YouTube 自動字幕・英語） - Summary: [[@2022__SREcon22 EMEA__Principled Identification of Root Causes Using Techniques from Safety Engineering]] - Pages created: [[@2022__SREcon22 EMEA__Principled Identification of Root Causes Using Techniques from Safety Engineering]], [[Laura de Vesine]] - Pages updated: [[根本原因分析]]（根本原因/トリガー用語再定義・トリガーホワイトアモール病理の横断知見追記）、[[事故モデル]]（System/Environment 境界モデルとスイスチーズモデルの対比を追記） - Key insight: 「根本原因 = システムの脆弱性」「トリガー = 最悪ケースの環境条件」という区別は、Will Gallego の「根本原因という概念を捨てよ」と Cook の「原因は構築される」の間で SRE エンジニアが実際に操作できる中間点を提供する。安全工学の System/Environment 境界という既存の学術知識が SRE インシデント分析に直接転用できることを示した。 ## [2026-06-28] ingest-slides | Ditch the Template: How to Write Incident Reports They Want To Read - Source: `.raw/slides/srecon22emea-nolan-ditch-template/srecon22emea-nolan-ditch-template.pdf` - Visual pages: `.raw/slides/srecon22emea-nolan-ditch-template/pages/` (36 pages) - Media: transcript なし（動画埋め込みのみ・ダウンロード不可）。ブログ記事（Container Solutions Blog 2023-03-31）を参考資料として補完。 - Summary: [[@2022__SREcon22 EMEA__Ditch the Template - How to Write Incident Reports They Want To Read]] - Pages created: [[@2022__SREcon22 EMEA__Ditch the Template - How to Write Incident Reports They Want To Read]], [[Laura Nolan]], [[Stanza Systems]], [[インシデントレポート執筆]] - Pages updated: [[ポストモーテム]]（テンプレート形式批判・専門知識伝承の 2 横断知見追記） - Key insight: 「IR の価値は学習にあり、プロセスにあるのではない」——テンプレートを捨ててナラティブで書くことがこれまでのポストモーテムプロセス論（Gallego・Larson・Lund）と相補的に IR 文書品質の視点から同じ形骸化問題を解決する。 ## [2026-06-28] ingest-slides | Retrospectives for Humans (a crash course) - Source: `.raw/slides/srecon19apac-eckhardt/srecon19apac-eckhardt.pdf` - Visual pages: `.raw/slides/srecon19apac-eckhardt/pages/` (47 pages) - Media: `.raw/slides/srecon19apac-eckhardt/transcript.md`（YouTube 自動字幕） - Summary: [[@2019__SREcon19 Asia__Retrospectives for Humans (a crash course)]] - Pages created: [[Courtney Eckhardt]], [[Heroku]], [[レトロスペクティブファシリテーション]] - Pages updated: [[ポストモーテム]]（contributing factor discovery・ファシリテーター言語の三者収束）、[[人的要因]]（Miller's Law・ヒューマンエラー三者収束） - Key insight: 「Why/You→How/What」という言語変換は Eckhardt/Lund/Gallego の三者が独立に収束した、ポストモーテムでの学習深度を規定する中核原則。Miller's Law がその認識論的基盤を提供する。 ## [2026-06-28] ingest-slides | Getting More out of Postmortems and Making Them Less Painful to Do - Source: `.raw/slides/srecon19apac-rizqi/srecon19apac-rizqi.pdf` - Visual pages: `.raw/slides/srecon19apac-rizqi/pages/` (51 pages) - Media: `.raw/slides/srecon19apac-rizqi/transcript.md`（YouTube 自動字幕 1355 行） - Summary: [[@2019__SREcon19Asia__Getting More out of Postmortems and Making Them Less Painful to Do]] - Pages created: [[Ashar Rizqi]], [[Blameless]] - Pages updated: [[ポストモーテム]]（6要素・再参照性・Slack 非同期・ギルド・未解決問い追記） - Key insight: ポストモーテム成功の6要素を 300 社以上から体系化。再参照性(Referenceability)は現時点でも未解決問題。Slack 上での軽量非同期 PM が期日内完了を促す実用的戦術。 ## [2026-06-27] ingest-slides | Accident Models in Post Mortems - Source: `.raw/slides/2016__SREcon16Europe__Accident-Models-in-Post-Mortems/2016__SREcon16Europe__Accident-Models-in-Post-Mortems.pdf` - Visual pages: `.raw/slides/2016__SREcon16Europe__Accident-Models-in-Post-Mortems/pages/` (100 pages) - Media: なし（transcript なし） - Summary: [[@2016__SREcon16Europe__Accident Models in Post Mortems]] - Pages created: [[Nathan Hoffman]], [[Miriam Lautner]], [[事故モデル]] - Pages updated: [[Will Gallego]], [[Etsy]], [[ポストモーテム]] - Key insight: 「ヒューマンエラー」は分析の行き止まりを示すラベル。安全性は創発的特性であり、原因は発見でなく構築される（Dekker）。デブリーフィング 7 カテゴリ問いかけで当事者視点を段階的再構築。 ## [2026-06-27] ingest-slides | What Brought Us Down? Outage Trend Analysis at Google - Source: `.raw/slides/srecon15__lueder__incident-analysis/srecon15__lueder__incident-analysis.pdf` - Visual pages: `.raw/slides/srecon15__lueder__incident-analysis/pages/` (30 pages) - Media: `.raw/slides/srecon15__lueder__incident-analysis/media/lueder.mp3`（音声取得済み、Whisper 失敗 → transcript なし） - Summary: [[@2015__SREcon15__What Brought Us Down - Outage Trend Analysis at Google]] - Pages created: [[Sue Lueder]], [[障害傾向分析]], [[インシデント重大度評価]] - Pages updated: [[インシデント管理]], [[根本原因分析]], [[ポストモーテム]], [[Google]] - Key insight: GQM フィードバックモデルで「複数障害横断分析プログラム」を体系化。8 フェーズタイムライン（Incident Duration = Detect → Resolve）・9 カテゴリ根本原因・Stop/Faster/Culture の 3 方向修正機会を公開した 2015 年時点の Google 最初の公開事例。 ## [2026-06-27] ingest-slides (amendment) | A Tale of Two Postmortems — transcript 補完 - transcript Whisper バックグラウンド処理で取得完了(581行, audio.m4a から) - Source page 更新: 口頭説明セクション追加(個別インタビュー手法・ファシリテーション言語規律・System 定義拡張・Debriefing Facilitation Guide 出典) - ポストモーテム concept 更新: 3 横断的知見追加(個別インタビュー / How 言語規律 / 会議目標の再定義) ## [2026-06-27] ingest-slides | A Tale of Two Postmortems: A Human Factors View - Source: `.raw/slides/srecon19apac_lund_postmortem/srecon19apac_lund_postmortem.pdf` - Visual pages: `.raw/slides/srecon19apac_lund_postmortem/pages/` (45 pages) - Media: `.raw/slides/srecon19apac_lund_postmortem/transcript.md` (Whisper, 581 行) - Summary: [[@2019__SREcon19 Asia__A Tale of Two Postmortems - A Human Factors View]] - Pages created: [[Tanner Lund]], [[人的要因]], [[レジリエンスエンジニアリング]] - Pages updated: [[ポストモーテム]] - Key insight: Dekker の4目的(認識論的・予防的・道徳的・実存的)が SRE ポストモーテムの「何のために実施するか」を体系化し、「ヒューマンエラー」という結論が分析を止める理由を構造的に説明する ## [2026-06-27] ingest-slides | Architecting a Technical Post Mortem - Source: `.raw/slides/srecon18americas-gallego/srecon18americas-gallego.pdf` - Visual pages: `.raw/slides/srecon18americas-gallego/pages/` (33 pages) - Media: none (transcript なし; YouTube `UlIfDdoK6EQ` は未取得) - Summary: [[@2018__SREcon18 Americas__Architecting a Technical Post Mortem]] - Pages created: [[Will Gallego]], [[@2018__SREcon18 Americas__Architecting a Technical Post Mortem]] - Pages updated: [[ポストモーテム]], [[根本原因分析]], [[Etsy]] - Key insight: 「ブレームレス」より「ブレーム・アウェア」という Gallego の精緻化と、PM における「根本原因」用語の明示的否定が、既存のポストモーテム文化論（Google SRE Book / mixi / Hatena）に欠けていた実践者視点を補完する。 ## [2026-06-27] ingest-paper | Failures and Fixes: A Study of Software System Incident Response - Source: `.raw/papers/Sillito-and-Kutomi-2020---Failures-and-Fixes---A-Study-of-Software-System-Incident-Response.pdf` - Summary: [[@2020__arXiv__Failures and Fixes - A Study of Software System Incident Response]] - Pages created: [[Jonathan Sillito]], [[Esdras Kutomi]], [[Brigham Young University]], [[インシデント調査戦略]], + 1 source page - Pages updated: [[インシデント管理]], [[根本原因分析]], [[オペラビリティ]], [[変更起因インシデント]] - Key insight: 2020 年の 30 インシデント定性研究が LLM 時代の AIOps 論文が個別に攻める問題群（しきい値検知の脆弱性・モニタリング支援ツール自体の監視不足・調査における相関と因果の混同・設定変更のプロセス的非対称性）の出所として機能する。AlertGuardian/Bian Que/TSGuard の課題設定を遡る基礎文献として wiki に追加。 ## [2026-06-27] ingest | ポストモーテム実務ガイド 5 ソースバッチ ingest - Sources: `.raw/articles/incident-response-to-reliability-2026-06-27.md`, `.raw/articles/pagerduty-post-mortem-process-2026-06-27.md`, `.raw/articles/datadog-incident-postmortem-best-practices-2026-06-27.md`, `.raw/articles/mixi-fault-handling-postmortem-2026-06-27.md`, `.raw/articles/hatena-incident-information-sharing-2026-06-27.md` - Summaries: [[@ReadME__Will Larson__Move Past Incident Response to Reliability]], [[@PagerDuty__Post-Mortem Process]], [[@2021__Datadog Blog__Best Practices for Writing Incident Postmortems]], [[@mixi developers__インフラ障害対応とポストモーテム]], [[@2018__Hatena Developer Blog__社内障害情報共有のススメ]] - Pages created: [[ポストモーテム]], [[Will Larson]], [[PagerDuty]] + 5 source pages - Pages updated: [[インシデント管理]], [[Datadog]], [[Hatena]] - Key insight: ポストモーテムの実務面を Google/PagerDuty/Datadog の英語圏プラクティスと mixi/Hatena の日本企業プラクティスで横断比較。Incident Legalism（形骸化メカニズム）、再発防止策の 4 分類（予防/検出/緩和/修正）、リビングドキュメント化、全社共有による横方向学習という 4 つの実務知見を概念ページに集約。 ## [2026-06-27] ingest-paper | Do Not Blame Users for Misconfigurations - Source: `.raw/papers/Xu-et-al.-2013---Do-not-blame-users-for-misconfigurations.pdf` - Summary: [[@2013__SOSP__Do Not Blame Users for Misconfigurations]] - Pages created: [[@2013__SOSP__Do Not Blame Users for Misconfigurations]], [[設定ミス脆弱性]], [[Yuanyuan Zhou]], [[Shankar Pasupathy]], [[NetApp]] - Pages updated: [[設定マイニング]], [[Tianyin Xu]], [[Ding Yuan]] - Key insight: 設定ミスの 80% がサイレント系(違反・無視)であり、クラッシュより遥かに多い。開発者がソースコードから設定制約を自動推論するホワイトボックスアプローチ(SPEX)が、ブラックボックスマイニングと並ぶ二大系統の一方として 2013 年に確立した。 ## [2026-06-27] ingest | OBI docs, OBI Header Enrichment, GenAI Observability, OTel Collector Survey, Japanese Survey, Log Dedup Processor (6 OTel sources batch) - Source: `.raw/articles/obi-opentelemetry-ebpf-instrumentation-2026-06-27.md`, `.raw/articles/obi-http-header-enrichment-2026-06-27.md`, `.raw/articles/genai-observability-opentelemetry-2026-06-27.md`, `.raw/articles/otel-collector-follow-up-survey-2026-06-27.md`, `.raw/articles/otel-japanese-community-survey-2026-06-27.md`, `.raw/articles/log-deduplication-processor-2026-06-27.md` - Summary: [[@2026__OTelDocs__OBI - OpenTelemetry eBPF Instrumentation]], [[@2026__OTelBlog__OBI HTTP Header Enrichment]], [[@2026__OTelBlog__GenAI Observability with OpenTelemetry]], [[@2026__OTelBlog__OTel Collector Follow-up Survey]], [[@2026__OTelBlog__Japanese Community Survey]], [[@2026__OTelBlog__Log Deduplication Processor]] - Pages created: [[@2026__OTelDocs__OBI - OpenTelemetry eBPF Instrumentation]], [[@2026__OTelBlog__OBI HTTP Header Enrichment]], [[@2026__OTelBlog__GenAI Observability with OpenTelemetry]], [[@2026__OTelBlog__OTel Collector Follow-up Survey]], [[@2026__OTelBlog__Japanese Community Survey]], [[@2026__OTelBlog__Log Deduplication Processor]], [[OBI]], [[ゼロコード計装]], [[GenAI オブザーバビリティ]], [[ログ重複排除]] - Pages updated: [[OpenTelemetry]], [[eBPF]], [[オブザーバビリティ]], [[テレメトリ]] - Key insight: OBI が eBPF ベースのゼロコード計装を GenAI プロバイダまで拡張。日本の OTel コミュニティではトレースが 93% で最多シグナル（国際パターンと乖離）。Collector のデプロイ規模拡大と VM ハイブリッド化が進行し、設定管理・安定性が最優先課題。ログ重複排除プロセッサはサンプリングと異なり情報を保持しつつ冗長ストレージを排除する。 ## [2026-06-27] ingest | OTel-Arrow Phase 2: From Efficient Transport to Efficient Telemetry Pipelines - Source: `.raw/articles/otel-arrow-phase-2-2026-06-27.md` - Summary: [[@2026__OTelBlog__OTel-Arrow-Phase-2]] - Pages created: [[@2026__OTelBlog__OTel-Arrow-Phase-2]], [[OTel-Arrow]], [[OTAP]], [[Apache-Arrow]] - Pages updated: [[OpenTelemetry]] - Key insight: Apache Arrow のカラム型フォーマットをテレメトリパイプラインの内部表現として維持することで、シリアライゼーションオーバーヘッドを排除し、単一コアで OTLP 比 20 倍のスループット（2.47M vs 121K logs/sec）を達成。Phase 1（ワイヤプロトコル）から Phase 2（パイプライン全体の内部表現）への進化。 ## [2026-06-27] ingest-paper | Cloud Atlas, Chain-of-Event, IRLLS, HeMiRCA, MicroIRC, RCA Outliers, RCInvestigator, GrayScope, SynthoDiag, MicroDig (10 papers batch) - Source: `.raw/papers/arxiv-2407.08694.pdf`, `.raw/papers/Chain-of-Event_Interpretable-Root-Cause-Analysis-for-MicroservicesFSE24-Camera-Ready.pdf`, `.raw/papers/Xie-et-al.-2024---Microservice-root-cause-analysis-with-limited-observability-through-intervention-recognition-in-the-latent-space.pdf`, `.raw/papers/2026_Unknown_HeMiRCA_Fine_Grained_Root_Cause.pdf`, `.raw/papers/2026_Zhu_Microirc_Instance_Level_Root_Cause.pdf`, `.raw/papers/neurips-2025-rca-outliers.pdf`, `.raw/papers/RCInvestigator_Towards_Better_Investigation_of_Anomaly_Root_Causes_in_Cloud_Computing_Systems.pdf`, `.raw/papers/FSE_24_GrayScope.pdf`, `.raw/papers/FSE_2024_SynthoDiag.pdf`, `.raw/papers/Diagnosing_Performance_Issues_for_Large-Scale_Microservice_Systems_With_Heterogeneous_Graph.pdf` - Summary: **(1) Cloud Atlas** ([[@2024__arXiv__Cloud Atlas - Efficient Fault Localization for Cloud Systems using Language Models and Causal Insight]]): Zhiqiang Xie ほか(Stanford/CMU/Microsoft Research)。LLM でシステム文書から因果グラフを自動合成し障害箇所特定。手動構築グラフと同等精度。**(2) Chain-of-Event** ([[@2024__FSE__Chain-of-Event - Interpretable Root Cause Analysis for Microservices through Automatically Learning Weighted Event Causal Graph]]): Zhenhe Yao ほか(清華/CAS/eBay)。マルチモーダル観測データをイベントに変換し、重み付きイベント因果グラフで解釈可能な RCA。SRE の運用知見を直接統合。**(3) IRLLS** ([[@2024__KDD__Microservice Root Cause Analysis with Limited Observability]]): 限定観測可能性下の潜在空間介入認識。**(4) HeMiRCA** ([[@2024__TOSEM__HeMiRCA - Fine-Grained Root Cause Analysis for Microservices with Heterogeneous Data Sources]]): Zhouruixing Zhu ほか(CUHK-Shenzhen/CUHK)。トレースとメトリクスの異種データ間の異常認識単調相関を発見し、Spearman 相関で階層的 RCA。サービスレベル top-1 82.7%。**(5) MicroIRC** ([[@2026__Elsevier__MicroIRC - Instance-level Root Cause Localization for Microservice Systems]]): Yuhan Zhu ほか(武漢大学/CSIRO)。インスタンスレベル粒度の GNN ベース RCA。呼び出しグラフ+メトリクスグラフの二重グラフ。**(6) RCA Outliers** ([[@2025__NeurIPS__Root Cause Analysis of Outliers with Missing Structural Knowledge]]): William Roy Orchard ほか(Cambridge/MPI/Amazon)。因果グラフ未知の単一サンプル RCA の理論的保証。ポリツリー構造で周辺異常スコアのみで RCA 可能。**(7) RCInvestigator** ([[@2026__TVCG__RCInvestigator - Towards Better Investigation of Anomaly Root Causes in Cloud Computing Systems]]): Shuhan Liu ほか(Zhejiang/Microsoft)。人間-機械協調型 RCA の可視分析システム。ビルド→モニタリング→推論→結論の 4 段階ワークフロー。**(8) GrayScope** ([[@2024__FSE__Illuminating the Gray Zone - Non-Intrusive Gray Failure Localization in Server Operating Systems]]): Shenglin Zhang ほか(南開/清華/Huawei)。サーバー OS のグレー障害(部分的・断続的障害)を非侵入的に箇所特定。専門知識と因果学習の融合。AC@5 90%。**(9) SynthoDiag** ([[@2024__FSE__SynthoDiag - Fault Diagnosis for Test Alarms in Microservices through Multi-source Data]]): Shenglin Zhang ほか(南開/Huawei Cloud/清華)。テストアラームの多ソース障害診断。障害分類+箇所特定の二段階。**(10) MicroDig** ([[@2024__TSC__MicroDig - Diagnosing Performance Issues for Large-Scale Microservice Systems With Heterogeneous Graph]]): Lei Tao ほか(南開/清華/Tencent)。異種グラフで因果関係と呼び出し関係の不一致を考慮した性能障害診断。 - Pages created: [[@2024__arXiv__Cloud Atlas - Efficient Fault Localization for Cloud Systems using Language Models and Causal Insight]], [[@2024__FSE__Chain-of-Event - Interpretable Root Cause Analysis for Microservices through Automatically Learning Weighted Event Causal Graph]], [[@2024__FSE__Illuminating the Gray Zone - Non-Intrusive Gray Failure Localization in Server Operating Systems]], [[@2024__FSE__SynthoDiag - Fault Diagnosis for Test Alarms in Microservices through Multi-source Data]], [[@2024__TOSEM__HeMiRCA - Fine-Grained Root Cause Analysis for Microservices with Heterogeneous Data Sources]], [[@2024__TSC__MicroDig - Diagnosing Performance Issues for Large-Scale Microservice Systems With Heterogeneous Graph]], [[@2025__NeurIPS__Root Cause Analysis of Outliers with Missing Structural Knowledge]], [[@2026__Elsevier__MicroIRC - Instance-level Root Cause Localization for Microservice Systems]], [[@2026__TVCG__RCInvestigator - Towards Better Investigation of Anomaly Root Causes in Cloud Computing Systems]], [[Zhiqiang Xie]], [[Yujia Zheng]], [[Lizi Ottens]], [[Wenxiao Chen]], [[Huai Jiang]], [[Liangfei Su]], [[GrayScope]], [[Di Weng]], [[Yingcai Wu]], [[CSIRO Data61]], [[テスト障害診断]], [[情報理論的異常スコア]], [[単一サンプルRCA]] - Pages updated: [[@2024__KDD__Microservice Root Cause Analysis with Limited Observability]], [[根本原因分析]], [[グラフベースRCA]], [[因果推論ベースRCA]], [[介入的因果学習]], [[Interactive AIOps]], [[仮説駆動RCA]], [[ログベース障害診断]], [[グラフニューラルネットワーク]], [[知識グラフ]], [[Dan Pei]], [[Shenglin Zhang]], [[Qingwei Lin]], [[Jonathan Mace]], [[Christos Kozyrakis]], [[Kun Zhang]], [[Zhenhe Yao]], [[Pinjia He]], [[Zhouruixing Zhu]], [[Xiaohui Nie]], [[Zeyan Li]], [[Tencent]], [[Wuhan University]], [[Cheryl Lee]] - Key insight: マイクロサービス RCA が「マルチモーダル入力(メトリクス+トレース+ログ)」「インスタンスレベル粒度」「理論的因果保証(単一サンプル RCA)」「人間協調型可視分析」「グレー障害」「テストアラーム」へと多軸に展開。LLM 因果グラフ自動合成(Cloud Atlas)は手動グラフ構築コストの突破口。 # Operation Log ## [2026-06-27] enrich-source | source ノート 9 本へ代表図スクリーンショットを埋め込み - Source: `.raw/papers/dsn2017-datacenter-hardware-failures.pdf`, `.raw/papers/socc2016-cos.pdf`, `.raw/papers/imc2018-facebook-network-errors.pdf`, `.raw/papers/hotos2019-azure-software-failures.pdf`, `.raw/papers/sosp2011-hardware-errors.pdf`, `.raw/papers/nsdi2020-omegagen.pdf`, `.raw/papers/datacenter-scale-temperature-impact.pdf`, `.raw/papers/empirical-kubernetes-operator-bugs.pdf`, `.raw/papers/taxdc.pdf` - Pages updated: [[@2017__DSN__What Can We Learn from Four Years of Data Center Hardware Failures]], [[@2016__SoCC__Why Does the Cloud Stop Computing - Lessons from Hundreds of Service Outages]], [[@2018__IMC__A Large Scale Study of Data Center Network Reliability]], [[@2019__HotOS__What Bugs Cause Production Cloud Incidents]], [[@2011__SOSP__An Empirical Study on Configuration Errors in Commercial and Open Source Systems]], [[@2020__NSDI__Understanding, Detecting and Localizing Partial Failures in Large System Software]], [[@2013__ACM TOS__Datacenter Scale Evaluation of the Impact of Temperature on Hard Disk Drive Failures]], [[@2024__ISSTA__An Empirical Study on Kubernetes Operator Bugs]], [[@2016__ASPLOS__TaxDC - A Taxonomy of Non-Deterministic Concurrency Bugs in Datacenter Distributed Systems]] - Key insight: wiki-ingest-paper の Step 0.5 に従い、PDF から代表図・表の領域をクロップして `wiki/sources/_attachments/` に保存し、該当する本文節へ Obsidian 埋め込みで追加した。 ## [2026-06-27] ingest-paper | G-Cause, FaultInsight, LoFI, iKnow, SparseRCA, Interventional Causal Learning, ResilienceGuardian (7 papers batch) - Sources: - `.raw/papers/G-Cause---Parameter-free-Global-Diagnosis-for-Hyperscale-Web-Service-Infrastructures.pdf` - `.raw/papers/2026_Unknown_FaultInsight_Interpreting_Hyperscale_Data_Center.pdf` - `.raw/papers/LoFI.pdf` - `.raw/papers/iKnow.pdf` - `.raw/papers/SparseRCA__Unsupervised_Root_Cause_Analysis_in_Sparse_Microservice_Testing_Traces__ISSRE24_Camera_Ready_.pdf` - `.raw/papers/957000a141.pdf` - `.raw/papers/1570994962-final.pdf` - Summary: - [[@2024__ICWS__G-Cause - Parameter-free Global Diagnosis for Hyperscale Web Service Infrastructures]] - [[@2024__KDD__FaultInsight - Interpreting Hyperscale Data Center Host Faults]] - [[@2024__ISSRE__LoFI - Demystifying and Extracting Fault-indicating Information from Logs for Failure Diagnosis]] - [[@2025__ASE__iKnow - an Intent-Guided Chatbot for Cloud Operations with Retrieval-Augmented Generation]] - [[@2024__ISSRE__SparseRCA - Unsupervised Root Cause Analysis in Sparse Microservice Testing Traces]] - [[@2024__DSN-S__Fault Localization Using Interventional Causal Learning for Cloud-Native Applications]] - [[@2024__ISSRE__Guardian of the Resiliency - Detecting Erroneous Software Changes Before They Make Your Microservice System Less Fault-Resilient]] - Pages created: 7 source pages, 16 entity pages ([[SparseRCA]], [[FaultInsight]], [[LoFI]], [[iKnow]], [[CausalBench]], [[ResilienceGuardian]], [[Xinrui Jiang]], [[Meng Ma]], [[Ping Wang]], [[Tingzhu Bi]], [[Junjie Huang]], [[Saurabh Jha]], [[Guanglei He]], [[Zhihan Jiang]], [[Guangba Yu]], [[Zhenhe Yao]]), 6 concept pages ([[障害注入]], [[運用障害分析]], [[介入的因果学習]], [[障害耐性劣化変更検知]], [[OpsQA]], [[RAGベースクラウド運用支援]]) - Pages updated: [[根本原因分析]] - Key insight: テスト環境の疎トレースでは統計メトリクスベース因果発見が破綻し、排他レイテンシのパターンベース分解が有効。ハイパースケール Web 基盤ではパラメータフリーの全体診断(G-Cause)が有効。 ## [2026-06-27] ingest-slides | AIスパコン「さくらONE」のオブザーバビリティ - Source: `.raw/slides/o11yconjp2025/o11yconjp2025.pdf` (62 pages) - Visual pages: `.raw/slides/o11yconjp2025/pages/` - Media: `.raw/slides/o11yconjp2025/transcript.md` (YouTube 自動字幕・日本語) - Summary: [[@2025__O11yConTokyo2025__AIスパコン「さくらONE」のオブザーバビリティ]] - Pages created: [[@2025__O11yConTokyo2025__AIスパコン「さくらONE」のオブザーバビリティ]] - Pages updated: [[GPU観測性]], [[LLM学習モニタリング]], [[SAKURAONE]], [[坪内佑樹]] - Key insight: AI スパコンサービスのオブザーバビリティは責任境界によりリソース分析から始まらざるを得ず、クラウドネイティブ分野との「オブザーバビリティギャップ」が存在する。OTeL + Grafana パイプラインの具体構成が開示され、eBPF による GPU ゼロコード計装と R-Pingmesh による RoCE 常時監視でギャップ解消を目指す。 ## [2026-06-27] rewrite-source | source ノート 9 本を既存形式に合わせて再構成 - Source: `.raw/papers/dsn2017-datacenter-hardware-failures.pdf`, `.raw/papers/socc2016-cos.pdf`, `.raw/papers/imc2018-facebook-network-errors.pdf`, `.raw/papers/hotos2019-azure-software-failures.pdf`, `.raw/papers/sosp2011-hardware-errors.pdf`, `.raw/papers/nsdi2020-omegagen.pdf`, `.raw/papers/datacenter-scale-temperature-impact.pdf`, `.raw/papers/empirical-kubernetes-operator-bugs.pdf`, `.raw/papers/taxdc.pdf` - Pages updated: [[@2017__DSN__What Can We Learn from Four Years of Data Center Hardware Failures]], [[@2016__SoCC__Why Does the Cloud Stop Computing - Lessons from Hundreds of Service Outages]], [[@2018__IMC__A Large Scale Study of Data Center Network Reliability]], [[@2019__HotOS__What Bugs Cause Production Cloud Incidents]], [[@2011__SOSP__An Empirical Study on Configuration Errors in Commercial and Open Source Systems]], [[@2020__NSDI__Understanding, Detecting and Localizing Partial Failures in Large System Software]], [[@2013__ACM TOS__Datacenter Scale Evaluation of the Impact of Temperature on Hard Disk Drive Failures]], [[@2024__ISSTA__An Empirical Study on Kubernetes Operator Bugs]], [[@2016__ASPLOS__TaxDC - A Taxonomy of Non-Deterministic Concurrency Bugs in Datacenter Distributed Systems]] - Key insight: 前回の追記型拡充を置き換え、既存の充実した source ノートに合わせて、論文情報・概要・問題設定・分析方法・主要結果・新規性・考察・強み/弱点・関連・出典を含む一貫した本文へ再作成した。 ## [2026-06-27] ingest-paper | A Survey on Failure Analysis and Fault Injection in AI Systems - Source: `.raw/papers/2026_Unknown_A_Survey_Failure_Fault_Injection.pdf` - Summary: [[@2025__TOSEM__A Survey on Failure Analysis and Fault Injection in AI Systems]] - Pages created: [[@2025__TOSEM__A Survey on Failure Analysis and Fault Injection in AI Systems]], [[Roberto Natella]] - Pages updated: [[Guangba Yu]], [[Pengfei Chen]], [[Michael R. Lyu]], [[Zibin Zheng]], [[Gou Tan]], [[障害注入]], [[運用障害分析]] - Key insight: AI システムの6層（Service / Model / Framework / Toolkit / Platform / Infrastructure）それぞれで障害分析と障害注入のギャップが異なる形で存在し、特に NCCL・NVLink・InfiniBand など分散 AI 訓練の基幹通信障害は既存 FI ツールがまったくカバーしていない。 ## [2026-06-27] ingest-paper | PreServe: Intelligent Management for LMaaS Systems via Hierarchical Prediction - Source: `.raw/papers/PreServe.pdf` - Summary: [[@2026__ICSE__PreServe - Intelligent Management for LMaaS Systems via Hierarchical Prediction]] - Pages created: [[@2026__ICSE__PreServe - Intelligent Management for LMaaS Systems via Hierarchical Prediction]], [[LLMサービング管理]] - Pages updated: [[LLM推論]], [[Zhihan Jiang]], [[Yujie Huang]], [[Guangba Yu]], [[Junjie Huang]], [[Jiazhen Gu]], [[Michael R. Lyu]] - Key insight: LLM インスタンスのコールドスタート(数十〜数百秒)という制約が反応的スケーリングを無効化し、mLSTM による先読み予測 + DistilBERT による応答長予測の二層構造が LMaaS 管理の新しい設計原則となることを示した。 ## [2026-06-27] ingest-paper | 障害箇所特定・根本原因分析 11 論文一括 ingest - Source: `.raw/papers/Accurate_and_Interpretable_Log_Fault_Diagnosis_using_Large_Language_Models.pdf`, `.raw/papers/arxiv-2502.15728.pdf`, `.raw/papers/arxiv-2503.23051.pdf`, `.raw/papers/arxiv-2501.11545.pdf`, `.raw/papers/pdf.pdf` (OpenReview sCS9nrEXIS), `.raw/papers/DejaVu-paper.pdf`, `.raw/papers/2026_Unknown_Making_Fault_Localization_Online_Service.pdf`, `.raw/papers/No_More_Data_Silos_Unified_Microservice_Failure_Diagnosis_With_Temporal_Knowledge_Graph.pdf`, `.raw/papers/Ren-et-al.-2024---SLIM---A-scalable-and-interpretable-light-weight-fault-localization-algorithm-for-imbalanced-data-in-microservice.pdf`, `.raw/papers/Han-et-al.-2024---The-potential-of-one-shot-failure-root-cause-analysis---Collaboration-of-the-large-language-model-and-small-classifier.pdf`, `.raw/papers/arxiv-2412.02239.pdf` - Summary: 障害箇所特定(Fault Localization)と根本原因分析(RCA)に関する 11 論文を並列エージェントで一括 wiki 化。 - Pages created: [[@2025__nkcs.iops.ai__Accurate and Interpretable Log-Based Fault Diagnosis using Large Language Models]], [[@2025__arXiv__BSODiag - A Global Diagnosis Framework for Batch Servers Outage in Large-scale Cloud Infrastructure Systems]], [[@2025__arXiv__COCA - Generative Root Cause Analysis for Distributed Systems with Code Knowledge]], [[@2025__arXiv__RADICE - Causal Graph Based Root Cause Analysis for System Performance Diagnostic]], [[@2025__AAAI Workshop AICT__Causal Discovery for Cloud Microservice Architectures]], [[@2022__ESEC FSE__Actionable and Interpretable Fault Localization for Recurring Failures in Online Service Systems]], [[@2025__TOSEM__Making Fault Localization in Online Service Systems More Actionable and Interpretable]], [[@2024__TSC__No More Data Silos - Unified Microservice Failure Diagnosis With Temporal Knowledge Graph]], [[@2024__ASE__SLIM - A scalable and interpretable light-weight fault localization algorithm for imbalanced data in microservice]], [[@2024__ASE__The Potential of One-Shot Failure Root Cause Analysis - Collaboration of the Large Language Model and Small Classifier]], [[@2024__arXiv__FaaSRCA - Full Lifecycle Root Cause Analysis for Serverless Applications]] - Pages updated: [[Fault Localization]], [[根本原因分析]], [[因果発見]], [[因果推論ベースRCA]], [[LLMによる根本原因分析]], [[マルチモーダル障害診断]], [[グラフベースRCA]], [[サービス依存グラフ]], [[マイクロサービスコールグラフ]], [[サーバーレスアーキテクチャ]], [[ドメイン別RCA]] ほか entity 47 件・concept 20 件 - Key insight: 2019〜2025 の FL/RCA 研究が「単一モダリティ→マルチモーダル」「ランキング→因果サブグラフ」「教師あり→ワンショット/教師なし」「マイクロサービス限定→サーバーレス/クラウドインフラ/DB」へ多軸で拡大しており、コード知識(COCA)・TKG(UniDiag)・障害ユニット(DéjàVu)など問題粒度の再定義が主要な貢献パターンとなっている。 ## [2026-06-27] enrich-source | データセンター信頼性・クラウド障害・分散システム障害の論文 source ページ拡充 - Source: `.raw/papers/dsn2017-datacenter-hardware-failures.pdf`, `.raw/papers/socc2016-cos.pdf`, `.raw/papers/imc2018-facebook-network-errors.pdf`, `.raw/papers/hotos2019-azure-software-failures.pdf`, `.raw/papers/sosp2011-hardware-errors.pdf`, `.raw/papers/nsdi2020-omegagen.pdf`, `.raw/papers/datacenter-scale-temperature-impact.pdf`, `.raw/papers/empirical-kubernetes-operator-bugs.pdf`, `.raw/papers/taxdc.pdf` - Pages updated: [[@2017__DSN__What Can We Learn from Four Years of Data Center Hardware Failures]], [[@2016__SoCC__Why Does the Cloud Stop Computing - Lessons from Hundreds of Service Outages]], [[@2018__IMC__A Large Scale Study of Data Center Network Reliability]], [[@2019__HotOS__What Bugs Cause Production Cloud Incidents]], [[@2011__SOSP__An Empirical Study on Configuration Errors in Commercial and Open Source Systems]], [[@2020__NSDI__Understanding, Detecting and Localizing Partial Failures in Large System Software]], [[@2013__ACM TOS__Datacenter Scale Evaluation of the Impact of Temperature on Hard Disk Drive Failures]], [[@2024__ISSTA__An Empirical Study on Kubernetes Operator Bugs]], [[@2016__ASPLOS__TaxDC - A Taxonomy of Non-Deterministic Concurrency Bugs in Datacenter Distributed Systems]] - Key insight: 初回 ingest 時に薄かった各 source ページを、抽出済み PDF 本文に基づき、調査設計・分類軸・主要結果・実務含意・限界まで読める粒度へ拡充した。 ## [2026-06-27] ingest-slides | SREのためのテレメトリー技術の探究 - Source: `.raw/slides/yapcfukuoka2025-telemetry-for-sre/yapcfukuoka2025-telemetry-for-sre.pdf` - Visual pages: `.raw/slides/yapcfukuoka2025-telemetry-for-sre/pages/` (69 pages) - Media: none (transcript なし) - Auxiliary: さくらのナレッジ記事 https://knowledge.sakura.ad.jp/48582/ - Summary: [[@2025__YAPC Fukuoka 2025__SREのためのテレメトリー技術の探究]] - Pages created: [[@2025__YAPC Fukuoka 2025__SREのためのテレメトリー技術の探究]] - Pages updated: [[坪内佑樹]], [[さくらインターネット研究所]], [[テレメトリ]], [[Scaling Telemetry Workloads]], [[SREの工学化]], [[GPU観測性]] - Key insight: 博士論文の 3 層モデル（計装→保存→分析）が一般聴衆向けに「コアコンセプト抽出の思考過程」として再提示され、collect-first → use-first のフィードバック閉ループ構想と AI for SRE / Observability for AI Systems への展開が示された。 ## [2026-06-27] ingest-paper | LLMRCA: Multilevel Root Cause Analysis for LLM Applications Using Multimodal Observability Data - Source: `.raw/papers/Tan-et-al.-2026---LLMRCA---Multilevel-root-cause-analysis-for-LLM-applications-using-multimodal-observability-data.pdf` - Summary: [[@2026__TOSEM__LLMRCA - Multilevel Root Cause Analysis for LLM Applications Using Multimodal Observability Data]] - Pages created: [[@2026__TOSEM__LLMRCA - Multilevel Root Cause Analysis for LLM Applications Using Multimodal Observability Data]], [[LLMRCA]], [[Gou Tan]] - Pages updated: [[根本原因分析]], [[LLMによる根本原因分析]], [[マルチモーダル障害診断]] - Key insight: LLM アプリケーション特化の多段 RCA を、マルチモーダルオブザーバビリティデータ（トレース・ログ・メトリクス）の統合的活用で実現。 ## [2026-06-27] ingest-paper | MetaRCA: A Generalizable Root Cause Analysis Framework for Cloud-Native Systems Powered by Meta Causal Knowledge - Source: `.raw/papers/arxiv-2603.02032.pdf` - Summary: [[@2026__FSE__MetaRCA - A Generalizable Root Cause Analysis Framework for Cloud-Native Systems Powered by Meta Causal Knowledge]] - Pages created: [[@2026__FSE__MetaRCA - A Generalizable Root Cause Analysis Framework for Cloud-Native Systems Powered by Meta Causal Knowledge]], [[MetaRCA]] - Pages updated: [[根本原因分析]], [[因果推論ベースRCA]] - Key insight: メタ因果知識により未知のシステムへの汎化を達成する RCA フレームワーク。 ## [2026-06-27] ingest-paper | CCL-D: A High-Precision Diagnostic System for Slow and Hang Anomalies in Large-Scale Model Training - Source: `.raw/papers/arxiv-2605.04478.pdf` - Summary: [[@2026__PPoPP__CCL-D - A High-Precision Diagnostic System for Slow and Hang Anomalies in Large-Scale Model Training]] - Pages created: [[@2026__PPoPP__CCL-D - A High-Precision Diagnostic System for Slow and Hang Anomalies in Large-Scale Model Training]] - Pages updated: [[集合通信]] - Key insight: 大規模モデル訓練における集合通信の遅延・ハング異常を高精度に診断するシステム。 ## [2026-06-27] ingest-paper | KPIRoot+: An Efficient Integrated Framework for Anomaly Detection and Root Cause Analysis in Large-Scale Cloud Systems - Source: `.raw/papers/arxiv-2506.04569.pdf` - Summary: [[@2025__arXiv__KPIRoot+ - An Efficient Integrated Framework for Anomaly Detection and Root Cause Analysis in Large-Scale Cloud Systems]] - Pages created: [[@2025__arXiv__KPIRoot+ - An Efficient Integrated Framework for Anomaly Detection and Root Cause Analysis in Large-Scale Cloud Systems]] - Pages updated: [[異常検知]], [[根本原因分析]] - Key insight: 異常検知と RCA を統合したエンドツーエンドフレームワークで大規模クラウドシステムに適用。 ## [2026-06-27] ingest-paper | Towards LLM-Based Failure Localization in Production-Scale Networks - Source: `.raw/papers/sigcomm25-bian.pdf` - Summary: [[@2025__SIGCOMM__Towards LLM-Based Failure Localization in Production-Scale Networks]] - Pages created: [[@2025__SIGCOMM__Towards LLM-Based Failure Localization in Production-Scale Networks]], [[BiAn]], [[Guyue Liu]] - Pages updated: [[Fault Localization]], [[Ennan Zhai]] - Key insight: 本番規模ネットワークにおける LLM ベースの障害箇所特定。 ## [2026-06-27] ingest-paper | Robust Root Cause Diagnosis using In-Distribution Interventions - Source: `.raw/papers/arxiv-2505.00930.pdf` - Summary: [[@2025__ICLR__Robust Root Cause Diagnosis using In-Distribution Interventions]] - Pages created: [[@2025__ICLR__Robust Root Cause Diagnosis using In-Distribution Interventions]], [[TWIST]] - Pages updated: [[因果推論ベースRCA]], [[根本原因分析]] - Key insight: 分布内介入により因果推論ベース RCA のロバスト性を向上。 ## [2026-06-27] ingest-paper | ThinkFL: Self-Refining Failure Localization for Microservice Systems via Reinforcement Fine-Tuning - Source: `.raw/papers/arxiv-2504.18776.pdf` - Summary: [[@2026__ACM TOSEM__ThinkFL - Self-Refining Failure Localization for Microservice Systems via Reinforcement Fine-Tuning]] - Pages created: [[@2026__ACM TOSEM__ThinkFL - Self-Refining Failure Localization for Microservice Systems via Reinforcement Fine-Tuning]] - Pages updated: [[Fault Localization]], [[LLMによる根本原因分析]] - Key insight: 強化微調整による自己改善型障害箇所特定。 ## [2026-06-27] ingest-paper | eARCO: Efficient Automated Root Cause Analysis with Prompt Optimization - Source: `.raw/papers/arxiv-2504.11505.pdf` - Summary: [[@2025__arXiv__eARCO - Efficient Automated Root Cause Analysis with Prompt Optimization]] - Pages created: [[@2025__arXiv__eARCO - Efficient Automated Root Cause Analysis with Prompt Optimization]], [[eARCO]], [[PromptWizard]] - Pages updated: [[LLMによる根本原因分析]], [[根本原因分析]] - Key insight: プロンプト最適化による効率的な自動 RCA。 ## [2026-06-27] ingest-paper | GALA: Can Graph-Augmented Large Language Model Agentic Workflows Elevate Root Cause Analysis - Source: `.raw/papers/arxiv-2508.12472.pdf` - Summary: [[@2025__arXiv__GALA - Can Graph-Augmented Large Language Model Agentic Workflows Elevate Root Cause Analysis]] - Pages created: [[@2025__arXiv__GALA - Can Graph-Augmented Large Language Model Agentic Workflows Elevate Root Cause Analysis]], [[GALA]], [[RCAEval]] - Pages updated: [[根本原因分析]], [[LLMによる根本原因分析]] - Key insight: グラフ拡張 LLM エージェントワークフローによる RCA の高度化。 ## [2026-06-27] ingest-paper | GPTuner: A Manual-Reading Database Tuning System via GPT-Guided Bayesian Optimization - Source: `.raw/papers/p1939-tang.pdf` - Summary: [[@2024__VLDB__GPTuner - A Manual-Reading Database Tuning System via GPT-Guided Bayesian Optimization]] - Pages created: [[@2024__VLDB__GPTuner - A Manual-Reading Database Tuning System via GPT-Guided Bayesian Optimization]], [[Jiale Lao]], [[Mingjie Tang]] - Pages updated: [[データベースノブチューニング]] - Key insight: LLM でマニュアルを読み構造化知識を構築し Coarse-to-Fine ベイズ最適化と組み合わせることで、ランタイムフィードバックのみに依存する従来手法（OtterTune 等）比 16 倍速く良い設定を発見。知識の構造化を LLM が直接担う点が DB-BERT のテキスト+RL アプローチからの質的転換。 ## [2026-06-27] ingest-paper | openGauss: An Autonomous Database System - Source: `.raw/papers/vldb21-opengauss.pdf` - Summary: [[@2021__VLDB__openGauss - An Autonomous Database System]] - Pages created: [[@2021__VLDB__openGauss - An Autonomous Database System]], [[openGauss]], [[Raftログ診断]] - Pages updated: [[データベース自律診断]], [[データベースノブチューニング]], [[Guoliang Li]], [[Xuanhe Zhou]] - Key insight: 学習ベースの最適化技術（MCTS クエリ書き換え・Tree-LSTM コスト推定・DRL プラン生成）を実際のオープンソース DB に統合した初の包括的自律フレームワーク。外付けチューニングツールとは異なり、診断・監視・チューニングを DBMS 内部で一貫して自動化する「内製化」アプローチ。 ## [2026-06-27] ingest-paper | Automatic Database Management System Tuning Through Large-scale Machine Learning - Source: `.raw/papers/van-aken-etal-parameters.pdf` - Summary: [[@2017__SIGMOD__Automatic Database Management System Tuning Through Large-scale Machine Learning]] - Pages created: [[@2017__SIGMOD__Automatic Database Management System Tuning Through Large-scale Machine Learning]], [[OtterTune]], [[Dana Van Aken]], [[Andrew Pavlo]] - Pages updated: [[データベースノブチューニング]] - Key insight: 因子分析→Lasso→ガウシアンプロセスの 3 段パイプラインで DBMS 設定を自動最適化し、過去チューニングセッションの転用で 60 分以内に DBA 相当の設定を生成。後続の DB-BERT・GPTuner・AgentTune すべてがこのパイプライン構造を土台としており、データベースノブチューニング分野の基盤研究。 ## [2026-06-27] ingest-paper | データベース異常診断・RCA 8 論文一括 - Source: `.raw/papers/Anomaly_Diagnosis_with_Siamese_Discrepancy_Networks_in_Distributed_Cloud_Databases.pdf`, `.raw/papers/AIDB25_4.pdf`, `.raw/papers/p1169-ouyang.pdf`, `.raw/papers/vista-amazon-rds.pdf`, `.raw/papers/FSE_2023_LIZHI.pdf`, `.raw/papers/2026_Unknown_BALANCE_Bayesian_Linear_Attribution_Root.pdf`, `.raw/papers/grano-rca.pdf`, `.raw/papers/2019SIGMOD-ExplainIt.pdf` - Summary: [[@2025__ICDE__Anomaly Diagnosis with Siamese Discrepancy Networks in Distributed Cloud Databases]], [[@2025__AIDB__AutoDebugger - Efficient Root Cause Analysis for Anomaly Jobs]], [[@2025__VLDB__RCRank - Multimodal Ranking of Root Causes of Slow Queries in Cloud Database Systems]], [[@2023__Amazon Science__Vista - Machine Learning based Database Performance Troubleshooting Framework in Amazon RDS]], [[@2023__FSE__Adapting Performance Analytic Techniques in a Real-World Database-Centric System]], [[@2023__PACMMOD__BALANCE - Bayesian Linear Attribution for Root Cause Localization]], [[@2019__VLDB__GRANO - Interactive Graph-based Root Cause Analysis for Cloud-Native Distributed Data Platform]], [[@2019__SIGMOD__ExplainIt! - A Declarative Root-cause Analysis Engine for Time Series Data]] - Pages created: 上記 source 8 + entity 多数 + concept 4([[Sparkジョブ異常診断]], [[グラフベースRCA]], [[宣言的RCA]], [[データベース性能トラブルシューティング]]) - Pages updated: [[根本原因分析]], [[異常検知]], [[データベース自律診断]], [[データベース O&M]], [[マルチモーダル障害診断]], [[帰属手法]], [[Fault Localization]], [[AIOps]], [[Interactive AIOps]], [[仮説駆動RCA]], [[因果発見]], [[サービス依存グラフ]] ほか - Key insight: DB 異常診断の 8 論文を横断すると、RCA の入力モダリティが単一指標→トポロジ+指標→マルチモーダル統合と拡大する一方、産業展開では検知→RCA→解決の 3 段パイプライン(Vista)やベイズ帰属(BALANCE)による解釈性確保が鍵。データベース中心システムでは従来の性能分析技法の直接適用が困難(FSE 2023)。 ## [2026-06-27] ingest-paper | DB-BERT: a Database Tuning Tool that "Reads the Manual" - Source: `.raw/papers/arxiv-2112.10925.pdf` - Summary: [[@2022__SIGMOD__DB-BERT - a Database Tuning Tool that Reads the Manual]] - Pages created: [[@2022__SIGMOD__DB-BERT - a Database Tuning Tool that Reads the Manual]], [[Immanuel Trummer]], [[NLPベースDBチューニング]] - Pages updated: [[データベースノブチューニング]], [[データベース O&M]], [[Cornell University]], [[sources/_index]], [[entities/_index]], [[concepts/_index]], [[index]] - Key insight: BERT によるマニュアルの読解と Double DQN による実行時フィードバック学習を組み合わせることで、人手のパラメータ選定・値域指定なしに全実験で既存手法を凌駕する最初の実証。NLP とランタイムフィードバックを分離して使う手法は「片手落ち」になることを示した。 ## [2026-06-27] ingest-paper | Automatic Database Knob Tuning: A Survey & Automatic Configuration Tuning on Cloud Database: A Survey - Source: `.raw/papers/tuning-survey.pdf`, `.raw/papers/arxiv-2404.06043.pdf` - Summary: [[@2023__TKDE__Automatic Database Knob Tuning - A Survey]], [[@2024__arXiv__Automatic Configuration Tuning on Cloud Database - A Survey]] - Pages created: [[@2023__TKDE__Automatic Database Knob Tuning - A Survey]], [[@2024__arXiv__Automatic Configuration Tuning on Cloud Database - A Survey]], [[Xinyang Zhao]], [[Limeng Zhang]], [[M. Ali Babar]], [[University of Adelaide]] - Pages updated: [[Guoliang Li]], [[Xuanhe Zhou]], [[Tsinghua University]], [[データベースノブチューニング]], [[sources/_index]], [[entities/_index]], [[concepts/_index]] - Key insight: 清華大学(TKDE 2023)とアデレード大学(arXiv 2024)が独立に提示したノブチューニングパイプラインは4段階に収束しつつあり、BO と RL の適用域は反復回数で棲み分けが生じる。クラウド固有の安全性・適応性制約は両サーベイの差分として浮き彫りになった。 ## [2026-06-27] ingest | The Morning Paper on Operability - Source: `.raw/articles/the-morning-paper-on-operability-2026-06-27.md` - Summary: [[@2016__blog.acolyer__The Morning Paper on Operability]] - Pages created: [[@2016__blog.acolyer__The Morning Paper on Operability]], [[Adrian Colyer]], [[オペラビリティ]] - Pages updated: [[sources/_index]], [[entities/_index]], [[concepts/_index]], [[index]], [[hot]] - Key insight: Colyer の 400+ 論文横断レビューから、運用堅牢性が設計→可視化→デバッギング→フィードバックの 4 段階連鎖で構成されることを示す。先ほど ingest した 6 論文（Mystery Machine、Failure Sketching、Delta Debugging、HDD、DEMi、FDD）すべてがこの枠組みに位置づけられる。 ## [2026-06-26] ingest-video | Introducing the Reliability Map – r9y.dev - Source: https://www.youtube.com/watch?v=e_OD4AnxICg (USENIX SREcon22 APAC) - Transcript: .raw/videos/e_OD4AnxICg/transcript.md（YouTube 自動生成字幕 en-orig、1090 行） - Frames: .raw/videos/e_OD4AnxICg/frames/（12 枚、120 秒間隔） - Summary: [[@2022__SREcon22 APAC__Introducing the Reliability Map – r9y.dev]] - Pages created: [[@2022__SREcon22 APAC__Introducing the Reliability Map – r9y.dev]], [[Aaron Bowden]], [[Reliability Map (r9y.dev)]] - Pages updated: [[SRE]] - Key insight: SRE ケイパビリティのロードマップは「コンテキスト抽出→ケイパビリティ選択（採用/構築/購入）→戦術・戦略の2種インサイト」の順序で構成される。ゲームのテック・ツリーに着想を得た r9y.dev マップが SRE Book の「何を信じるか」と SRE Workbook の「どう実装するか」の間にある「次に何を取得するか」のギャップを埋める。 ## [2026-06-26] ingest-slides | エンジニアのためのSRE論文への招待 - Source: .raw/slides/srenext2023-yuukit-sre-papers/srenext2023-yuukit-sre-papers.pdf - Visual pages: .raw/slides/srenext2023-yuukit-sre-papers/pages/（35 ページ） - Media: .raw/slides/srenext2023-yuukit-sre-papers/media/audio.m4a（YouTube 録画の音声原本。文字起こし未生成のため内容根拠には不使用） - Summary: [[@2023__SRE NEXT 2023__エンジニアのためのSRE論文への招待]] - Pages created: [[@2023__SRE NEXT 2023__エンジニアのためのSRE論文への招待]], [[SRE論文]] - Pages updated: [[坪内佑樹]], [[SRE NEXT]] - Key insight: SRE 論文を単一分野に閉じず、国際会議・検索・引用ネットワークを組み合わせて探索し、速読から精読・記録へ段階を分けることで、未普及技術論文をエンジニアの実装・適用のアイデアへ接続できる。 ## [2026-06-26] ingest-slides | AIOps研究録―SREのためのシステム障害の自動原因診断 - Source: `.raw/slides/srenext2022-yuukit/srenext2022-yuukit.pdf` - Visual pages: `.raw/slides/srenext2022-yuukit/pages/`（54 ページ） - Media: `.raw/slides/srenext2022-yuukit/transcript.md`（YouTube 日本語自動字幕、固有名詞の精度に限界あり） - Summary: [[@2022__SRE NEXT 2022__AIOps研究録―SREのためのシステム障害の自動原因診断]] - Pages created: [[@2022__SRE NEXT 2022__AIOps研究録―SREのためのシステム障害の自動原因診断]], [[TSifter]] - Pages updated: [[坪内佑樹]], [[Meltria]], [[AIOps]], [[因果推論ベースRCA]], [[時系列クラスタリング]], [[自動化の皮肉]] - Key insight: SLO による症状アラートと原因診断を分けた上で、異常検知・時系列クラスタリングによる入力削減を因果グラフ生成の前処理として設計する必要がある。入力削減は高速化だけでなく、原因から症状への経路を残す境界設計である。 ## [2026-06-26] ingest-paper | デバッギング・性能解析・フィードバック 6 論文一括 ingest - Source: `.raw/papers/delta-debugging.pdf`, `103-kasikci.pdf`, `osdi14-paper-chow.pdf`, `hdd.pdf`, `nsdi16-paper-scott.pdf`, `2026_Unknown_Runtime_metric_meets_developer_building.pdf` - Summary: デルタデバッギング系(ddmin/dd → HDD → DEMi)の系譜、Facebook のエンドツーエンド性能解析(Mystery Machine)、本番障害の自動診断(Failure Sketching)、開発者へのランタイムフィードバック統合(FDD)の 6 論文を一括 wiki 化。 - Pages created: - [[@2002__IEEE TSE__Simplifying and Isolating Failure-Inducing Input]], [[@2006__ICSE__HDD - Hierarchical Delta Debugging]], [[@2014__OSDI__The Mystery Machine - End-to-end Performance Analysis of Large-scale Internet Services]], [[@2015__SOSP__Failure Sketching - A Technique for Automated Root Cause Diagnosis of In-Production Failures]], [[@2015__Onward!__Runtime Metric Meets Developer - Building Better Cloud Applications using Feedback]], [[@2016__NSDI__Minimizing Faulty Executions of Distributed Systems]] - entities: [[Andreas Zeller]], [[Ghassan Misherghi]], [[Zhendong Su]], [[Michael Chow]], [[David Meisner]], [[Jason Flinn]], [[Thomas F. Wenisch]], [[George Candea]], [[Jürgen Cito]], [[Philipp Leitner]], [[Harald C. Gall]], [[Colin Scott]], [[Scott Shenker]], [[George Necula]], [[Gist]] - concepts: [[デルタデバッギング]], [[階層的デルタデバッギング]], [[障害スケッチング]], [[フィードバック駆動開発]], [[分散実行最小化]] - Key insight: ddmin (2002) → HDD (2006) → DEMi (2016) はテストケース最小化の汎用→構造化→分散化の系譜を形成。Failure Sketching は「テスト不可能な本番障害」を協調解析で診断する相補的手法。FDD は運用メトリクスの開発フィードバック統合という今日の DevOps 議論の先駆。 ## [2026-06-26] ingest-paper | The First 50 Years of Software Reliability Engineering - Source: `.raw/papers/arxiv-1902.06140.pdf` - Summary: [[@2019__arXiv__The First 50 Years of Software Reliability Engineering - A History of SRE with First Person Accounts]] - Pages created: [[James J. Cusick]], [[John Musa]], [[Norman F. Schneidewind]], [[Martin L. Shooman]], [[ISSRE]] - Pages updated: [[Michael R. Lyu]], [[SREの工学化]], [[ソフトウェア信頼性工学]] - Key insight: Hudson (1967) → NATO (1968) → Jelinski-Moranda/Shooman (1971) → Musa (1975) → 体系化 (1987-1996) → 新領域展開 (2000s-2010s) の発展段階を創設者インタビューで裏付けた 50 年通史。 ## [2026-06-26] ingest-paper | Software Reliability Engineering: A Roadmap - Source: `.raw/papers/Lyu-2007---Software-Reliability-Engineering---A-Roadmap.pdf` - Summary: [[@2007__FOSE__Software Reliability Engineering - A Roadmap]] - Pages created: [[ソフトウェア信頼性工学]], [[ソフトウェア信頼性成長モデル]] - Pages updated: [[Michael R. Lyu]], [[ソフトウェア耐障害性]], [[Design for Reliability]], [[SREの工学化]] - Key insight: 障害ライフサイクル 4 段階（予防・除去・耐性・予測）と SRE プロセス 4 構成要素を体系化し、将来方向 5 軸を提示したロードマップ。 ## [2026-06-26] ingest-slides | From Sysadmins to (almost) Flying Unicorns - Source: `.raw/slides/srecon23emea-herail/srecon23emea-herail.pdf` - Visual pages: `.raw/slides/srecon23emea-herail/pages/`（64 ページ） - Media: none（動画・transcript 未取得） - Summary: [[@2023__SREcon23 EMEA__From Sysadmins to (almost) Flying Unicorns]] - Pages created: [[Guillaume Hérail]], [[Gilberto Müller]], [[Sony Interactive Entertainment]], [[SRE組織変革]] - Pages updated: [[SRE]] - Key insight: TOS（割り込み吸収レイヤー）と CFT（設計段階への SRE 参加）の組み合わせが、SRE のフィードバックループ外問題とトイル問題を構造的に解消した。Reliability Meetup の 2.5 年継続（22 回・600+ 参加者）が信頼性文化浸透に寄与。 ## [2026-06-26] ingest-paper | データベース/分散システム異常診断 6 論文一括 - Source: `.raw/papers/p1176-ma.pdf` - Summary: [[@2020__PVLDB__Diagnosing Root Causes of Intermittent Slow Queries in Cloud Databases]] - Pages created: [[iSQUAD]], [[間欠的遅延クエリ]] - Pages updated: [[Minghua Ma]], [[Dan Pei]], [[Shenglin Zhang]], [[異常検知]], [[データベース自律診断]] - Key insight: iSQUAD は外部要因による間欠的遅延クエリを TOPIC クラスタリングとベイズ事例モデルで診断し F1 80.4% を達成。 ## [2026-06-26] ingest-paper | OS Pre-trained Transformer - Source: `.raw/papers/OSprey.pdf` - Summary: [[@2024__arXiv__OS Pre-trained Transformer - Predicting Query Latencies across Changing System Contexts]] - Pages created: [[Tim Kraska]], [[OSprey]], [[クエリレイテンシ予測]] - Pages updated: [[Samuel Madden]], [[MIT CSAIL]] - Key insight: OSprey はワークロード固有とシステム固有の効果を因子分解することで、1 システムでの訓練で複数システムへの汎化を実現する。 ## [2026-06-26] ingest-paper | Multivariate Log-based Anomaly Detection for Distributed Database - Source: `.raw/papers/arxiv-2406.07976.pdf` - Summary: [[@2024__KDD__Multivariate Log-based Anomaly Detection for Distributed Database]] - Pages created: [[Apache IoTDB]], [[ログベース異常検知]] - Pages updated: [[Lingzhe Zhang]], [[Tong Jia]], [[Ying Li]], [[Peking University]], [[異常検知]] - Key insight: 単一ノードログによる異常検知は分散データベースでは不十分であり、マルチノードログ統合で約 12% の精度向上を実証。 ## [2026-06-26] ingest-paper | DBPA - A Benchmark for Transactional Database Performance Anomalies - Source: `.raw/papers/2026_Unknown_DBPA_Benchmark_Transactional_Database_Performance.pdf` - Summary: [[@2023__PACMMOD__DBPA - A Benchmark for Transactional Database Performance Anomalies]] - Pages created: [[データベース性能異常ベンチマーク]] - Pages updated: [[Bin Cui]], [[Shiyue Huang]], [[ZTE Corporation]], [[データベース自律診断]] - Key insight: DBPA は OLTP 性能異常の決定論的再現手順を体系化し、ML 診断のデータ不足ボトルネックを直交的に解決する。 ## [2026-06-26] ingest-paper | LogDB - Multivariate Log-based Failure Diagnosis for Distributed Databases - Source: `.raw/papers/arxiv-2505.01676.pdf` - Summary: [[@2025__arXiv__LogDB - Multivariate Log-based Failure Diagnosis for Distributed Databases]] - Pages created: (MultiLog agent と共有 entity/concept を更新) - Pages updated: [[Lingzhe Zhang]], [[Tong Jia]], [[Ying Li]], [[ログベース異常検知]], [[異常検知]] - Key insight: MultiLog の拡張版 LogDB はノード単位のログ特徴抽出・圧縮とマスターノード集約で、異なるワークロード・異常タイプに対してロバストな障害診断を実現。 ## [2026-06-26] ingest-paper | Towards Close-To-Zero Runtime Collection Overhead - Source: `.raw/papers/Zhang-et-al.-2025---Towards-close-to-zero-runtime-collection-overhead...-sed-anomaly-diagnosis-on-system-faults-for-distributed-storage-system.pdf` - Summary: [[@2025__IEEE TSC__Towards Close-To-Zero Runtime Collection Overhead - Raft-Based Anomaly Diagnosis on System Faults for Distributed Storage System]] - Pages created: [[RBAD]], [[Raftログ診断]] - Pages updated: [[Lingzhe Zhang]], [[Tong Jia]], [[Ying Li]], [[分散ストレージ]], [[異常検知]] - Key insight: Raft ログはコンセンサスプロトコルの副産物として収集オーバーヘッドゼロで入手可能であり、監視データベース手法を 15.38%、ログベース手法を 53.10% 上回る異常診断精度を達成する。 ## [2026-06-26] ingest-paper | Ultra Ethernet's Design Principles and Architectural Innovations - Source: `.raw/papers/arxiv-2508.08906.pdf` - Summary: [[@2025__arXiv__Ultra Ethernet's Design Principles and Architectural Innovations]] - Pages created: [[@2025__arXiv__Ultra Ethernet's Design Principles and Architectural Innovations]], [[Ultra Ethernet]] - Pages updated: [[RDMA]], [[RoCE設計課題]], [[Torsten Hoefler]] - Key insight: UE 1.0 は「計算 1,000 倍・帯域 100 倍」という非対称を根拠に、パケットスプレー・選択的確認応答・ゼロ RTT PDC 確立という「計算コストのかかる」メカニズムをシリコンで合理的にした RoCEv2 の正式な次世代標準。仕様書著者自身が執筆した初の論文レベル解説。 ## [2026-06-26] ingest | How Complex Systems Fail (Cook 1998) - Source: `.raw/articles/how-complexsystems-fail-2026-06-26.md` - URL: https://how.complexsystems.fail/ - Summary: [[@1998__CtL__How Complex Systems Fail]] - Pages created: [[@1998__CtL__How Complex Systems Fail]], [[Richard I. Cook]], [[複雑システム障害論]], [[潜在的障害]], [[ヒンドサイトバイアス]] - Pages updated: [[根本原因分析]], [[Metastable Failure]], [[自動化の皮肉]] - Key insight: Cook (1998) は「根本原因帰属は技術的ではなく社会的行為」と定式化し、複雑システムが常に潜在的障害を抱えたまま劣化モードで稼働することを 18 の命題で体系化した。AIOps の RCA 研究全体への根本的な問い直し。 ## [2026-06-26] ingest-slides | とあるSREの博士「過程」 - Source: `.raw/slides/srenext2025/srenext2025.pdf` - Visual pages: `.raw/slides/srenext2025/pages/` - Media: `.raw/slides/srenext2025/transcript.md` (YouTube 日本語自動字幕) - Summary: [[@2025__SRE NEXT 2025__とあるSREの博士「過程」]] - Pages created: (なし — 新規 entity/concept なし) - Pages updated: [[坪内佑樹]], [[SRE NEXT]], [[SREの工学化]] - Key insight: SRE NEXT 2024「工学としての SRE」→ IOTS2025「サイバネティクスの夢」→ SRE NEXT 2025「博士『過程』」の講演三部作が完結。博士課程で得た「独自の専門技術体系を自分の内に持つ感覚」と「他の学術分野との接続能力」が、SRE の工学化・システム論化の土台であることが開示された。 ## [2026-06-26] ingest-slides | SONiC Scale-Up Working Group から探る Scale-Up や Ultra Ethernet 機能の実装方法 - Source: `.raw/slides/sonic-jp-2026-ebiken/sonic-jp-2026-ebiken.pdf` - Visual pages: `.raw/slides/sonic-jp-2026-ebiken/pages/` - Media: none (transcript なし) - Summary: [[@2026__SONiC Workshop Japan 2026__SONiC Scale-Up Working Group から探る Scale-Up や Ultra Ethernet 機能の実装方法]] - Pages created: [[海老澤健太郎]], [[Arrcus]] - Pages updated: [[RDMA]], [[オープンネットワーキング]] - Key insight: RoCEv2 対 UE Transport/Falcon/MRC の 4 方式 12 軸比較表（p.6）は、RoCEv2 のみがロスレス前提・Go-Back-N・OoO 非対応であり、次世代 Ethernet への移行が実装レベルで具体化していることを示す。SONiC/SAI が UE spec v1.0.2 の LLR・CBFC・LLDP を実装中であり、オープンネットワーキングスタックが Scale-Out から Scale-Up 領域にも延伸しつつある。 ## [2026-06-26] enrich-source | LLM Wiki (Karpathy Gist) — full gist 再読による source/concept 強化 - Source: `https://gist.github.com/karpathy/442a6bf555914893e9891c11519de94f` - Summary: [[@2026__GitHub Gist__LLM Wiki]] - Pages updated: [[@2026__GitHub Gist__LLM Wiki]]（IDE メタファー・ユースケース・index vs log 役割分担・ツール推奨を追加）、[[LLM Wikiパターン]]（Tolkien Gateway 例・問い合わせ帰還原理を追記） - Pages created: [[Memex]]（`[[Memex]]` リンクが実体なし状態だったため新規作成、Bush-Wiener-Karpathy 系譜を整理） - Key insight: "Obsidian is the IDE; the LLM is the programmer; the wiki is the codebase" が既存ページに未収録だった。既存 ingest（2026-06-19）は概念核は正確だが Tips・ユースケース・IDE メタファーを落としていた。 ## [2026-06-26] ingest-slides | 再帰化への認知的転回 + なめらかなシステムと運用維持の終わらぬ未来 - Source: `.raw/slides/pepabo2022-the-turn-to-recursive-system/pepabo2022-the-turn-to-recursive-system.pdf` - Source: `.raw/slides/dicomo2025-coherently-fittable-system/dicomo2025-coherently-fittable-system.pdf` - Visual pages: `.raw/slides/pepabo2022-the-turn-to-recursive-system/pages/` (27 pages) - Visual pages: `.raw/slides/dicomo2025-coherently-fittable-system/pages/` (69 pages) - Media: none (transcript なし) - Summary: [[@2022__ペパボテックカンファレンス__再帰化への認知的転回]], [[@2025__DICOMO2025__なめらかなシステムと運用維持の終わらぬ未来]] - Pages created: [[再帰化]], [[エフェクチュエーション]], [[@2022__ペパボテックカンファレンス__再帰化への認知的転回]], [[@2025__DICOMO2025__なめらかなシステムと運用維持の終わらぬ未来]] - Pages updated: [[なめらかなシステム]], [[基礎情報学]], [[セルフクラフト]], [[三宅悠介]] - Key insight: なめらかなシステムの理論的発展が 2018(定義) → 2022(再帰化: 実装方向の具体化) → 2025(目的生成的ななめらかさへの転換) の 3 段階で進んでおり、DICOMO2025 では「主体から関係性へ」の視座転換と AI エージェントネットワークによる意味の翻訳・関係性の媒介が構想された。 ## [2026-06-26] ingest-paper | ReCycle: Resilient Training of Large DNNs using Pipeline Adaptation - Source: `.raw/papers/arxiv-2405.14009.pdf` - Summary: [[@2024__SOSP__ReCycle - Resilient Training of Large DNNs using Pipeline Adaptation]] - Pages created: [[@2024__SOSP__ReCycle - Resilient Training of Large DNNs using Pipeline Adaptation]], [[Swapnil Gandhi]], [[Christos Kozyrakis]], [[ReCycle]] - Pages updated: [[Stanford University]], [[パイプライン並列化]], [[耐障害LLM訓練]] - Key insight: ハイブリッド並列訓練に内在するデータ並列冗長性とパイプラインバブルを活用し、スペアサーバなしで障害復旧を実現。分割逆伝播とストラグラーオプティマイザの組み合わせで、Oobleck 対比最大 1.46×、Bamboo 対比最大 1.64× のスループット向上を達成した。 ## [2026-06-26] ingest-paper | Alibaba HPN: A Data Center Network for Large Language Model Training - Source: `.raw/papers/Qian-et-al.-2024---Alibaba-HPN---A-data-center-network-for-large-language-model-training.pdf` - Summary: [[@2024__SIGCOMM__Alibaba HPN - A Data Center Network for Large Language Model Training]] - Pages created: [[@2024__SIGCOMM__Alibaba HPN - A Data Center Network for Large Language Model Training]] - Pages updated: [[Alibaba HPN]], [[Kun Qian]], [[Alibaba Cloud]], [[LLM分散学習]], [[耐障害LLM訓練]], [[Rail-Optimizedトポロジ]], [[RDMA]] - Key insight: LLM 訓練トラフィックの周期的バースト大流量特性に対し、2 層デュアルプレーン + レール最適化 + 非スタック型デュアル ToR で ECMP ハッシュ偏極を排除し 15K GPU を単一 Pod に収容。8 ヶ月本番運用で DCN+ 比 14.9% スループット向上、ToR 単一障害点ゼロを達成した。 ## [2026-06-26] ingest-paper | なめらかなシステムを目指して - Source: `.raw/papers/dicomo2018-proceeding-antipop.pdf` - Summary: [[@2018__DICOMO2018__なめらかなシステムを目指して]] - Pages created: [[@2018__DICOMO2018__なめらかなシステムを目指して]], [[栗林健太郎]], [[コンテキスト・アウェアネス]], [[基礎情報学]] - Pages updated: [[なめらかなシステム]], [[三宅悠介]], [[Ryosuke Matsumoto]], [[GMOペパボ]], [[サイバネティクス]] - Key insight: [[なめらかなシステム]]の一次出典論文を取り込み。「利用者のコンテキストは事後的に形成される」という要求(2)を[[基礎情報学]](HACS)から導き、コンテキスト・アウェアネスと統合した 2018 年の原定義がここにある。開発運用者を「利用者」と対称的に扱う設計が SRE 文脈への接続可能性を内包している。 ## [2026-06-26] ingest-paper | Exploring GPU-to-GPU Communication: Insights into Supercomputer Interconnects - Source: `.raw/papers/sc24-gpu-gpu-interconnect.pdf` - Summary: [[@2024__SC__Exploring GPU-to-GPU Communication - Insights into Supercomputer Interconnects]] - Pages created: [[@2024__SC__Exploring GPU-to-GPU Communication - Insights into Supercomputer Interconnects]], [[Tiziano De Matteis]], [[Zebin Ren]], [[Animesh Trivedi]], [[Duncan Roweth]] - Pages updated: [[Daniele De Sensi]], [[Lorenzo Pichetti]], [[Flavio Vella]], [[Torsten Hoefler]], [[集合通信]], [[HPCインターコネクトベンチマーク]] - Key insight: 3台の欧州スーパーコンピュータで最大4,096 GPU 規模を実計測し、「デフォルト設定は全システムで最適から程遠い・ノード内集団通信では*CCL が優位だがノード間点対点では MPI が最大10倍高速・InfiniBand Dragonfly+ はネットワークノイズで allreduce を最大50%劣化させる」という3つの構造的非対称性を初めて定量化した。 ## [2026-06-26] ingest-paper | Unicron: Economizing Self-Healing LLM Training at Scale - Source: `.raw/papers/arxiv-2401.00134.pdf` - Summary: [[@2024__arXiv__Unicron - Economizing Self-Healing LLM Training at Scale]] - Pages created: [[@2024__arXiv__Unicron - Economizing Self-Healing LLM Training at Scale]], [[Tao He (Alibaba)]], [[Jingren Zhou]], [[Unicron]], [[弾性LLM訓練]] - Pages updated: [[Kun Qian]], [[Alibaba Group]], [[耐障害LLM訓練]] - Key insight: Unicron は「個別タスクのダウンタイム最小化」から「クラスタ全体の複数タスク WAF 最大化」へ目標を転換し、Megatron の訓練効率を完全継承しつつ自己修復を実現することで、障害頻度が高まるほど優位性が拡大する設計を実証した。 ## [2026-06-26] ingest-paper | I've Got 99 Problems But FLOPS Ain't One - Source: `.raw/papers/hotnets24-333.pdf` - Summary: [[@2024__HotNets__I've Got 99 Problems But FLOPS Ain't One]] - Pages created: [[@2024__HotNets__I've Got 99 Problems But FLOPS Ain't One]], [[AIデータセンタートポロジ]], [[Costin Raiciu]], [[University Politehnica of Bucharest]] - Pages updated: [[Broadcom]], [[LLM分散学習]], [[データセンター輻輳制御]], [[LLMスケーリング則]] - Key insight: スケーリング則で導出した 103.8T モデルの物量分析により、百万 GPU スケールでは FLOPs でなくネットワーク(特にスケールアップ帯域)が最大ボトルネックであり、マルチプレーン・マルチレール・マルチパストランスポートへの移行が不可避であることを定量的に示した。 ## [2026-06-26] ingest-paper | Generic and ML Workloads in an HPC Datacenter - Source: `.raw/papers/arxiv-2409.08949.pdf` - Summary: [[@2024__ICPADS__Generic and ML Workloads in an HPC Datacenter]] - Pages created: [[@2024__ICPADS__Generic and ML Workloads in an HPC Datacenter]], [[HPCワークロード特性化]], [[Xiaoyu Chu]], [[Alexandru Iosup]], [[SURF]], [[Ivona Brandic]] - Pages updated: [[GPUクラスタ運用]], [[Vrije Universiteit Amsterdam]] - Key insight: 国家規模の混在 HPC(ML+汎用)で、ML ジョブはノード数 15%・投入件数 9% に対してエネルギー 39% を消費し、クラスタ全体エネルギーの 50% が未完了ジョブに費やされるという、HPC スケールでの「失敗コストは件数でなく消費資源で見る」原則の実証。 ## [2026-06-26] ingest-paper | An Empirical Study on Quality Issues of Deep Learning Platform - Source: `.raw/papers/quality-issues-icse2023.pdf` - Summary: [[@2023__ICSE__An Empirical Study on Quality Issues of Deep Learning Platform]] - Pages created: [[@2023__ICSE__An Empirical Study on Quality Issues of Deep Learning Platform]], [[DLプラットフォーム品質問題]] - Pages updated: [[Yanjie Gao]], [[Hongyu Zhang]], [[Microsoft Research]] - Key insight: Microsoft 社内 DL プラットフォーム Platform-X の品質問題 360 件を分析した初の包括的実証研究。ユーザー側障害が 43.34% と最大で、そのうち 15.00% がバグコード起因だが、単純な Job Resubmission だけで全体の 34.72% を緩和できるという非対称が明らかになった。 ## [2026-06-26] ingest-slides | 工学としてのSRE再訪 - Source: `.raw/slides/srenext2024_yuuk1/srenext2024_yuuk1.pdf` - Visual pages: `.raw/slides/srenext2024_yuuk1/pages/` - Media: `.raw/slides/srenext2024_yuuk1/transcript.md` (YouTube 自動字幕 ja) - Summary: [[@2024__SRE NEXT 2024__工学としてのSRE再訪]] - Pages created: [[SREの工学化]], [[Mark Burgess]] - Pages updated: [[坪内佑樹]], [[さくらインターネット研究所]], [[SRE NEXT]], [[Topotal]], [[自動化の皮肉]], [[サイバネティクス]] - Key insight: SRE の「技芸→工学」移行を 1987 年の USENIX LISA から 2024 年の SREcon 統合まで歴史的に辿り、オープンチャレンジ 6 件と学術分野接続の地図を提示。IOTS2025 の「工学→システム論」と合わせて坪内の知的系譜の 2 段階構造が確認できる。 ## [2026-06-26] ingest-slides | SREはサイバネティクスの夢をみるか - Source: `.raw/slides/iots2025_presentation/iots2025_presentation.pdf` - Visual pages: `.raw/slides/iots2025_presentation/pages/` (137 pages) - Media: `.raw/slides/iots2025_presentation/transcript.md` - Summary: [[@2025__IOTS2025__SREはサイバネティクスの夢をみるか]] - Pages created: [[@2025__IOTS2025__SREはサイバネティクスの夢をみるか]], [[坪内佑樹]], [[さくらインターネット研究所]], [[自動化の皮肉]], [[なめらかなシステム]] - Pages updated: [[サイバネティクス]], [[セルフクラフト]], [[テレメトリ]], [[特徴量削減]], [[Fault Localization]], [[サービスレベル目標]], [[HeteroTSDB]], [[MetricSifter]], [[三宅悠介]] - Key insight: SRE をサイバネティクス(フィードバックループ・セカンドオーダー・創発)で再解釈し、要素還元的な信頼性工学から総体的システム観への転換を示した。博士論文の 3 貢献を「計測→保存→分析」3 層で俯瞰し、自動化の皮肉・なめらかなシステム・セルフクラフトを経由して AI エージェント時代の SRE 像を展望。 ## [2026-06-26] ingest-paper | Demystifying NCCL: An In-depth Analysis of GPU Communication Protocols and Algorithms - Source: `.raw/papers/arxiv-2507.04786.pdf` - Summary: [[@2025__IEEE__Demystifying NCCL - An In-depth Analysis of GPU Communication Protocols and Algorithms]] - Pages created: [[@2025__IEEE__Demystifying NCCL - An In-depth Analysis of GPU Communication Protocols and Algorithms]], [[ATLAHS]], [[Siyuan Shen]], [[Zhiyi Hu]] - Pages updated: [[NCCL]], [[Torsten Hoefler]], [[集合通信]] - Key insight: NCCL の「ブラックボックス」内部を初めて体系的に解明——Simple/LL/LL128 のトレードオフはノード内外で非対称(LL128 はノード内 NVLink で全サイズ最安定、Simple はノード間大メッセージで最速)であり、Ring AllReduce の 2k-1 ステップ構造と Tree AllReduce の SM 非対称パイプラインが Mycroft のトレース精度と VCCL の SM-free 設計の共通的前提になっている。 ## [2026-06-26] ingest-paper | An Efficient, Reliable and Observable Collective Communication Library in Large-scale GPU Training Clusters - Source: `.raw/papers/arxiv-2510.00991.pdf` - Summary: [[@2026__arXiv__An Efficient, Reliable and Observable Collective Communication Library in Large-scale GPU Training Clusters]] - Pages created: [[wiki/sources/@2026__arXiv__An Efficient, Reliable and Observable Collective Communication Library in Large-scale GPU Training Clusters]], [[Mingjun Zhang]] - Pages updated: [[Infrawaves]], [[Menghao Zhang]], [[集合通信]], [[耐障害LLM訓練]], [[RDMAネットワーク監視]] - Key insight: VCCL は SM-free P2P・プライマリバックアップ QP・スライディングウィンドウ RDMA モニタの三機構で、CCL 層が SM 競合・NIC 障害・ネットワーク異常をそれぞれ自律解決できることを 24K GPU 本番で実証——CCL を「ブラックボックス」から「自己最適化・自己診断するレイヤー」へ引き上げる設計思想。 ## [2026-06-26] ingest-paper | Training with Confidence: Catching Silent Errors in Deep Learning Training with Automated Proactive Checks - Source: `.raw/papers/osdi25-jiang.pdf` - Summary: [[@2025__OSDI__Training with Confidence - Catching Silent Errors in Deep Learning Training with Automated Proactive Checks]] - Pages created: [[Yuxuan Jiang]], [[Peng Huang]], [[TrainCheck]], [[OrderLab]], [[DLトレーニングサイレントエラー]], [[訓練不変条件]] - Pages updated: [[University of Michigan]], [[Heisenbug]] - Key insight: DL 訓練サイレントエラーは損失・精度では見えず、訓練不変条件(重みの整合性・API 呼び出し順序等)の自動推論・継続検証によって 1 イテレーション以内に検知できる——観察の粒度の選択が検知可能性を規定するという Heisenbug 的洞察の DL 特化実現。 ## [2026-06-25] wiki-query | QoA 3 軸 / Zadka コストモデル / アンチパターン防止策 - Query 1: QoA(Quality of Alerts)の 3 軸(indicativeness / precision / handleability)の詳細解説 - Query 2: Zadka SREcon22 コストモデル(アンチクオリティ定式化・アラーム 3 分類・レイテンシ 4 区間分解)の詳細解説 - Query 3: QoA アンチパターンをどのように防ぐか(設計時 Avoidance / 運用時 Reaction / 自動検知改善の 3 層) - Pages created: - [[QoA-3軸-詳細解説]] - [[Zadka-コストモデル-詳細]] - [[QoAアンチパターン-防ぎ方]] - Pages updated: [[index]] ## [2026-06-25] ingest-slides | Symptom-based Alerting for Machine Learning - Source: `.raw/slides/srecon23emea-weichbrodt/srecon23emea-weichbrodt.pdf` - Visual pages: `.raw/slides/srecon23emea-weichbrodt/pages/` - Media: `.raw/slides/srecon23emea-weichbrodt/media/audio.m4a`、YouTube 自動字幕（英語）で補完 - Summary: [[@2023__SREcon23 EMEA__Symptom-based Alerting for Machine Learning]] - Pages created: [[Lina Weichbrodt]], [[MLモデル監視]] - Pages updated: [[アラート管理]], [[アクショナブルアラート]], [[アラート疲労]] - Key insight: SRE の症状ベースアラーティングを ML に転用し、出力側から逆順に 3 段階の監視優先度を割り当てるフレームワーク。ML 監視文献の入力データドリフト偏重に対し、出力分布監視を「キャッチオール手法」として優先する実践的方針。 ## [2026-06-25] wiki-query | アラーティング研究の学術/実務マップ作成 - Source: [[アラーティングの進歩-年代別]] - Summary: [[アラーティング学術実務マップ]] を新規作成。年代別に [A]学術 / [P]実務 / [H]産業研究の3区分で地図化。通時的な役割分担パターン（問題命名→規範確立→手法設計→実証→実用化）を表形式で整理。学術/実務の断絶4点も特定。 ## [2026-06-24] ingest-paper | LLM-Powered Multi-Agent Collaboration for Intelligent Industrial On-Call Automation - Source: `.raw/papers/Ruowei__OncallX_to_ASE_25.pdf` - Summary: [[@2025__ASE__LLM-Powered Multi-Agent Collaboration for Intelligent Industrial On-Call Automation]] - Pages created: [[@2025__ASE__LLM-Powered Multi-Agent Collaboration for Intelligent Industrial On-Call Automation]], [[Ruowei Fu]], [[OncallX]], [[オンコール自動化]] - Pages updated: [[Shenglin Zhang]], [[ByteDance]], [[Nankai University]], [[マルチエージェント協調]], [[LLMによる根本原因分析]], [[インシデント管理]] - Key insight: OncallX は「ユーザー意図強化 → 木探索マルチエージェント QA → KG 拡張トリアージ」の 3 モジュールでオンコールライフサイクル全体を自動化し、ByteDance 本番 2 か月で対応 789 倍・トリアージ 50 倍高速化を実証した初の産業統合事例である。 ## [2026-06-24] wiki-query | 現代の理想的なアラーティング — 4フェーズ圧縮モデル追記 - Target: [[現代の理想的なアラーティング]] - Summary: 原則 5「段階的ノイズ除去」の 12 介入点(半整数層含む)を発火タイミング軸で A 保証 / B 抑制 / C 配信最適化 / D 解決支援の 4 フェーズに圧縮したモデルを追記。元の詳細表を残したまま補完する形で挿入。 ## [2026-06-24] wiki-query | 現代の理想的なアラーティング-判断モデル - Target: [[現代の理想的なアラーティング-判断モデル]] - Summary: [[現代の理想的なアラーティング]] を判断モデル型の別ノートとして再構成。アラーティングを「人へ通知する仕組み」ではなく、顧客影響・緊急性・対応主体・自動処理可能性・欠落リスクに基づいてページ、チケット、自律処理、診断情報、削除へ振り分ける信頼性制御ループとして定義。 - Structure: 語彙分離、中心原則、7 段ライフサイクル、ページ判定表、品質モデル、組織モデル、成熟度モデル、監査チェックリスト、未解決の論点。 - Pages referenced: [[現代の理想的なアラーティング]]、[[アラーティングの進歩-年代別]]、[[アラート管理]]、[[アクショナブルアラート]]、[[Quality of Alerts]]、[[サービスレベル目標]]、[[Prometheusルールリント]]、[[Adaptive Paging]]、[[認知的徒弟制]]、[[agentic SRE]] ## [2026-06-24] ingest-paper | ClickHouse - Lightning Fast Analytics for Everyone - Source: `.raw/papers/p3731-schulze.pdf` - Summary: [[@2024__PVLDB__ClickHouse - Lightning Fast Analytics for Everyone]] - Pages created: [[@2024__PVLDB__ClickHouse - Lightning Fast Analytics for Everyone]], [[ClickHouse]], [[ClickHouse Inc|ClickHouse Inc.]], [[Robert Schulze]], [[Alexey Milovidov]], [[列指向OLAPデータベース]] - Pages updated: [[LSMツリー]], [[sources/_index]], [[entities/_index]], [[concepts/_index]], [[index]], [[hot]], [[log]] - Key insight: ClickHouse の MergeTree* は LSM ツリーのフラット等価パート変形であり、ペイロード別マージ戦略（置換/集計/TTL）と WAL レス直書きで OLAP 高スループットを実現する。 ## [2026-06-24] wiki-query | 現代の理想的なアラーティング - Target: [[現代の理想的なアラーティング]] - Summary: [[アラーティングの進歩-年代別]] の 45 ソースと wiki/concepts 14 ページの横断的知見を統合し、「現代の理想的なアラーティング」を定義。 - Structure: 統合定義(1 文) + 7 設計原則(症状起点・アクショナブル・発火前保証・適切な受け手・多層介入・測定可能品質・Agentic 対応) + 4 面の組織的条件(インセンティブ・社会的合意・能力育成・文化) + 未到達点 5 項 + 判定基準 10 項の表 - Pages referenced: [[アラーティングの進歩-年代別]]、[[アラート管理]]、[[アクショナブルアラート]]、[[Quality of Alerts]]、[[アラート疲労]]、[[アラートポリューション]]、[[アラート抑制]]、[[サービスレベル目標]]、[[Prometheusルールリント]]、[[Adaptive Paging]]、[[認知的徒弟制]]、[[Warningアラート]]、[[アラート相関]]、[[agentic SRE]] ほか ## [2026-06-24] wiki-query | アラーティングの進歩-年代別大幅更新 - Target: [[アラーティングの進歩-年代別]] - Summary: 2026-06-23 の SREcon スライド・動画・記事の大量 ingest を踏まえ、年代別レビューを全面更新。 - Added sources: 15 件（SREcon16〜SREcon23、SRE NEXT 2023、Cloudflare Blog 2022） - Key changes: - §4 に実践者エコシステム（Treat, Rabenstein, Chen/Baidu, Jalleda/Zynga, Cloudflare, Alibaba）を新設 - §5.5 を新設（Adaptive Paging 2019、LinkedIn MAD スパイク検知 2021） - §6 を大幅拡充（pint, Zadka コストモデル, alert pollution, 認知的徒弟制, prepalert, Runbook ガバナンス） - §10 を「4 つの大潮流」→「5 つの大潮流」へ拡張（第 5: 人間的・組織的介入の独立軸化） - §10 第 1 潮流の介入点を「8 層」→「10 層超」へ更新 - §11 に未解決の問い (f) を追加（技術的介入 × 人間的介入の統合設計） - frontmatter に related 4 件・sources 15 件を追加 ## [2026-06-24] ingest-slides | SREcon スライド 7 件一括取り込み 7 件の SREcon 発表スライドを一括取り込み。異常検知・変化点検知・時系列マイニング・アラート管理に関する 2015–2025 年の産業事例。 - Source: `.raw/slides/sre19apac_slides_chen_golden_signals/` (34 pages) - Summary: [[@2019__SREcon19 Asia__Anomaly Detection on Golden Signals]] - Pages created: source + entity (Yu Chen 更新) - Key insight: Baidu のゴールデンシグナル上の異常検知を STL 分解と曜日別正規化で構築。 - Source: `.raw/slides/srecon15europe_slides_clegg/` (21 pages) - Summary: [[@2015__SREcon15 Europe__Signatures, Patterns, and Trends - Timeseries Data Mining at Etsy]] - Pages created: source + [[Andrew Clegg]], [[Etsy]] + [[時系列類似度検索]] - Key insight: SAX/DTW を Kale/Skyline に統合した 2015 年の時系列マイニング実践。 - Source: `.raw/slides/srecon24emea_slides-shubin/` (87 pages) - Summary: [[@2024__SREcon24 EMEA__Anomaly Detection in Time Series from Scratch Using Statistical Analysis]] - Pages created: source + [[Ivan Shubin]], [[Booking.com]] - Pages updated: [[異常検知]] - Key insight: MAD が過去インシデントによる標準偏差膨張を 78%→31% に抑える定量結果。Granomaly サービス。 - Source: `.raw/slides/sre25amer_slides-neidel/` (32 pages) - Summary: [[@2025__SREcon25 Americas__Using Statistical Techniques to Automatically Detect Game-Breaking Issues]] - Pages created: source + [[Ian Neidel]], [[Open Connect]] - Pages updated: [[変化点検知]] - Key insight: Netflix ゲーム QoE の変化点検知。 - Source: `.raw/slides/sre25amer_slides-cirella/` (27 pages) - Summary: [[@2025__SREcon25 Americas__Stopping Performance Regression via Changepoint Detection]] - Pages created: source + [[Joseph Cirella]], [[Shanthini Velan]] - Pages updated: [[変化点検知]] - Key insight: Bloomberg PELT アルゴリズムで CI/CD パイプラインの性能レグレッション自動検知。 - Source: `.raw/slides/srecon17asia_slides_wang_0/` (18 pages) - Summary: [[@2017__SREcon17 Asia__Smart Monitoring System for Anomaly Detection on Business Trends in Alibaba]] - Pages created: source + entity (Zhaogang Wang, Alibaba Group 更新) - Pages updated: [[異常検知]] - Key insight: N-σ 適応閾値と STL 分解によるビジネストレンド異常検知。 - Source: `.raw/slides/srecon15_slides_qu/` (35 pages) - Summary: [[@2015__SREcon15__Smart Monitor System For Automatic Anomaly Detection at Baidu]] - Pages created: source + [[Xianping Qu]] - Key insight: BNS/BMS/DMP/Transfer 4 モジュールの監視プラットフォーム。 ## [2026-06-23] ingest-slides | Automatic Metric Screening for Service Diagnosis - Source: `.raw/slides/srecon18americas_slides_chen/srecon18americas_slides_chen.pdf` - Visual pages: `.raw/slides/srecon18americas_slides_chen/pages/` (15 pages) - Media: `.raw/slides/srecon18americas_slides_chen/media/audio.m4a` (transcript 未生成) - Summary: [[@2018__SREcon18 Americas__Automatic Metric Screening for Service Diagnosis]] - Pages created: [[@2018__SREcon18 Americas__Automatic Metric Screening for Service Diagnosis]] - Pages updated: [[Yu Chen (Baidu)]], [[Baidu]], [[Fault Localization]], [[RCA入力選別]], [[特徴量削減]] - Key insight: Baidu の自動メトリクススクリーニングは、ゴールデンメトリクス設定に依存せず、KDE 異常度測定・DBSCAN クラスタリング・ダイジェストランキングで「次に見るべきインスタンス集合 + 異常メトリクス集合」を推薦する。FluxRank 論文化前の、LLM 以前の統計的 RCA 入力選別として位置づけられる。 ## [2026-06-23] ingest-slides | A Practical Guide to Monitoring and Alerting with Time Series at Scale - Source: `.raw/slides/srecon17_americas_slides_wilkinson/srecon17_americas_slides_wilkinson.pdf` - Visual pages: `.raw/slides/srecon17_americas_slides_wilkinson/pages/` (82 pages) - Media: none (公式 MP3 あり、transcript 未生成) - Summary: [[@2017__SREcon17 Americas__A Practical Guide to Monitoring and Alerting with Time Series at Scale]] - Pages created: [[@2017__SREcon17 Americas__A Practical Guide to Monitoring and Alerting with Time Series at Scale]] - Pages updated: [[Jamie Wilkinson]], [[Prometheus]], [[アラート管理]], [[アクショナブルアラート]], [[サービスレベル目標]], [[ヒストグラムメトリクス]] - Key insight: 監視の保守コストはサービス規模に対して劣線形に保つ必要がある。静的しきい値は容量差・ワークロード差で壊れやすく、時間・分布・SLO 違反に変換したアラート条件と、Prometheus のラベル付き時系列/記録ルール/トポロジ集約が保守コスト削減の実装部品になる。 ## [2026-06-23] ingest-slides | Alerting for Distributed Systems - A Tale of Symptoms and Causes, Signals and Noise - Source: `.raw/slides/srecon16europe_slides_rabenstein/srecon16europe_slides_rabenstein.pdf` - Visual pages: `.raw/slides/srecon16europe_slides_rabenstein/pages/` (30 pages) - Media: none - Summary: [[@2016__SREcon16 Europe__Alerting for Distributed Systems - A Tale of Symptoms and Causes, Signals and Noise]] - Pages created: [[Björn Rabenstein]], [[SoundCloud]] - Pages updated: [[Prometheus]], [[アラート管理]], [[アクショナブルアラート]], [[Prometheusルールリント]], [[サービスレベル目標]] - Key insight: 分散システムでは原因と症状が緩く結合するため、ページは症状や差し迫ったサービス問題に絞り、原因はチケット・情報通知・調査用グラニュラリティとして扱う。Prometheus 型の時系列アラーティングは静的閾値を越え、将来の枯渇見込みにページできる一方、ページ用ルールは単純・堅牢に保つ必要がある。 ## [2026-06-23] ingest-slides | Less Alarming Alerts! - Source: `.raw/slides/srecon16_slides_treat/srecon16_slides_treat.pdf` - Visual pages: `.raw/slides/srecon16_slides_treat/pages/` (55 pages) - Media: `.raw/slides/srecon16_slides_treat/media/treat.mp3` (transcript 未生成) - Summary: [[@2016__SREcon16__Less Alarming Alerts]] - Pages created: [[Robert Treat]], [[OmniTI]] - Pages updated: [[アラート管理]], [[アクショナブルアラート]], [[アラート疲労]] - Key insight: 2016 年時点で Treat は、アラートを「人を起こすページ」として限定し、ビジネス影響・修復手順・通知先・予防可能性を説明できないものを悪いアラートとして削除・通知化・修正する発火前ガバナンスを提示していた。 ## [2026-06-23] ingest-slides | Draining the Flood — A Combat against Alert Fatigue - Source: `.raw/slides/srecon17asia-chen-draining-the-flood/srecon17asia-chen-draining-the-flood.pdf` - Visual pages: `.raw/slides/srecon17asia-chen-draining-the-flood/pages/` (20 pages) - Media: none - Summary: [[@2017__SREcon17 Asia__Draining the Flood - A Combat against Alert Fatigue]] - Pages created: [[Argus (Baidu)]] - Pages updated: [[Yu Chen (Baidu)]], [[アラート管理]], [[アラート疲労]], [[アラート集約]] - Key insight: 2017 年時点で Baidu が 4 施策（グルーピング・キャリブレーション・エスカレーション・自動修復）を同時投入し 85% 削減を達成。アテンション率による重要度校正は AlertRank を 3 年先行する着想。 ## [2026-06-23] ingest-slides | Anomaly Detection in Infrequently Occurred Patterns - Source: `.raw/slides/srecon17amer-wang-anomaly-infrequent/srecon17amer-wang-anomaly-infrequent.pdf` - Visual pages: `.raw/slides/srecon17amer-wang-anomaly-infrequent/pages/` (15 pages) - Media: `.raw/slides/srecon17amer-wang-anomaly-infrequent/transcript.md` (Whisper 文字起こし) - Summary: [[@2017__SREcon17Americas__Anomaly Detection in Infrequently Occurred Patterns]] - Pages created: [[Dong Wang]], [[Baidu]] - Pages updated: [[異常検知]] - Key insight: 低頻度かつ非固定日付の祝日トラフィックに対し、CDF k-means クラスタリングで類似日を発見しリアルタイム比率補正を重ねる 2 段階手法が、Holt-Winters 等の従来手法を超える産業事例。 ## [2026-06-23] ingest-slides | Want to Solve Over-Monitoring and Alert Fatigue? Create the Right Incentives! - Source: `.raw/slides/srecon17emea-jalleda-over-monitoring/srecon17emea-jalleda-over-monitoring.pdf` - Visual pages: `.raw/slides/srecon17emea-jalleda-over-monitoring/pages/` (61 pages) - Media: none (yt-dlp 失敗、transcript なし) - Summary: [[@2017__SREcon17 Europe__Want to Solve Over-Monitoring and Alert Fatigue - Create the Right Incentives]] - Pages created: [[Kishore Jalleda]], [[Zynga]], [[アラート疲労]] - Pages updated: [[アラート管理]], [[アラートポリューション]], [[アクショナブルアラート]] - Key insight: アラート疲労の根本原因はインセンティブの不整合であり、アラートバジェットで SRE サポートを条件付き提供する Clean Room イニシアティブが偽アラーム 90% 削減を達成した。 ## [2026-06-23] ingest-video | Monitoring Cloudflare's Planet-Scale Edge Network - Source: https://www.youtube.com/watch?v=lHtY7TUsLzk (SREcon17 Europe, 2017-09-01) - Transcript: `.raw/videos/srecon17europe-bostock-monitoring-cloudflare/transcript.md` (YouTube auto-generated) - Frames: `.raw/videos/srecon17europe-bostock-monitoring-cloudflare/frames/` (12 frames) - Summary: [[@2017__SREcon17 Europe__Monitoring Cloudflare's Planet-Scale Edge Network]] - Pages created: [[@2017__SREcon17 Europe__Monitoring Cloudflare's Planet-Scale Edge Network]], [[Matt Bostock]] - Pages updated: [[Cloudflare]], [[Prometheus]], [[アラート管理]] - Key insight: Cloudflare は 116 PoP それぞれに独立した Prometheus を配置し、コアデータセンターへフェデレーションで集約するアーキテクチャで、監視対象と同じ障害ドメインに監視を置く信頼性設計を実現。「原因でなく症状にアラートする」「マシンでなくサービスにアラートする」を組織原則として推進し、5 年後の pint 開発（ルール健全性保証）へ繋がる上流シフトの起点。 ## [2026-06-23] ingest-video | Introduction to Alibaba Monitoring System - Source: https://www.youtube.com/watch?v=yXaV6Eedrzo (SREcon18 Asia, 2018-06-06) - Transcript: `.raw/videos/youtube-yXaV6Eedrzo/transcript.md` (YouTube auto-generated en-orig) - Frames: `.raw/videos/youtube-yXaV6Eedrzo/frames/` (12 frames) - Summary: [[@2018__SREcon18 Asia__Introduction to Alibaba Monitoring System]] - Pages created: [[@2018__SREcon18 Asia__Introduction to Alibaba Monitoring System]], [[Ren Xinchi]], [[Hammurabi]], [[ビジネスモニタリング]] - Pages updated: [[Alibaba Group]], [[アラート管理]] - Key insight: Alibaba は 4 層モニタリング構造のうちビジネス層を最重要と位置づけ、CMDB Hammurabi で KPI と P1〜P4 優先度を一元管理。5 ゴールデンエレメント（総数・成功数・成功率・応答時間・失敗数）で顧客影響を定量的に表現し、変更情報の重ね合わせ（障害の 70% が変更起因）で迅速な復旧を実現する。 ## [2026-06-23] ingest | Monitoring our Monitoring (Cloudflare Blog) - Source: `.raw/articles/monitoring-our-monitoring-2026-06-23.md` - Summary: [[@2022__Cloudflare-Blog__Monitoring-our-Monitoring]] - Pages created: [[wiki/sources/@2022__Cloudflare-Blog__Monitoring-our-Monitoring]], [[Cloudflare]], [[pint]], [[Prometheusルールリント]] - Pages updated: [[Prometheus]], [[アラート管理]] - Key insight: Prometheus の空クエリ問題による「監視の静かな失敗」を pint の CI + デーモン watchdog パターンで防ぐ「第零の介入点」を体系化。 ## [2026-06-23] ingest-slides | Spike Detection in Alert Correlation at LinkedIn - Source: `.raw/slides/srecon21_slides_singh/srecon21_slides_singh.pdf` - Visual pages: `.raw/slides/srecon21_slides_singh/pages/` (30 pages) - Media: `.raw/slides/srecon21_slides_singh/transcript.md` (Whisper 音声文字起こし) - Summary: [[@2021__SREcon21__Spike Detection in Alert Correlation at LinkedIn]] - Pages created: [[Nishant Singh]], [[アラート相関]] - Pages updated: [[LinkedIn]], [[異常検知]] - Key insight: 修正 Z スコア（MAD ベース）という ML 不要の単純統計手法で、アラート相関システムの推奨結果から一時的スパイクを分離し、偽陽性率 1% 未満・トイル 30–40% 削減を実現。 ## [2026-06-23] ingest-slides | Dark Sky Camping: Reducing Alert Pollution with Modern Observability Practices - Source: `.raw/slides/sre22amer-smith-dark-sky-camping/sre22amer-smith-dark-sky-camping.pdf` - Visual pages: `.raw/slides/sre22amer-smith-dark-sky-camping/pages/` (36 pages) - Media: `.raw/slides/sre22amer-smith-dark-sky-camping/transcript.md` (YouTube 英語自動字幕) - Summary: [[@2022__SREcon22 Americas__Dark Sky Camping - Reducing Alert Pollution with Modern Observability Practices]] - Pages created: [[Kristin Smith]], [[Campspot]], [[アラートポリューション]] - Pages updated: [[アラート管理]], [[サービスレベル目標]], [[オブザーバビリティ]] - Key insight: アラート増設は信号対雑音比を悪化させ、「モニタリング=安全」の心理的結合がその根本原因。OpenTelemetry 自動計装は 4 時間で完了し、移行コストの認知バイアスが最大の障壁。 ## [2026-06-23] ingest-paper | Sakana Fugu Technical Report - Source: `.raw/papers/Fugu_technical_report.pdf` - Summary: [[@2026__Sakana AI__Sakana Fugu Technical Report]] - Pages created: [[@2026__Sakana AI__Sakana Fugu Technical Report]], [[集合知]] - Pages updated: [[マルチエージェント協調]], [[Sakana AI]], [[Yujin Tang]], [[Stefan Nielsen]], [[Edoardo Cetin]] - Key insight: Fugu/Fugu-Ultra は「オーケストレーションをスケーリング軸とする」ことを本番公開レベルで実証した最初のシステム。ワーカープールに含まれないモデルクラス（Mythos Preview・Fable 5）を超えたことが、スケーリング軸の独立性を示す最強の証拠。 ## [2026-06-23] ingest-slides | Cognitive Apprenticeship in Practice with Alert Triage Hour of Power - Source: `.raw/slides/srecon23-americas-cruz-cognitive-apprenticeship/srecon23-americas-cruz-cognitive-apprenticeship.pdf` - Visual pages: `.raw/slides/srecon23-americas-cruz-cognitive-apprenticeship/pages/` - Media: `.raw/slides/srecon23-americas-cruz-cognitive-apprenticeship/transcript.md` (YouTube 英語自動字幕由来) - Summary: [[@2023__SREcon23 Americas__Cognitive Apprenticeship in Practice with Alert Triage Hour of Power]] - Pages created: [[@2023__SREcon23 Americas__Cognitive Apprenticeship in Practice with Alert Triage Hour of Power]], [[Paige Cruz]], [[Chronosphere]], [[認知的徒弟制]] - Pages updated: [[アラート管理]], [[アクショナブルアラート]] - Key insight: アラートトリアージは生得的スキルではなく、認知的徒弟制の 6 段階を通じて構造化ミーティングで体系的に伝達できる。KEEP/TUNE/DELETE の集団判定は技術的介入と直交する「人間側のアラート衛生」介入点として既存の AIM 分類に追加すべき要素。 ## [2026-06-23] ingest-slides | Are We All on the Same Page? Let's Fix That - Source: `.raw/slides/srecon19emea-mineiro-adaptive-paging/srecon19emea-mineiro-adaptive-paging.pdf` - Visual pages: `.raw/slides/srecon19emea-mineiro-adaptive-paging/pages/` - Media: `.raw/slides/srecon19emea-mineiro-adaptive-paging/transcript.md` (YouTube 英語自動字幕由来) - Summary: [[@2019__SREcon19 EMEA__Are We All on the Same Page - Lets Fix That]] - Pages created: [[@2019__SREcon19 EMEA__Are We All on the Same Page - Lets Fix That]], [[Luis Mineiro]], [[Zalando SE]], [[Adaptive Paging]] - Pages updated: [[アラート管理]], [[アクショナブルアラート]], [[分散トレーシング]] - Key insight: 症状ベースアラーティングの「通知先固定」問題を分散トレーシングの因果関係で解消する Adaptive Paging は、アラート管理の既存介入点（件数・重要度・内容操作）に「通知先ルーティング」という直交する介入点を追加する。FSF（IEEE CLOUD 2022）が同型のスパンツリー動的因果推論を学術的に形式化する 3 年前の実運用実装。 ## [2026-06-23] ingest | AIのモデル崩壊と多様性 (joisino) - Source: `.raw/articles/joisino-collapse-2026-06-22.md` - URL: https://joisino.hatenablog.com/entry/collapse - Summary: [[joisino-モデル崩壊と多様性-2026]] - Pages created: [[joisino-モデル崩壊と多様性-2026]], [[モデル崩壊]] - Pages updated: [[佐藤竜馬]], [[wiki/index.md]] - Key insight: AI 生成データの反復訓練は分布収縮を引き起こし、人間の思考パターンにまで波及する二段階カスケード。π²/6 ≈ 1.645 倍の損失増加上界（線形モデル）が唯一の数理的緩和保証だが、スケーリングのみでの解決は不可能。 ## [2026-06-23] ingest-slides | Modeling Alert Quality - Source: `.raw/slides/srecon22-zadka-modeling-alert-quality/srecon22-zadka-modeling-alert-quality.pdf` - Visual pages: `.raw/slides/srecon22-zadka-modeling-alert-quality/pages/` - Media: `.raw/slides/srecon22-zadka-modeling-alert-quality/transcript.md` (YouTube 自動字幕由来) - Summary: [[@2022__SREcon22 Americas__Modeling Alert Quality]] - Pages created: [[@2022__SREcon22 Americas__Modeling Alert Quality]], [[Moshe Zadka]] - Pages updated: [[Quality of Alerts]], [[アラート管理]] - Key insight: アラート品質をコスト（アンチクオリティ = アラーティングコスト + 非アラーティングコスト）として定量化する実践的フレームワーク。Yang+ DSN2022 の学術的 QoA 3 軸と同年に出された相補的な枠組み。 ## [2026-06-23] ingest-paper | mABC: Multi-Agent Blockchain-inspired Collaboration for Root Cause Analysis in Micro-Services Architecture - Source: `.raw/papers/2024.findings-emnlp.232.pdf` - Summary: [[@2024__EMNLP Findings__mABC - Multi-Agent Blockchain-inspired Collaboration for Root Cause Analysis in Micro-Services Architecture]] - Pages created: [[@2024__EMNLP Findings__mABC - Multi-Agent Blockchain-inspired Collaboration for Root Cause Analysis in Micro-Services Architecture]], [[Wei Zhang (Beihang)]], [[Hongcheng Guo]], [[Cloudwise]] - Pages updated: [[Beihang University]], [[LLMによる根本原因分析]], [[マルチエージェント協調]], [[wiki/sources/_index.md]], [[wiki/entities/_index.md]], [[wiki/concepts/_index.md]], [[wiki/index.md]] - Key insight: マルチエージェントの役割分担設計がモデル規模のスケーリングを超える性能向上をもたらし、blockchain 投票は精度より解決策品質(人間評価)に寄与する ## [2026-06-23] ingest-slides | Warningアラートを放置しない！アラート駆動でログやメトリックを自動収集する仕組みによる恩恵 - Source: `.raw/slides/warning-alert-driven-log-metric-collection/warning-alert-driven-log-metric-collection.pdf` - Visual pages: `.raw/slides/warning-alert-driven-log-metric-collection/pages/` - Media: `.raw/slides/warning-alert-driven-log-metric-collection/transcript.md`（YouTube 日本語自動字幕） / `.raw/slides/warning-alert-driven-log-metric-collection/media/audio.m4a` - Summary: [[@2023__SRE NEXT__Warningアラートを放置しない！アラート駆動でログやメトリックを自動収集する仕組みによる恩恵]] - Pages created: [[@2023__SRE NEXT__Warningアラートを放置しない！アラート駆動でログやメトリックを自動収集する仕組みによる恩恵]], [[池田将士]], [[面白法人カヤック]], [[prepalert]], [[Warningアラート]] - Pages updated: [[Mackerel]], [[SRE NEXT]], [[アラート管理]], [[エラーバジェット]], [[サービスレベル目標]], [[wiki/sources/_index]], [[wiki/entities/_index]], [[wiki/concepts/_index]], [[wiki/index]], [[wiki/hot]] - Key insight: Warning アラートは Critical ほど即時対応ではなくても、SLO・エラーバジェット消費や長尾の信頼性劣化を示す。発火時点のログ・メトリクスを prepalert で自動添付すると、振り返り時の証拠欠落を減らし、低重要度アラートを放置せず調査可能な状態へ移せる。 ## [2026-06-23] ingest-slides | Runbookに何を書き、どのようにアラートを振り分けるか？ - Source: `.raw/slides/runbook-alert-triage/runbook-alert-triage.pdf` - Visual pages: `.raw/slides/runbook-alert-triage/pages/` - Media: `.raw/slides/runbook-alert-triage/transcript.md`（YouTube 日本語自動字幕） / `.raw/slides/runbook-alert-triage/media/audio.m4a` - Summary: [[@2023__SpeakerDeck__Runbookに何を書き、どのようにアラートを振り分けるか]] - Pages created: [[@2023__SpeakerDeck__Runbookに何を書き、どのようにアラートを振り分けるか]], [[Sohei Iwahori]], [[GREE, Inc]] - Pages updated: [[SRE NEXT]], [[アクショナブルアラート]], [[アラート管理]], [[wiki/sources/_index]], [[wiki/entities/_index]], [[wiki/concepts/_index]], [[wiki/index]], [[wiki/hot]] - Key insight: Runbook は単なる手順書ではなく、アラートの Why・背景・判断材料を保存して発火後の actionability を支える。さらにアラート追加時点で通知チャンネル・対応タイミング・スコープ・対応 Runbook を明示させることで、発火前にアクションを合意する上流統制になる。 ## [2026-06-23] ingest-paper | JustDiag!: A Diagnostic Justification Engine for Accountable Root Cause Analysis - Source: `.raw/papers/arxiv-2606.19407.pdf` - Summary: [[@2026__arXiv__JustDiag! A Diagnostic Justification Engine for Accountable Root Cause Analysis]] - Pages created: [[診断的正当化]], [[Tingzhu Bi]], [[Xinrui Jiang]], [[Xun Zhang]], [[Pengcheng Su]], [[Congjie He]], [[Jinglin Li]], [[Meng Ma]], [[Beijing University of Posts and Telecommunications]] - Pages updated: [[Ping Wang]], [[LLMによる根本原因分析]], [[仮説駆動RCA]], [[RCA評価設計]] - Key insight: LLM ベース RCA システムの Process Score(説明責任品質)が Outcome Score(最終回答品質)から独立して評価可能であり、既存手法(RCAgent: 9.5、Flow-of-Action: 9.3)は説明責任品質において著しく低い。「校正された非閉包(stalled)」が誤った確実性の回避として設計上の利点となる。 ## [2026-06-23] ingest-paper | Rethinking the Role of Efficient Attention in Hybrid Architectures - Source: `.raw/papers/arxiv-2606.15378.pdf`（arXiv:2606.15378v1 [cs.CL] 2026-06-13） - Summary: [[@2026__arXiv__Rethinking the Role of Efficient Attention in Hybrid Architectures]] - Pages created: [[@2026__arXiv__Rethinking the Role of Efficient Attention in Hybrid Architectures]], [[ハイブリッドアテンションアーキテクチャ]], [[Zhiyuan Liu]], [[Xu Han]], [[Chaojun Xiao]], [[OpenBMB]] - Pages updated: [[NoPE]], [[線形注意]], [[Tsinghua University]], [[wiki/sources/_index]], [[wiki/entities/_index]], [[wiki/concepts/_index]], [[wiki/index]] - Key insight: 効率的注意はフルアテンションの最適化事前として機能し、大きなウィンドウ SWA はフルアテンション検索ヘッドの形成を遅らせる（Large-Window Laziness）。フルアテンション層への NoPE 適用がこの問題への有効な対処法。 ## [2026-06-23] ingest-paper | Learning to Orchestrate Agents in Natural Language with the Conductor - Source: `.raw/papers/arxiv-2512.04388.pdf`（ICLR 2026; arXiv:2512.04388） - Summary: [[@2026__ICLR__Learning to Orchestrate Agents in Natural Language with the Conductor]] - Pages created: [[Sakana AI]], [[Stefan Nielsen]], [[Edoardo Cetin]], [[Yujin Tang]], [[マルチエージェント協調]] - Pages updated: [[テスト時計算スケーリング]], [[wiki/sources/_index]], [[wiki/entities/_index]], [[wiki/concepts/_index]], [[index]], [[hot]] - Key insight: 自然言語を媒介にしたエンドツーエンドRLで、7B ConductorがGPT-5を超えるSOTAを達成。「エージェント協調呼び出し回数を増やす」という新しいテスト時スケーリング軸（再帰的トポロジー）が提案された。 ## [2026-06-23] ingest | Agentic RL Frameworks and Best Practices — Cameron R. Wolfe - Source: `.raw/articles/agentic-rl-2026-06-23.md` - Summary: [[Agentic-RL-Cameron-Wolfe-2026]] - Pages created: [[Agentic-RL-Cameron-Wolfe-2026]]、[[Cameron-R-Wolfe|Cameron R. Wolfe]] - Pages updated: [[エージェント型強化学習]]、[[index]]、[[hot]]、[[log]] - Key insight: ToRL(RL-Zero でツール利用が創発し呼び出し率 40→80%)と AgentGym-RL(ScalingInter-RL の 3 フェーズカリキュラムで高次行動が創発)という 2 フレームワークが新規。5 フレームワーク横断で「ステップレベル軌跡・非同期デカップリング・タスク別正規化」の 3 原則が独立に再発見されている。 ## [2026-06-23] ingest | Europe 2031 — ARQ Foundation 政策シナリオ - Source: `.raw/articles/europe2031-ai-2026-06-23.md` - Summary: [[europe2031-ai|Europe 2031]] - Pages created: [[europe2031-ai]]、[[ARQ Foundation]]、[[ASML]]、[[ヨーロッパのAI主権]]、[[コンピュート格差]] - Pages updated: [[wiki/sources/_index]]、[[wiki/entities/_index]]、[[wiki/concepts/_index]]、[[index]]、[[hot]]、[[log]] - Key insight: デジタル主権規制がフロンティアAIアクセスを遮断することで主権を守るはずが逆に脆弱性を高めるという逆説を、2025〜2031年のシナリオで体系化している。 ## [2026-06-23] ingest | Loop Engineering — sairahul1 X スレッド - Source: `.raw/articles/sairahul1-loop-engineering-2026-06-23.md` - Summary: [[sairahul1-Loop-Engineering-2026]] - Pages created: [[sairahul1-Loop-Engineering-2026]], [[ループエンジニアリング]], [[Boris Cherny]], [[Peter Steinberger]], [[Sai Rahul]] - Pages updated: [[wiki/index]], [[wiki/log]], [[wiki/hot]] - Key insight: [[Peter Steinberger]](OpenAI) と [[Boris Cherny]](Anthropic Claude Code ヘッド)が同時に「プロンプトを送るのではなくループを設計せよ」と発言。コスト問題(1 回のフリートループで 500K〜2M トークン)は DeepSeek V4 などで解消されつつある。 ## [2026-06-23] ingest-paper | LLM 基盤論文 4 本一括(InstructGPT / Chinchilla / Sparsely-Gated MoE / ReAct) - Source: `.raw/papers/arxiv-2203.02155.pdf`, `.raw/papers/arxiv-2203.15556.pdf`, `.raw/papers/arxiv-1701.06538.pdf`, `.raw/papers/arxiv-2210.03629.pdf` - Summary: [[@2022__NeurIPS__Training language models to follow instructions with human feedback]], [[@2022__arXiv__Training Compute-Optimal Large Language Models]], [[@2017__ICLR__Outrageously Large Neural Networks The Sparsely-Gated Mixture-of-Experts Layer]], [[@2023__ICLR__ReAct Synergizing Reasoning and Acting in Language Models]] - Pages created: [[@2022__NeurIPS__Training language models to follow instructions with human feedback]], [[@2022__arXiv__Training Compute-Optimal Large Language Models]], [[@2017__ICLR__Outrageously Large Neural Networks The Sparsely-Gated Mixture-of-Experts Layer]], [[@2023__ICLR__ReAct Synergizing Reasoning and Acting in Language Models]], [[Long Ouyang]], [[Jordan Hoffmann]], [[DeepMind]], [[Azalia Mirhoseini]], [[Geoffrey Hinton]], [[Jeffrey Dean]], [[Quoc V. Le]], [[Shunyu Yao]], [[Karthik Narasimhan]], [[Princeton University]], [[Jared Kaplan]], [[人間フィードバックからの強化学習]], [[指示チューニング]], [[アライメント]], [[計算最適訓練]], [[条件付き計算]], [[負荷分散]], [[ReAct]] - Pages updated: [[Mixture-of-Experts]], [[Chain-of-Thought Prompting]], [[スケーリング則]], [[Google Brain]], [[OpenAI]], [[Noam Shazeer]] - Key insight: 2017–2023 年の LLM 基盤 4 論文を横断すると、(1) RLHF と CoT は独立に発展したが組み合わせが事実上の標準化、(2) Chinchilla のスケーリング則は Kaplan et al. 2020 を覆し「データもモデルと等比率で増やすべき」を示した、(3) Shazeer 2017 の MoE 設計原則(top-k ゲーティング・補助損失・帯域ボトルネック)は 2024–2026 年の DeepSeek-V3 等に直接継承、(4) ReAct は CoT の推論力にツール利用の接地力を加え、エージェント型 AI の原型を確立。 ## [2026-06-22] ingest | The Big LLM Architecture Comparison (Sebastian Raschka) - Source: `.raw/articles/the-big-llm-architecture-comparison-2026-06-22.md`（https://magazine.sebastianraschka.com/p/the-big-llm-architecture-comparison） - Summary: [[The-Big-LLM-Architecture-Comparison|The Big LLM Architecture Comparison]] - Pages created: [[The-Big-LLM-Architecture-Comparison|The Big LLM Architecture Comparison]] / [[Multi-Head Latent Attention]] / [[Grouped-Query Attention]] / [[スライディングウィンドウアテンション]] / [[NoPE]] / [[QK-Norm]] / [[Gated DeltaNet]] / [[Sebastian Raschka]] / [[Gemma 3]] / [[Gemma 4]] / [[Qwen3]] / [[Qwen3-Next]] / [[GPT-OSS]] / [[SmolLM3]] / [[Mistral 3]] / [[Kimi Linear]] / [[Arcee AI Trinity Large]] / [[Xiaomi MiMo-V2-Flash]] / [[OLMo 2]] - Pages updated: [[Mixture-of-Experts]]（共有エキスパート論争・SWA+MoE 普及・粗粒度 vs 細粒度設計分岐）/ [[マルチトークン予測]]（Qwen3-Next・Nemotron 3 Super の推論時 MTP 活用進化）/ [[線形注意]]（2025 年後半の再台頭・MiniMax-M2 の撤退・Kimi Linear の反論） - Key insight: MLA vs GQA の性能比較でMLA優位(DeepSeek-V2アブレーション)が示されているにもかかわらずGQAが多数派のままである理由は実装複雑度にある。また共有エキスパートの採否・線形アテンションの有効性がともに現時点で設計判断が分岐している重要な未解決問題。 ## [2026-06-21] ingest | Datadog Bits AI SRE GA Announcement (Kai Xin Tai, 2025-06-10) - Source: `.raw/articles/bits-ai-sre-2026-06-21.md`（https://www.datadoghq.com/blog/bits-ai-sre/） - Summary: [[@2025__Datadog__Introducing Bits AI SRE]] - Pages created: [[@2025__Datadog__Introducing Bits AI SRE]] - Pages updated: [[Bits AI SRE]]（GA 後拡張・Bits AI Dev Agent プレビュー）、[[agentic SRE]]（調査→修正拡張・プロアクティブトリガー・調査間記憶の知見）、[[Datadog]]（新ソース追加） - Key insight: Bits AI Dev Agent（コード修正 PR 生成）は agentic SRE が「読み取り中心 → 書き込み権限付き」へ拡張する最初の公開産業実装。Watchdog ストーリー・合成テストへのトリガー拡大でプロアクティブ調査方向が示される。 ## [2026-06-21] wiki-refactor | concepts health check and duplicate consolidation - Summary: `wiki/concepts` 全体を健全性チェックし、動的インストルメンテーションを [[動的計装]] に統合、サービス依存性発見・ネットワークサービス依存性発見を [[ネットワーク依存性発見]] に統合。旧タイトルは aliases に保存し、concept/index 側の旧リンクを正準名へ更新。 - Pages updated: [[動的計装]], [[ネットワーク依存性発見]], [[Fault Localization]], [[concepts/_index]], [[index]] - Pages deleted: `wiki/concepts/動的インストルメンテーション.md`, `wiki/concepts/サービス依存性発見.md`, `wiki/concepts/ネットワークサービス依存性発見.md` - Compression: [[Fault Localization]] を 170 行から 100 行へ圧縮し、親概念として [[根本原因分析]]・[[RCA評価設計]]・[[ログ解析]]・[[LLM学習モニタリング]]・[[RDMAネットワーク監視]] へ詳細論点を分岐。 - Verification: concept files 349→346、100 行超 44→41、必須見出し欠落 36→35、`sources` 空 39→38、`lint-stub` 0、対象旧リンク 0、対象 3 concept の未解決 wikilink 0。残債: 既存 index 差分 55/26 と既存必須見出し欠落 35 は別作業。 ## [2026-06-21] ingest-paper | LLM-Enhanced Failure Localization in Microservices (LocaleXpert) - Source: `.raw/papers/LLM-Enhanced_Failure_Localization_in_Microservices_Integrating_Multi-Modal_Data_and_Expert_Interpretation.pdf` - Summary: [[@2026__TSC__LLM-Enhanced Failure Localization in Microservices - Integrating Multi-Modal Data and Expert Interpretation]] - Pages created: [[@2026__TSC__LLM-Enhanced Failure Localization in Microservices - Integrating Multi-Modal Data and Expert Interpretation]], [[Zhouruixing Zhu]] - Pages updated: [[根本原因分析]], [[マルチモーダル障害診断]], [[LLMによる根本原因分析]] - Key insight: LLM 単体より統計的障害箇所特定モジュールとの併用が精度・解釈可能性の両面で優位。 ## [2026-06-21] ingest-paper | Time Series as Language (UniTok) - Source: `.raw/papers/arxiv-2606.09861.pdf` - Summary: [[@2026__arXiv__Time Series as Language - A Universal Tokenizer for General-Purpose Time Series Foundation Models]] - Pages created: [[@2026__arXiv__Time Series as Language - A Universal Tokenizer for General-Purpose Time Series Foundation Models]], [[Yunhao Zhang]], [[Ruiying Qi]], [[Jiale Zheng]], [[Jianfeng Zhang (Huawei)]], [[Lujia Pan]], [[Junchi Yan]], [[Huawei Noah's Ark Lab]] - Pages updated: [[時系列基盤モデル]] - Key insight: VQ-VAE ベースの離散トークン化により、予測・生成・分類を統一的に NTP で解く汎用 TSFM を実現。 ## [2026-06-21] ingest-paper | MRCA: Metric-level Root Cause Analysis - Source: `.raw/papers/2026_Unknown_MRCA_Metric_level_Root_Cause.pdf` - Summary: [[@2024__ASE__MRCA - Metric-level Root Cause Analysis for Microservices via Multi-Modal Data]] - Pages created: [[@2024__ASE__MRCA - Metric-level Root Cause Analysis for Microservices via Multi-Modal Data]], [[Yidan Wang]] - Pages updated: [[根本原因分析]], [[マルチモーダル障害診断]] - Key insight: サービスレベルでなくメトリクスレベルまで根本原因を掘り下げることで運用者の行動可能性を高める。 ## [2026-06-21] ingest-paper | Holistic Root Cause Analysis for Cloud-Native Systems - Source: `.raw/papers/Holistic_Root_Cause_Analysis_for_Failures_in_Cloud-Native_Systems_Through_Observability_Data.pdf` - Summary: [[@2024__TSC__Holistic Root Cause Analysis for Failures in Cloud-Native Systems Through Observability Data]] - Pages created: [[@2024__TSC__Holistic Root Cause Analysis for Failures in Cloud-Native Systems Through Observability Data]], [[Yongqi Han]], [[Qingfeng Du]], [[Tongji University]], [[Di-Matrix]] - Pages updated: [[根本原因分析]], [[マルチモーダル障害診断]] - Key insight: メトリクス・ログ・トレースの 3 モダリティを因果グラフ上で統合し、単一モダリティ手法の限界を克服。 ## [2026-06-21] ingest-paper | Medicine: Multimodal Adaptive Optimization for Failure Diagnosis - Source: `.raw/papers/ASE_24_Medicine.pdf` - Summary: [[@2024__ASE__Giving Every Modality a Voice in Microservice Failure Diagnosis via Multimodal Adaptive Optimization]] - Pages created: [[@2024__ASE__Giving Every Modality a Voice in Microservice Failure Diagnosis via Multimodal Adaptive Optimization]], [[Lei Tao]], [[Zhengdan Li]] - Pages updated: [[根本原因分析]], [[マルチモーダル障害診断]], [[Minghua Ma]], [[Shenglin Zhang]], [[Dan Pei]] - Key insight: 各モダリティに適応的重み付けを行い、情報量の少ないモダリティが支配的モダリティに埋もれる問題を解決。 ## [2026-06-21] ingest-paper | ChangeLLM: Multimodal Change Assessment - Source: `.raw/papers/ChangeLLM.pdf` - Summary: [[@2025__FSE__A Multimodal Intelligent Change Assessment Framework for Microservice Systems Based on Large Language Models]] - Pages created: [[@2025__FSE__A Multimodal Intelligent Change Assessment Framework for Microservice Systems Based on Large Language Models]], [[Yuchi Ma]], [[Qiuai Fu]] - Pages updated: [[変更起因インシデント]], [[LLMによる根本原因分析]], [[Chetan Bansal]], [[Pinjia He]] - Key insight: RAG で過去の変更事例を検索し LLM のコールドスタート問題を緩和する変更影響評価パイプライン。 ## [2026-06-21] ingest-paper | Interpretable Failure Localization (DeepHunt) - Source: `.raw/papers/2026_Unknown_Interpretable_Failure_Localization_Microservice_Systems.pdf` - Summary: [[@2025__TOSEM__Interpretable Failure Localization for Microservice Systems Based on Graph Autoencoder]] - Pages created: [[@2025__TOSEM__Interpretable Failure Localization for Microservice Systems Based on Graph Autoencoder]] - Pages updated: [[根本原因分析]], [[Nankai University]] - Key insight: グラフオートエンコーダの再構成誤差から異常スコアを導出し、注意機構による解釈可能な帰属を実現。 ## [2026-06-21] fold | batch-exponent-k4 rollup of 16 entries (2026-06-17–2026-06-18) - Location: wiki/folds/fold-k4-from-2026-06-17-to-2026-06-18-n16.md - Range: 2026-06-17 to 2026-06-18 - Children: 16 log entries ## [2026-06-21] fold | batch-exponent-k4 rollup of 16 entries (2026-06-16–2026-06-17) - Location: wiki/folds/fold-k4-from-2026-06-16-to-2026-06-17-n16.md - Range: 2026-06-16 to 2026-06-17 - Children: 16 log entries ## [2026-06-21] fold | batch-exponent-k4 rollup of 16 entries (2026-06-16) - Location: wiki/folds/fold-k4-from-2026-06-16-to-2026-06-16-n16.md - Range: 2026-06-16 to 2026-06-16 - Children: 16 log entries ## [2026-06-21] fold | batch-exponent-k4 rollup of 16 entries (2026-06-15–2026-06-16) - Location: wiki/folds/fold-k4-from-2026-06-15-to-2026-06-16-n16.md - Range: 2026-06-15 to 2026-06-16 - Children: 16 log entries ## [2026-06-21] fold | batch-exponent-k4 rollup of 16 entries (2026-06-14–2026-06-15) - Location: wiki/folds/fold-k4-from-2026-06-14-to-2026-06-15-n16.md - Range: 2026-06-14 to 2026-06-15 - Children: 16 log entries ## [2026-06-21] fold | batch-exponent-k4 rollup of 16 entries (2026-06-14 b2) - Location: wiki/folds/fold-k4-from-2026-06-14-to-2026-06-14-n16-b2.md - Range: 2026-06-14 to 2026-06-14 - Children: 16 log entries ## [2026-06-21] fold | batch-exponent-k4 rollup of 16 entries (2026-06-14) - Location: wiki/folds/fold-k4-from-2026-06-14-to-2026-06-14-n16.md - Range: 2026-06-14 to 2026-06-14 - Children: 16 log entries ## [2026-06-21] fold | batch-exponent-k4 rollup of 16 entries (2026-06-08–2026-06-14) - Location: wiki/folds/fold-k4-from-2026-06-08-to-2026-06-14-n16.md - Range: 2026-06-08 to 2026-06-14 - Children: 16 log entries ## [2026-06-21] fold | batch-exponent-k4 rollup of 16 entries (2026-06-06–2026-06-08) - Location: wiki/folds/fold-k4-from-2026-06-06-to-2026-06-08-n16.md - Range: 2026-06-06 to 2026-06-08 - Children: 16 log entries ## [2026-06-21] fold | batch-exponent-k4 rollup of 16 entries (2026-06-06) - Location: wiki/folds/fold-k4-from-2026-06-06-to-2026-06-06-n16.md - Range: 2026-06-06 to 2026-06-06 - Children: 16 log entries ## [2026-06-21] fold | batch-exponent-k4 rollup of 16 entries (2026-06-05–2026-06-06) - Location: wiki/folds/fold-k4-from-2026-06-05-to-2026-06-06-n16.md - Range: 2026-06-05 to 2026-06-06 - Children: 16 log entries ## [2026-06-21] fold | batch-exponent-k4 rollup of 16 entries (2026-06-05 b2) - Location: wiki/folds/fold-k4-from-2026-06-05-to-2026-06-05-n16-b2.md - Range: 2026-06-05 to 2026-06-05 - Children: 16 log entries ## [2026-06-20] ingest-paper | TraceRank, LogCluster, LogKG, FSF, Nezha, Eadro (6 papers batch) - Source: `.raw/papers/Yu-et-al.-2021---TraceRank-*.pdf`, `.raw/papers/Lin-et-al.-2016---Log-clustering-*.pdf`, `.raw/papers/LogKG.pdf`, `.raw/papers/CLOUD22.pdf`, `.raw/papers/Nezha-*.pdf`, `.raw/papers/arxiv-2302.05092.pdf` - Summary: [[@2021__JSEP__TraceRank - Abnormal service localization with dis-aggregated end-to-end tracing data in cloud native systems]], [[@2016__ICSE-C__Log Clustering Based Problem Identification for Online Service Systems]], [[@2023__TSC__LogKG - Log Failure Diagnosis through Knowledge Graph]], [[@2022__IEEE CLOUD__Localizing and Explaining Faults in Microservices Using Distributed Tracing]], [[@2023__ESEC-FSE__Nezha - Interpretable Fine-Grained Root Causes Analysis for Microservices on Multi-modal Observability Data]], [[@2023__arXiv__Eadro - An End-to-End Troubleshooting Framework for Microservices on Multi-source Data]] - Pages created: 6 source pages, 12 entity pages, 2 concept pages ([[ログクラスタリング]], [[知識グラフ]]) - Pages updated: [[根本原因分析]], [[Fault Localization]], [[分散トレーシング]], [[ログ解析]], [[マルチモーダル障害診断]], [[異常検知]], [[The Chinese University of Hong Kong]], [[Saurabh Jha]] - Key insight: サービスレベル箇所特定(TraceRank)→ コード領域レベル根本原因特定(Nezha)→ 検知-箇所特定統合(Eadro)の解像度進化軸が 2016–2023 の 6 論文で明瞭に浮かぶ。ログ系は「圧縮してから知識照合」(LogCluster)→ 知識グラフ推論(LogKG)への構造化が進む。 ## [2026-06-20] ingest-paper | Energy statistics: A class of statistics based on distances - Source: `.raw/papers/Szkely-and-Rizzo-2013---Energy-statistics---A-class-of-statistics-based-on-distances.pdf` - Summary: [[@2013__JSPI__Energy statistics - A class of statistics based on distances]] - Pages created: [[@2013__JSPI__Energy statistics - A class of statistics based on distances]], [[エネルギー統計]], [[距離相関]], [[Gábor J. Székely]], [[Maria L. Rizzo]] - Pages updated: [[sources/_index]], [[entities/_index]], [[concepts/_index]], [[index]], [[hot]] - Key insight: エネルギー距離は特性関数の重み付き L2 距離と同値であり、回転不変+スケール同変の公理から一意に定まる。距離共分散(dCov)のゼロ当量が独立性の必要十分条件となり、ブラウン共分散と任意次元で一致する。 ## [2026-06-20] fold | batch-exponent-k4 rollup of 16 entries (2026-06-05) - Location: wiki/folds/fold-k4-from-2026-06-05-to-2026-06-05-n16.md - Range: 2026-06-05 to 2026-06-05 - Children: 16 log entries ## [2026-06-20] ingest-paper | Odin: Microsoft's Scalable Fault-Tolerant CDN Measurement System - Source: `.raw/papers/nsdi18-calder.pdf` - Summary: [[@2018__NSDI__Odin - Microsoft's Scalable Fault-Tolerant CDN Measurement System]] - Pages created: [[@2018__NSDI__Odin - Microsoft's Scalable Fault-Tolerant CDN Measurement System]], [[Matt Calder]], [[Ethan Katz-Bassett]], [[Ganesh Ananthanarayanan]], [[Ratul Mahajan]], [[Columbia University]], [[Intentionet]], [[CDN計測システム]], [[エニキャストルーティング]] - Pages updated: [[Microsoft]], [[Jitendra Padhye]], [[sources/_index]], [[entities/_index]], [[concepts/_index]], [[index]], [[hot]] - Key insight: ファーストパーティCDNは自社アプリ埋め込みでサードパーティ計測基盤を大幅に超えるユーザーカバレッジ(98% AS・85% /24)を達成できる。エニキャストは20%のリクエストを最適より25ms以上悪いFEへ送り、Odinの計測でリアルタイム検知・ユニキャストパッチで補正するというハイブリッドCDN制御が実運用で有効。 ## [2026-06-20] fold | batch-exponent-k4 rollup of 16 entries (2026-06-04–2026-06-05) - Location: wiki/folds/fold-k4-from-2026-06-04-to-2026-06-05-n16.md - Range: 2026-06-04 to 2026-06-05 - Children: 16 log entries ## [2026-06-20] ingest-paper | Practitioners' Expectations on Automated Fault Localization - Source: `.raw/papers/issta16.pdf` (MD5: d322f7635d8a61a0a7dafe2db9bc3a18) - Summary: [[@2016__ISSTA__Practitioners' Expectations on Automated Fault Localization]] - Pages created: [[@2016__ISSTA__Practitioners' Expectations on Automated Fault Localization]], [[Xin Xia]], [[Pavneet Singh Kochhar]], [[Shanping Li]], [[Singapore Management University]] - Pages updated: [[David Lo]], [[Fault Localization]] - Key insight: 2016 年の実務者 386 名調査で FL 採用の「壁」を定量化——Top-5・成功率 75%・100kLOC・1 分以内・判断根拠という 5 条件を同時に満たした論文は 2011–2015 年の 15 本中皆無。この「研究-実務ギャップの定量化」が SE FL における RCA 評価設計の先行事例となる。 ## [2026-06-20] ingest-paper | 分散トレーシング基礎論文 5 本一括 - Sources: Pinpoint (DSN 2002), Magpie (HotOS IX 2003), lprof (OSDI 2014), Pivot Tracing (SOSP 2015), Canopy (SOSP 2017) - Pages created: [[@2002__DSN__Pinpoint - Problem Determination in Large, Dynamic Internet Services]], [[@2003__HotOS__Magpie - Online Modelling and Performance-aware Systems]], [[@2014__OSDI__lprof - A Non-intrusive Request Flow Profiler for Distributed Systems]], [[@2015__SOSP__Pivot Tracing - Dynamic Causal Monitoring for Distributed Systems]], [[@2017__SOSP__Canopy - An End-to-End Performance Tracing And Analysis System]], [[Pinpoint]], [[Magpie]], [[lprof]], [[Pivot Tracing]], [[Canopy]], [[Mike Y. Chen]], [[Emre Kıcıman]], [[Armando Fox]], [[Eric Brewer]], [[Paul Barham]], [[Rebecca Isaacs]], [[Richard Mortier]], [[Xu Zhao]], [[Ding Yuan]], [[Michael Stumm]], [[Jonathan Mace]], [[Ryan Roelke]], [[Rodrigo Fonseca]], [[Jonathan Kaldor]], [[Scuba]], [[動的計装]], [[リクエストモデリング]], [[非侵入プロファイリング]] - Pages updated: [[分散トレーシング]], [[根本原因分析]], [[Stanford University]], [[Microsoft Research]], [[University of Toronto]], [[Brown University]], [[Facebook]], [[Meta]] - Key insight: 2002–2017 の分散トレーシング進化系譜を横断的に集約。統計的相関（Pinpoint）→ イベントベースリクエスト抽出（Magpie）→ 非侵入ログ再構築（lprof）→ 動的因果計装（Pivot Tracing）→ 超大規模本番運用（Canopy）へと、計装の侵入度と分析の因果性が段階的に深化。 ## [2026-06-20] fold | batch-exponent-k4 rollup of 16 entries (2026-06-03–2026-06-04) - Location: wiki/folds/fold-k4-from-2026-06-03-to-2026-06-04-n16.md - Range: 2026-06-03 to 2026-06-04 - Children: 16 log entries ## [2026-06-20] fold | batch-exponent-k4 rollup of 16 entries (2026-06-02–2026-06-03) - Location: wiki/folds/fold-k4-from-2026-06-02-to-2026-06-03-n16.md - Range: 2026-06-02 to 2026-06-03 - Children: 16 log entries ## [2026-06-20] ingest-paper | BARO - Robust Root Cause Analysis for Microservices via Multivariate Bayesian Online Change Point Detection - Source: `.raw/papers/arxiv-2405.09330.pdf` (MD5: 61a2035a2d1d87009e0dd860b1f8e232) - Summary: [[@2024__FSE__BARO - Robust Root Cause Analysis for Microservices via Multivariate Bayesian Online Change Point Detection]] - Pages created: [[@2024__FSE__BARO - Robust Root Cause Analysis for Microservices via Multivariate Bayesian Online Change Point Detection]] - Pages updated: [[Luan Pham]], [[Huong Ha]], [[Hongyu Zhang]], [[変化点検知]], [[根本原因分析]], [[因果推論ベースRCA]] - Key insight: 因果グラフ不要の BARO がすべての因果グラフ手法を 3 ベンチマークで上回る理由は、異常検知時刻のずれへの非感度設計（中央値/IQR）にある——辺方向推定ボトルネックと並ぶ新たな設計要件を明示した。 ## [2026-06-20] ingest-paper | Localizing Failure Root Causes in a Microservice through Causality Inference - Source: `.raw/papers/paper-IWQOS2020-MicroCause.pdf` - Summary: [[@2020__IWQoS__Localizing Failure Root Causes in a Microservice through Causality Inference]] - Pages created: [[Yuan Meng]]、[[Ruru Zhang]]、[[Zhilong Hu]]、[[Yiyin Zhang]]、[[Chenyang Jia]]、[[Zhaogang Wang]]、[[@2020__IWQoS__Localizing Failure Root Causes in a Microservice through Causality Inference]] - Pages updated: [[因果推論ベースRCA]]、[[Dan Pei]]、[[Shenglin Zhang]]、[[Yongqian Sun]] - Key insight: PCTS（PCMCI ベース）が PC の孤立サブグラフ問題を解決し、TCORW の偏相関設計が相関ベースランダムウォークのコンファウンダー誤判定を解消する——ただし大規模ベンチ（Pham+ ASE 2024）では 2 時間超過の評価もあり小規模評価の楽観性に注意 ## [2026-06-20] fold | batch-exponent-k4 rollup of 16 entries (2026-06-18–2026-06-19) - Location: wiki/folds/fold-k4-from-2026-06-18-to-2026-06-19-n16.md - Range: 2026-06-18 to 2026-06-19 - Children: 16 log entries ## [2026-06-20] fold | batch-exponent-k4 rollup of 16 entries - Location: wiki/folds/fold-k4-from-2026-06-19-to-2026-06-20-n16.md - Range: 2026-06-19 to 2026-06-20 - Children: 16 log entries ## [2026-06-20] ingest-paper | マイクロサービス・DB RCA 基礎論文 10 本一括 - Source: `.raw/papers/arxiv-2306.11417.pdf`, `.raw/papers/2013SIGMETRICS13_Root20Cause20Detection20in20a20Service-Oriented20Architecture.pdf`, `.raw/papers/2018-CloudRanger.pdf`, `.raw/papers/liuping-camera-ready.pdf`, `.raw/papers/2019WWW_CFB5-diagnosis.pdf`, `.raw/papers/FluxInfer.pdf`, `.raw/papers/2020WWW_AutoMap.pdf`, `.raw/papers/Wu-et-al.-2021---MicroDiag---Fine-grained-Performance-Diagnosis-for-Microservice-Systems.pdf`, `.raw/papers/Hu-et-al.-2022---TS-InvarNet---Anomaly-Detection-and-Localization-based-on-Tempo-spatial-KPI-Invariants-in-Distributed-Services.pdf`, `.raw/papers/ZengTLSG14.pdf` - Summary: MonitorRank(2013)→ CloudRanger(2018)→ FluxRank/AutoMAP/ε-Diagnosis(2019-2020)→ TS-InvarNet/MicroDiag(2021-2022)→ PyRCA(2023) のマイクロサービス・DB RCA 系譜を一次ソースとして wiki 化 - Pages created: [[@2013__SIGMETRICS__Root Cause Detection in a Service-Oriented Architecture]], [[@2014__CNSM__Mining Temporal Lag from Fluctuating Events for Correlation and Root Cause Analysis]], [[@2018__CCGrid__CloudRanger - Root Cause Identification for Cloud Native Systems]], [[@2019__ISSRE__FluxRank - A Widely-Deployable Framework to Automatically Localizing Root Cause Machines for Software Service Failure Mitigation]], [[@2019__WWW__ε-Diagnosis - Unsupervised and Real-time Diagnosis of Small-window Long-tail Latency in Large-scale Microservice Platforms]], [[@2020__IPCCC__FluxInfer - Automatic Diagnosis of Performance Anomaly for Online Database System]], [[@2020__WWW__AutoMAP - Diagnose Your Microservice-based Web Applications Automatically]], [[@2021__CloudIntelligence__MicroDiag - Fine-grained Performance Diagnosis for Microservice Systems]], [[@2022__ICWS__TS-InvarNet - Anomaly Detection and Localization based on Tempo-spatial KPI Invariants in Distributed Services]], [[@2023__arXiv__PyRCA - A Library for Metric-based Root Cause Analysis]], + entity 30+ pages - Pages updated: [[Dan Pei]], [[Pengfei Chen]], [[Shenglin Zhang]], [[Guangba Yu]], [[Sun Yat-sen University]], [[Stanford University]], [[Peking University]], [[Minghua Ma]] - Key insight: MonitorRank(2013)が提案した「コールグラフ上のパーソナライズドランダムウォーク」は CloudRanger・AutoMAP・MicroRCA 等に再利用され pre-LLM era RCA の標準パイプラインとなった。FluxInfer は有向性推定を捨てて無向グラフ + PageRank に転換し PC 系 8 手法を上回る。PyRCA がこれら手法群を統合ライブラリ化。 ## [2026-06-20] ingest-paper | A Tutorial on Kernel Density Estimation and Recent Advances - Source: `.raw/papers/arxiv-1704.03924.pdf` - Summary: [[@2017__arXiv__A Tutorial on Kernel Density Estimation and Recent Advances]] - Pages created: [[@2017__arXiv__A Tutorial on Kernel Density Estimation and Recent Advances]], [[Yen-Chi Chen]], [[カーネル密度推定]] - Pages updated: [[University of Washington]], [[密度ベースクラスタリング]], [[wiki/sources/_index]], [[wiki/entities/_index]], [[wiki/concepts/_index]], [[index]], [[hot]], [[log]] - Key insight: KDE の帯域幅選択は推定精度を支配する唯一の重要パラメータであり、信頼帯構成におけるバイアス処理（アンダースムージング/バイアス補正/無視）の3戦略が理論的に整理された。密度の幾何学的・位相的特徴推定への展開が密度ベースクラスタリングの理論的基盤を補完する。 ## [2026-06-20] ingest-paper | DirectLiNGAM: A Direct Method for Learning a Linear Non-Gaussian Structural Equation Model - Source: `.raw/papers/shimizu11a.pdf` - Summary: [[@2011__JMLR__DirectLiNGAM - A Direct Method for Learning a Linear Non-Gaussian Structural Equation Model]] - Pages created: [[@2011__JMLR__DirectLiNGAM - A Direct Method for Learning a Linear Non-Gaussian Structural Equation Model]], [[Shohei Shimizu]], [[Aapo Hyvärinen]], [[Kenneth Bollen]], [[Osaka University]], [[University of Helsinki]] - Pages updated: [[因果発見]], [[wiki/sources/_index]], [[wiki/entities/_index]], [[wiki/concepts/_index]], [[index]], [[hot]], [[log]] - Key insight: ICA-LiNGAM の反復探索依存（初期値感度・収束非保証）を外生変数の逐次同定で解消し、固定ステップ数の収束保証を実現。ただし社会学データでの潜在交絡因子による誤推定は、因果発見手法のモデル仮定違反の実践的影響を具体的に示す。 ## [2026-06-20] ingest-paper | A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise - Source: `.raw/papers/KDD96-037.pdf` - Summary: [[@1996__KDD__A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise]] - Pages created: [[@1996__KDD__A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise]], [[Martin Ester]], [[Hans-Peter Kriegel]], [[Jörg Sander]] - Pages updated: [[密度ベースクラスタリング]], [[wiki/sources/_index]], [[wiki/entities/_index]], [[wiki/concepts/_index]], [[index]], [[hot]], [[log]] - Key insight: 密度に基づくクラスタの形式的定義（核点・密度到達可能性・密度接続）により、事前のクラスタ数指定なしに任意形状のクラスタを発見可能にした。 ## [2026-06-20] ingest-paper | Density-Based Clustering Based on Hierarchical Density Estimates - Source: `.raw/papers/Campello-et-al.-2013---Density-Based-Clustering-Based-on-Hierarchical-Density-Estimates.pdf` - Summary: [[@2013__PAKDD__Density-Based Clustering Based on Hierarchical Density Estimates]] - Pages created: [[@2013__PAKDD__Density-Based Clustering Based on Hierarchical Density Estimates]], [[Ricardo J.G.B. Campello]], [[Davoud Moulavi]], [[クラスタ安定性]] - Pages updated: [[密度ベースクラスタリング]], [[wiki/sources/_index]], [[wiki/entities/_index]], [[wiki/concepts/_index]], [[index]], [[hot]], [[log]] - Key insight: DBSCAN のグローバル密度閾値制約を階層化と安定性尺度で解決し、異なる密度のクラスタを最適に抽出する問題を定式化して大域最適解を与えた。 ## [2026-06-20] ingest-paper | k-Shape: Efficient and Accurate Clustering of Time Series - Source: `.raw/papers/18_kShape_RH_Paparrizos.pdf` - Summary: [[@2016__SIGMOD Record__k-Shape - Efficient and Accurate Clustering of Time Series]] - Pages created: [[@2016__SIGMOD Record__k-Shape - Efficient and Accurate Clustering of Time Series]], [[Luis Gravano]] - Pages updated: [[John Paparrizos]], [[時系列クラスタリング]], [[wiki/sources/_index]], [[wiki/entities/_index]], [[wiki/concepts/_index]], [[index]], [[hot]], [[log]] - Key insight: 正規化相互相関に基づく SBD は cDTW と同等精度で 1 桁以上高速であり、クラスタリング手法の選択は距離尺度の選択と同程度に重要である。 ## [2026-06-19] ingest-paper | Selective review of offline change point detection methods - Source: `.raw/papers/arxiv-1801.00718.pdf` - Summary: [[@2020__Signal Processing__Selective review of offline change point detection methods]] - Pages created: [[@2020__Signal Processing__Selective review of offline change point detection methods]], [[Charles Truong]], [[Laurent Oudre]], [[Nicolas Vayatis]], [[ruptures]] - Pages updated: [[変化点検知]], [[wiki/sources/_index]], [[wiki/entities/_index]], [[index]], [[hot]], [[log]] - Key insight: オフライン変化点検知をコスト関数 13 種・探索手法 5 種・制約の 3 軸で統一分類するサーベイ。AIOps 実運用は $c_{L_2}$ + Pelt に収束しているが、カーネル法・順位統計など多様なコスト関数の AIOps での比較は未踏。 ## [2026-06-19] ingest-paper | Time-Series Clustering: A Comprehensive Study of Data Mining, Machine Learning, and Deep Learning Methods - Source: `.raw/papers/p4380-paparrizos.pdf` - Summary: [[@2025__PVLDB__Time-Series Clustering - A Comprehensive Study of Data Mining, Machine Learning, and Deep Learning Methods]] - Pages created: [[@2025__PVLDB__Time-Series Clustering - A Comprehensive Study of Data Mining, Machine Learning, and Deep Learning Methods]], [[John Paparrizos]], [[UCR Time Series Archive]], [[時系列クラスタリング]] - Pages updated: [[The Ohio State University]], [[時系列基盤モデル]], [[wiki/sources/_index]], [[wiki/entities/_index]], [[wiki/concepts/_index]], [[index]], [[hot]], [[log]] - Key insight: 84手法・128データセットの包括的評価により、10年前の k-Shape を統計的に有意に上回る時系列クラスタリング手法は存在しないことが実証された。深層学習・基盤モデル(CHRONOS・OFA・MOMENT)を含む全手法が k-Shape と同等以下であり、先行ベンチマークの結論はバグのある実装・不公平な設定に起因する「進歩の幻想」であった。 ## [2026-06-19] ingest | The Software Development Lifecycle Is Dead — Boris Tane - Source: `.raw/articles/the-software-development-lifecycle-is-dead-2026-06-19.md` - Summary: [[@2026__Boris Tane Blog__The Software Development Lifecycle Is Dead]] - Pages created: [[@2026__Boris Tane Blog__The Software Development Lifecycle Is Dead]], [[Boris Tane]], [[コンテキストエンジニアリング]], [[AIネイティブ開発]] - Pages updated: [[wiki/sources/_index]], [[wiki/entities/_index]], [[wiki/concepts/_index]], [[index]], [[hot]], [[log]] - Key insight: AI エージェントは SDLC を加速させたのではなく解体した。モニタリング（オブザーバビリティ）が唯一の生存フェーズとなり、コンテキストエンジニアリングが新たな差別化要因になる。 ## [2026-06-19] ingest-paper | D'ya like DAGs? A Survey on Structure Learning and Causal Discovery - Source: `.raw/papers/arxiv-2103.02582.pdf` - Summary: [[@2022__CSUR__D'ya Like DAGs - A Survey on Structure Learning and Causal Discovery]] - Pages created: [[@2022__CSUR__D'ya Like DAGs - A Survey on Structure Learning and Causal Discovery]], [[Matthew J. Vowels]], [[Necati Cihan Camgoz]], [[Richard Bowden]] - Pages updated: [[University of Surrey]], [[因果発見]], [[因果推論ベースRCA]], [[wiki/sources/_index]], [[wiki/entities/_index]], [[index]], [[hot]] - Key insight: 連続最適化パラダイム（NOTEARS 以降）の体系化により、RCA で使われる DAG-GNN/NOTEARS 系が <100 変数でしか評価されていないことが明確化。Glymour 2019 との対比で因果発見の2つのスケーラビリティ瓶首（超指数探索空間 vs. O(d³) 行列演算）が浮上。 ## [2026-06-19] ingest-paper | Review of Causal Discovery Methods Based on Graphical Models - Source: `.raw/papers/Glymour-et-al-2019-Review-of-Causal-Discovery-Methods.pdf` - Summary: [[@2019__Frontiers in Genetics__Review of Causal Discovery Methods Based on Graphical Models]] - Pages created: [[@2019__Frontiers in Genetics__Review of Causal Discovery Methods Based on Graphical Models]], [[Clark Glymour]], [[Kun Zhang]], [[Peter Spirtes]], [[因果発見]] - Pages updated: [[Carnegie Mellon University]], [[因果推論ベースRCA]], [[wiki/sources/_index]], [[wiki/entities/_index]], [[wiki/concepts/_index]], [[index]], [[hot]] - Key insight: 因果発見の3系統（制約ベース・スコアベース・FCM ベース）は「スケーラビリティ vs 識別力」のトレードオフで位置づけられる。RCA で使われる PC・LiNGAM・Granger の理論的仮定と限界、特に前処理が非ガウス性を破壊するリスクを体系的に理解する基盤を提供する。 ## [2026-06-19] ingest-paper | Signal propagation in complex networks - Source: `.raw/papers/Signal-propagation-in-complex-networks.pdf` - Summary: [[@2023__Physics Reports__Signal propagation in complex networks]] - Pages created: [[@2023__Physics Reports__Signal propagation in complex networks]], [[Peng Ji]], [[Jürgen Kurths]], [[Matjaž Perc]], [[University of Maribor]], [[複雑ネットワーク]], [[信号伝播]] - Pages updated: [[wiki/sources/_index]], [[wiki/entities/_index]], [[wiki/concepts/_index]], [[index]], [[hot]] - Key insight: 信号伝播のジオメトリはトポロジーと非線形相互作用の両方によって決まる。時変ネットワークの静的近似は真の伝播パターンを正確に反映できない。 ## [2026-06-19] ingest-paper | Anomaly Detection: A Survey - Source: `.raw/papers/Chandola-et-al.-2009---Anomaly-detection---A-survey.pdf` - Summary: [[@2009__CSUR__Anomaly Detection - A Survey]] - Pages created: [[@2009__CSUR__Anomaly Detection - A Survey]], [[Varun Chandola]], [[Arindam Banerjee]], [[Vipin Kumar]], [[University of Minnesota]] - Pages updated: [[異常検知]], [[sources/_index]], [[entities/_index]], [[concepts/_index]], [[index]], [[hot]] - Key insight: Chandola 2009 は、点/文脈/集合異常と 6 技法群を「仮定」で比較する基礎 taxonomy を提供する。現代 AIOps の practical anomaly や文脈依存の偽陽性問題は、この文脈異常の古典的定義を運用ドメインへ拡張したものとして読める。 ## 2026-06-19 ingest | System@Scale: AI Observability - Source: `.raw/articles/systemscale-ai-observability-2026-06-19.md` - Summary: [[@2023__SystemAtScale__AI Observability]] - Pages created: [[@2023__SystemAtScale__AI Observability]] / [[Valentin Andrei]] / [[Dynolog]] / [[LibAsicMon]] / [[Kineto]] - Pages updated: [[GPU観測性]] / [[Meta]] - Key insight: Meta の 4 層 AI 観測性スタック(Dynolog→Kineto/Gpusnoop→分析プラットフォーム→フリートダッシュボード)は、研究コミュニティが「単一プロファイラの低オーバーヘッド化」を問うのに対し「複数ツールを組織的にスタックする」視点を提供する。FLOPs/sec と rDevice hour/Byte の二指標でフリート全体を評価する点も独自。 ## [2026-06-19] enrich | Karpathy「LLM Wiki」で稲見3部作考察を更新 - Source: [[@2026__GitHub Gist__LLM Wiki]] (Andrej Karpathy, 2026-04-04) - Pages created: [[@2026__GitHub Gist__LLM Wiki]], [[Andrej Karpathy]], [[Vannevar Bush]], [[LLM Wikiパターン]] - Pages updated: [[Human-out-of-the-loop]], [[サイバネティクス]], [[個人的知識蓄積の意味-稲見3部作から]] - Key insight: 稲見が「書くことによる調律」に意味を見出した同じ問いを、Karpathyは「bookkeeping は LLM が引き受けるべきボトルネック」と正反対の前提で答えた。「調律の媒体は書くことか、選ぶことか」という問いが新たに浮上した。Bush(1945)–Wiener(1948)–Karpathy(2026)の系譜も確認された。 ## [2026-06-19] wiki-query | 個人的知識蓄積の意味——稲見3部作からの考察 - Query: 「LLM Wiki のように個人で世界や AI が発見した知識を蓄積し、人間が理解していくことに今後意味はあるのか？」 - Sources consulted: [[@2026__note.com__科学の終焉と、新しい科学の始まり]] / [[@2026__note.com__Out of the Blue]] / [[@2026__note.com__ループのボトルネックは、人間だ]] + 関連概念6ページ - Filed as: [[個人的知識蓄積の意味-稲見3部作から]] ## [2026-06-19] ingest-slides | AI/ML基盤における800GbEスイッチ導入とその挑戦 - Source: `.raw/slides/janog56-800gbe-cycloud/janog56-800gbe-cycloud.pdf` - Visual pages: `.raw/slides/janog56-800gbe-cycloud/pages/`(47 ページ) - Media: none(transcript なし) - Summary: [[@2025__JANOG56__AI ML基盤における800GbEスイッチ導入とその挑戦]] - Pages created: [[サイバーエージェント]], [[CIU]], [[小障子尚太朗]], [[疋田紅樹]], [[Juniper QFX5240]], [[Rail-Optimizedトポロジ]], [[マルチベンダーLosslessネットワーク]] - Pages updated: [[集合通信]](NCCL_CROSS_NIC=0 知見追加), [[データセンター輻輳制御]](Ingress hashing+DLB 知見追加), [[GPUクラスタ運用]](異種サーバー配線ズレ問題追加) - Key insight: Broadcom(QFX5240)+Mellanox(SN4700)混在では AR/DLB の単純 on/off では輻輳を解消できず、Spine の Ingress interface hashing と Leaf の DLB 組み合わせでデフォルト比ほぼ 2 倍の帯域を達成。NCCL_CROSS_NIC=0 によるリング経路の Leaf 閉じ込めも有効。 ## [2026-06-19] batch-ingest | 稲見昌彦「科学とAIとループ」3部作エッセイ（note.com） - Source: `.raw/articles/kagaku-no-shuuen-part1-2026-06-19.md` / `.raw/articles/out-of-the-blue-part2-2026-06-19.md` / `.raw/articles/loop-bottleneck-part3-2026-06-19.md` - Summary: [[@2026__note.com__科学の終焉と、新しい科学の始まり]] / [[@2026__note.com__Out of the Blue]] / [[@2026__note.com__ループのボトルネックは、人間だ]] - Pages created (source × 3): [[@2026__note.com__科学の終焉と、新しい科学の始まり]], [[@2026__note.com__Out of the Blue]], [[@2026__note.com__ループのボトルネックは、人間だ]] - Pages created (entity × 10): [[稲見昌彦]], [[東京大学先端科学技術研究センター]], [[ノーバート・ウィーナー]], [[マックス・テグマーク]], [[舘暲]], [[ゴットフリート・ライプニッツ]], [[ジェンスン・フアン]], [[ティモシー・リアリー]], [[ヘレン・ケラー]], [[VPL社]] - Pages created (concept × 13): [[Human-out-of-the-loop]], [[サイバネティクス]], [[アロスタシス]], [[inside the loops]], [[See-through]], [[Feel-through]], [[光学迷彩]], [[拡張現実感]], [[調律]], [[バイブコーディング]], [[テレイグジスタンス]], [[情報顕微鏡]], [[モナド論]] - Key insight: 3部を貫く論題「ループから外れた人間はどこへ行くか」を、第一部(理論)→第二部(感覚拡張)→第三部(実践+哲学)と展開。サイバネティクスとHuman-out-of-the-loopが理論的基盤として全篇に通底する。 ## [2026-06-19] ingest-slides | Latency SLOs Done Right - Source: `.raw/slides/srecon19emea-latency-slos-done-right/srecon19emea-latency-slos-done-right.pdf` - Visual pages: `.raw/slides/srecon19emea-latency-slos-done-right/pages/` - Media: none - Summary: [[@2019__SREcon19 EMEA__Latency SLOs Done Right]] - Pages created: [[@2019__SREcon19 EMEA__Latency SLOs Done Right]], [[Heinrich Hartmann]], [[Circonus]], [[ヒストグラムメトリクス]] - Pages updated: [[サービスレベル目標]], [[sources/_index]], [[entities/_index]], [[concepts/_index]], [[index]], [[hot]] - Key insight: レイテンシ SLO はパーセンタイル時系列の監視ではなく、期間全体のイベント集合に対してしきい値以内のリクエスト割合を数える問題であり、ログ・カウンタ・ヒストグラムがその実装経路になる。 ## [2026-06-19] ingest-paper | Failure Diagnosis in Microservice Systems: A Comprehensive Survey and Analysis - Source: `.raw/papers/arxiv-2407.01710.pdf` - Summary: [[@2024__arXiv__Failure Diagnosis in Microservice Systems - A Comprehensive Survey and Analysis]] - Pages created: [[@2024__arXiv__Failure Diagnosis in Microservice Systems - A Comprehensive Survey and Analysis]] - Pages updated: [[Shenglin Zhang]], [[マルチモーダル障害診断]], [[根本原因分析]], [[sources/_index]], [[index]], [[hot]] - Key insight: result fusion → model fusion → feature fusion のマルチモーダル進化線と PC アルゴリズム + ランダムウォークの古典的パイプラインが 98 論文スケールで体系化。LLM + 知識グラフ統合が次の重要方向として明示された。 ## [2026-06-19] ingest-paper | Graph-based Incident Aggregation for Large-Scale Online Service Systems - Source: `.raw/papers/arxiv-2108.12179.pdf` - Summary: [[@2021__ASE__Graph-based Incident Aggregation for Large-Scale Online Service Systems]] - Pages created: [[@2021__ASE__Graph-based Incident Aggregation for Large-Scale Online Service Systems]], [[GRLIA]], [[OpsPAI]], [[Xuemin Wen]], [[Xiao Ling]] - Pages updated: [[アラート集約]], [[インシデント管理]], [[サービス依存グラフ]], [[グレイ障害]], [[Zhuangbin Chen]], [[Jinyang Liu]], [[Yuxin Su]], [[Hongyu Zhang]], [[Yongqiang Yang]], [[Michael R. Lyu]], [[Huawei Cloud]], [[The Chinese University of Hong Kong]], [[University of Newcastle]], [[sources/_index]], [[entities/_index]], [[index]], [[hot]] - Key insight: GRLIA は、テキスト非類似なインシデントを束ねるだけでなく、モニタ閾値やフォールトトレランスで incident stream に現れない「沈黙ノード」を KPI トレンドで補完し、表現学習の前段にある障害影響グラフ自体を改善する。 ## [2026-06-18] ingest-slides | AIスーパーコンピュータにおけるLLM学習処理性能の計測と可観測性 - Source: `.raw/slides/ai-supercomputer-llm-benchmarking-and-observability/ai-supercomputer-llm-benchmarking-and-observability.pdf` - Visual pages: `.raw/slides/ai-supercomputer-llm-benchmarking-and-observability/pages/` - Media: `.raw/slides/ai-supercomputer-llm-benchmarking-and-observability/transcript.md` - Summary: [[@2025__SpeakerDeck__AIスーパーコンピュータにおけるLLM学習処理性能の計測と可観測性]] - Pages created: [[@2025__SpeakerDeck__AIスーパーコンピュータにおけるLLM学習処理性能の計測と可観測性]] - Pages updated: [[Yuuki Tsubouchi]], [[SAKURA Internet]], [[SAKURAONE]], [[R-Pingmesh]], [[GPU観測性]], [[LLM学習モニタリング]], [[RDMAネットワーク監視]], [[sources/_index]], [[entities/_index]], [[concepts/_index]], [[index]], [[hot]] - Key insight: AI スパコン可観測性の難しさは、細粒度 GPU/RDMA 計装だけでなく、クラウド事業者がユーザーコード・アプリログへ入れない責任境界から来る。OTel + Grafana のリソース分析から、学習処理スパン・集団通信・RoCE 経路へ意味を戻すことが次の課題。追加 transcript により、LLM 開発では最高性能だけでなく安定に完走できる構成が選ばれ、障害時はリザーブドノードとチェックポイント復旧が現実的な運用境界になることを補足した。 ## [2026-06-18] ingest-article | Introducing Contextual Retrieval - Source: `.raw/articles/contextual-retrieval-2024-09-19.md` - Summary: [[@2024__Anthropic Engineering Blog__Introducing Contextual Retrieval]] - Pages created: [[@2024__Anthropic Engineering Blog__Introducing Contextual Retrieval]], [[Daniel Ford]] - Pages updated: [[文脈付き検索]](seed → developing; 一次資料数値・BM25・リランキング・横断的知見を追加), [[Anthropic]](ソース追加・数値修正), [[sources/_index]], [[entities/_index]], [[index]], [[log]] - Key insight: チャンク分割による文脈損失が RAG の主要ボトルネック。BM25 語彙一致は埋め込み単体より高精度。複数技術(Contextual Embeddings + BM25 + リランキング)は効果が累積的で合計 67% 削減を実現。ベースライン失敗率は 5.7%(二次ソースの 5.0% と齟齬あり; 一次資料を正とする)。 ## [2026-06-18] ingest-slides | Scaling KV Caches for LLMs: How LMCache + NIXL Handle Network and Storage Heterogeneity - Source: `.raw/slides/LMCache20and20NIXL/LMCache20and20NIXL.pdf` - Visual pages: `.raw/slides/LMCache20and20NIXL/pages/` - Media: none - Summary: [[@2025__PyTorchConference__Scaling KV Caches for LLMs - How LMCache + NIXL Handle Network and Storage Heterogeneity]] - Pages created: [[@2025__PyTorchConference__Scaling KV Caches for LLMs - How LMCache + NIXL Handle Network and Storage Heterogeneity]], [[Moein Khazraee]] - Pages updated: [[Junchen Jiang]], [[LMCache]], [[NIXL]], [[KVキャッシュ管理]], [[LLM推論]], [[Prefill-Decode分離]], [[sources/_index]], [[entities/_index]], [[concepts/_index]], [[index]], [[hot]] - Key insight: KV キャッシュ最適化は GPU 内 page/chunk 粒度だけでなく、DRAM/VRAM/BLK/FILE/OBJ の登録、remote metadata 交換、非同期 Xfer request の投稿という転送制御面を含む設計問題になっている。 ## [2026-06-18] ingest-paper | FlashAttention (arXiv:2205.14135) - Source: `.raw/papers/arxiv-2205.14135.pdf` - Summary: [[@2022__arXiv__FlashAttention - Fast and Memory-Efficient Exact Attention with IO-Awareness]] - Pages created: [[@2022__arXiv__FlashAttention - Fast and Memory-Efficient Exact Attention with IO-Awareness]] / [[Tri Dao]] - Pages updated: [[Together AI]] / [[FlashAttention]] / [[カーネルフュージョン]] / [[GPU最適化]] / [[LLM推論]] - Key insight: タイリング＋オンライン softmax＋再計算で N×N アテンション行列の HBM 読み書きを排除し、IO 複雑度 O(N²d²M⁻¹) で厳密アテンションを 2-4 倍高速化。 ## [2026-06-18] ingest-paper | FlashAttention-2 (arXiv:2307.08691) - Source: `.raw/papers/arxiv-2307.08691.pdf` - Summary: [[@2023__arXiv__FlashAttention-2 - Faster Attention with Better Parallelism and Work Partitioning]] - Pages created: [[@2023__arXiv__FlashAttention-2 - Faster Attention with Better Parallelism and Work Partitioning]] - Pages updated: [[FlashAttention]] / [[LLM推論]] - Key insight: 非 MMA FLOP 削減とシーケンス長並列化で A100 利用率を 25-35% から 50-73% へ改善し 225 TFLOP/秒を達成。 ## [2026-06-18] ingest-paper | FlashAttention-3 (arXiv:2407.08608) - Source: `.raw/papers/arxiv-2407.08608.pdf` - Summary: [[@2024__arXiv__FlashAttention-3 - Fast and Accurate Attention with Asynchrony and Low-precision]] - Pages created: [[@2024__arXiv__FlashAttention-3 - Fast and Accurate Attention with Asynchrony and Low-precision]] / [[Jay Shah]] - Pages updated: [[Together AI]] / [[FlashAttention]] / [[テンソルコア]] / [[LLM推論]] - Key insight: Hopper のワープ特化と FP8 ブロック量子化＋incoherent processing で 740 TFLOP/秒（75% 利用率）と数値誤差 2.6 倍改善を同時達成。 ## [2026-06-18] ingest-paper | FlashAttention-4 (arXiv:2603.05451) - Source: `.raw/papers/arxiv-2603.05451.pdf` - Summary: [[@2026__arXiv__FlashAttention-4 - Algorithm and Kernel Pipelining Co-Design for Asymmetric Hardware Scaling]] - Pages created: [[@2026__arXiv__FlashAttention-4 - Algorithm and Kernel Pipelining Co-Design for Asymmetric Hardware Scaling]] - Pages updated: [[FlashAttention]] / [[テンソルコア]] / [[カーネルフュージョン]] / [[GPU最適化]] / [[LLM推論]] - Key insight: Blackwell で指数関数ユニットがボトルネックになる非対称スケーリング問題を特定し、ソフトウェアエミュレート指数関数＋TMEM＋CuTe-DSL で 1613 TFLOP/秒（71%）を達成。 ## [2026-06-18] ingest-paper | AIBrix (arXiv:2504.03648) - Source: `.raw/papers/arxiv-2504.03648.pdf` - Summary: [[@2025__arXiv__AIBrix - Towards Scalable, Cost-Effective Large Language Model Inference Infrastructure]] - Pages created: [[@2025__arXiv__AIBrix - Towards Scalable, Cost-Effective Large Language Model Inference Infrastructure]] - Pages updated: [[AIBrix]] / [[LLM推論]] / [[KVキャッシュ管理]] / [[Prefill-Decode分離]] - Key insight: Kubernetes＋Ray ハイブリッドで推論エンジンを vendor-neutral に収容。分散 KV キャッシュで 50% スループット向上・70% レイテンシ削減。 ## [2026-06-18] ingest-paper | GPT-4 Technical Report (arXiv:2303.08774) - Source: `.raw/papers/arxiv-2303.08774.pdf` - Summary: [[@2023__arXiv__GPT-4 Technical Report]] - Pages created: [[@2023__arXiv__GPT-4 Technical Report]] - Pages updated: [[OpenAI]] / [[LLMスケーリング則]] / [[LLM評価]] / [[RLHF誤誘導]] / [[index]] / [[sources/_index]] - Key insight: 予測可能スケーリング(1/1,000〜1/10,000 の計算量から性能を事前予測)と、RLHF によるキャリブレーション劣化(ECE 0.007→0.074)の定量化が GPT-4 技術報告の 2 大貢献。アーキテクチャ非公開。 --- ## [2026-06-18] wiki-query | TSFM・TS-MLLM・Toto-Qwen3-VL 比較 + Transformer 基礎 - Question: 時系列基盤モデル・時系列マルチモーダル LLM・Toto-Qwen3-VL の共通点・差異・アーキテクチャ上の本質は何か - Summary: 自己注意・パッチ化・Encoder-Decoder・トークン化・事前学習/RLVR の基礎から出発し、3 者を信号フロー・TS エンコーダ出自・多変量次元の扱い・モダリティ数・訓練パイプラインの 5 軸で比較。「予測精度の担い手 vs 推論の担い手」という役割分業と、TSFM→TS-MLLM の 2 段スタック(Toto-Qwen3-VL が実証)という統合構図を整理した - Pages created: [[TSFM-TSMLLM-TotoQwen3VL-比較と基礎]](`wiki/questions/`) - Pages updated: [[index]] --- ## [2026-06-18] ingest-paper | KV キャッシュ・GPU クラスタ論文 5 本 - Source: `.raw/papers/arxiv-2506.02634.pdf`, `.raw/papers/nsdi22-paper-weng.pdf`, `.raw/papers/arxiv-2405.16444.pdf`, `.raw/papers/arxiv-2503.16525.pdf`, `.raw/papers/arxiv-2412.10319.pdf` - Summary: KV キャッシュ本番ワークロード特性(KVCache Cache in the Wild)、異種混合 GPU クラスタのワークロード解析(MLaaS in the Wild)、RAG 向け非プリフィックス KV キャッシュ再利用(CacheBlend)、マルチテナント KV キャッシュ再利用(KVShare)、KV キャッシュ中心長コンテキスト手法ベンチマーク(SCBench)の 5 本を一括取り込み - Pages created: [[@2026__arXiv__KVCache Cache in the Wild - Characterizing and Optimizing KVCache Cache at a Large Cloud Provider]], [[@2022__NSDI__MLaaS in the Wild - Workload Analysis and Scheduling in Large-Scale Heterogeneous GPU Clusters]], [[@2025__EuroSys__CacheBlend - Fast Large Language Model Serving for RAG with Cached Knowledge Fusion]], [[@2025__arXiv__KVShare - An LLM Service System with Efficient and Effective Multi-Tenant KV Cache Reuse]], [[@2025__ICLR__SCBench - A KV Cache-Centric Analysis of Long-Context Methods]], [[Xingda Wei]], [[Jinbo Han]], [[Qizhen Weng]], [[Alibaba PAI]], [[Alibaba GPU Cluster Trace]], [[Jiayi Yao]], [[Junchen Jiang]], [[CacheBlend]], [[Huan Yang]], [[KVShare]], [[Central South University]], [[Yucheng Li]], [[Huiqiang Jiang]], [[SCBench]], [[University of Chicago]], [[University of Surrey]] - Pages updated: [[KVキャッシュ管理]], [[LLM推論]], [[GPUクラスタスケジューリング]], [[Tsinghua University]], [[Microsoft Research]], [[Alibaba Group]], [[Yuhan Liu]] - Key insight: 本番 KV キャッシュヒット率は合成ベンチマークの 80% 超に対し 54-62% にとどまり、ワークロード対応エビクションが必須。RAG/マルチテナントでの非プリフィックス選択的再計算は CacheBlend(プリフィル時)→ KVShare(デコード時アテンション・ドリフト対処)へ発展。SCBench は sub-O(n) メモリ手法がマルチターン破綻し、KV キャッシュライフサイクル全体の評価が必要であることを示した。 ## [2026-06-18] enrich-source | From Attention to Disaggregation の充実化 - Source: [[@2025__arXiv__From Attention to Disaggregation - Tracing the Evolution of LLM Inference]] (`.raw/papers/arxiv-2511.07422.pdf` 全 22 ページを再走査) - Pages updated: [[@2025__arXiv__From Attention to Disaggregation - Tracing the Evolution of LLM Inference]](6 最適化テーブル、GPU メモリ階層、Monolithic vs Disaggregated 比較、PEARL 並列 Speculative Decoding、DistServe/AIBrix/NVIDIA Dynamo の制御/データプレーン詳細、性能数値、参考文献 25 件の主要引用) - Pages created: [[NVIDIA Dynamo]](Amazon [[Dynamo]] と区別)、[[AIBrix]](クラウドネイティブ制御プレーン) - Key insight: 本論文の CAP 解釈は KV Cache・Continuous Batching・PagedAttention・RadixAttention の各図注で「単一/密結合システム内の論理リソース割当としての比喩」と著者が明言。厳密な分散理論ではなく**設計語彙**として扱う。一方 3 アーキタイプ比較(research-first / cloud-native / full-stack hardware co-design)は、PD 分離研究の地図として実用価値が高い。 ## [2026-06-18] ingest-paper | Mooncake - A KVCache-centric Disaggregated Architecture for LLM Serving - Source: `.raw/papers/arxiv-2407.00079.pdf` - Summary: [[@2024__arXiv__Mooncake - A KVCache-centric Disaggregated Architecture for LLM Serving]] - Pages created: [[@2024__arXiv__Mooncake - A KVCache-centric Disaggregated Architecture for LLM Serving]], [[Ruoyu Qin]], [[Zheming Li]], [[Weiran He]], [[Mingxing Zhang]], [[Yongwei Wu]], [[Weimin Zheng]], [[Xinran Xu]] - Pages updated: [[KVキャッシュ管理]], [[Prefill-Decode分離]], [[LLM推論]], [[Mooncake]], [[Moonshot AI]], [[Tsinghua University]], [[sources/_index]], [[entities/_index]], [[concepts/_index]], [[index]], [[hot]] - Key insight: Kimi 本番の過負荷 MaaS では CPU/DRAM を KVCache 第一階層に昇格させ、KVCache 中心スケジューリングでランダム比 TTFT を 15 倍改善。PD 分離固有の負荷振動は将来負荷予測なしでは解消できない。 ## [2026-06-18] ingest-slides | A study on accelerating LLM inference using KV cache sharing with IOWN APN - Source: `.raw/slides/mpls2025-02-02-tanaka/mpls2025-02-02-tanaka.pdf` - Visual pages: `.raw/slides/mpls2025-02-02-tanaka/pages/` - Media: none - Summary: [[@2025__MPLSJapan__A study on accelerating LLM inference using KV cache sharing with IOWN APN]] - Pages created: [[@2025__MPLSJapan__A study on accelerating LLM inference using KV cache sharing with IOWN APN]], [[田仲顕至]], [[NTT]], [[IOWN APN]] - Pages updated: [[KVキャッシュ管理]], [[LLM推論]], [[AI Greenferencing]], [[sources/_index]], [[entities/_index]], [[concepts/_index]], [[index]], [[hot]] - Key insight: KV キャッシュ共有は、クラスタ内のメモリ/ストレージ管理から、IOWN APN のような低遅延・広帯域ネットワークで分散小型データセンターを束ねる広域推論基盤設計へ拡張する。 ## [2026-06-18] ingest-paper | 分散深層学習の通信・スケジューリング・ネットワーク基盤 15 論文 - Source: `.raw/papers/nsdi22-paper-romero.pdf`, `.raw/papers/arxiv-2410.21680.pdf`, `.raw/papers/sigcomm24-final246-acmpaginated.pdf`, `.raw/papers/gpu-util-icse2024.pdf`, `.raw/papers/nsdi19-gu.pdf`, `.raw/papers/rdma_sigcomm2016.pdf`, `.raw/papers/haidar_fp16_sc18.pdf`, `.raw/papers/arxiv-2209.01346.pdf`, `.raw/papers/arxiv-2302.03337.pdf`, `.raw/papers/google-34926.pdf`, `.raw/papers/google-35154.pdf`, `.raw/papers/nsdi20-paper-mahajan.pdf`, `.raw/papers/2024_EthernetHu.pdf`, `.raw/papers/arxiv-2307.12169.pdf`, `.raw/papers/p523.pdf` - Summary: 集合通信最適化(MSCCL)、GPU クラスタスケジューリング(Tiresias/Themis)、GPU 利用率実証研究、RDMA 大規模展開(Microsoft/Meta)、輻輳制御(DCQCN)、RoCE 設計課題、ネットワークトポロジ(Dragonfly/HammingMesh/Rail-only)、混合精度訓練(FP16 Tensor Core)、Ethernet ベンチマーク、ML クラスタ信頼性(HPCA'25 既存)の計 15 本を一括取り込み - Pages created: [[@2022__NSDI__Accelerating Collective Communication in Data Parallel Training across Deep Learning Frameworks]], [[@2024__SIGCOMM__RDMA over Ethernet for Distributed AI Training at Meta Scale]], [[@2024__ICSE__An Empirical Study on Low GPU Utilization of Deep Learning Jobs]], [[@2019__NSDI__Tiresias - A GPU Cluster Manager for Distributed Deep Learning]], [[@2016__SIGCOMM__RDMA over Commodity Ethernet at Scale]], [[@2018__SC__Harnessing GPU Tensor Cores for Fast FP16 Arithmetic to Speed up Mixed-Precision Iterative Refinement Solvers]], [[@2008__ISCA__Technology-Driven, Highly-Scalable Dragonfly Topology]], [[@2009__IEEE-Micro__Cost-Efficient Dragonfly Topology for Large-Scale Systems]], [[@2015__SIGCOMM__Congestion Control for Large-Scale RDMA Deployments]], [[@2020__NSDI__Themis - Fair and Efficient GPU Cluster Scheduling]], [[@2023__IEEE Computer__Datacenter Ethernet and RDMA - Issues at Hyperscale]], [[@2022__SC__HammingMesh - A Network Topology for Large-Scale Deep Learning]], [[@2023__arXiv__Rail-only - A Low-Cost High-Performance Network for Training LLMs with Trillion Parameters]], [[@2024__SC-W 2024__Benchmarking Ethernet Interconnect for HPC AI workloads]], [[Dragonflyトポロジ]], [[データセンター輻輳制御]], [[RoCE設計課題]], [[HPCインターコネクトベンチマーク]] - Pages updated: [[RDMA]], [[GPUクラスタスケジューリング]], [[Fat-Tree]], [[集合通信]], [[混合精度訓練]], [[GPUクラスタ運用]] - Key insight: RDMA の大規模展開は DCQCN(2015)→Microsoft 全 DC 展開(2016)→Meta AI 訓練(2024)と進化し、AI ワークロード固有の要求が輻輳制御の再設計を促した。ネットワークトポロジも Fat-Tree 一辺倒から Dragonfly/HammingMesh/Rail-only へ多様化し、ワークロード特化設計がコスト削減の鍵となっている。 ## [2026-06-18] ingest-paper | LLM inference KV cache management and disaggregation - Source: `.raw/papers/arxiv-2309.06180.pdf`, `.raw/papers/arxiv-2510.09665.pdf`, `.raw/papers/arxiv-2511.07422.pdf`, `.raw/papers/724be4472168f31ba1c9ac630f15dec8-Paper-Conference.pdf`, `.raw/papers/arxiv-2408.08147.pdf`, `.raw/papers/arxiv-2404.14294.pdf` - Summary: [[@2023__SOSP__Efficient Memory Management for Large Language Model Serving with PagedAttention]], [[@2025__arXiv__LMCache - An Efficient KV Cache Layer for Enterprise-Scale LLM Inference]], [[@2025__arXiv__From Attention to Disaggregation - Tracing the Evolution of LLM Inference]], [[@2024__NeurIPS__SGLang - Efficient Execution of Structured Language Model Programs]], [[@2024__arXiv__P-D-Serve - Serving Disaggregated Large Language Model at Scale]], [[@2024__arXiv__A Survey on Efficient Inference for Large Language Models]] - Pages created: [[@2023__SOSP__Efficient Memory Management for Large Language Model Serving with PagedAttention]], [[@2025__arXiv__LMCache - An Efficient KV Cache Layer for Enterprise-Scale LLM Inference]], [[@2025__arXiv__From Attention to Disaggregation - Tracing the Evolution of LLM Inference]], [[@2024__NeurIPS__SGLang - Efficient Execution of Structured Language Model Programs]], [[@2024__arXiv__P-D-Serve - Serving Disaggregated Large Language Model at Scale]], [[@2024__arXiv__A Survey on Efficient Inference for Large Language Models]], [[KVキャッシュ管理]], [[SGLang]], [[P-D-Serve]], [[Woosuk Kwon]], [[Yuhan Liu]], [[Srinivasa Rao Aravilli]], [[Yibo Jin]], [[Zixuan Zhou]], [[Tensormesh Inc]], [[Infinigence-AI]], [[Capital One]] - Pages updated: [[LLM推論]], [[Prefill-Decode分離]], [[vLLM]], [[LMCache]], [[sources/_index]], [[entities/_index]], [[concepts/_index]], [[index]], [[hot]] - Key insight: KV キャッシュ最適化は vLLM の GPU 内ページ化から、SGLang の prefix 木再利用、LMCache の階層ストレージ/転送、P/D-Serve の本番 RoCE D2D 転送へ拡張し、LLM 推論の中心制御対象になった。 ## [2026-06-18] ingest-paper | LLM inference serving: DistServe + Taming the Titans - Source: `.raw/papers/osdi24-zhong-yinmin.pdf`, `.raw/papers/acl-2025-inlg-main-32.pdf` - Summary: [[@2024__OSDI__DistServe - Disaggregating Prefill and Decoding for Goodput-optimized Large Language Model Serving]], [[@2025__INLG__Taming the Titans - A Survey of Efficient LLM Inference Serving]] - Pages created: [[@2024__OSDI__DistServe - Disaggregating Prefill and Decoding for Goodput-optimized Large Language Model Serving]], [[@2025__INLG__Taming the Titans - A Survey of Efficient LLM Inference Serving]], [[Prefill-Decode分離]], [[DistServe]], [[Yinmin Zhong]], [[Shengyu Liu]], [[Junda Chen]], [[Jianbo Hu]], [[Xuanzhe Liu]], [[Ranran Zhen]], [[Juntao Li]], [[Yixin Ji]], [[Zhenlin Yang]], [[Tong Liu]], [[Min Zhang]], [[Qingrong Xia]], [[Xinyu Duan]], [[Zhefeng Wang]], [[Baoxing Huai]], [[Soochow University]], [[UC San Diego]], [[StepFun]] - Pages updated: [[LLM推論]], [[Peking University]], [[Huawei Cloud]], [[Yibo Zhu]], [[Xin Jin]], [[Hao Zhang]], [[vLLM]], [[sources/_index]], [[entities/_index]], [[concepts/_index]], [[index]], [[hot]] - Key insight: PD 分離は単なる配置技法ではなく、TTFT/TPOT の二重 SLO を満たす Goodput 最適化問題として定式化される。2025 年時点では、推論サービングはインスタンス・クラスタ・新興シナリオを跨ぐ階層的な運用設計問題へ広がっている。 ## [2026-06-18] ingest-slides | 推論基盤のパフォーマンス検証と最適化戦略 - Source: `.raw/slides/performance_verification_and_optimization_strategy_for_inference/performance_verification_and_optimization_strategy_for_inference.pdf` - Visual pages: `.raw/slides/performance_verification_and_optimization_strategy_for_inference/pages/` - Media: none - Summary: [[@2026__SpeakerDeck__推論基盤のパフォーマンス検証と最適化戦略]] - Pages created: [[@2026__SpeakerDeck__推論基盤のパフォーマンス検証と最適化戦略]] - Pages updated: [[LLM推論]], [[サービスレベル目標]], [[道下幹也]], [[SAKURA Internet]], [[高火力 PHY]], [[vLLM]], [[LMCache]], [[Mooncake]], [[sources/_index]], [[concepts/_index]], [[entities/_index]], [[index]], [[hot]] - Key insight: LLM 推論基盤の最適化は SLO/SLA と Goodput を中心に置くべきで、同一 4 GPU 条件の PD 分離は ITL テイルを維持し、Mooncake Store による KV Cache Reuse/Sharing は TTFT を最大 1.75 倍程度削減するが、読み込みコストは未解決の設計課題として残る。 ## [2026-06-17] ingest-paper | FFTrainer: Fast Failover in LLM Training - Source: `.raw/papers/arxiv-2512.03644.pdf` - Summary: [[@2025__arXiv__FFTrainer Fast Failover in Large Language Model Training with Almost Free State Management]] - Pages created: [[@2025__arXiv__FFTrainer Fast Failover in Large Language Model Training with Almost Free State Management]], [[FFTrainer]], [[Bohan Zhao]], [[Wei Xu]], [[耐障害LLM訓練]] - Pages updated: [[チェックポイント]], [[LLM分散学習]] - Key insight: 訓練ネットワークの遊休帯域を利用したゼロオーバーヘッドチェックポイント（< 3%）と checkpoint razor（サイズ 1/10 以下圧縮）で反復ごとのチェックポイントを実現し、障害復旧を数十分→数十秒に短縮。 ## [2026-06-17] ingest-paper | Cassini: Network-Aware Job Scheduling - Source: `.raw/papers/nsdi24-rajasekaran.pdf` - Summary: [[@2024__NSDI__Cassini Network-Aware Job Scheduling in Machine Learning Clusters]] - Pages created: [[@2024__NSDI__Cassini Network-Aware Job Scheduling in Machine Learning Clusters]], [[Cassini]], [[Sudarsanan Rajasekaran]], [[Manya Ghobadi]], [[Aditya Akella]], [[ネットワーク対応スケジューリング]] - Pages updated: [[GPUクラスタスケジューリング]] - Key insight: GPU 配置とネットワークフロースケジューリングの統合で JCT を最大 1.6 倍改善。ring-allreduce のフロー間干渉が主要ボトルネックであることを実証。 ## [2026-06-17] ingest-paper | Understanding Communication Characteristics of Distributed Training - Source: `.raw/papers/ai-workload-apnet24.pdf` - Summary: [[@2024__APNet__Understanding Communication Characteristics of Distributed Training]] - Pages created: [[@2024__APNet__Understanding Communication Characteristics of Distributed Training]], [[Kai Chen (HKUST)]], [[iSING Lab]] - Pages updated: [[集合通信]], [[並列化戦略]] - Key insight: 3D 並列化で TP 内 AllReduce が帯域の 55〜85% を占有し、DP の AllReduce はバースト性が高く、PP は帯域消費が低いが遅延に敏感という実測プロファイルを初めて体系的に提示。 ## [2026-06-17] ingest-paper | PyTorch FSDP: Experiences on Scaling Fully Sharded Data Parallel - Source: `.raw/papers/p3848-huang.pdf` - Summary: [[@2023__VLDB__PyTorch FSDP Experiences on Scaling Fully Sharded Data Parallel]] - Pages created: [[@2023__VLDB__PyTorch FSDP Experiences on Scaling Fully Sharded Data Parallel]], [[Yanli Zhao]], [[ZeROパラメータシャーディング]] - Pages updated: [[並列化戦略]], [[LLM分散学習]] - Key insight: FlatParameter による通信集約、後退プリフェッチで GPT-175B 18% スループット向上、レートリミッターで T5-11B 最大 5× 向上を産業規模で実証。 ## [2026-06-17] ingest-paper | Reducing Activation Recomputation in Large Transformer Models - Source: `.raw/papers/80083951326cf5b35e5100260d64ed81-Paper-mlsys2023.pdf` - Summary: [[@2023__MLSys__Reducing Activation Recomputation in Large Transformer Models]] - Pages created: [[@2023__MLSys__Reducing Activation Recomputation in Large Transformer Models]], [[Vijay Korthikanti]], [[選択的活性化再計算]], [[シーケンス並列化]], [[再マテリアライゼーション]] - Pages updated: [[Megatron-LM]], [[Bryan Catanzaro]], [[Mohammad Shoeybi]] - Key insight: QKV 以外の活性化のみ選択的に再計算し、530B パラメータモデルで活性化メモリ 5 倍削減・再計算オーバーヘッドは完全再計算の 1/3 に抑制。 ## [2026-06-17] ingest-paper | FP8-LM: Training FP8 Large Language Models - Source: `.raw/papers/arxiv-2310.18313.pdf` - Summary: [[@2023__arXiv__FP8-LM Training FP8 Large Language Models]] - Pages created: [[@2023__arXiv__FP8-LM Training FP8 Large Language Models]], [[Houwen Peng]], [[Han Hu]], [[混合精度訓練]] - Pages updated: [[LLM分散学習]] - Key insight: FP8 での LLM 事前訓練を初めて体系的に検証。forward に FP8・backward に FP16/BF16・勾配に FP8+精度補償を適用し、GPT-175B でメモリ 42% 削減・訓練 64% 高速化を BF16 と同等精度で達成。 ## [2026-06-17] ingest-paper | Efficient Large-Scale Language Model Training on GPU Clusters Using Megatron-LM - Source: `.raw/papers/sc_megatron_lm.pdf` - Summary: [[@2021__SC__Efficient Large-Scale Language Model Training on GPU Clusters Using Megatron-LM]] - Pages created: [[@2021__SC__Efficient Large-Scale Language Model Training on GPU Clusters Using Megatron-LM]], [[Deepak Narayanan]], [[Matei Zaharia]], [[PTD-P]] - Pages updated: [[Megatron-LM]], [[並列化戦略]], [[LLM分散学習]] - Key insight: パイプライン・テンソル・データの 3D 並列を組み合わせる PTD-P を提案し、1 兆パラメータモデルを 3072 A100 GPU で 502 petaFLOP/s（MFU 52%）で訓練可能と実証。 ## [2026-06-17] ingest-paper | ZeRO: Memory Optimizations Toward Training Trillion Parameter Models - Source: `.raw/papers/arxiv-1910.02054.pdf` - Summary: [[@2020__SC__ZeRO Memory Optimizations Toward Training Trillion Parameter Models]] - Pages created: [[@2020__SC__ZeRO Memory Optimizations Toward Training Trillion Parameter Models]], [[Samyam Rajbhandari]], [[ZeROメモリ最適化]], [[ZeROオプティマイザ]] - Pages updated: [[DeepSpeed]], [[LLM分散学習]] - Key insight: オプティマイザ状態・勾配・パラメータを GPU 間で段階的に分割する Stage 1〜3 を提案し、モデル並列なしで 1000 億パラメータ訓練を可能にした。 ## [2026-06-17] ingest-paper | HiveD: Sharing a GPU Cluster for Deep Learning with Guarantees - Source: `.raw/papers/osdi20-zhao_hanyu.pdf` - Summary: [[@2020__OSDI__HiveD Sharing a GPU Cluster for Deep Learning with Guarantees]] - Pages created: [[@2020__OSDI__HiveD Sharing a GPU Cluster for Deep Learning with Guarantees]], [[HiveD]], [[Hanyu Zhao]], [[OpenPAI]], [[共有異常]], [[Virtual Private Cluster]] - Pages updated: [[GPUクラスタスケジューリング]] - Key insight: マルチテナント GPU クラスタで「共有異常」（クォータ内でも私有クラスタより待ち時間が長い）を発見し、VC + バディセル割り当てで共有安全性を数学的に保証。 ## [2026-06-17] ingest-paper | DeepSpeed: System Optimizations Enable Training Deep Learning Models with Over 100 Billion Parameters - Source: `.raw/papers/2026_Unknown_DeepSpeed.pdf` - Summary: [[@2020__KDD__DeepSpeed System Optimizations Enable Training Deep Learning Models with Over 100 Billion Parameters]] - Pages created: [[@2020__KDD__DeepSpeed System Optimizations Enable Training Deep Learning Models with Over 100 Billion Parameters]], [[Jeff Rasley]], [[Yuxiong He]] - Pages updated: [[DeepSpeed]] - Key insight: KDD 2020 チュートリアル概要（2 ページ）。ZeRO による 100〜200 億パラメータモデルの 10 倍高速訓練と BERT 44 分事前訓練記録の概要紹介。 ## [2026-06-17] ingest-paper | PipeDream: Generalized Pipeline Parallelism for DNN Training - Source: `.raw/papers/sosp_pipedream.pdf` - Summary: [[@2019__SOSP__PipeDream Generalized Pipeline Parallelism for DNN Training]] - Pages created: [[@2019__SOSP__PipeDream Generalized Pipeline Parallelism for DNN Training]], [[PipeDream]] - Pages updated: [[パイプライン並列化]], [[並列化戦略]] - Key insight: 1F1B パイプラインスケジュールと重み隠蔽で GPipe 比メモリ 2 倍削減、データ並列比 VGG-16 5.3 倍高速化を実現。 ## [2026-06-17] ingest-paper | Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism - Source: `.raw/papers/arxiv-1909.08053.pdf` - Summary: [[@2019__arXiv__Megatron-LM Training Multi-Billion Parameter Language Models Using Model Parallelism]] - Pages created: [[@2019__arXiv__Megatron-LM Training Multi-Billion Parameter Language Models Using Model Parallelism]], [[Mohammad Shoeybi]], [[テンソル並列]] - Pages updated: [[Megatron-LM]], [[並列化戦略]] - Key insight: MLP と自己注意の行列分割による層内テンソル並列化を提案し、通信を AllReduce 2 回に抑制。83 億パラメータ Transformer で 512 V100 GPU・15.1 PetaFLOPs（理論ピーク 76%）を達成。 ## [2026-06-17] ingest-paper | GPipe: Easy Scaling with Micro-Batch Pipeline Parallelism - Source: `.raw/papers/arxiv-1811.06965.pdf` - Summary: [[@2019__NeurIPS__GPipe Easy Scaling with Micro-Batch Pipeline Parallelism]] - Pages created: [[@2019__NeurIPS__GPipe Easy Scaling with Micro-Batch Pipeline Parallelism]], [[GPipe]], [[Yanping Huang]], [[Quoc V. Le]], [[パイプライン並列化]] - Pages updated: [[LLM分散学習]] - Key insight: マイクロバッチ分割と再マテリアライゼーションの組み合わせでパイプライン並列化を実用化。bubble 比率 O(K−1)/M でマイクロバッチ数 M 増加により無視可能に。 ## [2026-06-17] ingest-paper | Ray: A Distributed Framework for Emerging AI Applications - Source: `.raw/papers/osdi18-moritz.pdf` - Summary: [[@2018__OSDI__Ray A Distributed Framework for Emerging AI Applications]] - Pages created: [[@2018__OSDI__Ray A Distributed Framework for Emerging AI Applications]], [[Ray]], [[Philipp Moritz]], [[タスク並列フレームワーク]], [[動的タスクグラフ]] - Pages updated: [[Ion Stoica]], [[University of California, Berkeley]] - Key insight: タスク並列とアクターモデルを統合する分散フレームワーク。動的タスクグラフ・GCS・ボトムアップ分散スケジューラで 1.8 ミリ秒遅延・毎秒 100 万タスク以上を処理。 ## [2026-06-17] ingest-slides | AI時代に向けたクラウドにおける信頼性エンジニアリングの未来構想 - Source: `.raw/slides/dicomo2022/dicomo2022.pdf` - Visual pages: `.raw/slides/dicomo2022/pages/` - Media: none - Summary: [[@2022__DICOMO__AI時代に向けたクラウドにおける信頼性エンジニアリングの未来構想]] - Pages created: [[@2022__DICOMO__AI時代に向けたクラウドにおける信頼性エンジニアリングの未来構想]], [[Interactive AIOps]], [[セルフクラフト]] - Pages updated: [[AIOps]], [[SRE]], [[サービスレベル目標]], [[自動化のアイロニー]], [[Yuuki Tsubouchi]], [[Hirofumi Tsuruta]], [[sources/_index]], [[concepts/_index]], [[index]], [[hot]] - Key insight: 2022 年時点で、SRE の信頼性制御思想を 2040 年代の利用者主導 [[セルフクラフト]] へ延長し、その手前の技術者-AI 協働段階として [[Interactive AIOps]](実験可能性 + 解釈性)を提示していた。 ## [2026-06-17] ingest-paper | Ironies of Automation 後続 2 論文(Baxter+ ECCE2012 / Strauch IEEE-THMS2017) - Source: `.raw/papers/ECCE2012_baxter_ironies.pdf`, `.raw/papers/roniesofutomationtillnresolvedfterllheseears_4830.pdf` - Summary: [[@2012__ECCE__The Ironies of Automation Still Going Strong at 30]], [[@2017__IEEE THMS__Ironies of Automation - Still Unresolved After All These Years]] - Pages created: [[Gordon Baxter]], [[John Rooksby]], [[Barry Strauch]], [[University of St Andrews]], [[National Transportation Safety Board]] - Pages updated: [[自動化のアイロニー]], [[Lisanne Bainbridge]], [[sources/_index]], [[entities/_index]], [[index]], [[hot]] - Key insight: Bainbridge (1983) のアイロニーは 40 年以上にわたり構造的に解消されず、ドメインは拡大し続ける。Strauch は新アイロニー(技能マスキング・同一エラー反復・機能過多)を体系化し、Baxter らはクラウドの低コストによる品質迂回という新しいアイロニーを特定した。 ## [2026-06-17] ingest | ペパボ研究所 gpt-ossモデルのサービング性能評価(三宅悠介) - Source: `.raw/articles/gpt-oss-serving-2025-08-18.md` - Summary: [[@2025__ペパボ研究所__gpt-ossモデルのサービング性能評価]] - Pages created: [[三宅悠介]], [[GMOペパボ]] - Pages updated: [[vLLM]], [[LLM推論]], [[sources/_index]], [[index]], [[hot]] - Key insight: H100 でのみ並列スケーリングが有効で、出力トークン数がスループットを支配し、Reasoning effort はモデルサイズ選択と同等に重要。 --- ## [2026-06-17] ingest-paper | マイクロサービスベンチマーク/データセット 4 論文一括(DeathStarBench + Smith+ + OSS-MS + TrainTicketTrace) - Sources: - `.raw/papers/arxiv-1905.11055.pdf`(md5: 取得、16 pages) - `.raw/papers/arxiv-2306.05895.pdf`(md5: 取得、7 pages) - `.raw/papers/2026_Unknown_A_Dataset_Microservices_Open_Source.pdf`(md5: 取得、6 pages) - `.raw/papers/TrainTicketTrace_A_Multi-Fault_Distributed_Dataset_for_Microservice_Fault_Detection_and_Localization.pdf`(md5: 取得、8 pages) - Summary: [[@2019__ASPLOS__An Open-Source Benchmark Suite for Cloud and IoT Microservices]] / [[@2023__arXiv__Benchmarks for End-to-End Microservices Testing]] / [[@2024__MSR__A Dataset of Microservices-based Open-Source Projects]] / [[@2026__SANER-C__TrainTicketTrace - A Multi-Fault Distributed Dataset for Microservice Fault Detection and Localization]] - Pages created (sources): [[@2019__ASPLOS__An Open-Source Benchmark Suite for Cloud and IoT Microservices]] / [[@2023__arXiv__Benchmarks for End-to-End Microservices Testing]] / [[@2024__MSR__A Dataset of Microservices-based Open-Source Projects]] / [[@2026__SANER-C__TrainTicketTrace - A Multi-Fault Distributed Dataset for Microservice Fault Detection and Localization]] - Pages created (entities): [[Christina Delimitrou]] / [[Yu Gan]] / [[Cornell University]] / [[Davide Taibi]] / [[Tomas Cerny]] / [[University of Oulu]] / [[Baylor University]] / [[eShopOnContainers]] / [[EvoMaster]] / [[World of Code]] / [[Software Competence Center Hagenberg]] / [[Pirmin Urbanke]] / [[Stefan Fischer]] / [[Dario Amoroso d'Aragona]] / [[Alexander Bakhtin]] / [[Tampere University]] - Pages created (concepts): [[マイクロサービスベンチマーク]] - Pages updated: [[DeathStarBench]] / [[Train-Ticket]] / [[マイクロサービスアーキテクチャ]] / [[マイクロサービスコールグラフ]] / [[分散トレーシング]] / [[Fault Localization]] / [[障害注入]] / [[sources/_index]] / [[index]] / [[hot]] / `.raw/.manifest.json` - Key insight: マイクロサービス研究の benchmark/dataset カタログとして 4 本を一望できる枠組みが揃った。DeathStarBench(2019、6 アプリ + 自前 trace 0.1% overhead)が学術ベンチの原典、Smith+(2023、Selenium + Gatling test suite)がテスト benchmark、Amoroso+(2024、378 件 OSS-MS dataset)が大規模カタログ、TrainTicketTrace(2026、42 services × 9 fault × 3 modality)が fault localization dataset の現代版。Train-Ticket が 3 本に共通の benchmark system として登場し、microservice 研究の **de facto 共通基盤**化が裏付けられた。EvoMaster の生成テストはすべての seeded fault を test では見落としたが trace/metric/log には痕跡が残るという観察は、test layer と observability layer の段の切れ目を示し、後者で fault detection 研究を進めるべき方向性を示唆する。 ## [2026-06-17] ingest-paper | Time-RA(ACL Findings 2026)— TSAD 生成型推論タスク + RATs40K - Source: `.raw/papers/arxiv-2507.15066.pdf`(md5: 8ad5638639ef270b90f0c0d24a1f721e、27 pages) - Summary: [[@2026__ACL Findings__Time-RA - Towards Time Series Reasoning for Anomaly Diagnosis with LLM Feedback]] - Pages created: - [[@2026__ACL Findings__Time-RA - Towards Time Series Reasoning for Anomaly Diagnosis with LLM Feedback]] - [[Yiyuan Yang]] - Pages updated: [[Qingsong Wen]] / [[Zichuan Liu]] / [[時系列推論]] / [[時系列異常検知ベンチマーク]] / [[時系列マルチモーダルLLM]] / [[sources/_index]] / [[index]] / [[hot]] - Key insight: TSAD を「二値識別」から「生成型推論(検知+分類+因果説明)」へ転換した TIME-RA と、実世界 10 ドメイン約 4 万件の RATs40K。SFT + LoRA で fine-tune した Qwen2.5-7B がプラグアンドプレイで未見ドメインに転用可能であることを初めて実証。視覚化は分類より推論一貫性向上に安定的に寄与する。 ## [2026-06-17] ingest-paper | LLMAD(KDD 2025)+ ChatTS(VLDB 2025)を同時取り込み - Sources: `.raw/papers/Liu-et-al.-2024---Large-Language-Models-can-Deliver-Accurate-and-Interpretable-Time-Series-Anomaly-Detection.pdf`(md5: 92f204e58e5a16066500e5ff5fdf1eb2、20 pages) / `.raw/papers/p2385-xie.pdf`(md5: aea42f734779575e627e15a61e546fae、14 pages) - Summary: [[@2025__KDD__Large Language Models can Deliver Accurate and Interpretable Time Series Anomaly Detection]] / [[@2025__VLDB__ChatTS - Aligning Time Series with LLMs via Synthetic Data for Enhanced Understanding and Reasoning]] - Pages created: - [[@2025__KDD__Large Language Models can Deliver Accurate and Interpretable Time Series Anomaly Detection]] / [[@2025__VLDB__ChatTS - Aligning Time Series with LLMs via Synthetic Data for Enhanced Understanding and Reasoning]] - [[LLMAD]] / [[ChatTS]] / [[AnoCoT]] / [[TSEvol]] / [[Anomaly Transformer]] / [[Qwen2.5-14B-Instruct]] - [[Jun Liu (UCAS)]] / [[Jiaxu Qian]] / [[Xiao He]] / [[Jianjun Chen]] / [[Rui Shi]] / [[University of Chinese Academy of Sciences]] / [[Zhejiang University of Technology]] - [[時系列マルチモーダルLLM]] - Pages updated: - 既存 entity: [[Microsoft]] / [[Tsinghua University]] / [[BNRist]] / [[BizSeer]] / [[ByteDance]] / [[Dan Pei]] / [[Qingwei Lin]] / [[Minghua Ma]] / [[Chaoyun Zhang]] / [[Si Qin]] / [[Chetan Bansal]] / [[Dongmei Zhang]] / [[Saravan Rajmohan]] / [[Zhe Xie]] / [[Zeyan Li]] / [[Longlong Xu]] / [[Xidao Wen]] / [[Tieying Zhang]] - 既存 concept(横断的知見・未解決の問い更新): [[異常検知]] / [[LLM時系列アプローチ]] / [[時系列異常検知ベンチマーク]] - 索引: [[sources/_index]] / [[entities/_index]] / [[concepts/_index]] / [[index]] / [[hot]] / [[log]] / `.raw/.manifest.json` - Key insights: - [[LLMAD]] は「LLM を直接判定器として使う」路線で、Prompting + 履歴 ICL(FastDTW)+ AnoCoT で平均 Best F1=0.759、TFAD(0.725)を上回り年間 $65.70。常時稼働の検知に LLM が重すぎる制約に対し、1 分粒度サンプリングなら実用域に届くことを示し、Microsoft 内の異常検知 LLM 化研究の 2 路線(検知器 vs メタ層)が併走している事実を明確化。 - [[ChatTS]] は時系列を画像同等のネイティブモダリティとして扱う初の TS-MLLM。属性プールから生成した完全合成データのみで Qwen2.5-14B を SFT し、GPT-4o vision を alignment +46.0% / reasoning +25.8% で凌駕。[[LLM時系列アプローチ]] の 5 既存路線(Prompting/Quantization/Aligning/Vision/Tool)に第 6 路線として [[時系列マルチモーダルLLM]] を追加する必要性を実証。 - 両論文は共に「Prompting/MLLM で時系列を扱う」という同じ問題意識から異なる解(LLMAD: 単変量 + 解釈、ChatTS: 多変量 + 推論)を出した姉妹路線。両者を統合する経路(ChatTS の rulebook 適用 × LLMAD の AnoCoT)は未着手。 ## [2026-06-17] wiki-query 追補 | Ganatra+ ESEC/FSE2023 を年代別レビューに統合 - Page updated: [[アラーティングの進歩-年代別]] - Source incorporated: [[@2023__ESEC-FSE__Detection Is Better Than Cure - A Cloud Incidents Perspective]](Microsoft、 ESEC/FSE 2023) - 統合箇所: - §1 motivation に 27.25% アウテージ率 / 10.7× TTD / 3.75× TTM を追加 - §6 (2022〜2023) に Ganatra+ パラグラフ追加 — 既存研究の「ある alert をどう良くするか」に対し「そもそも alert が無い」を直交軸として可視化。 6 カテゴリ(Missing monitor/alert 40.41%・Missing/improper signal 18.13%・Incorrect alerting logic 12.78%・Improper coverage 10.02%・Buggy monitor 5.87%・Others 6.39%)とサービス特性 5 軸相関を要約 - §10 通時的潮流の「介入点の細分化」に零番目(monitor 存在判定)を追加し、 6+1 層 → 8 層構造に拡張 - §11 未解決の問いに (e) 検知失敗の事業者横断再測定 + intelligent monitoring framework 未実装を追加 - Key cross-link: Yang+ DSN2022(既存アラートの 6 アンチパターン)と Ganatra+ FSE2023(アラート不在の 6 カテゴリ)が**相補的タクソノミ対**を成す。 AlertGuardian 2025 の rule refinement が両者の motivation を引き継ぐ ## [2026-06-17] wiki-query | アラーティングの進歩 — 年代別レビュー - Question: 「アラーティングの進歩について、年代別にまとめてください。」 - Mode: Deep(hot/index 不読み・concept 10 本精読 + sources 年代別棚卸し) - Page created: [[アラーティングの進歩-年代別]] — 1980s 商用 NMS から 2026 agentic SRE まで、5+1 介入点(評価/抑制/フィルタ/集約/ランキング/RCA/handler)の層分化を論文 introduction 形式で再構成。 Jiang+ ICAC2009 → Tang+ NOMS2012 → Lin+ KDD2014 → Google SRE Book 2016 → Wilkinson SREcon18 → 2020 5 介入点同時開花(AlertStorm/AlertRank/DEAR/DeepIP)→ Yang+ DSN2022 アンチパターン → TraceArk/DyAlert 2023 動的グラフ世代 → 2024 LLM 役割 3 分化(COLA/Zha/MonitorAssistant)→ 2025 AlertGuardian ライフサイクル一括 + SkyNet の LLM 不採用境界 + ProAlert 教師なし化 → 2026 Google AI in SRE の 3 段 agentic を貫いて記述。 - Pages updated: [[wiki/index]] §Questions、[[wiki/log]] - Key observations: - **5+1 介入点が層分化した**: 「閾値+通知」が一体だった 2007 から、2024 までに監視評価/抑制/フィルタリング/集約/ランキング/RCA/autonomous handler の 6+1 層に分解された。 end-to-end 統合運用報告はまだない。 - **保証様式が 12 年で緩んだ**: Tang+ 2012 Theorem 1(数学的存在保証)→ Bhukar+ 2024(教師あり上界に到達する経験的近似)。教師あり ML 成熟の副作用として保証必要条件が「証明可能性」から「再現可能近似性能」へシフト。 - **集約アルゴリズム 3 段世代交代**: ペア類似度(2014)→ 動的グラフ表現学習(2023)→ 教師なしトポロジセマンティクス(2025、ProAlert)。「接続性のみ」から「伝播のしやすさの semantics 学習」への軸シフト。 - **LLM 採用境界が SkyNet で明文化**: severe failure では Syslog 10M/15min が 20M トークン context 超過 + hallucination 許容不可で LLM を意図的に不採用。次世代 LLM の context 拡張で境界は動的に再定義される。 - **agentic 時代に「アラート」の意味論が変質中**: 人間通知用から autonomous handler 入力用へ。Google AI in SRE 2026 の 3 段(TimesFM 動的閾値 → alerting agent → autonomous handler)が示すこの変化を理論化する研究はまだない。 ## [2026-06-17] ingest-paper x4 | GLM family — 起点(ACL 2022) → GLM-4.5 → GLM-5 → GLM-OCR - Sources: - `.raw/papers/arxiv-2103.10360.pdf`(MD5: cdbcb4e448ccbdbac1947e548abe86d5、16p、ACL 2022) - `.raw/papers/arxiv-2508.06471.pdf`(MD5: c0413d46e4a971f8f42e5f2d03b50274、26p、arXiv 2025) - `.raw/papers/arxiv-2602.15763.pdf`(MD5: ff79e2abd52115089bb5373f82460c50、40p、arXiv 2026) - `.raw/papers/arxiv-2603.10910.pdf`(MD5: e624933ba66dfe5081d82ad1f9128d92、17p、arXiv 2026) - Summaries: [[@2022__ACL__GLM - General Language Model Pretraining with Autoregressive Blank Infilling]] / [[@2025__arXiv__GLM-4.5 - Agentic Reasoning and Coding Foundation Models]] / [[@2026__arXiv__GLM-5 - From Vibe Coding to Agentic Engineering]] / [[@2026__arXiv__GLM-OCR Technical Report]] - Pages created: 4 source + 1 サブソース([[@2026__Cursor__CursorBench - How Cursor Evaluates Model Quality]]、GLM-5 評価で参照される) + entity([[Zhengxiao Du]] / [[Yujie Qian]] / [[Ming Ding]] / [[Jiezhong Qiu]] / [[Zhilin Yang]] / [[Jie Tang]] / [[Zhipu AI]] / [[BAAI]] / [[Wenmeng Yu]] / [[Xiaotao Gu]] / [[CursorBench]] / [[Cursor]] / [[SWE-Bench-Verified]] / [[Naman Jain]]) + concept(GLM 系統では [[自己回帰空白埋め]] / [[2D位置符号化]] / [[スパン破壊]] / [[事前学習目的設計]] / [[言語モデル事前学習]]、GLM-4.5/5 系統では [[エージェント型コーディング]] / [[非同期エージェントRL]] / [[DSA]]、GLM-OCR では [[光学文字認識]] / [[文書理解]] / [[ビジョン言語モデル]]) - Pages updated: [[Tsinghua University]](GLM 起点論文の所属追記) / [[MIT CSAIL]](Yujie Qian の所属で GLM 関与追記) / [[Shanghai Qi Zhi Institute]](Zhilin Yang の所属で追記) / [[Xiao Liu]](GLM 共著者として first_mentioned 更新) / [[Mixture-of-Experts]](GLM-4.5/5 の MoE 設計を横断的知見追加) / [[マルチトークン予測]](DeepSeek-V3 と GLM-OCR の MTP 設計比較を横断的知見追加) / [[オープンLLM開発]](Zhipu AI 系統の追加) / [[コーディングエージェント評価]](CursorBench 関連) / [[sources/_index]] / [[wiki/index]] / [[wiki/hot]] / [[wiki/log]] / `.raw/.manifest.json` - Key insights: - **GLM 系統が単一論文ファミリーとして wiki に揃った**: 2022 ACL の自己回帰空白埋め目的関数(NLU/生成統一)→ 2025 GLM-4.5(ARC 統合・ハイブリッド推論モード・深さ優先設計)→ 2026 GLM-5(DSA + 非同期エージェント RL + コワーク能力)→ 2026 GLM-OCR(0.9B 小型 VLM で 235B モデル超え)。4 年間の単一研究グループ([[Zhipu AI]] / [[Tsinghua University]] [[Jie Tang]] 系)の漸進的進化が一望できる - **GLM-OCR が OCR の MTP 親和性を示した**: [[DeepSeek-V3]] の MTP(汎用テキスト用、$D=1$、独立 Transformer ブロック)に対し、GLM-OCR は**パラメータ共有ドラフトヘッド**で実装し、OCR の構造トークン局所性(表タグ・Markdown 構文)を活かし平均 5.2 トークン/ステップを達成。**OCR ドメインが MTP の効果を最大化するタスク特性を持つ**ことを実証 - **小型モデルでフロンティアモデル超え**: GLM-OCR の 0.9B が OmniDocBench v1.5 で 235B Qwen3-VL や Gemini-3 Pro を上回る 1 位を獲得。タスク特化型小型 VLM + 段階訓練 + GRPO RL のアプローチが汎用大型 VLM を超える可能性を示す - **DSA(DeepSeek Sparse Attention)が GLM-5 で次世代スパーシティ手法として採用**: [[Lightning Attention]](MiniMax-M1)・MoE エキスパートスパーシティ(Kimi K2)に続く第三のスパーシティ軸として DSA が 744B 規模で実装され、28.5T トークン訓練を可能にする - **非同期 RL インフラの台頭**: GLM-5 の slime フレームワーク(生成と訓練の分離)は MiniMax-M2 の Forge(Windowed FIFO + 接頭辞木マージ)と独立に同じ問題意識(長期エージェントロールアウトの GPU 利用率向上)に到達。エージェント RL インフラが独立した研究分野として確立しつつある - **Cursor との連携が GLM-5 評価で明示**: GLM-5 の coding 評価で [[CursorBench]] を用い、Composer 2.5 / Kimi K2.5 等の産業モデルとの比較を行う。学術 LLM ペーパーが産業ベンチマークを引用する流れの一例 ## [2026-06-17] ingest | CursorBench - How Cursor Evaluates Model Quality (Cursor Blog) - Source: `.raw/articles/cursorbench-2026-06-17.md` - Summary: [[@2026__Cursor__CursorBench - How Cursor Evaluates Model Quality]] - Pages created: [[@2026__Cursor__CursorBench - How Cursor Evaluates Model Quality]], [[コーディングエージェント評価]] - Pages updated: [[CursorBench]], [[Naman Jain]], [[Cursor]], [[SWE-Bench-Verified]] - Key insight: Cursor が CursorBench 3.1 のハイブリッド評価手法を公開。OpenAI は SWE-bench Verified 報告を停止(未解決問題の 60% にテスト欠陥)、内部ベンチマーク + オンライン評価への業界移行を示唆。 ## [2026-06-17] ingest-paper | アラート管理・時系列異常検知 10 本一括(NOMS2012-FSE2025) - Sources(10): - `.raw/papers/noms2012-situation.pdf`(MD5: 83a2ac7d1db5491fdf45f664da6bfed1、9p) - `.raw/papers/CIKM18-AlertR.pdf`(MD5: b080948f601693083336ca692f16b27f、9p) - `.raw/papers/2026_Unknown_Online_summarizing_alerts_semantic_behavior.pdf`(MD5: a754a1c843cd3c75178fbcb4f21f0efc、12p) - `.raw/papers/arxiv-2501.14170.pdf`(arXiv 2501.14170、20p) - `.raw/papers/arxiv-2502.17812.pdf`(MD5: e2a96c8f8244ef4b5d3adb2863b6d278、12p) - `.raw/papers/2026_Unknown_Ranking_importance_alerts_problem_determination.pdf`(MD5: e529bf8feb9302e4f52f0bcb0612ed13、10p) - `.raw/papers/kdd17p1067.pdf`(MD5: 702b37f74518e39425b2382aa204026e、9p) - `.raw/papers/cloud_20_dear.pdf`(MD5: 9dc8218a2b0420b16ac5dde8dbe98f8f、8p) - `.raw/papers/2026_Unknown_Alert_Summarization_Online_Service_Systems.pdf`(MD5: db8c192e842c5056cd1df2d4570c68a4、23p) - `.raw/papers/2026_Unknown_ChangeRCA_Finding_Root_Causes_Software.pdf`(23p) - Summaries: [[@2012__NOMS__Optimizing System Monitoring Configurations for Non-Actionable Alerts]] / [[@2018__CIKM__Collaborative Alert Ranking for Anomaly Detection]] / [[@2022__ICSE__Online Summarizing Alerts through Semantic and Behavior Information]] / [[@2025__arXiv__ARGOS - Agentic Time-Series Anomaly Detection with Autonomous Rule Generation via Large Language Models]] / [[@2025__arXiv__Can Multimodal LLMs Perform Time Series Anomaly Detection]] / [[@2009__ICAC__Ranking the Importance of Alerts for Problem Determination in Large Computer Systems]] / [[@2017__KDD__Anomaly Detection in Streams with Extreme Value Theory]] / [[@2020__CLOUD__DEAR - Distributed Evaluation of Alerting Rules]] / [[@2025__FSE__Alert Summarization for Online Service Systems by Validating Propagation Paths of Faults]] / [[@2024__FSE__ChangeRCA - Finding Root Causes from Software Changes in Large Online Systems]] - Pages created(10 source + ~50 entity): [[Liang Tang]] / [[Tao Li]] / [[Florian Pinel]] / [[Larisa Shwartz]] / [[Genady Grabarnik]] / [[Florida International University]] / [[IBM T.J. Watson Research Center]] / [[St. John's University]] / [[Ying Lin]] / [[Zhengzhang Chen]] / [[Cheng Cao]] / [[Lu-An Tang]] / [[Wei Cheng]] / [[Zhichun Li]] / [[Kai Zhang (Temple University)]] / [[University of Houston]] / [[Temple University]] / [[Yile Gu]] / [[Yigong Hu]] / [[Baris Kasikci]] / [[Xiongxiao Xu]] / [[Haoran Wang]] / [[Yueqing Liang]] / [[Yue Zhao]] / [[Kai Shu]] / [[Illinois Institute of Technology]] / [[Emory University]] / [[University of Southern California]] / [[Guofei Jiang]] / [[Haifeng Chen]] / [[Kenji Yoshihira]] / [[Akhilesh Saxena]] / [[NEC Laboratories America]] / [[Alban Siffer]] / [[Pierre-Alain Fouque]] / [[Alexandre Termier]] / [[Christine Largouet]] / [[Amossys]] / [[Inria]] / [[IRISA]] / [[Univ. Rennes 1]] / [[AgroCampus]] / [[Mathias Mormul]] / [[Pascal Hirmer]] / [[Christoph Stach]] / [[Bernhard Mitschang]] / [[University of Stuttgart]] / [[Yuang He]] / [[Zilong He]] / [[Qiuyu Yan]] / [[Yu Luo (Tencent)]] / [[Fangyuan Li]] - Pages updated: [[IBM Research]] / [[Philip S. Yu]] / [[University of Illinois Chicago]] / [[Amazon]] / [[Kai Zhang]] / [[Jia Chen (Fudan)]] / [[Peng Wang (Fudan)]] / [[Wei Wang (Fudan)]] / [[Fudan University]] / [[Yifan Xiong]] / [[Jonathan Mace]] / [[Yuting Jiang]] / [[Peng Cheng]] / [[University of Washington]] / [[Microsoft Research]] / [[Guangba Yu]] / [[Pengfei Chen]] / [[Sun Yat-sen University]] / [[Tencent]] / [[Zibin Zheng]] / [[アラート管理]] / [[アラート集約]] / [[アラートストーム]] / [[アラート抑制]] / [[アラートフィルタリング]] / [[時系列異常検知]] / [[変更起因インシデント]] / [[根本原因分析]] / [[sources/_index]] / [[wiki/index]] / [[wiki/hot]] / `.raw/.manifest.json` - Key insights: - **Fudan アラート集約三部作の系譜が確定**: [[@2022__ICSE__Online Summarizing Alerts through Semantic and Behavior Information|OAS]](Chen+ ICSE2022、semantic+behavior の教師あり深層学習)→ [[@2023__ASE__Dynamic Graph Neural Networks-Based Alert Link Prediction for Online Service Systems|DyAlert]](Chen+ ASE2023、動的グラフでアラート伝播モデル化)→ [[@2025__FSE__Alert Summarization for Online Service Systems by Validating Propagation Paths of Faults|ProAlert]](Chen+ FSE2025、教師なしで伝播パスのセマンティクスへ昇格)。同一 Fudan グループによる 3 年スパンの漸進的進化が一望できる - **EVT が現代アラートストーム検知のルーツ**: [[@2017__KDD__Anomaly Detection in Streams with Extreme Value Theory|SPOT/DSPOT]](Siffer+ KDD2017)が分布仮定不要・閾値不要のストリーム異常検知を確立し、これが [[@2020__ICSE-SEIP__Understanding and Handling Alert Storm for Online Service Systems|Zhao+ ICSE-SEIP 2020]] の Alert Storm 検知器の統計的根拠となった - **アラートランキング系統の対照構造**: [[@2009__ICAC__Ranking the Importance of Alerts for Problem Determination in Large Computer Systems|Jiang+ ICAC2009]](不変条件 + NTV ピアレビュー、教師なし) / [[@2018__CIKM__Collaborative Alert Ranking for Anomaly Detection|CAR(CIKM2018)]](Pitman-Yor 階層ベイズ + エンティティ埋め込み、教師なし統一最適化) / [[@2020__ISSRE__AlertRank - Automatically and Adaptively Identifying Severe Alerts for Online Service Systems|AlertRank(ISSRE2020)]](XGBoost、教師あり incremental)の 3 ルーツが「教師あり vs 教師なし」「最適化 vs 探索」軸で系統樹を描く - **IBM 系研究のアラート品質ロードマップ**: [[@2012__NOMS__Optimizing System Monitoring Configurations for Non-Actionable Alerts|Tang+ NOMS2012]](オフライン静的ルール最適化 + 遅延)→ [[@2024__ICSE-SEIP__Dynamic Alert Suppression Policy for Noise Reduction in AIOps|Bhukar+ ICSE-SEIP 2024]](動的オンライン抑制ポリシー、教師なし統計学習)。12 年間で静的→動的、ルール配備→ポリシー学習へシフト - **LLM × TSAD の役割分業 2 パラダイム**: [[@2025__arXiv__ARGOS - Agentic Time-Series Anomaly Detection with Autonomous Rule Generation via Large Language Models|ARGOS(Gu+ arXiv2025)]] は LLM を**訓練時のルール生成のみ**に使い推論はルールベースで実行(説明可能性・再現性・自律性を同時達成、推論レイテンシ最大 34.3x)。[[@2025__arXiv__Can Multimodal LLMs Perform Time Series Anomaly Detection|VisualTimeAnomaly(Xu+ arXiv2025)]] は MLLM を**推論時の検知器**として使うが粗粒度(range/variate)では数値モデル超え、点別では大幅劣後(F1 8.12% 上限)。LLM の TSAD への組み込み方は「訓練時ルール抽出」と「推論時検知」で根本的に分化 - **ChangeRCA が RCCA を新概念として定式化**: [[@2024__FSE__ChangeRCA - Finding Root Causes from Software Changes in Large Online Systems|Yu+ FSE2024]] は既存の ACD(Abnormal Change Detection、変更の異常度判定)から RCCA(Root Cause Change Analysis、複数変更から defective change を特定)へ問題を昇格。WeChat 本番 + 81 種シミュレーションで Top-1 Hit Rate 85%、TTI 90% 削減 - **DEAR が「評価場所の移動」アプローチを提示**: [[@2020__CLOUD__DEAR - Distributed Evaluation of Alerting Rules|Mormul+ CLOUD2020]] は BET(バイナリ式木)中間表現でアラートルール評価を VM に自動配布。「発火後フィルタリング」(Voutsas+ JCC2023 / Bhukar+ ICSE-SEIP2024)とは独立した「発火前精度向上」の介入点として位置づけられる --- ## [2026-06-17] ingest-paper | Harp: Improving VPC Network Availability via Efficient Failure Detection and Rerouting in Tencent Cloud - Source: `.raw/papers/nsdi26-hu-jiayu.pdf` - Summary: [[@2026__NSDI__Harp - Improving VPC Network Availability via Efficient Failure Detection and Rerouting in Tencent Cloud]] - Pages created: [[@2026__NSDI__Harp - Improving VPC Network Availability via Efficient Failure Detection and Rerouting in Tencent Cloud]] / [[Jiayu Hu]] / [[Feng Jin]] / [[Kai Zhang]] / [[VPCネットワーク可用性]] - Pages updated: [[Tencent]] / [[Fudan University]] / [[グレイ障害]] / [[ネットワーク監視]] - Key insight: UDP ソースポートによる ECMP 決定論的パス制御とインバンドプローブ埋め込みを組み合わせることで、特定ハードウェア不要でサブ秒の VPC 障害回復を Tencent Cloud 数十万台の本番環境で実現(停止時間 78-99.97% 削減)。 --- ## [2026-06-17] ingest-paper | アラート管理 3 本(Zha+ Electronics 2024 / VOCE FASE 2025 / SkyNet SIGCOMM 2025) - Sources: - `.raw/papers/2024__Electronics__LLM-Alert-Aggregation.pdf`(MD5: 7410c9606224d2598c09a3517be5427b) - `.raw/papers/978-3-031-90900-9_4.pdf`(MD5: e273f964f1f3bd342120ae21d88764e8) - `.raw/papers/sigcomm25-skynet.pdf`(MD5: a1a167024e384fe8b0a09a02f9429642) - Summaries: [[@2024__Electronics__Leveraging Large Language Models for Efficient Alert Aggregation in AIOPs]] / [[@2025__FASE__VOCE - A Virtual On-Call Engineer for Automated Alert Incident Analysis Using a Large Language Model]] / [[@2025__SIGCOMM__SkyNet - Analyzing Alert Flooding from Severe Network Failures in Large Cloud Infrastructures]] - Pages created(24): 3 source + 17 entity([[Junjie Zha]] / [[Xinwen Shan]] / [[Jiaxin Lu]] / [[Jiajia Zhu]] / [[Zihan Liu]] / [[State Grid Jiangsu Electric Power]] / [[Jia Chen (Fudan)]] / [[Xiaolei Chen]] / [[Jie Shi]] / [[Peng Wang (Fudan)]] / [[Wei Wang (Fudan)]] / [[Bo Yang]] / [[Huanwu Hu]] / [[Yifan Li]] / [[Tao Lin (Alibaba)]] / [[node2vec]] / [[Sentence-BERT]] / [[FT-tree]] / [[Eigenvector Centrality]]) + 4 concept([[アラートインシデント分析]] / [[LLMによる根本原因分析]] / [[サービス依存グラフ]] / [[ネットワーク監視]]) - Pages updated: [[アラート集約]](LLM 役割 3 分化・LLM 採用境界・3 段階階層化・時間順仮定否定の 4 横断的知見を追記) / [[アラートストーム]](severe failure 第三カテゴリと alert 内分類軸の独立性を追記) / [[Drain]](stub から実体ページに昇格) / [[Fudan University]](VOCE 参照追加) / [[Ennan Zhai]](SkyNet 共著追加) / [[Alibaba Cloud]](SkyNet 研究記述追加) / [[Dennis Cai]](SkyNet 共著追加) / [[sources/_index]] / [[entities/_index]] / [[concepts/_index]] / [[wiki/index]] / [[wiki/hot]] / `.raw/.manifest.json` - Key insights: - **LLM 採用/不採用の境界が failure severity × スケールで引かれる**: Zha+ 2024 と VOCE は cloud service スケール(数万 alerts、context 収容可能)で LLM ハイブリッドを選択。SkyNet は severe failure × 10⁵ デバイス × Syslog 10M/15min で LLM 不採用を選択し §2.3 で論理的根拠を明文化(context 超過 + ハルシネーション + ブラックボックス) - **LLM の "RCA 内役割" の 3 分化**: 外部知識リーダー(COLA、SOP)/ グラフマッパー(Zha+ 2024、SDG)/ 多因子分析+因果推論器(VOCE、System Topology + CoT)。同じ "LLM × アラート" でも入力知識と問いの粒度で別系統 - **「時間順=原因」仮定の独立否定**: VOCE Table 2(Order = 45.34%)と SkyNet §7.3(BGP link break が先、Syslog hardware error が遅延)で、Microsoft 系の eWarn ら時系列 RCA 系の暗黙仮定を実データで複数論文が反証 - **集約は時間 → 空間 → 因果の 3 段構造に収束**: Zha+(τ=15min → node2vec+SBERT → LLM × SDG)、SkyNet(timeout → location 階層 → SOP/manual)、VOCE(alert linking → source 内 → 隣接 source 間)で同形だが、各段の中身が dispatch される ## [2026-06-16] ingest-slides | Reliability in the Age of AI: Engineering for AI Velocity - Source: `.raw/slides/reliability-in-the-age-of-ai-engineering-for-ai-velocity/reliability-in-the-age-of-ai-engineering-for-ai-velocity.pdf` - Visual pages: `.raw/slides/reliability-in-the-age-of-ai-engineering-for-ai-velocity/pages/` (27 pages) - Media: none - Summary: [[@2026__SpeakerDeck__Reliability in the Age of AI - Engineering for AI Velocity]] - Pages created: [[@2026__SpeakerDeck__Reliability in the Age of AI - Engineering for AI Velocity]], [[Ryota Yoshikawa]], [[Topotal]], [[Waroom]] - Pages updated: [[SRE]], [[agentic SRE]], [[SRE AI Autonomy Levels]], [[サービスレベル目標]], [[エラーバジェット]], [[sources/_index]], [[entities/_index]], [[concepts/_index]], [[wiki/index]], [[wiki/hot]], `.raw/.manifest.json` - Key insight: AI 時代の信頼性課題は「開発速度が上がる」こと自体ではなく、生成物の品質管理・本番での観測・SRE 判断のスケールが同時に追いつかなくなる点にある。SLI/SLO とエラーバジェットは、AI サービス固有 SLI と AI 補助承認ポリシーの制御信号へ拡張される。 ## [2026-06-16] ingest-paper x9 | アラート管理 9 論文一括取り込み(アラートストーム・抑制・集約・RCA・アクショナブル) - Sources: - `.raw/papers/2020__ICSE-SEIP__Zhao-Alert-Storm.pdf`(Zhao+ ICSE-SEIP 2020 - Understanding and Handling Alert Storm) - `.raw/papers/2020__ISSRE__AlertRank.pdf`(Zhao+ ISSRE 2020 - AlertRank) - `.raw/papers/arxiv-2309.07230.pdf`(Chakraborty+ arXiv 2023 - ESRO) - `.raw/papers/2024__JSS__Chen-Dynamic-Graph-Alert-Link.pdf`(Chen+ ASE 2023 - DyAlert; slug は誤りで実 venue は ASE 2023) - `.raw/papers/2023__CSCN__Voutsas-Filtering-Alerts.pdf`(Voutsas+ JCC 2023 - Filtering Alerts; slug 誤り) - `.raw/papers/2023__ICSE-SEIP__Zhang-Alert-Identification.pdf`(Zeng+ ICSE-SEIP 2023 - TraceArk) - `.raw/papers/2024__SAC__Bhukar-Dynamic-Alert-Suppression.pdf`(Bhukar+ ICSE-SEIP 2024 - Dynamic-X-Y; slug の SAC は誤り) - `.raw/papers/2024__CCGRID__AlertRCA.pdf`(Yu+ CCGRID 2024 - AlertRCA) - `.raw/papers/2024__ISSRE__SuperAgg.pdf`(Yuan+ ISSRE 2024 - SuperAgg) - Summaries: [[@2020__ICSE-SEIP__Understanding and Handling Alert Storm for Online Service Systems]]、[[@2020__ISSRE__AlertRank - Automatically and Adaptively Identifying Severe Alerts for Online Service Systems]]、[[@2023__arXiv__ESRO - Experience Assisted Service Reliability against Outages]]、[[@2023__ASE__Dynamic Graph Neural Networks-Based Alert Link Prediction for Online Service Systems]]、[[@2023__JCC__Filtering Alerts on Cloud Monitoring Systems]]、[[@2023__ICSE-SEIP__TraceArk - Towards Actionable Performance Anomaly Alerting for Online Service Systems]]、[[@2024__CCGRID__AlertRCA - Causality Enhanced Graph Representation Learning for Alert-Based Root Cause Analysis]]、[[@2024__ICSE-SEIP__Dynamic Alert Suppression Policy for Noise Reduction in AIOps]]、[[@2024__ISSRE__Exploring Hierarchical Patterns for Alert Aggregation in Supercomputers]] - Pages created: source 9、entities 30+(主要: [[Nengwen Zhao]]・[[Junjie Chen]]・[[Yiru Chen]]・[[Sarthak Chakraborty]]・[[Yuqun Zhang]]・[[Zhaoyang Yu]]・[[Yuan Yuan]]・[[Tongqing Zhou]]・[[Karan Bhukar]]・[[Fotios Voutsas]]・[[John Violos]]・[[Aris Leivadeas]]・[[Saravan Rajmohan]]・[[Yongqian Sun]]・[[Zhen Dong]]・[[Xin Peng]]・[[Shubham Agarwal]]・[[Shaddy Garg]]・[[Shiv Saini]]・[[ESRO]]・[[AlertRank]]・[[AlertRCA]]・[[TraceArk]]・[[SuperAgg]]・[[Alibaba Group]]・[[Fudan University]]・[[IIT Kanpur]]・[[Stevens Institute of Technology]]・[[École de Technologie Supérieure]]・[[Netdata]]・[[National University of Defense Technology]]・[[BizSeer]])、concepts 3([[アラートストーム]]・[[アラート抑制]]・[[アクショナブルアラート]]) - Pages updated: [[アラート管理]](4 つの横断的知見・新しい問い 3 つ)、[[アラート集約]](4 つの横断的知見・新しい問い 2 つ)、[[アラートアンチパターン]](3 つの横断的知見・新しい問い 2 つ)、[[Quality of Alerts]](3 つの横断的知見・新しい問い 2 つ)、[[アラートフィルタリング]](2 つの横断的知見・新しい問い 3 つ)、[[Dan Pei]]、[[Qingwei Lin]]、[[Pooja Aggarwal]]、[[Saravan Rajmohan]]、[[Rohan Arora]]、[[IBM Research]]、[[Alibaba Group]]、[[sources/_index]]、[[entities/_index]]、[[concepts/_index]]、[[wiki/index]]、[[wiki/hot]]、`.raw/.manifest.json` - Key insight: 9 本は「抑制(発火前)・フィルタリング(クリック行動)・集約(クラスタリング/グラフ表現)・ランキング(severity/actionability)・RCA(アラートのみ)」の 5 介入点に分化し、Yu+ JNCA2024 の 3 プロセス分類では捕捉しきれない解像度に到達。HPC の連続的アラート過負荷とクラウドの断続的アラートストームは別問題で集約戦略が異なる(EVT 変化点検知 vs Apriori 階層パターン)。アラートのみで RCA を完結する系統(AlertRCA・ESRO)が手作業ルールに基づく Groot を上回り、観測データの中で「アラート系列」が他モダリティ不要なほどの信号密度を持つことを示した。 ## [2026-06-16] ingest-paper x5 | アラート管理・集約・予測の系譜 5 論文一括取り込み - Sources: - `.raw/papers/arxiv-2204.09670.pdf`(Yang+ DSN 2022 - Anti-patterns of Alerts) - `.raw/papers/2024__ICSE-SEIP__Kuang-Knowledge-aware-Alert-Aggregation.pdf`(Kuang+ ICSE-SEIP 2024 - COLA) - `.raw/papers/arxiv-2501.03547.pdf`(Singal+ arXiv 2025 - KIMetrix) - `.raw/papers/2014__KDD__Lin-Unveiling-clusters-of-events.pdf`(Lin+ KDD 2014 - Pivotal Alert Clustering) - `.raw/papers/2019__WWW__AirAlert-Chen-Outage-Prediction-Diagnosis.pdf`(Chen+ WWW 2019 - AirAlert) - Summaries: [[@2022__DSN__Characterizing and Mitigating Anti-patterns of Alerts in Industrial Cloud Systems]]、[[@2024__ICSE-SEIP__Knowledge-aware Alert Aggregation in Large-scale Cloud Systems - a Hybrid Approach]]、[[@2025__arXiv__Metric Criticality Identification for Cloud Microservices]]、[[@2014__KDD__Unveiling Clusters of Events for Alert and Incident Management in Large-Scale Enterprise IT]]、[[@2019__WWW__Outage Prediction and Diagnosis for Cloud Service Systems]] - Pages created (sources 5、entities 24、concepts 7): - sources: 上記 5 - entities (筆頭ほか新規): [[Tianyi Yang]]、[[Jiacheng Shen]]、[[Yuxin Su]]、[[Xiaoxue Ren]]、[[Jinxi Kuang]]、[[Jinyang Liu]]、[[Jiazhen Gu]]、[[Lan Yu]]、[[Rui Tan]]、[[Akanksha Singal]]、[[Kaustabha Ray]]、[[Divya Pathak]]、[[Felix George]]、[[Mudit Verma]]、[[Pratibha Moogi]]、[[IIIT Delhi]]、[[Derek Lin]]、[[Rashmi Raghu]]、[[Vivek Ramamurthy]]、[[Jin Yu]]、[[Regunathan Radhakrishnan]]、[[Joseph Fernandez]]、[[Pivotal Software]]、[[Visa Inc]]、[[Yujun Chen]]、[[Hang Dong]] - concepts: [[Quality of Alerts]]、[[アラートアンチパターン]]、[[アラート集約]]、[[COLA]]、[[KIMetrix]]、[[情報量基準メトリクス選定]]、[[AirAlert]] - Pages updated: [[アラート管理]]、[[障害予測]]、[[Michael R. Lyu]]、[[Yongqiang Yang]]、[[Junjie Huang]]、[[Renyi Zhong]]、[[Zengyin Yang]]、[[IBM Research]]、[[Qingwei Lin]]、[[Hongyu Zhang]]、[[Dongmei Zhang]]、[[Yu Kang]] - Key insights: - (Yang+ 2022 & Kuang+ 2024) 同じ CUHK + Huawei Cloud 連携が 2 年差で「SOP の限界実証」(調査 OCE の 77.8% が SOP は Limited Help)→「LLM で SOP 再活用」(COLA で F1 0.901-0.930)を成し遂げた 2 段の問題発見→解決ループ。 - (Lin+ 2014 & Kuang+ 2024) アラート集約とインシデント集約は別系統に分かれ、テキスト構造性で手法選択が決まる(半構造化 → Jaccard + graph-cut、非構造化 → NMF/LLM)。「構造保存可視化」(2014) と「LLM 説明」(2024) は OCE 受容を高める同じ問題の 10 年差ソリューション。 - (Chen+ 2019 & Yang+ 2022) AirAlert の Bayesian network + XGBoost ハイブリッドは、サービスレベル outage に対し Simple Spike(F1 7-11%)が崩壊する場面で F1 53-88% を達成。「予測本体は軽量 ML + 構造的依存学習」のアーキテクチャは PAGER(2026)に 7 年先行。 - (Singal+ 2025) Informative Metric Subset Problem(NP 完全)を初定式化し、エントロピー + 相互情報量 + AIMD + topology-aware で SelectKBest/mRMR/Boruta/Max Weighted Clique を上回るカバレッジ。アラート定義の「前段」としてメトリクス選定そのものを自動化する研究系列を確立。 - (横断) 5 本の論文が「アラートのアンチパターン同定(2022)→メトリクス選定(2025)→アラート/インシデント集約(2014, 2024)→アウテージ予測(2019)」という運用エンジニアリングの段階全体を横断し、各段階で「何を自動化し何を人手に残すか」の設計圧力が一貫して「LLM はインターフェース層、軽量 ML/統計手法は予測コア層」に収束する系譜が見える。 ## [2026-06-16] ingest-video | How We Debug 1000s of Databases with AI — Annie Zhou & Sophie Zhang, SREcon26 Americas (Databricks) - Source: URL のみ(動画取得失敗) — https://www.youtube.com/watch?v=ibJ-MUgJyS0 - Transcript: `.raw/videos/youtube-ibJ-MUgJyS0/transcript.md`(YouTube 自動字幕から変換、100文) - Frames: なし(動画未取得) - Summary: [[@2026__SREcon26 Americas__How We Debug 1000s of Databases with AI]] - Pages created: [[@2026__SREcon26 Americas__How We Debug 1000s of Databases with AI]], [[Annie Zhou]], [[Sophie Zhang (Databricks)]], [[Databricks]], [[Storax]] - Pages updated: [[agentic SRE]], [[データベース O&M]], [[データベース自律診断]] - Key insight: AI 導入前のツール集中化・ユーザー共感が採用の前提条件になり、承認ゲートはモデル内部仕様ではなくワークフローエンジン(Temporal)で実現した ## [2026-06-16] ingest-slides | A Theory and Practice of Alerting with Service Level Objectives — Jamie Wilkinson, SREcon18 Asia - Source: `.raw/slides/srecon18asia-wilkinson-slo-alerting/srecon18asia-wilkinson-slo-alerting.pdf` - Visual pages: `.raw/slides/srecon18asia-wilkinson-slo-alerting/pages/` (29 pages) - Media: `.raw/slides/srecon18asia-wilkinson-slo-alerting/transcript.md` (Whisper small model, MP3 37MB, 406 行) - Summary: [[@2018__SREcon18 Asia__A Theory and Practice of Alerting with Service Level Objectives]] - Pages created: [[@2018__SREcon18 Asia__A Theory and Practice of Alerting with Service Level Objectives]], [[Jamie Wilkinson]] - Pages updated: [[エラーバジェット]], [[サービスレベル目標]] - Key insight: SLO バーンレートアラートを Prometheus `delta(errors[1h]) > budget/burn_period` という具体式で実装した 2018 年の先行定式化で、SRE Workbook と同時期の二重確証となる ## [2026-06-16] ingest-slides | The WTF Problem — Nicole Forsgren, SREcon26 Americas - Source: `.raw/slides/srecon26-forsgren/srecon26-forsgren.pdf` - Visual pages: `.raw/slides/srecon26-forsgren/pages/` (37 枚) - Media: `.raw/slides/srecon26-forsgren/media/download.mkv` (transcript: なし、Whisper 失敗) - Summary: [[@2026__SREcon26 Americas__The WTF Problem - Developer Experience as a Reliability Property]] - Pages created: [[@2026__SREcon26 Americas__The WTF Problem - Developer Experience as a Reliability Property]], [[Nicole Forsgren]], [[Abi Noda]], [[DORA]], [[SPACE]], [[MTWTF]] - Pages updated: [[SRE]](横断的知見追記) - Key insight: DX は「感情問題」ではなく SRE の信頼性システム特性であり、MTWTF という先行指標で計測することで MTTR 悪化の前兆を捉えられる。AI 時代に摩擦は増幅されるため事前対処が必要。 ## [2026-06-16] ingest-paper | TimeGPT-1 (Garza, Challu, Mergenthaler-Canseco, Nixtla, arXiv:2310.03589) - Source: `.raw/papers/arxiv-2310.03589.pdf` - Summary: [[@2023__arXiv__TimeGPT-1]] - Pages created: [[@2023__arXiv__TimeGPT-1]], [[Cristian Challu]], [[Max Mergenthaler-Canseco]], [[Nixtla]] - Pages updated: [[Azul Garza]], [[時系列基盤モデル]], [[wiki/sources/_index]], [[wiki/entities/_index]], [[wiki/index]], [[wiki/hot]] - Key insight: TimeGPT-1(2023)が「ゼロショット汎化・推論速度・コスト」という TSFM 競争軸を設定した起点であり、後続の Chronos・TimesFM・Toto はすべてこの枠内で競合している。 ## [2026-06-16] ingest | 佐藤竜馬「ジョイジョイジョイ」13 記事一括バッチ（2024-09〜2026-03、joisino.hatenablog.com） - Sources (13): - `.raw/articles/joisino-rinna-2024-09-30.md` → [[joisino-トランスフォーマーはRNN-2024]] - `.raw/articles/joisino-negation-2024-12-18.md` → [[joisino-否定文理解-2024]] - `.raw/articles/joisino-superai-2025-01-15.md` → [[joisino-超人的AIと認知不能情報-2025]] - `.raw/articles/joisino-theory-2025-03-17.md` → [[joisino-機械学習理論入門-2025]] - `.raw/articles/joisino-physics-2025-03-24.md` → [[joisino-言語モデルの物理学-2025]] - `.raw/articles/joisino-anna-2025-05-20.md` → [[joisino-アンナカレーニナの法則-2025]] - `.raw/articles/joisino-mislead-2025-06-23.md` → [[joisino-人間を騙すAI-2025]] - `.raw/articles/joisino-eureka-2025-08-28.md` → [[joisino-面白さ優先分類器-2025]] - `.raw/articles/joisino-kimoi-2025-10-27.md` → [[joisino-LLMのキモい算術-2025]] - `.raw/articles/joisino-onedata-2025-11-25.md` → [[joisino-訓練データ1個推論性能倍-2025]] - `.raw/articles/joisino-zeh-2026-01-26.md` → [[joisino-LLMの能力の穴-2026]] - `.raw/articles/joisino-llmsort-2026-02-09.md` → [[joisino-LLMでソート-2026]] - `.raw/articles/joisino-cognition-2026-03-16.md` → [[joisino-LLMと言葉の感じ方-2026]] - Pages created（source 13）: - [[joisino-トランスフォーマーはRNN-2024]], [[joisino-否定文理解-2024]], [[joisino-超人的AIと認知不能情報-2025]], [[joisino-機械学習理論入門-2025]], [[joisino-言語モデルの物理学-2025]], [[joisino-アンナカレーニナの法則-2025]], [[joisino-人間を騙すAI-2025]], [[joisino-面白さ優先分類器-2025]], [[joisino-LLMのキモい算術-2025]], [[joisino-訓練データ1個推論性能倍-2025]], [[joisino-LLMの能力の穴-2026]], [[joisino-LLMでソート-2026]], [[joisino-LLMと言葉の感じ方-2026]] - Pages created（entity 4）: - [[Zeyuan Allen-Zhu]], [[Yuanzhi Li]], [[Yann LeCun]], [[Meta FAIR]] - Pages created（concept ~50）: - [[Transformer]], [[RNN]], [[線形注意]], [[状態空間モデル]], [[カーネル法]], [[文脈内学習]], [[Physics of Language Models]], [[知識操作]], [[知識容量スケーリング則]], [[文脈自由文法]], [[LLM算術機構]], [[ヒューリスティックの束]], [[ロジットレンズ]], [[否定文理解]], [[テキスト埋め込み]], [[自然言語推論]], [[文脈付き検索]], [[ゼロエラー境界]], [[LLM評価]], [[LLM能力スパース性]], [[LLMアプリケーション信頼性]], [[AI検証可能性]], [[敵対的摂動]], [[帰属手法]], [[プラトン的表現仮説]], [[モデル表現収束]], [[モデル縫合]], [[暗黙的正則化]], [[アンサンブル学習]], [[ビジョン言語モデル]], [[汎化誤差バウンド]], [[集中不等式]], [[PAC学習]], [[カバリングナンバー]], [[深層学習の汎化]], [[1サンプルRLVR]], [[検証可能報酬による強化学習]], [[強化ファインチューニング]], [[報酬ハッキング]], [[RLHF誤誘導]], [[スコファンシ]], [[LLM自己検証]], [[LLMランキング]], [[LLM比較器]], [[pairwiseランキング]], [[一対比較ランキング]], [[面白さ優先分類]], [[好奇心駆動学習]], [[LLM意味表象]], [[認知意味論]], [[プロトタイプ意味論]] - Pages updated: - [[佐藤竜馬]], [[Anthropic]], [[機構的解釈性]], [[LLM向け情報検索]], [[sources/_index]], [[entities/_index]], [[concepts/_index]], [[index]], [[hot]] - Key insights: - **Transformer ↔ RNN**: 線形注意で固定次元 RNN として書き下せる。訓練=並列・推論=定メモリの両モード切替が可能。 - **LLM 内部表象は「ヒューリスティックの束」**: 算術・知識記憶・知識操作は MLP ニューロンの粗い条件判定の積み重ね。[[ロジットレンズ]] が共通の解析手段。 - **次トークン予測の限界が複数視点で浮上**: 否定文理解の構造的限界、典型度順位相関の低さ([[LLM意味表象]] vs [[認知意味論]])、能力スパース性([[ゼロエラー境界]])、RLHF 誤誘導([[RLHF誤誘導]])。 - **学習理論 → 表現収束 → モデル算術の連結**: [[汎化誤差バウンド]] の崩壊と[[暗黙的正則化]]、[[プラトン的表現仮説]]、[[1サンプルRLVR]] の高品質少データ仮説が、「強いモデルはどれも似てくる」共通テーマで繋がる。 - Method notes: - 取得: defuddle parse --json で 13 URL を取得（hatenablog はサンドボックス外ネットワーク経由）。 - 並列度: claude-obsidian:wiki-ingest agent ×13 を並列起動。9 件は JSON 完了、4 件はサブエージェントが mid-work で停止したが、source ページ自体は全 13 件が作成済みであることをファイルシステム検証で確認。 - 規約: [[conventions]] §4 では `@YYYY__SOURCE__Title.md` だが、joisino 系は既存 3 件 (`joisino-LLMアテンションと外挿-2025` ほか) との一貫性を優先し `joisino-<Japanese-summary>-<year>.md` 形式で統一。@2025__joisino__絶対に分かる機械学習理論.md は ingest 後にリネーム。 ## [2026-06-16] ingest | 佐藤竜馬 — ICLR 2024 GNN 動向 & モデルパラメータ算術（joisino.hatenablog.com、2 記事バッチ） - Source 1: `.raw/articles/joisino-iclr2024-gnn-2024-05-15.md` - Source 2: `.raw/articles/joisino-model-parameter-arithmetic-2024-01-09.md` - Summary 1: [[joisino-ICLR-2024-GNN]] - Summary 2: [[joisino-モデルパラメータ算術-2024]] - Pages created: [[グラフニューラルネットワーク]], [[GNN同変性]], [[タスクベクトル]], [[モデルパラメータ算術]], [[joisino-ICLR-2024-GNN]], [[joisino-モデルパラメータ算術-2024]] - Pages updated: [[佐藤竜馬]], [[sources/_index]], [[concepts/_index]], [[entities/_index]], [[index]] - Key insight: 2 記事が「MLP のパーミュテーション対称性を GNN 同変性で扱うメタネットワーク」という共通テーマで接続——モデルパラメータ算術とグラフ学習の交差点。 ## [2026-06-16] ingest | 佐藤竜馬 — LLMのアテンションと外挿（joisino.hatenablog.com） - Source: `.raw/articles/joisino-llm-attention-heads-2025-09-29.md` - Summary: [[joisino-LLMアテンションと外挿-2025]] - Pages created: [[joisino-LLMアテンションと外挿-2025]], [[佐藤竜馬]], [[アテンションヘッド]], [[帰納ヘッド]], [[機構的解釈性]], [[関数ベクトル]], [[反復ヘッド]] - Pages updated: [[National Institute of Informatics]], [[entities/_index]], [[concepts/_index]], [[sources/_index]], [[index]], [[hot]] - Key insight: LLM の注意ヘッドは 7 種に機能分化し（文法・受け皿・逐次・検索・帰納・関数ベクトル・反復）、訓練の最適化の結果として自然出現する。表層レベルでは外挿できても、アルゴリズムのメタレベルでは内挿にとどまる。 ## [2026-06-16] ingest-video | Michelle Brush — Taming the Unpredictable: Reliability in Chaos - Source: `.raw/videos/youtube-DqpcVQIs3G8/media/video.mp4` - Transcript: `.raw/videos/youtube-DqpcVQIs3G8/transcript.md` - Frames: `.raw/videos/youtube-DqpcVQIs3G8/frames/` - Summary: [[@2026__SREcon26 Americas__Taming the Unpredictable - Reliability in Chaos]] - Pages created: [[@2026__SREcon26 Americas__Taming the Unpredictable - Reliability in Chaos]], [[Michelle Brush]] - Pages updated: [[SRE]], [[agentic SRE]], [[LLMアプリケーション信頼性]], [[sources/_index]], [[entities/_index]], [[index]], [[hot]] - Key insight: AI エージェントは SRE 作業を速くするが、同時にシステム複雑性を増やすため、汎用緩和・実験・リスク先行開発・継続的検証を SRE の中心に据える必要がある。 ## [2026-06-16] ingest-paper | Goldschmidt+2014 IEEE CLOUD — 時系列データベースのスケーラビリティ・ロバスト性評価 - Source: `.raw/papers/Goldschmidt2014-IEEE-CLOUD-preprint.pdf` - Summary: [[@2014__IEEE CLOUD__Scalability and Robustness of Time-Series Databases for Cloud-Native Monitoring of Industrial Processes]] - Pages created: [[@2014__IEEE CLOUD__Scalability and Robustness of Time-Series Databases for Cloud-Native Monitoring of Industrial Processes]], [[Thomas Goldschmidt]], [[Anton Jansen]], [[Heiko Koziolek]], [[Jens Doppelhamer]], [[Hongyu Pei Breivold]], [[ABB Corporate Research]], [[時系列データベース]], [[時系列データベースベンチマーク]], [[クラウドモニタリング]], [[OpenTSDB]], [[KairosDB]], [[Databus]] - Pages updated: [[entities/_index]], [[concepts/_index]] - Key insight: KairosDB(Cassandra 基盤)は 36 ノードで最大 403,500 値/秒のほぼ線形スケーラビリティを達成し、OpenTSDB は HBase のメモリ不足で再現可能なベンチマーク不可、Databus は KairosDB の約 1/10 のスループットにとどまった。 ## [2026-06-16] ingest-paper | Malviya+2014 ICDE — コマンドロギングによるメインメモリ OLTP リカバリ - Source: `.raw/papers/Malviya-et-al.-2014---Rethinking-main-memory-OLTP-recovery.pdf` - Summary: [[@2014__ICDE__Rethinking Main Memory OLTP Recovery]] - Pages created: [[@2014__ICDE__Rethinking Main Memory OLTP Recovery]], [[Nirmesh Malviya]], [[Ariel Weisberg]], [[Samuel Madden]], [[Michael Stonebraker]], [[MIT CSAIL]], [[コマンドロギング]], [[VoltDB]], [[H-Store]] - Pages updated: [[メインメモリデータベース]], [[concepts/_index]], [[entities/_index]] - Key insight: ARIES 生理ロギングはメインメモリ OLTP の高スループット環境で無視できないオーバーヘッドを生じ、コマンドロギング(トランザクション名とパラメータのみ記録)が TPC-C で 1.5× 高いスループットを達成するが、復旧時間は 1.5〜5× 長くなる。 ## [2026-06-16] ingest-paper | Wu+2021 ISSRE — PatternMatcher による根本原因メトリクス特定 - Source: `.raw/papers/wch_ISSRE-1.pdf` - Summary: [[@2021__ISSRE__Identifying Root-Cause Metrics for Incident Diagnosis in Online Service Systems]] - Pages created: [[@2021__ISSRE__Identifying Root-Cause Metrics for Incident Diagnosis in Online Service Systems]], [[Canhua Wu]], [[Nengwen Zhao]], [[Dan Pei]], [[Tsinghua University]], [[BizSeer]], [[@2021__ISSRE__Identifying Root-Cause Metrics for Incident Diagnosis in Online Service Systems|PatternMatcher]], [[根本原因分析]], [[異常検知]] - Pages updated: [[AIOps]], [[concepts/_index]], [[entities/_index]] - Key insight: 根本原因メトリクスは「異常性」と「解釈可能性(13 種パターン分類)」の 2 要件を満たすべきで、PatternMatcher は 1-D CNN による異常パターン分類(F1=0.98)と重み付きランキングで Avg@3=0.91 を達成し、実際の商業銀行本番に展開された。 ## [2026-06-16] ingest-paper | Lu+2022 CCGrid — CauseRank: OLTP データベース向け因果推論ベース性能診断 - Source: `.raw/papers/CCGrid2022-CauseRank.pdf` - Summary: [[@2022__CCGrid__Generic and Robust Performance Diagnosis via Causal Inference for OLTP Database Systems]] - Pages created: [[@2022__CCGrid__Generic and Robust Performance Diagnosis via Causal Inference for OLTP Database Systems]], [[Xianglin Lu]], [[Zeyan Li]], [[Shenglin Zhang]], [[Nankai University]], [[@2022__CCGrid__Generic and Robust Performance Diagnosis via Causal Inference for OLTP Database Systems|CauseRank]], [[因果推論ベースRCA]], [[OLTPシステムアーキテクチャ]] - Pages updated: [[Dan Pei]], [[Tsinghua University]], [[BizSeer]], [[concepts/_index]], [[entities/_index]] - Key insight: G-GES(メトリクスをグループ単位でノードとする因果探索)と COPP(因果指向パーソナライズド PageRank)を組み合わせた教師なし手法 CauseRank が Oracle 本番 97 件で top-3 精度 82.5%・MAR 2.13 を達成し、既存手法(MicroCause MAR 3.95 等)を大幅に上回った。 ## [2026-06-16] ingest-paper | Xin+2022 arXiv — CausalRCA: マイクロサービス向け細粒度根本原因箇所特定 - Source: `.raw/papers/Xin-et-al.-2022---CausalRCA---Causal-Inference-based-Precise-Fine-grained-Root-Cause-Localization-for-Microservice-Applications.pdf` - Summary: [[@2022__arXiv__CausalRCA - Causal Inference based Precise Fine-grained Root Cause Localization for Microservice Applications]] - Pages created: [[@2022__arXiv__CausalRCA - Causal Inference based Precise Fine-grained Root Cause Localization for Microservice Applications]], [[Ruyue Xin]], [[Peng Chen]], [[Zhiming Zhao]], [[University of Amsterdam]], [[Xihua University]], [[@2022__arXiv__CausalRCA - Causal Inference based Precise Fine-grained Root Cause Localization for Microservice Applications|CausalRCA]] - Pages updated: [[因果推論ベースRCA]], [[根本原因分析]], [[マイクロサービスアーキテクチャ]], [[concepts/_index]], [[entities/_index]] - Key insight: DAG-GNN(勾配ベース因果構造学習)で重み付き DAG を生成し PageRank でランキングする CausalRCA が、PC/GES/LiNGAM の線形仮定や曖昧性の制約を克服し、細粒度根本原因箇所特定で平均 AC@3=0.719(ベースライン比平均 17% 改善)を達成した。 ## [2026-06-16] ingest-paper | Zhang+2015 TKDE — In-Memory Big Data Management and Processing: A Survey - Source: `.raw/papers/zhang-tkde2015.pdf` - Summary: [[@2015__TKDE__In-Memory Big Data Management and Processing - A Survey]] - Pages created: [[@2015__TKDE__In-Memory Big Data Management and Processing - A Survey]], [[Hao Zhang]], [[Gang Chen]], [[Beng Chin Ooi]], [[Kian-Lee Tan]], [[Meihui Zhang]], [[Singapore University of Technology and Design]] - Pages updated: [[National University of Singapore]], [[Zhejiang University]], [[メインメモリデータベース]], [[index]], [[sources/_index]], [[entities/_index]], [[hot]] - Key insight: 「ディスクベースで無視できたオーバーヘッド(システムコール・ネットワークスタック・キャッシュライン跨ぎ)がインメモリ環境では新ボトルネックになる」を 28 ページ 290 文献で網羅し、メモリ常駐単独では数倍にとどまる事実と、ロック・WAL・B-tree・バッファ管理など 90% 以上の重コンポーネント除去まで進めて初めて 100 倍が出るという中心命題を体系化した教科書的サーベイ。並行性制御の到達点は単一の正解ではなく「軽量化 + パーティション化単一スレッド + HTM の混在」であり、データオーバーフローは「ユーザ空間 / カーネル空間 / ハイブリッド」の三項対立として整理される。 ## [2026-06-16] ingest | The C10K Problem (Dan Kegel, 1999) - Source: `.raw/articles/c10k-2026-06-16.md` - Summary: [[C10K-Problem]] - Pages created: [[C10K-Problem]], [[C10K問題]], [[epoll]], [[kqueue]], [[Dan Kegel]], [[nginx]] - Pages updated: [[index]], [[sources/_index]], [[hot]] - Key insight: 「10,000 同時接続を処理できるか否かの境界はハードウェアではなく I/O 戦略の選択にある」という 1999 年の洞察が、Linux epoll・BSD kqueue の設計を後押しし、現代の nginx / libuv / Tokio の基盤となった。 ## [2026-06-16] ingest-paper | Tsubouchi+2022 IPSJ JIP — TCP/UDP ソケットベース依存性発見(カーネル内フローバンドリング) - Source: `.raw/papers/tsubouchi-ipsjjip-2022.pdf` - Summary: [[@2022__IPSJ JIP__Low Overhead TCP-UDP Socket-based Tracing for Discovering Network Services Dependencies]] - Pages created: [[Masahiro Furukawa]], [[ネットワーク依存性発見]] - Pages updated: [[Yuuki Tsubouchi]], [[Ryosuke Matsumoto]], [[go-conntracer-bpf]], [[eBPF]], [[サービストポロジ]], [[index]], [[hot]] - Key insight: エフェメラルポートをキーから除外するだけで転送フロー数の依存変数を「コネクション数」から「サービス数」に変え、CPU オーバーヘッドをサービス数に抑制するカーネル内フローバンドリングの定量的実証。 ## [2026-06-16] ingest | Netflix Service Topology — サービスサイロから統合リアルタイム依存マップへ - Source: `.raw/articles/netflix-service-topology-2026-05-29.md` - Summary: [[@2026__Netflix TechBlog__From Silos to Service Topology - Why Netflix Built a Real-Time Service Map]] - Pages created: [[サービストポロジ]], [[リアルタイム依存性マップ]], [[ブラスト半径]], [[IPCメトリクス]], [[Netflix]], [[Apache Pekko]] - Pages updated: [[eBPF]], [[Apache Kafka]], [[index]], [[hot]] - Key insight: eBPF・IPC メトリクス・分散トレースの 3 独立グラフ融合が計装カバレッジと詳細度を補完し、将来は AI エージェントがトポロジーを巡回して自動 RCA を行うロードマップを持つ。 ## [2026-06-16] question | マルチモーダルオブザーバビリティ基盤モデル設計案を新規作成 wiki-query(deep) でオブザーバビリティデータ(MELT)の特性、Transformer、TSFM、マルチモーダル障害診断、LLM時系列アプローチを横断参照し、新規基盤モデル設計案 **MELT-FM(Metrics-Events-Logs-Traces Foundation Model)** を [[multimodal-observability-foundation-model]] として保存。 - **着想の核**: 既存研究の空席 — [[Toto]]/[[Falcon-X]] は M のみ・予測のみ、[[TVDiag]]/[[TAMO]]/[[SCELM]] は M+L+T だが事前学習なし・診断のみ、[[UModel]] は意味付与だがモデルなし、ARFBench は M+QA のみ。M+L+T+E を同時にネイティブ事前学習し、スケール・意味を保持し、予測・検知・RCA・QA を一基盤で支える TSFM が未踏。 - **新規性の核**: ①PathAttn(トレース木を Transformer 因子に昇格、[[Falcon-X]] の時間×変量、[[Chronos-2]] の Group Attention に並ぶ第 3 軸)、② [[UModel]] の意味グラウンディングを事前学習特別トークンとして焼き込み、③ eBPF ゼロ計装で 4 モダ同期コーパス + Multimodal-Mixup。 - **積み上げ参照**: [[時系列基盤モデル]]・[[マルチモーダル障害診断]]・[[オブザーバビリティ]]・[[テレメトリ]]・[[Transformer]]・[[LLM時系列アプローチ]]・[[Contiguous Patch Masking]]・[[エージェント型時系列予測]]。 - **残余の問い**: Path-oriented データのスケーリング則・モダリティ嗜好の post-training 委譲・eBPF コーパスの公開可能性(差分プライバシー)・「正常急変動」の意味的弁別・[[@2025__arXiv__Cisco Time Series Model Technical Report|Cisco TSM]] の多解像度との直交性・ATSF との競合関係。 ## [2026-06-16] ingest-paper batch | 分散トレーシング・依存性発見・MTSAD の古典 7 論文を一括取り込み並列 ingest で**分散トレーシング系の祖(X-Trace・Dapper)・サービス/ネットワーク依存性発見の系譜(Sherlock・Orion・NSDMiner 拡張)・マイクロサービス時代の因果ベース RCA(Sieve)・MTSAD のコールドスタート解(JumpStarter)**の 7 論文を wiki に降ろした。`papers/` には未取り込みだった「分散システム可観測性 → AIOps」の基盤系譜が一次源で揃った。 - **(1) X-Trace** [[@2007__NSDI__X-Trace - A Pervasive Network Tracing Framework]]([[Rodrigo Fonseca]]・[[George Porter]]・[[Randy H. Katz|Randy Katz]]・[[Scott Shenker]]・[[Ion Stoica]]、[[University of California, Berkeley|UC Berkeley]] / [[ICSI]]、NSDI 2007): タスク識別子のインバンド伝搬 + レポートのアウトオブバンド収集の 2 原則と、`pushDown()` / `pushNext()` の 2 プリミティブだけで因果木を完全記述する設計。管理ドメインごとに独立な収集・公開ポリシーを許す段階展開可能性が特徴。**Dapper・Zipkin・OpenTelemetry の直接的な祖**。 - **(2) Dapper** [[@2010__Google__Dapper - A Large-Scale Distributed Systems Tracing Infrastructure]]([[Benjamin H. Sigelman]]・[[Luiz André Barroso]]・[[Mike Burrows]] ほか、[[Google]]、2010): 低オーバーヘッド + アプリ透過 + 偏在展開の 3 設計目標を、共通ライブラリ計装 + 1/1024 適応サンプリングで両立。Google 本番 2 年超稼働。**スパン / トレース木 / アノテーション**のデータモデルが OpenTracing・W3C Trace Context・OpenTelemetry の事実上の標準を確立。 - **(3) Sherlock** [[@2007__SIGCOMM__Towards Highly Reliable Enterprise Network Services via Inference of Multi-level Dependencies]]([[Paramvir Bahl]]・[[Ranveer Chandra]]・[[Albert Greenberg]]・[[Srikanth Kandula]]・[[David Maltz]]・[[Ming Zhang (Microsoft Research)|Ming Zhang]]、[[Microsoft Research]]、SIGCOMM 2007): Inference Graph(3 状態 up/troubled/down + 多層依存性)+ パケットトレース共起確率による自動依存性発見 + Ferret 推論で **90.66% 障害箇所特定精度**を達成し、2 層 Shrink の 58.61% を **30% 上回る**。Microsoft 本番ネット 358 コンポーネントで **87% の障害が 16 コンポーネントに集中**。サービス依存性推論ベース fault localization の代表ソース。 - **(4) Orion** [[@2008__OSDI__Automating Network Application Dependency Discovery - Experiences, Limitations, and New Solutions]]([[Xu Chen]]・[[Ming Zhang (Microsoft Research)|Ming Zhang]]・[[Z. Morley Mao]]・[[Paramvir Bahl]]、[[University of Michigan]] / [[Microsoft Research]]、OSDI 2008): パケットヘッダ + タイミング情報のみ(ペイロード解析不要)で「遅延スパイクベース分析」により依存性発見。**Sherlock 比偽陽性 10–95% 削減、eXpose 比 94–99% 削減**。Sherlock の続編として、受動観測ベース依存性発見の発展を代表。 - **(5) NSDMiner 拡張(Peddycord+ LISA12)** [[@2012__LISA__On the Accurate Identification of Network Service Dependencies in Distributed Systems]]([[Barry Peddycord III]]・[[Peng Ning]]・[[Sushil Jajodia]]、[[NC State University]] / [[George Mason University]]、LISA 2012): NSDMiner の比率ベースランキングを**対数ベース**に置換して偽陽性を大幅削減、利用頻度の低いサービスを類似クラスタから推論、ロードバランシング/バックアップクラスタの自動検出で出力候補を **25–50% 削減**。受動観測ベース依存性発見の系譜の到達点の一つ。 - **(6) Sieve** [[@2017__arXiv__Sieve - Actionable Insights from Monitored Metrics in Microservices]]([[Jörg Thalheim]] ほか、[[TU Dresden]]、Middleware 2017 / arXiv:1709.06686): k-Shape クラスタリングによる**メトリクス次元 10–100 倍削減** + Granger 因果性によるコンポーネント間依存推定の 2 段プラットフォーム。OpenStack/ShareLatex で実装、**CPU 80% / Storage 90% / Network 50% オーバーヘッド削減**。オートスケーリング + RCA への応用を実証。マイクロサービス時代の因果ベース RCA の初期基盤。 - **(7) JumpStarter** [[@2021__USENIX-ATC__Jump-Starting Multivariate Time Series Anomaly Detection for Online Service Systems]]([[Minghua Ma]] ほか、[[Sangfor Technologies]]、USENIX ATC 2021): **圧縮センシング(CS)** + 形状ベースクラスタリング + 外れ値耐性サンプリングで**訓練不要・20 分初期化**の MTSAD を実現し、3 データセット平均 F1=**94.12%** で SOTA を上回る。学習ベース MTSAD の「初期化時間 10–100 日」問題に対する設計的回答。 - **横断的知見**: X-Trace(2007)→ Dapper(2010)は**「因果木の表現方法」と「サンプリングの位置づけ」**で連続しており、Dapper §1 が「Magpie や X-Trace と概念的に類似」と明言している。Sherlock(2007 SIGCOMM)→ Orion(2008 OSDI)→ Peddycord+(2012 LISA)は**受動観測ベース依存性発見**の連続した 3 世代であり、各世代が「グラウンドトゥルース比偽陽性をどう削るか」の漸進改善で進む。Sieve(2017)は**メトリクス側で同じ依存性推論問題を Granger で解く**第 4 系譜であり、マイクロサービス時代に問題ドメインが「ネットワークサービス間」から「メトリクス時系列間」に移ったことを示す。JumpStarter(2021)は依存性発見ではなく**異常検知の初期化問題**に応える独立軸だが、Sieve と同じく「**学習ベース手法のコールドスタートを設計で迂回する**」という思想を共有する。 - **Pages created**: source 7(上記)+ entity([[Rodrigo Fonseca]] 更新・[[George Porter]]・[[Randy H. Katz|Randy Katz]]・[[Ion Stoica]]・[[Scott Shenker]]・[[University of California, Berkeley|UC Berkeley]]・[[ICSI]]・[[Benjamin H. Sigelman]]・[[Luiz André Barroso]]・[[Mike Burrows]]・[[Paramvir Bahl]]・[[Ranveer Chandra]]・[[Albert Greenberg]]・[[Srikanth Kandula]]・[[David Maltz]]・[[Ming Zhang (Microsoft Research)|Ming Zhang]]・[[Xu Chen]]・[[Z. Morley Mao]]・[[University of Michigan]]・[[Barry Peddycord III]]・[[Peng Ning]]・[[Sushil Jajodia]]・[[NC State University]]・[[George Mason University]]・[[Jörg Thalheim]]・[[TU Dresden]]・[[Minghua Ma]]・[[Sangfor Technologies]])+ concept([[トレースメタデータ伝搬]]・[[因果トレーシング]]・[[トレースコンテキスト]]・[[低オーバーヘッドインストルメンテーション]]・[[サービス依存性推論]]・[[Inference Graph]]・[[多層依存性]]・[[ネットワーク障害管理]]・[[受動観測ベース依存性推論]]・[[トラフィック相関分析]]・[[ネットワーク依存性発見|ネットワークサービス依存性発見]]・[[メトリクス削減]]・[[因果推論ベースRCA]]・[[圧縮センシング異常検知]])。 - **Pages updated**: [[分散トレーシング]]・[[トレースサンプリング]]・[[Fault Localization]]・[[ネットワーク依存性発見|サービス依存性発見]]・[[根本原因分析]]・[[マイクロサービスアーキテクチャ]]・[[異常検知]]・[[多変量時系列予測]] + 全索引 + manifest。 --- ## [2026-06-15] ingest-paper batch | 観測可能性・分散DB 基盤 6 論文を一括取り込み並列 ingest で観測可能性(動的計装・パケットフィルタリング・トレース・サンプリング)と分散時系列/データベースの**古典基盤 6 論文**を wiki に降ろした。eBPF・DTrace・分散トレーシング系の現代研究の系譜が一次源で揃った。 - **(1) BPF** [[@1993__USENIX__The BSD Packet Filter A New Architecture for User-level Packet Capture]]([[Steven McCanne]]・[[Van Jacobson]]、[[LBNL]]、USENIX Winter 1993): カーネル内 in-kernel VM + CFG ベースのパケットフィルタ評価で、当時主流の CSPF(スタックベース)に対し 20倍超の高速化を達成。`tcpdump`/`libpcap` の基盤であり、後の **eBPF(拡張 BPF)** の直接の祖。 - **(2) DTrace** [[@2004__USENIX-ATC__Dynamic Instrumentation of Production Systems]]([[Bryan Cantrill]]・[[Michael Shapiro]]・[[Adam Leventhal]]、[[Sun Microsystems]]、USENIX ATC 2004): プロダクション環境で動的計装(dynamic instrumentation)を**ゼロプローブ効果**で実現。プローブを無効時に NOP 化、有効時のみ命令置換で観測オーバーヘッドを実質ゼロに抑え、DTrace スクリプト言語 D で柔軟な集計を可能に。eBPF/perf/SystemTap・bpftrace の概念的・実装的源流。 - **(3) Weighted Sampling of Execution Traces** [[@2018__SoCC__Weighted Sampling of Execution Traces - Capturing More Needles and Less Hay]]([[Pedro Las-Casas]]・[[Jonathan Mace]]・[[Rodrigo Fonseca]]、[[Microsoft Research]] / Brown、SoCC 2018、DOI:10.1145/3267809.3267841): 分散トレースの一様サンプリングでは異常・希少パスが取りこぼされる問題に対し、トレース構造の特徴量から**重み付きサンプリング**で稀少トレース保持率を向上。トレースサンプリング研究の現代的方向(エッジサンプリング・カーディナリティ意識・SLO-aware sampling)の基礎。PDF 取得不可で abstract + メタデータのみから ingest(`confidence: medium`)。 - **(4) Parallel Database Systems** [[@1992__CACM__Parallel Database Systems The Future of High Performance Database Systems]]([[David DeWitt]]・[[Jim Gray]]、CACM 35(6):85–98、1992、DOI:10.1145/129888.129894): 並列 DB の**シェアードナッシング**アーキテクチャの優位性と**データパーティショニング**(range/hash/round-robin)を体系化した古典マニフェスト。後の MapReduce・Spark・Snowflake・BigQuery・Monarch 等のスケーラブル分析基盤の理論的下地。 - **(5) TSM-Bench** [[@2023__PVLDB__TSM-Bench - Benchmarking Time Series Database Systems for Monitoring Applications]]([[Abdelouahab Khelifati]]・[[Mourad Khayati]]・[[Djellel Difallah]]・[[Philippe Cudré-Mauroux]]、[[NYU Abu Dhabi]] / [[University of Fribourg]]、PVLDB Vol.16): 監視ワークロードに特化した時系列 DB ベンチマーク。**[[Monarch]]・[[Gorilla]]・観測系ワークロード**の評価軸を提供。 - **(6) Anomaly Detection in Time Series: A Comprehensive Evaluation (TimeEval)** [[@2022__PVLDB__Anomaly Detection in Time Series - A Comprehensive Evaluation]]([[Phillip Wenig]]・[[Sebastian Schmidl]] ほか、[[Hasso Plattner Institute]]、PVLDB Vol.15): 71 アルゴリズム × 976 データセットの大規模時系列異常検知評価フレームワーク **[[GutenTAG]]** とともに公開。本 wiki の [[異常検知]] 基盤ベンチマークの代表例。 - **横断的知見**: BPF と DTrace は**カーネル内 in-VM 動的計装**という同じ設計原理の独立発明として並ぶ(BPF: パケットフィルタ専用 → eBPF で汎用化、DTrace: 汎用動的計装を最初から指向)。Las-Casas+ 2018 は両者が解いた「観測コストを下げる」問題を**サンプリング側で解く**第三の系譜であり、現代の eBPF ベース観測スタック + サンプリング意識トレーサ(Jaeger / Tempo / Datadog)の合流点を 2018 年時点で予見していた。DeWitt+Gray 1992 のシェアードナッシング + パーティショニングは、25 年後の Monarch(2020、ゾーン分割)や TSM-Bench(2023、監視 DB 評価)の評価軸に直接接続する。 - **Pages created**: source 6([[@1993__USENIX__The BSD Packet Filter A New Architecture for User-level Packet Capture]] / [[@2004__USENIX-ATC__Dynamic Instrumentation of Production Systems]] / [[@2018__SoCC__Weighted Sampling of Execution Traces - Capturing More Needles and Less Hay]] / [[@1992__CACM__Parallel Database Systems The Future of High Performance Database Systems]] / [[@2023__PVLDB__TSM-Bench - Benchmarking Time Series Database Systems for Monitoring Applications]] / [[@2022__PVLDB__Anomaly Detection in Time Series - A Comprehensive Evaluation]]) + entity([[Steven McCanne]]・[[LBNL]]・[[Bryan Cantrill]]・[[Michael Shapiro]]・[[Adam Leventhal]]・[[Sun Microsystems]]・[[David DeWitt]]・[[Abdelouahab Khelifati]]・[[Mourad Khayati]]・[[Djellel Difallah]]・[[Philippe Cudré-Mauroux]]・[[NYU Abu Dhabi]]・[[Hasso Plattner Institute]]・[[Phillip Wenig]]・[[Sebastian Schmidl]]・[[GutenTAG]]) + concept([[DTrace]]・[[BPF]]・[[プローブ効果]]・[[カーネル内VM]]・[[パケットフィルタリング]]・[[シェアードナッシング]]・[[データパーティショニング]]・[[並列データベース]]・[[時系列データベースベンチマーク]]・[[時系列データ生成]]・[[時系列異常検知ベンチマーク]])。 - **Pages updated**: [[Jonathan Mace]]・[[Pedro Las-Casas]]・[[Rodrigo Fonseca]]・[[Jim Gray]]・[[University of Wisconsin]]・[[eBPF]]・[[トレースサンプリング]]・[[専用データベースシステム]]・[[時系列データベース]]・[[異常検知]]・[[動的計装|動的インストルメンテーション]] + 全索引・manifest。 --- ## [2026-06-15] wiki-query | Toto 2.0 比較・LM vs TSFM decoder-only 解説セッション内の 2 つの質問応答を `wiki/questions/` に保存。**ページ作成**: question 2（[[Toto-2アーキテクチャ比較-他TSFMとの特徴]]・[[LM-vs-TSFM-decoder-only-差異]]）。**ページ更新**: [[index]]（Questions セクションに 2 件追記）。 --- ## [2026-06-15] ingest-paper | Monarch: Google's Planet-Scale In-Memory Time Series Database - Source: `.raw/papers/6348.pdf`(PVLDB 13(12):3181–3194, Colin Adams ほか Google LLC) - Summary: [[@2020__VLDB__Monarch - Google's Planet-Scale In-Memory Time Series Database]] - Pages created: [[Borgmon]] - Pages updated: [[Monarch]], [[時系列データベース]], [[sources/_index]], [[entities/_index]], [[index]], [[hot]], [[log]], [[manifest]] - Key insight: プラネットスケール監視 TSDB では「インメモリ保持 = 循環依存の回避」という設計必然があり、FHI(99.5% ファンアウト抑制)とクエリプッシュダウン(95% ゾーン完結)の組み合わせがプラネットスケールでも運用可能な監視システムを実現する核心技術だ。 ## [2026-06-15] ingest-paper | Chronos-2: From Univariate to Universal Forecasting - Source: `.raw/papers/arxiv-2510.15821.pdf`(arXiv:2510.15821, 2025-10-17; Ansari+ AWS AI Labs ほか) - Summary: [[@2025__arXiv__Chronos-2 - From Univariate to Universal Forecasting]] - Pages created: [[Chronos-2]], [[Abdul Fatir Ansari]], [[Oleksandr Shchur]], [[Danielle C. Maddix]], [[Syama Sundar Rangapuram]], [[Michael Bohlke-Schneider]], [[George Karypis]], [[fev-bench]], [[GIFT-Eval]] - Pages updated: [[Amazon Web Services]], [[AWS AI Labs]], [[Yuyang Wang]], [[University of California, San Diego]], [[時系列基盤モデル]], [[多変量時系列予測]], [[index]], [[hot]], [[sources/_index]], [[entities/_index]], [[concepts/_index]] - Key insight: group attention で単変量・多変量・共変量付き予測を統一的にゼロショット処理する初の汎用 TSFM。合成データ(multivariatizer)による多変量 ICL の付与は、実観測の多変量データなしでも universal forecasting が成立しうる経験証拠で、Chronos(2024)の「単変量限定」を実質的に解消した直系後継。 ## [2026-06-15] ingest-paper | Time-MoE: Billion-Scale Time Series Foundation Models with Mixture of Experts - Source: `.raw/papers/arxiv-2409.16040.pdf`(arXiv:2409.16040, 2024-09-24 / ICLR 2025; Shi+ Princeton/Xiaohongshu ほか) - Summary: [[@2025__ICLR__Time-MoE - Billion-Scale Time Series Foundation Models with Mixture of Experts]] - Pages created: [[Xiaoming Shi]], [[Shiyu Wang]], [[Yuqi Nie]], [[Ming Jin]], [[Qingsong Wen]], [[Princeton University]], [[Tianjin University]], [[Griffith University]], [[Xiaohongshu Inc]], [[University of Freiburg]], [[Time-300B]] - Pages updated: [[時系列基盤モデル]], [[多変量時系列予測]], [[index]], [[hot]], [[sources/_index]], [[entities/_index]], [[concepts/_index]] - Key insight: Sparse MoE による decoder-only TSFM を 2.4B(活性化 1.1B)までスケールし、同活性化パラメータの密モデル比で訓練 78%・推論 39% コスト削減。TSFM におけるスケーリング則の経験的検証と、9 ドメイン 309B 点の最大規模公開コーパス Time-300B の公開という二点で、後続 TSFM([[Toto]] 系・[[Chronos-2]] 等)の前提を作った。 ## [2026-06-15] ingest-paper | Chronos: Learning the Language of Time Series - Source: `.raw/papers/arxiv-2403.07815.pdf`(arXiv:2403.07815, 2024-03-12; Ansari+ AWS AI Labs ほか) - Summary: [[@2024__arXiv__Chronos Learning the Language of Time Series]] - Pages created: [[Lorenzo Stella]], [[Yuyang Wang]], [[AWS AI Labs]], [[Andrew Gordon Wilson]], [[時系列トークナイゼーション]] - Pages updated: [[Amazon Web Services]], [[時系列基盤モデル]], [[LLM時系列アプローチ]], [[index]], [[hot]], [[sources/_index]], [[entities/_index]], [[concepts/_index]] - Key insight: 平均スケーリング + 均一量子化で時系列値を語彙トークンに落とし、T5/GPT-2 アーキテクチャをそのまま事前学習するという**最小主義の TSFM 構成**。LLM 重み初期化はランダム初期化を上回らず、LLM の言語知識転移が時系列で優位に立たないという否定的知見を残した。TSMixup・KernelSynth による合成データ生成が後続論文(Chronos-2 の multivariatizer 含む)の合成データ依存路線の起点。 ## [2026-06-15] wiki-query | Toto 2.0 vs 1.0 差分・分位点損失と区間予測 - 操作: wiki-query セッションの Q&A を questions/ に保存 - 作成ページ: - `wiki/questions/Toto-2.0-vs-1.0-差分.md` — バージョン比較(モデルサイズ・推論方式・出力ヘッド・オプティマイザ・スケーリング) - `wiki/questions/分位点損失と区間予測.md` — 分位点損失の仕組み・独立学習の意味・モデル構造・区間予測の利点を連問形式でまとめ - 更新: `wiki/index.md`(Questions セクションに 2 件追記) ## [2026-06-15] ingest-paper | From Pre-training to Post-training: A Survey on Time Series Foundation Models (Liu+ techRxiv 2026) - Source: `.raw/papers/techrxiv.176978429.902358012Fv2.pdf` - Summary: [[@2026__techRxiv__From Pre-training to Post-training - A Survey on Time Series Foundation Models]] - Pages created: [[@2026__techRxiv__From Pre-training to Post-training - A Survey on Time Series Foundation Models]], [[Zhen Liu]], [[Qianli Ma]], [[Min Wu]], [[South China University of Technology]], [[Institute for Infocomm Research]] - Pages updated: [[時系列基盤モデル]], [[強化ファインチューニング]], [[sources/_index]], [[entities/_index]], [[index]], [[hot]] - Key insight: TSFM サーベイの軸が「事前学習」中心から「事後学習(SFT・協調 PLC-MLC-HLC・強化 reasoning/non-reasoning)」を含む 3 次元タクソノミーへ拡張され、NLP・コード生成中心に発達した GRPO・LoRA・KD 群が TSFM ドメインへ越境する萌芽期を確認。本 wiki の [[強化ファインチューニング]] 知見(credit assignment 脆弱性・規則ベース報酬の堅牢性・SFT+RL 二段階)が TSFM の RL 設計指針として転用可能。 ## [2026-06-15] ingest-paper | A Decoder-Only Foundation Model for Time-Series Forecasting (TimesFM 原論文) - Source: `.raw/papers/arxiv-2310.10688.pdf` - Summary: [[@2024__arXiv__A Decoder-Only Foundation Model for Time-Series Forecasting]] - Pages created: [[@2024__arXiv__A Decoder-Only Foundation Model for Time-Series Forecasting]] - Pages updated: [[TimesFM]], [[Abhimanyu Das]], [[Rajat Sen]], [[Google Research]], [[時系列基盤モデル]], [[スケーリング則]], wiki/sources/_index, wiki/index, wiki/hot, wiki/log - Key insight: 200M decoder-only Transformer + 出力パッチ長 > 入力パッチ長(自己回帰ステップ削減)+ 約 100B 点(Google Trends/Wikipedia/合成)の事前学習で、ゼロショットで教師あり SOTA 級に到達。17M/70M/200M でモデル誤差が単調減少し TSFM スケーリング則の前史を成す。 ## [2026-06-15] ingest-paper | One Fits All - Power General Time Series Analysis by Pretrained LM (FPT) - Source: `.raw/papers/arxiv-2302.11939.pdf` - Summary: [[@2023__NeurIPS__One Fits All - Power General Time Series Analysis by Pretrained LM]] - Pages created: [[@2023__NeurIPS__One Fits All - Power General Time Series Analysis by Pretrained LM]], [[Frozen Pretrained Transformer]] - Pages updated: [[Tian Zhou]], [[Rong Jin]], [[Liang Sun]], [[時系列基盤モデル]], [[LLM時系列アプローチ]], [[多変量時系列予測]], [[異常検知]], wiki/sources/_index, wiki/concepts/_index, wiki/index, wiki/hot, wiki/log - Key insight: GPT-2 の self-attention・feedforward を凍結し位置埋め込みのみ学習(FPT)で時系列 7 タスク SOTA。**画像事前学習(BEiT)からの転移も有効**で、言語→時系列だけでなく異分野知識転移の汎用性を示す——self-attention の勾配最小化が PCA と等価という理論的根拠を提示。 ## [2026-06-15] ingest-paper | Large Language Models Are Zero-Shot Time Series Forecasters (LLMTime) - Source: `.raw/papers/arxiv-2310.07820.pdf` - Summary: [[@2023__NeurIPS__Large Language Models Are Zero-Shot Time Series Forecasters]] - Pages created: [[@2023__NeurIPS__Large Language Models Are Zero-Shot Time Series Forecasters]], [[Nate Gruver]], [[Marc Finzi]], [[Shikai Qiu]], [[Andrew Gordon Wilson]] - Pages updated: [[LLMTime]], [[New York University]], [[Carnegie Mellon University]], [[LLM時系列アプローチ]], [[時系列基盤モデル]], [[スケーリング則]], wiki/sources/_index, wiki/entities/_index, wiki/index, wiki/hot, wiki/log - Key insight: 数値の桁列トークン化のみで GPT-3・LLaMA-2 70B がゼロショット時系列予測で ARIMA/TCN/N-HiTS 等の専用モデルと同等以上(Darts/Monash/Informer 29 データセット)。**LLM の簡潔性バイアス(Occam's razor prior)と反復バイアスが季節性・トレンドに構造的に一致**するため外挿が成立。GPT-4 は RLHF とトークン化変更で GPT-3 より劣化——アライメントが不確実性較正を壊す現象を初めて定量化。 ## [2026-06-15] ingest-paper | PromptCast - A New Prompt-based Learning Paradigm for Time Series Forecasting - Source: `.raw/papers/arxiv-2210.08964.pdf` - Summary: [[@2022__arXiv__PromptCast - A New Prompt-based Learning Paradigm for Time Series Forecasting]] - Pages created: [[@2022__arXiv__PromptCast - A New Prompt-based Learning Paradigm for Time Series Forecasting]], [[Hao Xue]], [[Flora Salim]], [[PISA]] - Pages updated: [[University of New South Wales]], [[LLM時系列アプローチ]], [[時系列基盤モデル]], [[多変量時系列予測]], wiki/sources/_index, wiki/entities/_index, wiki/index, wiki/hot, wiki/log - Key insight: 数値列→自然言語文へのテンプレート変換で時系列予測を sentence-to-sentence へ再定式化し、事前学習言語モデル(Bigbird/Bart/LED)が数値専用モデル(Transformer/Informer/Autoformer)と同等以上の RMSE・MAE を達成。ゼロショット汎化能力は数値モデルを大幅に超え、PISA(311,932 件・気温/電力/人流の 3 サブセット)が LLM×時系列の最初期ベンチマークとなる。 ## [2026-06-15] ingest-paper | Toto 2.0: Time Series Forecasting Enters the Scaling Era - Source: `.raw/papers/arxiv-2605.20119.pdf` - Summary: [[@2026__arXiv__Toto 2.0 - Time Series Forecasting Enters the Scaling Era]] - Pages created: [[@2026__arXiv__Toto 2.0 - Time Series Forecasting Enters the Scaling Era]], [[Eden Belouadah]], [[Marc Cenac]], [[Xunyi Zhao]], [[Viktoriya Zhukova]], [[Othmane Abou-Amal]], [[NorMuon]], [[TIME]] - Pages updated: [[Toto]], [[BOOM]], [[GIFT-Eval]], [[u-μP]], [[Contiguous Patch Masking]], [[スケーリング則]], [[時系列基盤モデル]], wiki/sources/_index, wiki/entities/_index, wiki/index, wiki/log, wiki/hot - Key insight: Toto 2.0 が TSFM で初めて信頼できるスケーリング則を実証(4M〜2.5B 単調改善)、NorMuon がピンボール損失との組み合わせ問題を解決、u-μP で TSFM への初適用を達成 ## [2026-06-15] ingest | 時系列データのための大規模言語モデル (Zenn, tsurubee) - Source: `.raw/articles/zenn-llm-for-time-series-2024-07-10.md` - Summary: [[@2024__Zenn__tsurubee__LLM-for-Time-Series]] - Pages created: [[@2024__Zenn__tsurubee__LLM-for-Time-Series]], [[Hirofumi Tsuruta|tsurubee]], [[LLM時系列アプローチ]] - Pages updated: [[時系列基盤モデル]], [[SAKURA Internet]], [[index]], [[hot]], [[log]] - Key insight: LLM×時系列を Prompting/Quantization/Aligning/Vision/Tool の 5 アプローチで分類。One Fits All(GPT-2 凍結+位置埋め込みのみ学習)が異分野(言語・画像)の事前学習が時系列に転移できることを実証し、専用 TSFM 研究の前史となった。 ## [2026-06-15] ingest-paper | Production-Grounded Benchmarks for AI Code Optimization (Datadog blog, DODO) - Source: `.raw/articles/dodo-production-grounded-code-optimization-2026-06-08.md` - Summary: [[@2026__Datadog__Production-Grounded Benchmarks for AI Code Optimization]] - Pages created: [[@2026__Datadog__Production-Grounded Benchmarks for AI Code Optimization]], [[DODO]], [[Junaid Ahmed]], [[Piotr Bejda]], [[本番接地型ベンチマーク]] - Pages updated: [[Datadog]], [[エージェント型コーディング]], [[継続的プロファイリング]], [[wiki/sources/_index]], [[wiki/entities/_index]], [[wiki/concepts/_index]], [[wiki/index]], [[wiki/hot]], [[wiki/log]] - Key insight: CPU プロファイル+Live Debugger 実呼び出しによる本番接地型ベンチマーク生成が、合成ベンチマークでは不可視の最適化機会(NormalizeTags の入力大文字比率依存高速パス)を顕在化し、成熟 Go サービスの CPU コストを 8%+ 削減した。 ## [2026-06-15] ingest | Toto 2.0: Time Series Forecasting Enters the Scaling Era (Datadog blog, arXiv:2605.20119) - Source: `.raw/articles/toto-2-2026-06-15.md` - Summary: [[@2026__Datadog__Toto-2.0-Time-Series-Forecasting-Enters-the-Scaling-Era]] - Pages created: - [[@2026__Datadog__Toto-2.0-Time-Series-Forecasting-Enters-the-Scaling-Era]] - [[Emaad Khwaja]] - [[Gerald Woo]] - [[Chris Lettieri]] - [[David Asker]] - [[u-μP]] - [[Contiguous Patch Masking]] - Pages updated: - [[Toto]](v2.0 セクション追加、aliases/related 拡張) - [[Datadog]](Toto 2.0 言及追加) - [[Ameet Talwalkar]](Toto 2.0 著者参加を追記) - [[時系列基盤モデル]](スケーリング時代の知見・ソース追加) - [[index]](total+7、source+1) - [[hot]] - Key insight: 観測特化 TSFM が初の本格スケーリング実証——4M〜2.5B で単調改善・飽和なし。CPM によるシングルパス推論と u-μP による転移学習で精度とレイテンシを同時改善。 ## [2026-06-15] ingest-paper-batch | 因果推論ベース RCA + LLM ベース RCA の 2022/2024 基礎論文 3 本 3 本並列取り込み(因果推論 RCA の双璧 + LLM ベース RCA の本番稼働システム)。中間で各サブエージェントが tool use 上限で打ち止め、後処理(common concept・index・log・hot・manifest)はメインスレッドで直列に完成させた。 ### Chen+ EuroSys 2024 — Automatic Root Cause Analysis via Large Language Models for Cloud Incidents (RCACopilot) - Source: `.raw/papers/arxiv-2305.15778.pdf` (md5 6053891aed97c3a055dd21c3c23c8ae9) - Summary: [[@2024__EuroSys__Automatic Root Cause Analysis via Large Language Models for Cloud Incidents]] - Pages created: - [[@2024__EuroSys__Automatic Root Cause Analysis via Large Language Models for Cloud Incidents]] - [[Yinfang Chen]] / [[RCACopilot]] - Pages updated: - [[根本原因分析]]・[[RCA入力選別]]・[[TSG自動化]](RCACopilot を CIRCA/RCD と並べた 3 系統対比・情報スペクトラム問題・ハンドラ抽象の TSG 自動化先祖系統を横断的知見へ追加、未解決の問いに「ハンドラ自律生成」「ハンドラ DSL 標準化」を追加) - `wiki/sources/_index.md`・`wiki/entities/_index.md`・`wiki/index.md`・`wiki/log.md`(先頭追記) - Key insight: Microsoft Transport の 30 超チーム・4 年以上の本番稼働を持つ唯一の end-to-end LLM ベースクラウド RCA システム。アラート種別ハンドラ(scope/query/mitigate アクションノードの DAG)でマルチソース診断情報を自動収集 → GPT-4 で 2,000 トークンを 120–140 語に要約 → FastText 埋め込み + 時間重み付き k-NN(時間減衰係数 0.01) + few-shot CoT で根本原因カテゴリ予測。Micro-F1=0.766 / Macro-F1=0.533。「情報スペクトラム」問題: 診断情報のみ(0.766)・アラート情報のみ(0.379)・両方混合(0.525)で示されるように、情報過多は不足と同等に RCA を損なう。Ahmed+ ICSE 2023 とは「ハンドラ + 診断情報自動収集 + LLM 圧縮」の総合システムである点で別系統。CIRCA/RCD の形式的因果推論より、経験的ハンドラ + LLM 圧縮が本番運用で先行している。 ### Li+ KDD 2022 — Causal Inference-Based Root Cause Analysis for Online Service Systems with Intervention Recognition (CIRCA) - Source: `.raw/papers/KDD22-CIRCA.pdf` (md5 9f4821d9792f73ec77b3b717534abfab) - Summary: [[@2022__KDD__Causal Inference-Based Root Cause Analysis for Online Service Systems with Intervention Recognition]] - Pages created: - [[@2022__KDD__Causal Inference-Based Root Cause Analysis for Online Service Systems with Intervention Recognition]] - [[Mingjie Li]] / [[Kanglin Yin]] / [[Xiaohui Nie]] / [[Wenchi Zhang]] / [[Kaixin Sui]] / [[CIRCA]] - Pages updated: - [[Dan Pei]](CIRCA を参加論文に追加) - [[因果推論ベースRCA]](CIRCA の Pearl Causal Hierarchy による定式化、Structural Graph + RHT + 子孫調整の貢献、RCD との「ドメイン知識軸 vs スケーラビリティ軸」設計対比を横断的知見へ追加、未解決の問いに「子孫調整の他システム検証」「L2 介入知識の観測ベース推定限界」を追加) - [[根本原因分析]](CIRCA/RCD/RCACopilot の 3-paper 比較)・[[RCA評価設計]](D_O オラクル設計、CIRCA がマルチソース化で劣化する現象) - `wiki/sources/_index.md`・`wiki/entities/_index.md`・`wiki/index.md`・`wiki/log.md` - Key insight: Pearl の Causal Hierarchy で RCA を「介入認識(IR)」タスクとして初めて定式化した論文(Theorem 3.4)。IR が L2 介入知識に属することを Corollary 3.3 で証明し、Sage(ASPLOS 2021)の反実仮想分析(L3)を不要にすることを示した。アーキテクチャ知識(コールグラフ + Traffic/Saturation/Latency/Errors 4 メタメトリクス)から構造グラフを構築し、SVR ベース RHT で残差スコア化、子孫調整で正常分布の不完全観測を補正。Oracle DB 99 件で AC@1=0.404(NSigma 0.323 比 +25%)・0.578 秒。RCD と同年(2022)発表の双璧で、CIRCA = ドメイン知識・正確性、RCD = ドメイン知識不要・スケーラビリティの設計対比を成す。 ### Ikram+ NeurIPS 2022 — Root Cause Analysis of Failures in Microservices through Causal Discovery (RCD) - Source: `.raw/papers/c9fcd02e6445c7dfbad6986abee53d0d-Paper-Conference.pdf` (md5 f2cfb3ec4919243c008d707f68906790) - Summary: [[@2022__NeurIPS__Root Cause Analysis of Failures in Microservices through Causal Discovery]] - Pages created: - [[@2022__NeurIPS__Root Cause Analysis of Failures in Microservices through Causal Discovery]] - [[Azam Ikram]] / [[Saurabh Bagchi]] / [[Murat Kocaoglu]] / [[Sarthak Chakraborty]] / [[Adobe Research]] / [[Purdue University]] / [[Sock Shop]] / [[RCD]] - Pages updated: - [[因果推論ベースRCA]](RCD の soft intervention モデル化、F-NODE による分布不変性活用、階層 + 局所学習の相乗効果、AWS 本番 3 障害ケースでの観測限界(Outage B の潜在変数失敗・Outage C の計測スコープ外)を横断的知見へ追加、未解決の問いに「Ψ-FCI 拡張(交絡対応)」「分割パラメータ γ の最適化」を追加) - `wiki/sources/_index.md`・`wiki/entities/_index.md`・`wiki/index.md`・`wiki/log.md` - Key insight: 2022 年に CIRCA と並走して発表された因果推論ベース RCA。障害を根本原因ノードへの soft intervention としてモデル化する点が CIRCA の介入認識と相補的(両者とも L2 介入知識の実用化だが、CIRCA は介入分布変化を直接スコアリング、RCD は介入分布変化を「フィルタ」として使ってグラフ探索を局所化)。F-NODE($F=0$ 正常 / $F=1$ 障害)で $X \perp\!\!\!\perp F | Pa_X$ の条件付き独立性検定を実行することで、根本原因でないノードを除外。完全因果グラフ学習を回避し、合成 500 ノード 22 秒(対 Ψ-PC 150 分超で約 400× 高速)。本番クラウド障害でのケーススタディ(AWS 3 件)を最初に公開した因果推論ベース RCA で、Outage B では Memcached の hit ratio が潜在変数として top-1 を外す失敗モードを実証し、causal sufficiency 仮定の限界を境界条件として明示。 ## [2026-06-15] ingest-paper | Remil+ arXiv 2024 — AIOps Solutions for Incident Management: Technical Guidelines and A Comprehensive Literature Review - Source: `.raw/papers/arxiv-2404.01363.pdf` (md5 9e3b9de8ba73965e93ed8bf6459c8631) - Summary: [[@2024__arXiv__AIOps Solutions for Incident Management]] - Pages created: - [[@2024__arXiv__AIOps Solutions for Incident Management]] - [[Youcef Remil]] - [[Anes Bendimerad]] - [[Romain Mathonat]] - [[Mehdi Kaytoue]] - [[University of Lyon]] - [[INSA Lyon]] - [[CNRS]] - [[Infologic]] - Pages updated: - [[AIOps]](6 能力モデル・descriptive vs predictive 模型・contamination zone・interpretability 3 軸・研究密度偏りの 2 サーベイ独立確認を横断的知見へ追加し、関連質問を未解決の問いへ追加) - [[インシデント管理]](4 フェーズ × 9 タスク手続きでの再整理・4 層 Maintenance Strata の縦軸補強を横断的知見へ追加し、classification/deduplication 独立タスク化と Business 層の空白を未解決の問いへ追加) - [[障害予測]](Prevention 能力で offline+online を束ねる再構成・`Δt_p` と運用メトリクス連結を横断的知見へ追加し、LLM-era SDP の空白を未解決の問いへ追加) - `wiki/index.md`・`wiki/sources/_index.md`・`wiki/entities/_index.md` - Key insight: Remil+ 2024 は Notaro+ 2021(TIST)と独立に AIOps for incident management をサーベイし、6 能力モデル(Perception/Prevention/Detection/Location/Action/Interaction)と 4 フェーズ × 9 タスク手続きで、本 wiki が AIOpsLab 4-level taxonomy で見てきた構造を細分・補完する独自軸を提供する。最大の発見:(1) classification・deduplication・correlation を独立タスクとして立てた点で Notaro+ 2021 や Zhang+ 2015 と差別化、(2) interpretability の 3 次元(internal/external/time consistency)と in-context evaluation の contamination zone phenomenon という評価設計上の落とし穴を明文化、(3) descriptive 模型(pattern mining・FCA)を predictive 模型の対等な相棒として推奨する独自方向、(4) 40+ 件の公開データセットを application area 横断で 1 表に統合した最初の compendium、(5) 別データで Notaro+ 2021 の研究密度偏り(prevention 10.6% / remediation 2.5%)を再確認し、これが文献選定バイアスでなく AIOps 研究空間の構造的偏りである可能性を強める。LLM-era 以前のサーベイ(2024-04 時点)のため、agentic SRE 系は射程外。 ## [2026-06-15] ingest-paper | Hussain+ FSE 2026 industry — Attention Enhanced Entity Recommendation for Intelligent Monitoring in Cloud Systems - Source: `.raw/papers/arxiv-2510.20640.pdf` (md5 1faab80848cf8f4fda2dd23f46c4bda4) - Summary: [[@2026__FSE__Attention Enhanced Entity Recommendation for Intelligent Monitoring in Cloud Systems]] - Pages created: - [[@2026__FSE__Attention Enhanced Entity Recommendation for Intelligent Monitoring in Cloud Systems]] - [[Anson Bastos]] - Pages updated: - [[Fiza Husain]] (alias に "Fiza Hussain" を追加・FSE 2026 を参加論文に追記) - [[Chetan Bansal]]・[[Anjaly Parayil]]・[[Ayush Choure]]・[[Saravan Rajmohan]]・[[Rujia Wang]] (FSE 2026 を参加論文に追記) - [[クラウドモニタリング]] (DiRecGNN による「ディメンション部分集合推薦」階層、sparse モニタグラフでの汎用 HGNN の限界、類似モニタ説明と end-to-end 自動化の運用要件を横断的知見に追記し、静的グラフ仮定・閾値推薦への接続・LLM(MonitorAssistant)×HGNN(DiRecGNN)統合を未解決の問いに追加) - `wiki/index.md`・`wiki/sources/_index.md`・`wiki/entities/_index.md`・`wiki/concepts/_index.md` - Key insight: Microsoft Intelligent Monitoring ラインの第 3 弾。Ganatra+ 2023 で「モニタ自体の欠如」を実証 → Srinivas+ 2024 で「メトリクス選定」を自動化 → Hussain+ 2026 で「ディメンション部分集合選定」を HGNN ランキングとして自動化、という製品-論文サイクルが明示的になった。本番モニタの 94% が「メトリクスが出している全ディメンションを使わない」ためディメンション選定こそが運用律速で、汎用 HGNN(SAGEConv 等)では HR@1 0.29–0.40 にとどまる sparse グラフを、ランダムウォーク経路注意 + 注意ヘッド整列損失で 0.597(+55.8%)まで引き上げた。ユーザースタディは「類似モニタによる説明」と「end-to-end 自動化」を産業要件として確認した。 ## [2026-06-15] ingest-paper | Xiong+ USENIX ATC 2024 — SuperBench: Improving Cloud AI Infrastructure Reliability with Proactive Validation - Source: `.raw/papers/atc24-xiong.pdf` (597eee652e9c9777312e42bc3a58e0ee) - Summary: [[@2024__USENIX ATC__SuperBench - Improving Cloud AI Infrastructure Reliability with Proactive Validation]] - Pages created: - [[@2024__USENIX ATC__SuperBench - Improving Cloud AI Infrastructure Reliability with Proactive Validation]] - [[Yuting Jiang]] - [[Ziyue Yang]] - [[Lei Qu]] - [[Yongqiang Xiong]] - [[Lidong Zhou]] - [[SuperBench]] - [[グレイ障害]] - [[プロアクティブ検証]] - Pages updated: - [[Yifan Xiong]](SuperBench 共筆頭著者を追記、status を developing に昇格) - [[Peng Cheng]](SuperBench 共著を追記) - [[GPUクラスタ運用]](冗長による MTBI 漸減、検証時間も TCO の一部として最適化する二論点を横断的知見に追加・冗長予算という未解決問いを追加) - [[GPUレジリエンス]](冗長機構がグレイ障害を内蔵するという観点を横断的知見に追加) - [[プロアクティブ障害管理]](予測ありの選択的検証 vs 予測なし全検証のハイブリッド論点を追加) - [[障害予測]](Cox-Time + 行動 = SuperBench Selector を AI 時代の代表例として追加) - [[sources/_index]] / [[entities/_index]] / [[concepts/_index]] / [[index]] / [[hot]] / `.raw/.manifest.json` - Key insight: AI インフラのハードウェア冗長(HBM 行リマップ・GPU CUDA コア冗長・IB 過剰プロビジョン)はインシデント直前まで劣化を覆い隠す「グレイ障害」を生み、Azure A100 で MTBI が初回 719.4 時間 → 20 回目 151.7 時間に漸減する。プロアクティブ検証 SuperBench は Cox-Time 予測 + 貪欲ベンチマーク選択 + CDF 類似度クラスタリングでフルセット検証比 MTBI 1.11×・検証時間 92.07% 削減を達成し、Azure 本番 2 年運用で 10.36% のノードを欠陥として除外。「検証時間も TCO」「件数ではなく GPU 時間で測れ」という運用原則を検証側にも適用した最初の体系。ATC '24 Best Paper。 ## [2026-06-15] ingest-paper | Yang+ FSE 2026 — TSGuard: Automated User-Centric Incident Diagnosis for AI Workloads in the Cloud - Source: `.raw/papers/arxiv-2506.01481.pdf` (78bf39c29a081051f49a11714d1d3153) - Summary: [[@2026__FSE__TSGuard - Automated User-Centric Incident Diagnosis for AI Workloads in the Cloud]] - Pages created: - [[@2026__FSE__TSGuard - Automated User-Centric Incident Diagnosis for AI Workloads in the Cloud]] - [[Yitao Yang]] - [[Yifan Xiong]] - [[Baochun Li]] - [[Peng Cheng]] - [[Microsoft Research]] - [[University of Toronto]] - [[TSGuard]] - [[RCACopilot]] - Pages updated: - [[Yangtao Deng]](TSGuard 共著として追加・CUHK 所属を併記) - [[Hong Xu]](TSGuard 共著として追加) - [[The Chinese University of Hong Kong]](TSGuard / Yitao Yang を関連実体に追加) - [[Microsoft Azure]](AI ワークロード本番データソースとして追加・GPU 偏重 52.47% + recurrence 8.78 を記述) - [[インシデント管理]](user-centric paradigm への転換・症状-原因多対多と能動検証・半自動タクソノミー構築を横断的知見に追加) - [[耐障害LLM訓練]](user-centric pre-ticket interception + TTM 軸の補完を横断的知見に追加) - [[AIOps]](provider-centric/user-centric の主体軸を独立した設計次元として追加・AI ワークロード基盤の独立サブ領域化を追加) - [[Fault Localization]](タクソノミー誘導 DFS + 能動検証による第三系譜を横断的知見に追加) - [[RCA評価設計]](本番由来オラクル vs 合成注入オラクルの評価哲学差を横断的知見に追加) - [[sources/_index]] / [[entities/_index]] / [[index]] / [[hot]] / `.raw/.manifest.json` - Key insight: AI ワークロード(GPU 訓練)のインシデント管理は provider-centric paradigm に閉じた既存研究では median TTM 52.5 時間という非効率を残し続けるが、ユーザ側 pre-ticket interception layer + 階層タクソノミー誘導 DFS + 能動検証スクリプト実行 + 5 エージェント協調で Micro F1=0.854 / Macro F1=0.816(RCACopilot 比 +19.8/43.6%)を達成。recurrence rate 8.78 という AI ワークロード固有の反復性が quick path 51.4% の dominance を生む。 ## [2026-06-15] ingest-paper | Pham+ WWW Companion 2025 — RCAEval: A Benchmark for Root Cause Analysis of Microservice Systems with Telemetry Data - Source: `.raw/papers/2026_Unknown_RCAEval_Benchmark_Root_Cause_Microservice.pdf` (fe102b182a1c1c1a1cbf98d7a1bb15ff) - Summary: [[@2025__WWW Companion__RCAEval - A Benchmark for Root Cause Analysis of Microservice Systems with Telemetry Data]] - Pages created: - [[@2025__WWW Companion__RCAEval - A Benchmark for Root Cause Analysis of Microservice Systems with Telemetry Data]] - [[Flora Salim]] - [[Xiuzhen Zhang]] - [[University of New South Wales]] - Pages updated: - [[Luan Pham]](RCAEval 2025 共著・UNSW 併記を追加) - [[Hongyu Zhang]](RCAEval 2025 で University of Newcastle 所属に戻ったことを記述) - [[Huong Ha]](RCAEval 2025 共著を追加) - [[RMIT University]](RCAEval 2025 共著機関として更新・Xiuzhen Zhang を追加) - [[University of Newcastle]](RCAEval 2025 で再所属となった Hongyu Zhang を追加) - [[RCAEval]](2025 WWW Companion 版で 735 ケース・11 種障害・15 ベースラインへ拡張) - [[Sock Shop]] / [[Online-Boutique]] / [[Train-Ticket]](RCAEval の評価対象システムとして追記) - [[RCA評価設計]](カバレッジ路線・fine-grained 評価軸・本番由来 vs 合成由来オラクル等を横断的知見に追加) - [[障害注入]](コードレベル障害 F1〜F5 の初収録・stress-ng + tc + コード改変の 3 層注入を横断的知見に追加) - [[因果推論ベースRCA]](マルチソース化が CIRCA/RCD で逆効果・トレース系 TraceRCA が並走を横断的知見に追加) - [[Fault Localization]](RCAEval Table 6 でのモダリティ別性能差を横断的知見に追加) - [[マルチモーダル障害診断]](3 モダリティ全部入りベンチ標準化・コードレベル正解形式の導入を横断的知見に追加) - [[sources/_index]] / [[entities/_index]] / [[index]] / [[hot]] / `.raw/.manifest.json` - Key insight: マイクロサービス RCA ベンチマーク領域は 2024 ASE 版で Dummy ベースラインの導入により先行手法の過大評価可能性を可視化したが、2025 WWW Companion 版でメトリクス/トレース/マルチソースの 3 系統 + 11 種障害(資源・ネットワーク・コードレベルを含む RCA データセット初収録)・3 規模システム(12/15/64 サービス)・735 ケースの統一ベンチマークを公開。単純なマルチソース化で因果推論系の一部手法(CIRCA: AC@1 0.32→0.06、RCD: 0.09→0.10)が劣化する現象を定量化し、「モダリティを増やせば良くなる」という素朴な仮定を本ベンチで反証した。 ## [2026-06-15] ingest-paper | Hu+ NSDI 2024 — Characterization of Large Language Model Development in the Datacenter - Source: `.raw/papers/nsdi24-hu.pdf` (3dc1f7275f9b19f938ddca0a690572f3) - Summary: [[@2024__NSDI__Characterization of Large Language Model Development in the Datacenter]] - Pages created: - [[@2024__NSDI__Characterization of Large Language Model Development in the Datacenter]] - [[Dahua Lin]] - [[Yonggang Wen]] - [[Nanyang Technological University]] - [[SenseTime Research]] - [[InternLM]] - [[AcmeTrace]] - Pages updated: - [[Acme]] (NSDI'24 を sources/related に追加・AcmeTrace 等の関連追加) - [[耐障害LLM訓練]] (LLM 診断主役化・2 段階 NCCL allgather・async checkpointing×CPU 余剰直結の 3 件を横断的知見に追加、出典に NSDI'24 を追加) - [[GPUクラスタ運用]] (LLM 専用クラスタの GPU 二極化と補助資源余剰の同時成立・Evaluation 逆転の 2 件を横断的知見に追加) - [[LLM学習モニタリング]] (LLM 診断器の最初期本番事例の系譜を横断的知見に追加) - [[sources/_index]] / [[entities/_index]] / [[index]] / [[hot]] / `.raw/.manifest.json` - Key insight: 2024 年の Acme は LLM クラスタの「Pretraining が件数 0.9〜3.2% で GPU 時間 69.5〜94.0%」「GPU 利用率 0/100% 二極化」「Infrastructure 障害が件数 11% で GPU 時間 82%超」を最初に並べた本番証跡で、LLM ベース診断・async checkpointing・2 段階 NCCL allgather など現代の Fault-tolerant Pretraining の主要モチーフを初実装で揃えた上流。 ## [2026-06-15] ingest-paper batch | 時系列推論 × RLVR 論文 5 本同時取り込み - Sources: - `.raw/papers/arxiv-2505.24511.pdf` (94cb0deef00bf9bf25593f1ff96a1ee3) - `.raw/papers/7801b29c93b599b8d0c44138596bdeed-Paper-Conference.pdf` (65bbfff718db4f5e2ccc447bc564b357) - `.raw/papers/arxiv-2509.24803.pdf` (537d09ad0bbfaf39d29b621b6ebbeeda) - `.raw/papers/arxiv-2409.11376.pdf` (32894ec4bcb4d50df7f2bac8a08208b6) - `.raw/papers/arxiv-2511.08947.pdf` (5ffba8001b85b61e8bdd87dfcb87c5c3) - Summaries: [[@2025__KDD__Can Slow-thinking LLMs Reason Over Time - Empirical Studies in Time Series Forecasting]] / [[@2025__NeurIPS__Time-R1 - Post-Training Large Vision Language Model for Temporal Video Grounding]] / [[@2026__ICLR2026__TimeOmni-1 - Incentivizing Complex Reasoning with Time Series in Large Language Models]] / [[@2024__arXiv__Towards Time-Series Reasoning with LLMs]] / [[@2025__arXiv__AlphaCast - A Human Wisdom-LLM Intelligence Co-Reasoning Framework for Interactive Time Series Forecasting]] - Pages created (concept 3 + source 5 + entity 16 主要): - concept: [[時系列推論]] / [[検証可能報酬による強化学習]] / [[時間的映像グラウンディング]] - entity: [[Jiahao Wang]] / [[Daoyu Wang]] / [[Tong Guan]] / [[Qin Jin]] / [[MiLM Plus]] / [[Xiaomi]] / [[Xiaohan Zhang]] / [[Tian Gao]] / [[Winnie Chow]] / [[Lauren Gardiner]] / [[Haraldur T. Hallgrimsson]] / [[Maxwell A. Xu]] / [[Shirley You Ren]] / [[Apple]] / [[Ming Jin]] / [[Shirui Pan]] ほか - Pages updated: [[Mingyue Cheng]] / [[Xiaoyu Tao]] / [[Qi Liu]] / [[Enhong Chen]] / [[University of Science and Technology of China]] / [[DeepSeek-R1]] / [[Renmin University of China]] / [[NVIDIA]] / [[OpenAI]] / [[Stanford University]](lint-stub 解消) / [[エージェント型時系列予測]] / [[文脈内学習]] / [[強化ファインチューニング]] / [[ビジョン言語モデル]] / [[時系列基盤モデル]] / [[時系列質問応答]] / [[エージェント型強化学習]] / [[sources/_index]] / [[entities/_index]] / [[concepts/_index]] / [[index]] / [[hot]] / [[log]] / manifest - Key insights: - TimeReasoner は **訓練不要 [[DeepSeek-R1]] で深層学習ベースラインと競合** する性能を達成し、タイムスタンプ削除で MSE 5.4→25.3 の劣化・CoT 過長で精度低下・温度 τ=0.6 がスイートスポットという反直感的知見を提示 - Time-R1 は **2.5K サンプル RL が 339K サンプル SFT-LoRA を超える** 圧倒的データ効率で RLVR の TVG ドメインへの展開を実証し、SFT の「偽陰性過剰ペナルティ」問題を RL が解消する構造を明示 - TimeOmni-1 は SFT(コールドスタート CoT)+ GRPO の二段階訓練で因果発見 **GPT-4.1 を ID 40.6%・OOD 28.1% 上回り**、ジョイント訓練の能力補完(意思決定 40.9%→47.9%)を実証 - Chow+ は時系列推論を「知覚→文脈化→演繹」に分解し、軽量パッチエンコーダ + LoRA で **7B が GPT-4o を超える** 知覚ボトルネックの定式化と回避策を提示 - AlphaCast は Investigator-Generator-Reflector の三段階で訓練不要 LLM を駆動し、**反省モジュール除去で非推論ベースラインより悪化** することから「推論は両刃、反省が物理整合性に不可欠」を実証 - USTC([[Mingyue Cheng]] グループ)は TimeReasoner(推論時)→ AlphaCast(Workflow)→ Cast-R1(AgenticRL)の三世代を同一グループ内で揃え、ATSF の 3 パラダイムを系列的に積み上げた ## [2026-06-15] ingest-paper | Position: The Inevitable End of One-Architecture-Fits-All-Domains in Time Series Forecasting - Source: `.raw/papers/arxiv-2602.01736.pdf` - Summary: [[@2026__arXiv__Position - The Inevitable End of One-Architecture-Fits-All-Domains in Time Series Forecasting]] - Pages created: [[@2026__arXiv__Position - The Inevitable End of One-Architecture-Fits-All-Domains in Time Series Forecasting]] / [[Qinwei Ma]] / [[Jingzhe Shi]] / [[Jiahao Qiu]] / [[Zaiwen Yang]] - Pages updated: [[Tsinghua University]] / [[Princeton University]] / [[時系列基盤モデル]] / [[エージェント型時系列予測]] / [[sources/_index]] / [[entities/_index]] / [[index]] / [[log]] - Key insight: 時系列予測における「汎ドメインアーキテクチャ vs ドメイン特化 SOTA」の矛盾は和解不能であると論証。近似誤差下界 O(1/√T) により NLP・CV のスケーリング則が TSF に適用できないことを理論的根拠とし、Kaggle コンペ上位手法・各ドメインのトップ論文が汎ドメイン TSF NN を使わないことを実証。解決策として LLM Scientist 型メタラーニングを提案し、これは ATSF の Workflow パラダイムと独立に収束した設計思想。 ## [2026-06-15] ingest | Google Cloud Blog: Where and how Google is deploying agentic AI to improve operations - Source: `.raw/articles/google-sre-agentic-ai-improve-operations-2026-06-15.md` - Summary: [[@2026__Google Cloud Blog__AI in SRE - Where Google is Deploying Agentic AI to Improve Operations]] - Pages created: [[@2026__Google Cloud Blog__AI in SRE - Where Google is Deploying Agentic AI to Improve Operations]] / [[AI Insights]] / [[Agent Development Kit]] / [[Gemini Enterprise Agent Platform]] - Pages updated: [[Google]] / [[TimesFM]] / [[agentic SRE]] / [[SRE AI Autonomy Levels]] / [[インシデント管理]] / [[アラート管理]] / [[異常検知]] / [[index]] / [[hot]] / `.raw/.manifest.json` - Key insight: Google SRE AI のスコープが SDLC 全体(reliability design / anomaly detection & alerting / incident management / incident investigation / insights & risk management)に広がる地図と、本番スタックの**外向き表記**(Gemini + Gemini Enterprise Agent Platform[旧 Vertex AI のリブランドが一次確認] + ADK + MCP + BigQuery + vector DB)、[[TimesFM]] による異常検知の組み込み、[[AI Insights]](Gemini embedding + vector DB で過去インシデントを連続知識化 + risk category 注釈)、IMAG への 4 種の agentic orchestration layer(コミュニケーション監視/SRE 間ハンドオフ/ポストモーテム下書き/内外通信)、エージェント設計の 9 原則(transparency over black-box automation を含む)が、whitepaper の社内コードネーム表記と対の公開製品表記として固定された。 ## [2026-06-15] ingest-paper | How Incidental are the Incidents? Characterizing and Prioritizing Incidents for Large-Scale Online Service Systems - Source: `.raw/papers/2026_Unknown_How_incidental_incidents.pdf` - Summary: [[@2020__ASE__How Incidental are the Incidents - Characterizing and Prioritizing Incidents for Large-Scale Online Service Systems]] - Pages created: [[@2020__ASE__How Incidental are the Incidents - Characterizing and Prioritizing Incidents for Large-Scale Online Service Systems]] / [[Junjie Chen]] / [[Shu Zhang]] / [[Xiaoting He]] / [[Dan Hao]] / [[Feng Gao]] / [[Zhangwei Xu]] / [[Yingnong Dang]] / [[University of Newcastle]] / [[DeepIP]] / [[インシデント優先順位付け]] - Pages updated: [[Qingwei Lin]] / [[Hongyu Zhang]] / [[Dongmei Zhang]] / [[Yu Kang]] / [[Tianjin University]] / [[Microsoft]] / [[Microsoft Azure]] / [[Peking University]] / [[インシデント管理]] / [[アラート管理]] / [[インシデントTTM予測]] / [[sources/_index]] / [[entities/_index]] / [[concepts/_index]] / [[index]] / [[hot]] / `.raw/.manifest.json` - Key insight: Microsoft 18 オンラインサービスの 6 か月分本番インシデントで「半数以上(50.32%)が無視してよい incidental incidents で、TTR の 55.05% を消費する」という構造を初めて定量化し、severity 0 でも incidental が 57.96% を占める逆転を Table 1 で示した。attention 付き CNN + 直前 10 件の関連 incident 取り込みの [[DeepIP]](AUC 0.808)で bug severity prediction 流用ベースラインを 18 全システムで上回り、incident 固有の時間相関を捉える設計の優位を実証した。 ## [2026-06-15] refactor | wiki/concepts index synchronization - Pages updated: [[wiki/concepts/_index]] / [[index]] - Summary: concept 実ファイル 158 件に対して欠落していた [[MRC]]・[[Retroactive Sampling]]・[[SRv6]]・[[マルチプレーンClosトポロジ]] を [[wiki/concepts/_index]] に補完し、[[index]] の Concepts に [[AI Greenferencing]]・[[アラート管理]]・[[インシデントTTM予測]]・[[クラウドモニタリング]]・[[サーバーレスワークフロー]]・[[分散メッセージブローカ]]・[[変更起因インシデント]] を補完した。 - Verification: concept 実ファイル 158、concept index 158、wiki index Concepts 158、index 差分 0、重複 0。既存課題として必須見出し不足 11 件、frontmatter `sources` 欠落 14 件、concept 内 dead wikilink 28 件を次回候補に残す。 ## [2026-06-14] ingest-paper | Going through the Life Cycle of Faults in Clouds: Guidelines on Fault Handling - Source: `.raw/papers/Li-et-al.-2022---Going-through-the-Life-Cycle-of-Faults-in-Clouds---Guidelines-on-Fault-Handling.pdf` - Summary: [[@2022__ISSRE__Going through the Life Cycle of Faults in Clouds - Guidelines on Fault Handling]] - Pages created: [[@2022__ISSRE__Going through the Life Cycle of Faults in Clouds - Guidelines on Fault Handling]] / [[Xiaoyun Li]] / [[Hongyang Chen]] / [[Zhekang Chen]] / [[クラウド障害ライフサイクル]] - Pages updated: [[Guangba Yu]]・[[Pengfei Chen]]・[[Sun Yat-sen University]]・[[Bizseer]](source/related 追加) + [[運用障害分析]](TTX 実測値定量化・内部/外部原因の二分法を横断的知見に追加)・[[インシデント管理]](MTTM が TTR の 53% を占める緩和ボトルネックを横断的知見に追加)・[[根本原因分析]](misconfiguration 31.6% 最多・根本原因と緩和手段の強相関を横断的知見に追加)・[[障害緩和]](9 種の緩和手段分布・TTM・根本原因相関を横断的知見に追加)・[[障害注入]](4 つの未注入カテゴリをポストモーテム分析から導出を横断的知見に追加)・[[オブザーバビリティ]](階層化オブザーバビリティ・粒度オンデマンド切り替えを横断的知見に追加) + sources/_index・entities/_index・concepts/_index・index.md・hot.md・log.md・manifest - Key insight: 三大クラウドの実測データが「TTM が TTR の過半を占める」ことを初めて定量化し、緩和の短縮こそが可用性向上のボトルネックであることを実証した。根本原因(misconfiguration→rollback 51 件)と緩和手段の強相関はカテゴリ別自動推薦の設計根拠となる一方、hardware→replacement は均一分散でパターンが薄く、自動化が効きにくい領域として峻別される。 ## [2026-06-14] ingest-paper | Automated Analysis of Distributed Tracing: Challenges and Research Directions - Source: `.raw/papers/Bento-et-al.-2021---Automated-analysis-of-distributed-tracing---Challenges-and-research-directions.pdf` - Summary: [[@2021__J Grid Computing__Automated Analysis of Distributed Tracing - Challenges and Research Directions]] - Pages created: [[@2021__J Grid Computing__Automated Analysis of Distributed Tracing - Challenges and Research Directions]] / [[Andre Bento]] / [[Jaime Correia]] / [[Ricardo Filipe]] / [[Filipe Araujo]] / [[OpenTracing]] / [[OpenTracing Processor]] / [[トレース品質]] - Pages updated: [[分散トレーシング]](量制御の手前に品質天井診断・OpenTracing 仕様の自動分析阻害・LLM 前のトレース異常検知 3 段階進化の 3 項目を横断的知見に追加、トレース品質の改善が下流分析を押し上げるか・OpenTelemetry testability driver 化の 2 問を未解決の問いに追加、出典追加)・[[異常検知]](派生メトリクス時系列への古典外れ値検知系譜を横断的知見に追加、出典追加)・[[オブザーバビリティ]](シグナル内品質の早期指摘を横断的知見に追加、OpenTelemetry 品質メトリクス標準化を未解決の問いに追加、出典追加)・[[Jorge Cardoso]]・[[University of Coimbra]]・[[Huawei Munich Research Center]](Bento+ 2021 への参画追記)+ sources/_index・entities/_index・concepts/_index・index.md・hot.md・log.md・manifest - Key insight: 2021 年時点(LLM/TSFM 普及前)の OpenStack 本番 OpenTracing データ自動分析が「異常時間枠とサービスは Isolation Forest で位置づけられたが、Why 深掘りはトレース品質の天井で止まる」と診断したことが、後の AIOps が「サンプリング/圧縮で量を制御する」議論(Hindsight/TraStrainer/Astraea/Mint/Tracezip)と並行して「データの semantic 品質(計装カバレッジ・testability・annotation 規約)を確保する」議論へ向かう動機の起点になっている。Bento+ が OpenTelemetry を「merge 努力が主で testability driver の再設計が薄い」と批判した時点(2021)から 5 年後、本 wiki が記録する DeepFlow/ChainScope の非侵入計装研究は同じ「data sufficiency」課題への実装面の応答として読める。 ## [2026-06-14] ingest-paper | CMDiagnostor - An Ambiguity-Aware Root Cause Localization Approach Based on Call Metric Data - Source: `.raw/papers/acm-www2023-cmdiagstor.pdf` - Summary: [[@2023__WWW__CMDiagnostor - An Ambiguity-Aware Root Cause Localization Approach Based on Call Metric Data]] - Pages created: [[@2023__WWW__CMDiagnostor - An Ambiguity-Aware Root Cause Localization Approach Based on Call Metric Data]] / [[Bowen Hao]] / [[Mingjie Li]] / [[Xianglin Lu]] - Pages updated: [[Zeyan Li]](disambiguation) / [[Qingyang Yu]] / [[Changhua Pei]] / [[Shenglin Zhang]] / [[Nankai University]] / [[Dan Pei]] / [[Tencent]] / [[Fault Localization]](AmSit を精度のボトルネックとする横断的知見・RCNC 失敗モードの未解決問い追加) / [[根本原因分析]](入力表現精度が RCA 精度を律速するという横断的知見追加) / sources/_index / entities/_index / concepts/_index / index.md / hot.md / log.md / manifest - Key insight: コールメトリクスデータ(CMD)が集約の際に生じる「曖昧性(AmSit)」——1 つのノードに 2 本以上の上流コールと 1 本以上の下流コールが存在する状況——がコールグラフ精度の主ボトルネックであり、上流トラフィック系列の非負線形回帰で解消できる。入力グラフの精度が RCA 性能の上限を決めるという洞察は分野横断で普遍的である。 ## [2026-06-14] ingest-paper | A Conceptual Framework for System Fault Tolerance - Source: `.raw/papers/1992_005_001_16112.pdf` - Summary: [[@1992__CMU SEI__A Conceptual Framework for System Fault Tolerance]] - Pages created: [[@1992__CMU SEI__A Conceptual Framework for System Fault Tolerance]]・[[Walter Heimerdinger]]・[[Charles Weinstock]]・[[Honeywell]]・[[Software Engineering Institute]]・[[フォールトトレランス]] - Pages updated: [[ソフトウェア耐障害性]](3段階FTレベル体系と用語進化の横断的知見追加)・[[ディペンダビリティ]](fault evasion と Avizienis 4手段の対応、Heimerdinger+Weinstock 1992 出典追加)・sources/_index・entities/_index・concepts/_index・index.md・hot.md・log.md・manifest - Key insight: 1992 年に定義された「障害回避的措置(fault evasion)」が 2020 年代 AIOps のプロアクティブカテゴリとして再実装されており、30 年の概念的連続性が存在する。 ## [2026-06-14] ingest-paper | A Survey of Online Failure Prediction Methods - Source: `.raw/papers/Salfner-et-al.-2010---A-survey-of-online-failure-prediction-methods.pdf` - Summary: [[@2010__ACM CSUR__A Survey of Online Failure Prediction Methods]] - Pages created: [[@2010__ACM CSUR__A Survey of Online Failure Prediction Methods]]、[[Felix Salfner]]、[[Maren Lenk]]、[[Miroslaw Malek]]、[[Humboldt University of Berlin]]、[[プロアクティブ障害管理]]、[[ソフトウェアエイジング]] - Pages updated: [[障害予測]](Salfner+ 2010 を「pre-AIOps 期の元祖 taxonomy」「時間軸 4 パラメータの起点」「予測精度だけでは可用性は伸びず後段の対策自動化が本丸」の 4 項目で横断的知見と未解決の問いへ追加)・[[ディペンダビリティ]](Avižienis 2004 への symptom と undetected/detected 区別の拡張、5 段階連鎖と可視化技法対応を横断的知見へ追加)・[[AIOps]](pre-AIOps 期から taxonomy が確立しており Notaro+ 2021 の proactive/reactive 軸はその再パッケージ)+ sources/_index・entities/_index・concepts/_index・index.md・hot.md・log.md・manifest - Key insight: オンライン障害予測 taxonomy の起源は 2010 年で、AIOps の語が普及する前から「入力データ系統で 4 主要枝(failure tracking / symptom monitoring / detected error reporting / undetected error auditing)」という構造が完成していた。Salfner+ 2010 が立てた `(t_d, t_l, t_p, t_w)` の時間軸定式化と稀事象向け評価指標(precision/recall, F-measure, ROC/AUC)は今も AIOps 評価の共通通貨で、Notaro+ 2021 の proactive/reactive 軸も本サーベイの 4 系統の再パッケージにあたる。さらに「proactive fault management 4 段階のうち最初の予測しか研究蓄積がない」という Salfner+ 2010 の自己診断は、15 年後の Notaro+ 2021 が定量化した「remediation 2.5%」の研究密度の偏りと整合し、PFM 統合ループの欠如が今も AIOps の本丸であることを示す。 ## [2026-06-14] ingest | Practical Reliability Engineering - Source: `.raw/books/practical-reliability-engineering-2012.pdf` - Summary: [[@2012__Wiley__Practical Reliability Engineering]] - Pages created: [[@2012__Wiley__Practical Reliability Engineering]]、[[Patrick D. T. O'Connor]]、[[Andre Kleyner]]、[[Wiley]]、[[Design for Reliability]]、[[FRACAS]] - Pages updated: [[ディペンダビリティ]]、[[SRE]]、[[ソフトウェア耐障害性]]、sources/_index・entities/_index・concepts/_index・index.md・hot.md・log.md・manifest - Key insight: 古典的信頼性工学は、信頼性を「時間依存の品質」として扱い、数学的予測より工学判断・故障モード理解・是正フィードバックを優先する。[[Design for Reliability]] は信頼性を設計初期へ前倒しし、[[FRACAS]] は故障報告から是正処置の有効性確認までを閉じる。これは [[SRE]] の SLO/インシデント管理/ポストモーテムの製品信頼性版の前史として位置づく。 ## [2026-06-14] ingest-paper | How Long Will it Take to Mitigate this Incident for Online Service Systems? - Source: `.raw/papers/2021ISSRE_TTMPrediction_cameraReady1.pdf` - Summary: [[@2021__ISSRE__How Long Will it Take to Mitigate this Incident for Online Service Systems]] - Pages created: [[@2021__ISSRE__How Long Will it Take to Mitigate this Incident for Online Service Systems]]、[[Weijing Wang]]、[[Tianjin University]]、[[インシデントTTM予測]] - Pages updated: [[Hongyu Zhang]](Newcastle affiliation 追記)・[[Qingwei Lin]]・[[Yu Kang]]・[[Saravan Rajmohan]]・[[Dongmei Zhang]]・[[インシデント管理]](横断的知見 2 項・未解決の問い 1 項・出典追加)+ sources/_index・entities/_index・concepts/_index・index.md・hot.md・log.md・manifest - Key insight: インシデント TTM のうち最終担当チーム確定後の緩和フェーズ T3 が平均 70.20% を占める——トリアージ改善だけでは TTM 短縮に限界があることを 20 システム・4 年データで初めて定量化した ## [2026-06-14] ingest-paper | A Survey of Distributed Message Broker Queues - Source: `.raw/papers/arxiv-1704.00411.pdf` - Summary: [[@2017__arXiv__A Survey of Distributed Message Broker Queues]] - Pages created: [[@2017__arXiv__A Survey of Distributed Message Broker Queues]]、[[Vineet John]]、[[Xia Liu]]、[[University of Waterloo]]、[[Apache Kafka]]、[[AMQP]]、[[RabbitMQ]]、[[分散メッセージブローカ]] - Pages updated: sources/_index・entities/_index・concepts/_index・index.md・hot.md・log.md・manifest - Key insight: 同一テストベッド(5 ノード Flotilla)で Kafka と AMQP を直接対比した経験的調査。Kafka のスループット優位は SendFile API + シーケンシャル書き込み + OS ページキャッシュ + 標準バッチングに、AMQP のレイテンシ優位は push モデル + 既定で非永続化に集約される。起源(LinkedIn のログ処理 vs 金融取引処理)が現在の設計思想を決め、応用領域(損失許容 vs 損失非許容)が選択基準となる。multi P/C スケール時の resource contention は両系統共通の弱点。 ## [2026-06-14] ingest-paper | An up-to-date survey in web load balancing - Source: `.raw/papers/An_up-to-date_survey_in_web_load_balanci.pdf` - Summary: [[@2011__World Wide Web__An up-to-date survey in web load balancing]] - Pages created: [[@2011__World Wide Web__An up-to-date survey in web load balancing]]、[[Katja Gilly]]、[[Carlos Juiz]]、[[Ramon Puigjaner]]、[[Miguel Hernández University]]、[[University of Balearic Islands]]、[[Webロードバランシング]] - Pages updated: sources/_index・entities/_index・concepts/_index・index.md・hot.md・log.md・manifest - Key insight: Gilly ら(World Wide Web 2011)は 2010 年時点のウェブロードバランシングを 3 軸（OSI 層・応答返路・コンテンツ把握）で体系化し、TCP 接続マイグレーション 5 方式と分散方針 3 分類を文献横断で整理した。Kubernetes の Service/Ingress やサービスメッシュが現代のマイクロサービスで担うロードバランシング機能の前史として位置づけられ、[[コンテナ配置最適化]]・[[マイクロサービスアーキテクチャ]] との接合点を提供する。 ## [2026-06-14] ingest | CNCF Serverless Overview Whitepaper v1.0 - Source: `.raw/articles/cncf-serverless-overview-whitepaper-2026-06-14.md` - Summary: [[@2018__CNCF WG Serverless__Serverless Overview Whitepaper v1.0]] - Pages created: [[@2018__CNCF WG Serverless__Serverless Overview Whitepaper v1.0]]・[[サーバーレスワークフロー]] - Pages updated: [[CNCF]](サーバーレス領域取り組み節追加)・[[サーバーレスアーキテクチャ]](CNCF 公式定義を citation 化・横断的知見 3 項追加・未解決の問い 1 項追加・sources 3 本目追加・status seed→developing)・sources/_index・concepts/_index・index.md・hot.md・log.md・manifest - Key insight: CNCF 白書(2018)は「サーバーレス = サーバー管理不要」という外部・運用視点で定義し、Yuuki Tsubouchi(2019)の「2 種のサーバーを隠蔽」という内部・アーキテクチャ視点と相補的に並立する。両者は矛盾しない。また「サーバーレス」用語の起源(2012 IronWorker)と FaaS 実用化(2014 AWS Lambda)に 2 年のギャップがあることが確認でき、概念先行・技術後追いの典型パターンを示す。Function Workflow の n:m イベント-Function マッピングと 5 パターン・6 状態の定義が後の LLM エージェントのツール呼び出し連鎖との対応を問う未解決問いへ接続する。 ## [2026-06-14] ingest-paper | Cloud Container Technologies: A State-of-the-Art Review - Source: `.raw/papers/cloud-containers-tcc.pdf` - Summary: [[@2019__TCC__Cloud Container Technologies - A State-of-the-Art Review]] - Pages created: [[@2019__TCC__Cloud Container Technologies - A State-of-the-Art Review]]、[[Claus Pahl]]、[[Pooyan Jamshidi]]、[[Free University of Bozen-Bolzano]]、[[University of Pisa]]、[[Docker]]、[[LXC]]、[[コンテナオーケストレーション]]、[[体系的マッピング研究]] - Pages updated: [[Antonio Brogi]]、[[Jacopo Soldani]]、[[Carnegie Mellon University]]、[[マイクロサービスアーキテクチャ]](横断的知見 1 項追加)、[[コンテナ配置最適化]](横断的知見 1 項追加)、[[サーバーレスアーキテクチャ]](横断的知見 1 項追加)、sources/_index・entities/_index・concepts/_index・index.md・hot.md・log.md・manifest - Key insight: Pahl ら(IEEE TCC 2019)は 46 件 SMS で 2007-2016 のクラウドコンテナ・オーケストレーション研究を体系化し、Docker・LXC が支配的・Kubernetes が新興という当時の構造を 4 軸分類フレームワーク(Technology Stack・Management Services・Architecture Setting・Tools/Platforms/Technology)で固定化した。本論文の歴史的価値は (1) コンテナ研究は 2007 LXC → 2013 Docker → 2015 中盤クラスタ層という 2 段階加速の前史を定量化したこと、(2) 障害管理(failure management)を「2017 年時点で未開拓」と明示し、その後 8-9 年で [[@2021__TIST__A Survey of AIOps Methods for Failure Management]] や [[@2021__CSUR__Anomaly Detection and Failure Root Cause Analysis in (Micro)Service-Based Cloud Applications - A Survey]] が書かれた歴史的距離を逆算できること、(3) SLA Parameter(consumer 視点) と Infrastructure Parameter(provider 視点) の二極化を品質関心の構造として早期に定式化したこと、にある。これは [[@2020__SAC__Black-box inter-application traffic monitoring for adaptive container placement]] が Infra 側を深掘る eBPF アプローチや、[[@2022__IEEE ACCESS__A Survey on Observability of Distributed Edge & Container-Based Microservices]] のエッジオブザーバビリティ要件の前史を形成する。 ## [2026-06-14] ingest-paper | A survey on intelligent management of alerts and incidents in IT services - Source: `.raw/papers/A-survey-on-intelligent-management-of-alerts-and-incidents-in-IT-services.pdf` - Summary: [[@2024__JNCA__A survey on intelligent management of alerts and incidents in IT services]] - Pages created: [[@2024__JNCA__A survey on intelligent management of alerts and incidents in IT services]]、[[Qingyang Yu]]、[[アラート管理]] - Pages updated: [[インシデント管理]](横断的知見 2 項追加)、[[Dan Pei]]、[[Nengwen Zhao]]、[[BizSeer]]、sources/_index・entities/_index・concepts/_index・index.md・hot.md・log.md・manifest - Key insight: Yu+ JNCA2024 は alert と incident を別ライフサイクルとして分離する統一 AIM アーキテクチャ(Fig.5)を提示し、alert determination の 3 種(distinguishing / severe ranking / alert-based incident identification)を直列統合する将来方向(Fig.7)を示す。この分類体系は前 LLM 時代の境界研究として位置づき、AlertGuardian・FLASH・LLexus・FlowXpert など LLM 時代の研究は 8 プロセスを横断結合する形で進化している。 ## [2026-06-14] ingest-paper | Lindorm TSDB: A Cloud-native Time-series Database for Large-scale Monitoring Systems - Source: `.raw/papers/p3715-zheng.pdf` - Summary: [[@2023__PVLDB__Lindorm TSDB - A Cloud-native Time-series Database for Large-scale Monitoring Systems]] - Pages created: [[@2023__PVLDB__Lindorm TSDB - A Cloud-native Time-series Database for Large-scale Monitoring Systems]]、[[Lindorm TSDB]]、[[Feifei Li]]、[[Zhejiang University]] - Pages updated: [[Dan Pei]]、[[Alibaba Group]]、[[時系列データベース]]（横断的知見 3 項・未解決の問い 3 項追加）、sources/_index・entities/_index・index.md・hot.md・log.md・manifest - Key insight: 「共有なし + 共有ストレージ」ハイブリッドは時刻軸シャードグループ切り替えでノード追加時のデータ移動をゼロにし、書き込みスループット直線以上のスケーラビリティを実現する。Seriescache（MD5 エンコードのフォワードインデックス専用キャッシュ）が高次元問題の実用的解であり、前処理ダウンサンプリングが「書き込みコスト先払い・クエリゼロコスト」のトレードオフを選択した設計思想の具体例となる。 Navigation: [[index]] | [[hot]] | [[overview]] ## [2026-06-14] ingest-paper | Mining Causality of Network Events in Log Data - Source: `.raw/papers/2018TNSM-Mining-Causality-of-Network-Events-in-Log-Data.pdf` - Summary: [[@2018__TNSM__Mining Causality of Network Events in Log Data]] - Pages created: [[@2018__TNSM__Mining Causality of Network Events in Log Data]]、[[Satoru Kobayashi]]、[[Kazuki Otomo]]、[[Kensuke Fukuda]]、[[Hiroshi Esaki]]、[[University of Tokyo]]、[[National Institute of Informatics]]、[[SINET4]]、[[LogCausalAnalysis]] - Pages updated: [[ログ解析]]（ネットワーク syslog 因果推論節 + 未解決の問い追加）、[[因果推論ベースRCA]]（横断的知見 2 項・未解決の問い 1 項追加）、sources/_index・entities/_index・index.md・hot.md・log.md・manifest - Key insight: PC アルゴリズム + G-square 検定はスパースな二値ネットワーク syslog に適合し、フーリエ+線形回帰の周期フィルタで 93% を除去したうえでトラブルチケットの 74% に対応するエッジを 5.3 エッジ/日に絞り込める。2018 年にすでに「データの疎密が条件付き独立性検定の選択を決める」という設計原則が定量化されていた。 ## [2026-06-14] ingest-paper | A Survey of AIOps Methods for Failure Management - Source: `.raw/papers/notaro-2021-aiops-survey.pdf`(MD5: `ad5ecae3c47f2c948b83191e31d3bd36`、ACM TIST Vol.12 No.6 Art.81、2021-11、45p、DOI:10.1145/3483424) - Summary: [[@2021__TIST__A Survey of AIOps Methods for Failure Management]] - Pages created: [[@2021__TIST__A Survey of AIOps Methods for Failure Management]] / [[Paolo Notaro]] / [[Jorge Cardoso]] / [[Michael Gerndt]] / [[University of Coimbra]] / [[Huawei Munich Research Center]] - Pages updated: [[AIOps]] / [[障害予測]] / [[Fault Localization]] / sources/_index / entities/_index / index.md / hot.md / log.md / .raw/.manifest.json - Key insight: pre-LLM 期 AIOps の Failure Management を proactive/reactive 軸 × 5 カテゴリ・14 サブカテゴリで 100 件整理。detection/RCA/prediction に研究が集中(計 86.8%)し、prevention 10.6%・remediation 2.5% は LLM-era にも引き継がれた構造的偏り。online failure prediction の lead/prediction/warning time の評価枠と、SFL/network/general-purpose の 3 系統 fault localization 整理は、現代の本 wiki 系統との対比に直接効く。 ## [2026-06-14] ingest-paper | Identifying Faults in Large-Scale Distributed Systems by Filtering Noisy Error Logs - Source: `.raw/papers/rao-et-al-2011-identifying-faults-noisy-error-logs.pdf`（MD5: `034549e3e90bac2c628fff1816db4915`、IEEE 2011、IEEE Xplore 文書番号 5958800） - Summary: [[@2011__SRDS__Identifying Faults in Large-Scale Distributed Systems by Filtering Noisy Error Logs]] - Pages created: [[@2011__SRDS__Identifying Faults in Large-Scale Distributed Systems by Filtering Noisy Error Logs]] / [[Xiang Rao]] / [[Huaimin Wang]] / [[National University of Defense Technology]] - Pages updated: [[ログ解析]] / [[障害注入]] / sources/_index / entities/_index / concepts/_index(更新なし) / index.md / hot.md / log.md / .raw/.manifest.json - Key insight: 2011 年時点の Alibaba Cloud 実証で、障害注入テストには注入した以外のノイズ障害が必然的に共存し、時間/空間圧縮だけでは障害特徴抽出の再現率が 30% まで低下する——この「ノイズログを源流で除去する」設計思想は、2024 年の LogCleaner・LogReducer の先駆け。 ## [2026-06-14] ingest-paper | ByteSeries: An In-Memory Time Series Database for Large-Scale Monitoring Systems - Source: `.raw/papers/socc20-byteseries.pdf`（MD5: `a3de4331b7e378944057a7cd09e3d850`、SoCC 2020） - Summary: [[@2020__SoCC__ByteSeries - An In-Memory Time Series Database for Large-Scale Monitoring Systems]] - Pages created: [[@2020__SoCC__ByteSeries - An In-Memory Time Series Database for Large-Scale Monitoring Systems]] / [[Xuanhua Shi]] / [[Yongluan Zhou]] / [[University of Copenhagen]] / [[ByteSeries]] / [[tsdc]] - Pages updated: [[Bingsheng He]] / [[ByteDance]] / [[Huazhong University of Science and Technology]] / [[National University of Singapore]] / [[時系列データベース]] / sources/_index / entities/_index / index.md / hot.md / log.md / .raw/.manifest.json - Key insight: 超高次元監視 TSDB ではデータ点圧縮よりメタデータ(系列キー・タグ)の圧縮がボトルネックであり、Compressed Inverted Index(trie + p4nzenc64)は圧縮とグループ集計を単一データ構造で両立することで多次元クエリを 1.8〜10.7 倍高速化した——これは Gorilla 的データ点圧縮最適化とは直交する課題空間。 ## [2026-06-14] ingest-paper | Black-box inter-application traffic monitoring for adaptive container placement - Source: `.raw/papers/acm-3341105-3374007.pdf`（MD5: `2f10edb15dc5f6a11d43d8f3229601e9`、8 ページ） - Summary: [[@2020__SAC__Black-box inter-application traffic monitoring for adaptive container placement]] - Pages created: [[@2020__SAC__Black-box inter-application traffic monitoring for adaptive container placement]]・[[Francisco Neves]]・[[Ricardo Vilaça]]・[[José Pereira]]・[[HASLab]]・[[University of Minho]]・[[コンテナ配置最適化]] - Pages updated: [[eBPF]]・[[分散トレーシング]]・sources/_index・entities/_index・concepts/_index・index.md・hot.md・log.md - Key insight: eBPF カーネル内集約(KernelAgg)は per-connection バイトカウンタを `<pid,sock>` マップで保持し周期的に転送することで 9% オーバーヘッドでコンテナ間通信グラフを構築できる——UserAgg(68%)・Scope方式(1% だが量不可)との設計空間整理。 ## [2026-06-14] ingest-paper | Chronix: Long Term Storage and Retrieval Technology for Anomaly Detection in Operational Data - Source: `.raw/papers/fast17-lautenschlager.pdf` - Summary: [[@2017__FAST__Chronix - Long Term Storage and Retrieval Technology for Anomaly Detection in Operational Data]] - Pages created: [[@2017__FAST__Chronix - Long Term Storage and Retrieval Technology for Anomaly Detection in Operational Data]], [[Florian Lautenschlager]], [[Michael Philippsen]], [[Andreas Kumlehn]], [[Josef Adersberger]], [[QAware GmbH]], [[Friedrich-Alexander-Universität Erlangen-Nürnberg]], [[Chronix]] - Pages updated: [[時系列データベース]], [[異常検知]], [[専用データベースシステム]] + sources/_index, entities/_index, index.md, hot.md, log.md, manifest - Key insight: 汎用 TSDB はデータモデル制約（数値スカラー型のみ）によって異常検知の探索的分析を根本から制限しており、ドメイン固有 TSDB（Chronix の DDC・汎用データモデル・ビルトイン解析）との性能差は 20〜97% に達する——「どう検知するか」より「何を保存・相関できるか」が検知可能空間を決める 2017 年の先行実証 ## [2026-06-14] ingest-paper | Gorilla: A Fast, Scalable, In-Memory Time Series Database - Source: `.raw/papers/Pelkonen-et-al.-2015---Gorilla---a-fast-scalable-in-memory-time-series-database-1.pdf`（MD5: `afd40fee058eb419b6e6961df91d2e78`、12 ページ） - Summary: [[@2015__VLDB__Gorilla - A Fast, Scalable, In-Memory Time Series Database]] - Pages created: source 1（[[@2015__VLDB__Gorilla - A Fast, Scalable, In-Memory Time Series Database]]）+ entity 2（[[Gorilla]]・[[Tuomas Pelkonen]]） - Pages updated: [[時系列データベース]]（横断的知見 2 項・未解決の問い 1 項追加）・[[メインメモリデータベース]]（横断的知見 2 項・未解決の問い 1 項追加）・[[Facebook]]・`wiki/sources/_index.md`・`wiki/entities/_index.md`・[[index]]・[[hot]]・[[log]]・manifest - Key insight: 監視 TSDB は「個々のデータ点ロスは許容可能、最新データの可用性が優先」という設計哲学により ACID を捨てた。この一点が非 WAL ログ・部分結果の返却・最新ブロック優先復元に一貫して反映され、HBase 比 73 倍高速化と 12 倍圧縮の両立を可能にした。Gorilla 圧縮は後続 TSDB の事実上の標準になった。 ## [2026-06-14] ingest-paper | B-Trees Are Back: Engineering Fast and Pageable Node Layouts - Source: `.raw/papers/1-3709664.pdf`（MD5: `368c6aca1c84afb5d57fd9708e14486a`） - Summary: [[@2025__SIGMOD__B-Trees Are Back - Engineering Fast and Pageable Node Layouts]] - Pages created: source 1（[[@2025__SIGMOD__B-Trees Are Back - Engineering Fast and Pageable Node Layouts]]）+ concept 2（[[B-Tree]]・[[B-Treeノードレイアウト最適化]]）+ entity 6（[[Marcus Müller]]・[[Lawrence Benson]]・[[Viktor Leis]]・[[btree-cpp]]・[[btree24]]・[[vmcache]]） - Pages updated: [[TU Munich]]・[[LSMツリー]]・[[メインメモリデータベース]]・`wiki/sources/_index.md`・`wiki/entities/_index.md`・`wiki/concepts/_index.md`・[[index]]・[[hot]]・[[log]]・manifest - Key insight: B-Tree は「古いディスク向け構造」ではなく、可変長 record と 4 KiB page を保ったまま node layout を再工学すれば、range scan・paging・DBMS 統合の利点を維持しつつ、純インメモリ索引との lookup 性能差を大きく縮められる。 ## [2026-06-14] ingest-paper | LLM-Oriented Information Retrieval: A Denoising-First Perspective - Source: `.raw/papers/arxiv-2605.00505.pdf`（MD5: `fbddd621383f474e177f5a2ea8c7333e`） - Summary: [[@2026__SIGIR__LLM-Oriented Information Retrieval - A Denoising-First Perspective]] - Pages created: source 1（[[@2026__SIGIR__LLM-Oriented Information Retrieval - A Denoising-First Perspective]]）+ concept 2（[[LLM向け情報検索]]・[[RAGノイズ除去]]）+ entity 7（[[Lu Dai]]・[[Liang Sun]]・[[Fanpu Cao]]・[[Ziyang Rao]]・[[Cehao Yang]]・[[Hao Liu]]・[[Hui Xiong]]） - Pages updated: [[エージェント型コーディング]]・[[エージェント型強化学習]]・[[Hong Kong University of Science and Technology, Guangzhou]]・[[Hong Kong University of Science and Technology]]・`wiki/sources/_index.md`・`wiki/entities/_index.md`・`wiki/concepts/_index.md`・[[index]]・[[hot]]・[[log]]・manifest - Key insight: LLM 向け IR では検索結果の消費者が人間から LLM へ変わるため、検索器は raw recall を最大化するだけでなく、文脈ウィンドウ内の証拠密度・検証可能性・安全性を制御するノイズゲートになる必要がある。 ## [2026-06-14] ingest-paper | Rethinking The Compaction Policies in LSM-trees - Source: `.raw/papers/acm-3725344.pdf`（MD5: `72a51ef41063576d73cdafd70c303894`） - Summary: [[@2025__SIGMOD__Rethinking The Compaction Policies in LSM-trees]] - Pages created: source 1（[[@2025__SIGMOD__Rethinking The Compaction Policies in LSM-trees]]）+ concept 1（[[LSMツリーコンパクション]]）+ entity 7（[[Hengrui Wang]]・[[Jiansheng Qiu]]・[[Fangzhou Yuan]]・[[Huanchen Zhang]]・[[EcoTune]]・[[RocksDB]]・[[Shanghai Qi Zhi Institute]]） - Pages updated: [[LSMツリー]]・[[データベースノブチューニング]]・[[データベース O&M]]・[[Tsinghua University]]・`wiki/sources/_index.md`・`wiki/entities/_index.md`・`wiki/concepts/_index.md`・[[index]]・[[hot]]・[[log]]・manifest - Key insight: LSM ツリーのコンパクションは WA/RA の静的トレードオフだけではなく、将来の平均クエリスループットへ CPU/I/O を投資する時点選択問題である。EcoTune は三レベルモデルと動的計画法で、RocksDB 上で Leveling 比 1.5〜3 倍、Lazy Leveling 比最大 1.8 倍の平均クエリスループットを示した。 ## [2026-06-14] ingest-paper | Correlating Instrumentation Data to System States: A Building Block for Automated Diagnosis (OSDI 2004) - Source: `.raw/papers/cohen.pdf`（MD5: `e43cc257bd2a94e36fdddf507ad20271`） - Summary: [[@2004__OSDI__Correlating Instrumentation Data to System States - A Building Block for Automated Diagnosis]] - Pages created: source 1（[[@2004__OSDI__Correlating Instrumentation Data to System States - A Building Block for Automated Diagnosis]]）+ entity 7（[[Ira Cohen]]・[[Moises Goldszmidt]]・[[Terence Kelly]]・[[Julie Symons]]・[[Jeffrey S. Chase]]・[[HP Labs]]・[[Duke University]]） - Pages updated: [[根本原因分析]]（横断的知見追加: 「相関 ≠ 因果」は 2004 年から意識）・[[異常検知]]（SLO 二値分類の 2004 年実証）・[[Fault Localization]]（メトリクス帰属の先駆的定式化）+ sources/_index・entities/_index・index.md・hot.md・log.md・manifest - Key insight: 単一メトリクス（CPU 使用率）ルールは STEP ワークロードで balanced accuracy 56% に落ち込み、3–8 個のメトリクスを TAN で組み合わせることで 87–94% を達成。「メトリクス帰属（metric attribution）」と「相関 ≠ 因果」の制約を 2004 年に明示化しており、現代の MetricSifter 等のメトリクス選択研究の直接的先行。 ## [2026-06-14] ingest-paper | Humanity's Last Exam (arXiv 2025) - Source: `.raw/papers/arxiv-2501.14249.pdf` - Summary: [[@2025__arXiv__Humanity's Last Exam]] - Pages created: source 1（[[@2025__arXiv__Humanity's Last Exam]]）+ entity 4（[[Dan Hendrycks]]・[[Long Phan]]・[[Center for AI Safety]]・[[Scale AI]]） - Pages updated: [[LLM評価]]（ベンチマーク飽和・最前線ベンチマーク設計・横断的知見 4 項追加）+ sources/_index・entities/_index・index.md・hot.md・log.md・manifest - Key insight: MMLU など既存ベンチマークを飽和させた最先端モデルでも HLE では最高 13.4% の正解率にとどまり、全モデルで RMS キャリブレーション誤差 73〜89% と誤答時も高確信度を示す。「ベンチマーク飽和は不可避であり評価設計はモデル進化と競争する」という構造的課題を最前線規模で定量化した。 ## [2026-06-14] ingest-paper | Chatbot Arena: An Open Platform for Evaluating LLMs by Human Preference (arXiv 2024) - Source: `.raw/papers/arxiv-2403.04132.pdf` - Summary: [[@2024__arXiv__Chatbot Arena - An Open Platform for Evaluating LLMs by Human Preference]] - Pages created: source 1（[[@2024__arXiv__Chatbot Arena - An Open Platform for Evaluating LLMs by Human Preference]]）+ entity 4（[[Wei-Lin Chiang]]・[[Lianmin Zheng]]・[[LMSYS]]・[[Chatbot Arena]]）+ concept 1（[[LLM評価]]） - Pages updated: [[Ion Stoica]]（関連リンク・説明追記）+ sources/_index・entities/_index・concepts/_index・index.md・hot.md・log.md・manifest - Key insight: Chatbot Arena はクラウドソーシング型ペアワイズ比較で LLM を評価し、Bradley-Terry モデルが Elo より頑健な統計的ランキングを実現。能動サンプリングで win matrix 推定に必要な票数をランダム比で最大 54% 削減。クラウド投票と専門家評価の一致率は 72〜83% で、差分は主に「事実誤りの見落とし」に起因。 ## [2026-06-14] ingest-paper | Root Cause Analysis for Microservices based on Causal Inference - How Far Are We (ASE 2024) - Source: `.raw/papers/1-3691620.3695065.pdf` - Summary: [[@2024__ASE__Root Cause Analysis for Microservices based on Causal Inference - How Far Are We]] - Pages created: source 1（[[@2024__ASE__Root Cause Analysis for Microservices based on Causal Inference - How Far Are We]]）+ entity 6（[[Luan Pham]]・[[Huong Ha]]・[[Hongyu Zhang]]・[[RMIT University]]・[[Chongqing University]]・[[RCAEval]]）+ concept 1（[[因果推論ベースRCA]]） - Pages updated: [[根本原因分析]]（横断的知見 2 項追加）・[[Fault Localization]]（横断的知見 1 項追加）+ sources/_index・entities/_index・concepts/_index・index.md・hot.md・log.md・manifest - Key insight: Dummy ベースライン(ランダム選択)を初めて因果推論ベース RCA 評価に導入し、PC/FCI/Granger 系手法の多くが Dummy 同等にとどまることを実証。因果グラフ F1 は 0.10〜0.54 で辺方向推定が全手法の共通ボトルネック。仮説検定系（BARO・NSigma）はグラフ構築をスキップして最良。 ## [2026-06-14] ingest | Intelligent Monitoring: Towards AI-Assisted Monitoring for Cloud Services (Microsoft Research Blog) - Source: `.raw/articles/intelligent-monitoring-towards-ai-assisted-monitoring-for-cloud-services-2026-06-14.md` - Summary: [[@2024__Microsoft Research Blog__Intelligent Monitoring - Towards AI-Assisted Monitoring for Cloud Services]] - Pages created: source 1（[[@2024__Microsoft Research Blog__Intelligent Monitoring - Towards AI-Assisted Monitoring for Cloud Services]]）+ entity 2（[[Avi Nayak]]・[[Piyali Jana]]） - Pages updated: [[Rujia Wang]]・[[クラウドモニタリング]]（Monitor Scorecards を未解決の問いに追加・出典追加）+ sources/_index・entities/_index・index.md・hot.md・log.md・manifest - Key insight: ICSE-SEIP 2024 論文の論文未掲載情報として Monitor Scorecards（ベイズ統計＋時系列モデリングでモニタ有効性をインシデント分析・影響評価で体系評価）が予告されており、「推奨」フェーズの後段に「評価」フェーズを閉じるサイクルが計画されている。 ## [2026-06-14] ingest-paper | Anomaly Detection and Failure Root Cause Analysis in (Micro)Service-Based Cloud Applications - A Survey - Source: `.raw/papers/Soldani-and-Brogi-2021---Anomaly-Detection-and-Failure-Root-Cause-Analysis-in-MicroService-Based-Cloud-Applications---A-Survey.pdf`（MD5: `3eccc760cfbf57aac969d98aa11edcf7`） - Summary: [[@2021__CSUR__Anomaly Detection and Failure Root Cause Analysis in (Micro)Service-Based Cloud Applications - A Survey]] - Pages created: [[@2021__CSUR__Anomaly Detection and Failure Root Cause Analysis in (Micro)Service-Based Cloud Applications - A Survey]], [[Jacopo Soldani]], [[Antonio Brogi]] - Pages updated: [[異常検知]], [[根本原因分析]], `wiki/sources/_index.md`, `wiki/entities/_index.md`, [[index]], [[hot]], [[log]] - Key insight: データ源（ログ/分散トレース/監視メトリクス）× 手法の 2 軸でマイクロサービス向け異常検知 25 手法・RCA 26 手法を統合分類した 2021 年の基礎サーベイ。PC アルゴリズム + ランダムウォークが pre-LLM era の標準 RCA パイプラインとして確立し、「相関 ≠ 因果」と「説明可能性・対策推奨・継続的変化への対応」という未解決課題を 2021 年に定義していた。 ## [2026-06-14] ingest-paper | Intelligent Monitoring Framework for Cloud Services - A Data-Driven Approach - Source: `.raw/papers/arxiv-2403.07927.pdf`（MD5: `5e350e9b8d00107d53a2061cad451bfc`） - Summary: [[@2024__ICSE-SEIP__Intelligent Monitoring Framework for Cloud Services - A Data-Driven Approach]] - Pages created: [[@2024__ICSE-SEIP__Intelligent Monitoring Framework for Cloud Services - A Data-Driven Approach]], [[Pooja Srinivas]], [[Fiza Husain]], [[Ayush Choure]] - Pages updated: [[Anjaly Parayil]], [[Chetan Bansal]], [[Saravan Rajmohan]], [[クラウドモニタリング]] - Key insight: 791 本番サービスの実証分析がモニタオントロジー（13 リソースクラス・9 SLO タイプ）を導出し、サービスの依存グラフ＋コンポーネントだけからモニタ推奨を自動化できることを実証した——「何を監視するか」問題に対する初のデータ駆動アプローチ。 ## [2026-06-14] ingest-paper | Detection Is Better Than Cure - A Cloud Incidents Perspective - Source: `.raw/papers/acm-3611643-3613898.pdf`（MD5: `703abef803dac1bf1aa991544103215a`） - Summary: [[@2023__ESEC-FSE__Detection Is Better Than Cure - A Cloud Incidents Perspective]] - Pages created: [[@2023__ESEC-FSE__Detection Is Better Than Cure - A Cloud Incidents Perspective]], [[Vaibhav Ganatra]], [[Yu Kang]], [[Anjaly Parayil]], [[クラウドモニタリング]] - Pages updated: [[Chetan Bansal]], [[Supriyo Ghosh]], [[Suman Nath]], [[Jonathan Mace]], [[Minghua Ma]], [[インシデント管理]], [[異常検知]] - Key insight: Microsoft 本番の実証分析が、ミス検知の 40% 超が「モニタ不在」に起因することを示し、AIOps 研究が「いかに検知するか」より「何を監視すべきか」という上位問題を先に解く必要があることを定量的に裏づけた。 ## [2026-06-14] ingest-paper | CNCF TAG Observability Whitepaper v1.0 - Source: `.raw/articles/cncf-observability-whitepaper.md`（MD5: `137eec87ac483314adba22a0d5099720`） - Summary: [[@2023__CNCF TAG Observability__Observability Whitepaper]] - Pages created: [[@2023__CNCF TAG Observability__Observability Whitepaper]], [[CNCF]], [[TAG Observability]], [[Liz Fong-Jones]], [[継続的プロファイリング]] - Pages updated: [[オブザーバビリティ]], [[テレメトリ]], [[エラーバジェット]], [[OpenTelemetry]], [[wiki/sources/_index]], [[wiki/entities/_index]], [[wiki/concepts/_index]], [[index]], [[hot]], [[log]] - Key insight: 従来の「三本柱（ログ/メトリクス/トレース）」を 5 シグナルへ拡張し、Exemplar によるメトリクス→トレース横断ナビゲーションと SLO バーンレートアラートの定量的フレームワークを提供した CNCF の産業コンセンサスドキュメント。 ## [2026-06-14] ingest-paper | Performance Anomaly Detection and Bottleneck Identification (Ibidunmoye+ CSUR2015) - Source: `.raw/papers/Ibidunmoye-et-al.-2015---Performance-anomaly-detection-and-bottleneck-identification.pdf` - Summary: [[@2015__CSUR__Performance Anomaly Detection and Bottleneck Identification]] - Pages created: [[@2015__CSUR__Performance Anomaly Detection and Bottleneck Identification]], [[Olumuyiwa Ibidunmoye]], [[Francisco Hernández-Rodriguez]], [[Erik Elmroth]], [[Umeå University]] - Pages updated: [[異常検知]], [[wiki/sources/_index]], [[wiki/entities/_index]], [[wiki/concepts/_index]], [[index]], [[hot]], [[log]] - Key insight: 2015年時点の PADBI サーベイで調査論文の 53% が PAD のみを扱い PADBI 統合は 18%——10 年後の現代でも「検知だけで根本原因特定を含まない」という批判が続く同型問題が定量的に裏付けられた最初の証拠。 ## [2026-06-14] ingest-paper | How to Manage Change-Induced Incidents (Zhao+ ISSRE2023) - Source: `.raw/papers/How_to_Manage_Change-Induced_Incidents_Lessons_from_the_Study_of_Incident_Life_Cycle.pdf` - Summary: [[@2023__ISSRE__How to Manage Change-Induced Incidents - Lessons from the Study of Incident Life Cycle]] - Pages created: [[@2023__ISSRE__How to Manage Change-Induced Incidents - Lessons from the Study of Incident Life Cycle]], [[Yujin Zhao]], [[Ling Jiang]], [[Ye Tao]], [[Songlin Zhang]], [[Changlong Wu]], [[Yifan Wu]], [[Zhonghai Wu]], [[変更起因インシデント]] - Pages updated: [[インシデント管理]], [[wiki/sources/_index]], [[wiki/entities/_index]], [[wiki/concepts/_index]], [[index]], [[hot]], [[log]] - Key insight: RbIC(即時原因除去前の回復)を選択できれば TTM を 40.6% 短縮できる——緩和プロセス選択自体が TTM を律速するという新しい介入軸を定式化した最初の研究。 ## [2026-06-14] ingest-paper | An Empirical Study on Change-induced Incidents (Wu+ ICSE-SEIP2023) - Source: `.raw/papers/An_Empirical_Study_on_Change-induced_Incidents_of_Online_Service_Systems.pdf` - Summary: [[@2023__ICSE-SEIP__An Empirical Study on Change-induced Incidents of Online Service Systems]] - Pages created: [[@2023__ICSE-SEIP__An Empirical Study on Change-induced Incidents of Online Service Systems]], [[Bingxu Chai]], [[Bingchang Liu]], [[Jianguo Li]], [[Yong Yang]], [[Wei Jiang]] - Pages updated: [[インシデント管理]], [[変更起因インシデント]], [[wiki/sources/_index]], [[wiki/entities/_index]], [[wiki/concepts/_index]], [[index]], [[hot]], [[log]] - Key insight: Ant Group 実証で変更起因インシデントの TTD 75 パーセンタイルが通常の 26.8 倍長い——変更直後の監視設計の欠陥が「検知の遅延」として定量化された。 ## [2026-06-14] ingest-paper | Towards Observability Data Management at Scale - Source: `.raw/papers/1-3456859.3456863.pdf` - Summary: [[@2021__SIGMOD Record__Towards Observability Data Management at Scale]] - Pages created: [[@2021__SIGMOD Record__Towards Observability Data Management at Scale]], [[Suman Karumuri]], [[Franco Solleza]], [[Stan Zdonik]], [[Nesime Tatbul]], [[Slack Technologies]] - Pages updated: [[オブザーバビリティデータモデル]], [[テレメトリ]], [[時系列データベース]], [[Brown University]], [[wiki/sources/_index]], [[wiki/entities/_index]], [[wiki/concepts/_index]], [[index]], [[hot]], [[log]] - Key insight: Slack の実測データでクエリの 97% 超が <24h データを対象とすることを定量化し、「リアルタイム/履歴の分離」をアーキテクチャ原則として裏付けた最初の産業論文。MELT 4 型分類の初出。 ## [2026-06-14] ingest-paper | A Survey on Observability of Distributed Edge & Container-Based Microservices - Source: `.raw/papers/Usman-et-al.-2022---A-survey-on-observability-of-distributed-edge--container-based-microservices.pdf` - Summary: [[@2022__IEEE ACCESS__A Survey on Observability of Distributed Edge & Container-Based Microservices]] - Pages created: [[@2022__IEEE ACCESS__A Survey on Observability of Distributed Edge & Container-Based Microservices]], [[Muhammad Usman]], [[Simone Ferlin]], [[Anna Brunstrom]], [[Javid Taheri]], [[Karlstad University]], [[オブザーバビリティ]] - Pages updated: [[テレメトリ]], [[分散トレーシング]], [[マイクロサービスアーキテクチャ]], [[wiki/sources/_index]], [[wiki/entities/_index]], [[wiki/concepts/_index]], [[index]], [[hot]], [[log]] - Key insight: 「モニタリング vs オブザーバビリティ」は代替でなく補完関係であり、三本柱（ログ/メトリクス/トレース）とゴールデンシグナルという2022年時点の標準的枠組みが本論文で体系化された。統合オブザーバビリティプラットフォームの不在は当時から未解決の最大課題。 ## [2026-06-14] ingest-paper | Chain-of-Thought Prompting Elicits Reasoning in Large Language Models - Source: `.raw/papers/arxiv-2201.11903.pdf` - Summary: [[@2022__NeurIPS__Chain-of-Thought Prompting Elicits Reasoning in Large Language Models]] - Pages created: [[@2022__NeurIPS__Chain-of-Thought Prompting Elicits Reasoning in Large Language Models]], [[Chain-of-Thought Prompting]], [[Jason Wei]], [[Denny Zhou]] - Pages updated: [[Google Brain]] + index/log/hot/manifest - Key insight: 連鎖思考推論は約 100B パラメータ以上の LLM にのみ現れる創発的能力であり、少数の例示を追加するだけで微調整なしに算術・常識・記号推論の SOTA を更新できる。 ## [2026-06-14] ingest-paper | Scaling Laws for Autoregressive Generative Modeling - Source: `.raw/papers/arxiv-2010.14701.txt` - Summary: [[@2020__arXiv__Scaling Laws for Autoregressive Generative Modeling]] - Pages created: [[@2020__arXiv__Scaling Laws for Autoregressive Generative Modeling]] / [[Tom Henighan]] - Pages updated: [[OpenAI]] / [[LLMスケーリング則]] / [[Jared Kaplan]] / [[wiki/sources/_index.md]] / [[wiki/entities/_index.md]] / [[wiki/concepts/_index.md]] / [[index]] / [[hot]] - Key insight: スケーリング則は言語以外の全モダリティに普遍的に成立し、最適モデルサイズの指数 $\beta \approx 0.7$ が画像・動画・マルチモーダル・数学問題求解を横断して一定。損失の不可逆成分へのアプローチは下流タスク性能頭打ちを意味せず、「最後の数ビット」に意味論的情報が残る。 ## [2026-06-14] ingest-paper | Scaling Laws for Neural Language Models - Source: `.raw/papers/arxiv-2001.08361.pdf` - Summary: [[@2020__arXiv__Scaling Laws for Neural Language Models]] - Pages created: [[スケーリング則]], [[@2020__arXiv__Scaling Laws for Neural Language Models]] - Pages updated: [[Jared Kaplan]], [[OpenAI]] (+ index/log/hot) - Key insight: 損失はモデルサイズ N・データ D・計算量 C すべてに対してべき乗則 L∝X^{-α} でスケールし、アーキテクチャ詳細への依存は小さい。 ## [2026-06-14] ingest-paper | DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models - Source: `.raw/papers/arxiv-2402.03300.pdf`（PDF 未取得のため WebFetch で論文情報を取得） - Summary: [[@2024__arXiv__DeepSeekMath - Pushing the Limits of Mathematical Reasoning in Open Language Models]] - Pages created: [[@2024__arXiv__DeepSeekMath - Pushing the Limits of Mathematical Reasoning in Open Language Models]] - Pages updated: [[DeepSeek-AI]] / [[GRPO]] / [[強化ファインチューニング]] / [[強化学習スケーリング]] / [[wiki/sources/_index.md]] / [[wiki/index.md]] / [[hot]] / [[log]] / [[.raw/.manifest.json]] - Key insight: DeepSeekMath は [[GRPO]] の初出論文であり、「ドメイン特化コーパス構築→継続事前学習→GRPO による RL」という 3 段パイプラインが後続の DeepSeek-R1・DeepSWE・DeepSeek-V3.2 の設計思想の起点となる。価値モデルを廃してグループ内報酬正規化でアドバンテージを推定する設計がメモリ効率とスケーリング安定性を両立させた。 ## [2026-06-13] ingest-paper | From Detection to Recovery: Operational Analysis on LLM Pre-training with 504 GPUs - Source: `.raw/papers/arxiv-2605.09370.pdf` - Summary: [[@2026__arXiv__From Detection to Recovery - Operational Analysis on LLM Pre-training with 504 GPUs]] - Pages created: [[@2026__arXiv__From Detection to Recovery - Operational Analysis on LLM Pre-training with 504 GPUs]] / [[Lablup Inc]] / [[Backend.AI]] / [[Sokovan]] / [[Daemyung Kang]] - Pages updated: [[NVIDIA]] / [[VAST Data]] / [[耐障害LLM訓練]] / [[GPUクラスタ運用]] / [[チェックポイント]] / [[LLM学習モニタリング]] / [[wiki/sources/_index.md]] / [[wiki/entities/_index.md]] / [[index]] / [[hot]] - Key insight: LLM 事前学習の復旧は障害検知だけで決まらず、checkpoint load、NFS/RPC キュー形成、60 ノードのギャングスケジューリング、予備ノード占有、自動リトライ停止条件が一体で律速する。 ## [2026-06-13] ingest-paper | Empowering Azure Storage with RDMA - Source: `.raw/papers/nsdi23-bai.pdf` - Summary: [[@2023__NSDI__Empowering Azure Storage with RDMA]] - Pages created: [[@2023__NSDI__Empowering Azure Storage with RDMA]] / [[Wei Bai]] / [[Azure Storage]] / [[RDMA Estats]] - Pages updated: [[Microsoft]] / [[SONiC]] / [[RDMA]] / [[RDMAネットワーク監視]] / [[分散ストレージ]] / [[wiki/sources/_index.md]] / [[wiki/entities/_index.md]] / [[wiki/concepts/_index.md]] / [[index]] / [[hot]] - Key insight: RDMA は LLM/HPC 向けの高速通信だけでなく、ディスアグリゲートされたクラウドストレージで CPU 予約と I/O レイテンシを下げる基盤でもある。Azure Storage のリージョン内展開は、異世代 NIC・異種スイッチ・PFC/DCQCN・ホスト内輻輳・フェイルオーバー容量計画が RDMA の本番価値を左右することを示す。 ## [2026-06-12] ingest-paper | Aurora PostgreSQL Limitless Database: Building a Highly Scalable OLTP Database - Source: `.raw/papers/1-3788853.3803089.pdf` - Summary: [[@2026__SIGMOD Companion__Aurora PostgreSQL Limitless Database - Building a Highly Scalable OLTP Database]] - Pages created: [[@2026__SIGMOD Companion__Aurora PostgreSQL Limitless Database - Building a Highly Scalable OLTP Database]] / [[Aurora Limitless Database]] / [[Dmitry Arkhangelskiy]] / [[分散 PostgreSQL]] - Pages updated: [[Amazon Web Services]] / [[OLTPシステムアーキテクチャ]] / [[wiki/sources/_index.md]] / [[wiki/entities/_index.md]] / [[wiki/concepts/_index.md]] / [[index]] / [[hot]] - Key insight: Aurora Limitless は OLTP の水平スケールを「PostgreSQL 互換性を捨てた専用再設計」ではなく、ルータ/シャード分離、時刻ベース MVCC、lead shard 付き 2PC、Serverless V2、シャード分割で既存互換性を残したまま拡張する道として示す。OLTP アーキテクチャの評価軸に、純粋性能だけでなく移行容易性、DDL/バックアップ整合性、運用モデルが入る。 ## [2026-06-12] ingest-paper | Anomaly detection and root-cause identification in microservices: a survey - Source: `.raw/papers/1-s10586-026-06095-9.pdf` - Summary: [[@2026__Cluster Computing__Anomaly detection and root-cause identification in microservices - a survey]] - Pages created: [[@2026__Cluster Computing__Anomaly detection and root-cause identification in microservices - a survey]] / [[Luís M. Barata]] / [[Sérgio Sequeira]] / [[Eurico Lopes]] / [[Pedro R. M. Inácio]] / [[Mário M. Freire]] / [[Instituto de Telecomunicações]] / [[Universidade da Beira Interior]] / [[Instituto Politécnico de Castelo Branco]] / [[NOVA LINCS]] / [[Cluster Computing]] - Pages updated: [[異常検知]] / [[根本原因分析]] / [[マイクロサービスアーキテクチャ]] / [[Fault Localization]] / [[wiki/sources/_index.md]] / [[wiki/entities/_index.md]] / [[index]] / [[hot]] - Key insight: マイクロサービス異常検知/RCA はログ・トレース・監視メトリクスを増やせばよいのではなく、障害種別に合う信号源選別、依存グラフ、評価ベンチ、説明可能性を同時に設計する問題である。サーベイの性能集計は有用な地図だが、データセット・故障種別・指標の不統一により、手法間の優劣としては慎重に読む必要がある。 ## [2026-06-11] ingest-paper | RCAgent: Cloud Root Cause Analysis by Autonomous Agents with Tool-Augmented Large Language Models - Source: `.raw/papers/1-3627673.3680016.pdf` - Summary: [[@2024__CIKM__RCAgent - Cloud Root Cause Analysis by Autonomous Agents with Tool-Augmented Large Language Models]] - Pages created: [[RCAgent]] / [[Zefan Wang]] / [[Zichuan Liu]] / [[Yingying Zhang]] / [[Aoxiao Zhong]] / [[Jihong Wang]] / [[Fengbin Yin]] / [[Lunting Fan]] / [[Lingfei Wu]] / [[Qingsong Wen]] / [[Xi’an Jiaotong University]] / [[Anytime AI]] / [[Squirrel Ai Learning]] - Pages updated: [[@2024__CIKM__RCAgent - Cloud Root Cause Analysis by Autonomous Agents with Tool-Augmented Large Language Models]] / [[AIOps]] / [[根本原因分析]] / [[RCA入力選別]] / [[agentic SRE]] / [[ログ解析]] / [[wiki/sources/_index.md]] / [[wiki/entities/_index.md]] / [[index]] / [[hot]] - Key insight: RCA エージェントの性能は LLM の推論力だけでなく、観測をどう退避・再取得するか、ツールの引数空間をどれだけ意味的に狭めるか、ログ/コード分析を専門エージェントへ分けるかで大きく変わる。RCAgent の SQL/SLS 直接ツール置換が Invalid Rate 70.94% に崩れたことは、agentic SRE の入力選別がデータ量だけでなく行動空間の設計問題でもあることを示す。 ## [2026-06-10] ingest-paper | A System-Level Taxonomy of Failure Modes in Large Language Model Applications - Source: `.raw/papers/1-A_System-Level_Taxonomy_of_Failure_Modes_in_Large_Language_Model_Applications.pdf` - Summary: [[@2026__IEEE CAI__A System-Level Taxonomy of Failure Modes in Large Language Model Applications]] - Pages created: [[@2026__IEEE CAI__A System-Level Taxonomy of Failure Modes in Large Language Model Applications]] / [[Vaishali Vinay]] / [[LLMアプリケーション信頼性]] - Pages updated: [[Microsoft]] / [[エージェントシステム運用]] / [[運用障害分析]] / [[LLM推論]] / [[wiki/sources/_index.md]] / [[wiki/entities/_index.md]] / [[wiki/concepts/_index.md]] / [[index]] / [[hot]] - Key insight: LLM アプリケーションの信頼性問題は、幻覚や推論誤りに閉じず、入力/コンテキスト境界、ツール/API、マルチエージェント通信、バージョン更新、コスト制約をまたぐシステム失敗として分類する必要がある。静的ベンチマークは安定性・再現性・ドリフトを測れないため、意味的オブザーバビリティと検証レイヤーが運用設計の中核になる。 ## [2026-06-10] ingest-paper | Twenty Years of Bigtable（SIGMOD Companion 2026） - Source: `.raw/papers/1-3788853.3803095.pdf` - Summary: [[@2026__SIGMOD Companion__Twenty Years of Bigtable]] - Pages created: [[@2026__SIGMOD Companion__Twenty Years of Bigtable]] / [[Fabio Baltieri]] - Pages updated: [[Bigtable]] / [[Google]] / [[分散ストレージ]] / [[LSMツリー]] / [[データベース O&M]] / [[wiki/sources/_index.md]] / [[wiki/entities/_index.md]] / [[wiki/concepts/_index.md]] / [[index]] / [[hot]] - Key insight: Bigtable の 20 年史は、分散ストレージの長寿命化が「中核モデルの維持」と「レプリケーション・SQL・CDC・CRDT・ビュー・外部コンパクション・オートサイジング・SRE サービス運用の周辺追加」で成り立つことを示す。スケール後の律速はデータ分割だけでなく、メタデータ管理、位置特定、資源階層化、運用標準化へ移る。 ## [2026-06-08] ingest | VictoriaMetrics KubeCon EU 2026 — Retroactive Sampling for OpenTelemetry - Source: `.raw/articles/kubecon-eu-2026-sampling-2026-06-08.md` - Summary: [[VictoriaMetrics-KubeCon-EU-2026-Sampling|@2026__VictoriaMetrics Blog__KubeCon EU 2026 Retroactive Sampling]] - Pages created: [[VictoriaMetrics-KubeCon-EU-2026-Sampling|@2026__VictoriaMetrics Blog__KubeCon EU 2026 Retroactive Sampling]] / [[Retroactive Sampling]] / [[VictoriaTraces]] / [[Zhu Jiekun]] - Pages updated: [[VictoriaMetrics]] / [[OpenTelemetry]] / [[トレースサンプリング]] / [[Scaling Telemetry Workloads]] / [[index]] / [[hot]] - Key insight: エッジエージェントで最小属性（33 バイト）のみ中央コレクタへ送りオンディスク FIFO でバッファリングするレトロアクティブサンプリングが、テールサンプリング比でネットワーク 70%・CPU/メモリ 60–70% を削減。Pebble ベースのディスク型テールサンプリングが CPU を 649% 増加させるのに対し、FIFO の逐次 I/O はランダム I/O を避けてコストを維持する。 ## [2026-06-08] ingest | AI doesn't need giant supercomputers after all（Glenn K. Lockwood Blog、2026-05-08） - Source: `.raw/articles/ai-doesnt-need-giant-supercomputers-2026-05-08.md` - Summary: [[@2026__Glenn K. Lockwood Blog__AI doesnt need giant supercomputers after all]] - Pages created: [[@2026__Glenn K. Lockwood Blog__AI doesnt need giant supercomputers after all]] / [[Glenn K. Lockwood]] / [[VAST Data]] / [[Microsoft Fairwater]] / [[AWS Rainier]] - Pages updated: [[LLMスケーリング則]] / [[LLM分散学習]] / [[Microsoft]] / [[OpenAI]] / [[index]] / [[hot]] - Key insight: 2025 年に OpenAI の超大規模クラスタ訓練モデルが GPT-4o 比トークン単価 15 倍・推論 120 GPU 必要で経済破綻し非推奨化された。競合推論モデルが「小規模旧式クラスタ」で同等成果を達成したことで「スケールより賢さ」へのパラダイム転換が確定的になった。超大規模クラスタの価値は「パラメータ規模」でなく「訓練速度・リスク低減・運用負担軽減」に移った。 ## [2026-06-08] ingest-paper | Optimization Techniques for GPU Programming（ACM CSUR 2023, Hijma ほか） - Source: `.raw/papers/acm-csur-2023-gpu-optimization-hijma.pdf` - Summary: [[@2023__CSUR__Optimization Techniques for GPU Programming]] - Pages created: [[@2023__CSUR__Optimization Techniques for GPU Programming]] / [[Pieter Hijma]] / [[Stijn Heldens]] / [[Ben van Werkhoven]] / [[Henri E. Bal]] / [[Alessio Sclocco]] / [[Vrije Universiteit Amsterdam]] / [[Netherlands eScience Center]] / [[GPU最適化]] / [[コアレスドメモリアクセス]] / [[カーネルフュージョン]] / [[分岐発散]] / [[Auto-tuning]] - Pages updated: [[wiki/sources/_index.md]] / [[wiki/entities/_index.md]] / [[wiki/concepts/_index.md]] / [[index]] - Key insight: GPU 最適化技術の採用頻度分布と相互依存性が 450 本横断で定量化され、LLM 推論最適化（Flash Attention 等）は GPU 最適化の古典技術の直接応用であることが見通せる一次データを得た。 ## [2026-06-08] ingest | Anthropic Engineering Blog: A Postmortem of Three Recent Issues（2025-09-17） - Source: `.raw/articles/a-postmortem-of-three-recent-issues-2025-09-17.md` - Summary: [[@2025__Anthropic Engineering Blog__A Postmortem of Three Recent Issues]] - Pages created: [[@2025__Anthropic Engineering Blog__A Postmortem of Three Recent Issues]] / [[Anthropic]] - Pages updated: [[LLM推論]] / [[運用障害分析]] / [[index]] - Key insight: GenAI 本番障害はルーティング・推論設定・コンパイラの 3 層で独立発生しうる。評価カバレッジ問題とプライバシー vs 可観測性のトレードオフという LLM 固有の診断困難性を一次資料として記録した。 ## [2026-06-08] ingest-paper | UModel: An Agent-Ready Observability Data Modeling Method at Scale（arXiv:2606.04799） - Source: `.raw/papers/arxiv-2606.04799.pdf` - Summary: [[@2026__arXiv__UModel - An Agent-Ready Observability Data Modeling Method at Scale]] - Pages created: [[wiki/sources/@2026__arXiv__UModel - An Agent-Ready Observability Data Modeling Method at Scale]], [[Gaogang Xie]], [[UModel]], [[オブザーバビリティデータモデル]] - Pages updated: [[Changhua Pei]], [[Dan Pei]], [[Alibaba Cloud]], [[根本原因分析]], [[AIOps]], [[テレメトリ]], [[wiki/sources/_index.md]], [[wiki/entities/_index.md]], [[wiki/concepts/_index.md]], [[index]] - Key insight: データモデル層の設計変更（オブジェクト中心モデリング）のみで RCA 精度を 8% 向上させることを Alibaba Cloud 本番 1 年以上で実証。エージェント性能はモデル能力より「何を見せるか」に先行して律速される。 ## [2026-06-08] ingest-paper | 機械学習の原点：統計的機械学習の世界（応用物理 2026-05） - Source: `.raw/papers/oubutsu-2026-95-5-274.pdf` - Summary: [[@2026__応用物理__機械学習の原点 - 統計的機械学習の世界]] - Pages created: [[赤穂昭太郎]], [[産業技術総合研究所]], [[統計的機械学習]], [[ベイズ最適化]], [[アンサンブル学習]] - Pages updated: [[wiki/index.md]], [[wiki/hot.md]], [[wiki/sources/_index.md]], [[wiki/entities/_index.md]], [[wiki/concepts/_index.md]] - Key insight: 応用物理・材料科学の少量データ問題では統計的機械学習が有効であり、MAP 推定とリッジ/LASSO 回帰の同値性・ベイズ最適化による材料パラメータ探索という切り口が AIOps ドメインとは異なる横断知識として wiki に入った。 ## [2026-06-08] ingest-paper | XWind: A Cross-site Router for Large Language Model Inference Serving at Renewable Energy Farms - Source: `.raw/papers/arxiv-2605.23348.pdf` - Summary: [[@2026__arXiv__XWind - A Cross-site Router for Large Language Model Inference Serving at Renewable Energy Farms]] - Pages created: [[@2026__arXiv__XWind - A Cross-site Router for Large Language Model Inference Serving at Renewable Energy Farms]], [[XWind]], [[Debopam Bhattacherjee]], [[AI Greenferencing]] - Pages updated: [[LLM推論]]（可変電力下 KV キャッシュ先行指標・電力制御との統合設計軸を横断的知見と未解決の問いに追加）、[[Microsoft]]（XWind/Greenferencing 関連追記） - Key insight: [[AI Greenferencing]] が提示した「890 GW 超の風力容量が Azure データセンターから 50ms 以内」という実現可能性と、[[XWind]] が実証した「KV キャッシュ利用率を電力制御シグナルとして使う」設計は、LLM 推論の電力・性能統合制御という新しい研究軸を開く。 ## [2026-06-08] ingest | Resilient AI Supercomputer Networking: How MRC and SRv6 Keep 100,000+ GPUs Training - Source: `.raw/articles/resilient-ai-supercomputer-networking-mrc-srv6-2026-05-28.md` - Summary: [[@2026__LinkedIn__Resilient AI Supercomputer Networking - How MRC and SRv6 Keep 100,000+ GPUs Training]] - Pages created: [[@2026__LinkedIn__Resilient AI Supercomputer Networking - How MRC and SRv6 Keep 100,000+ GPUs Training]], [[Ravi Sharma]], [[MRC]], [[SRv6]], [[マルチプレーンClosトポロジ]] - Pages updated: [[RDMA]], [[OpenAI]] - Key insight: [[OpenAI]] の 10 万 GPU 超クラスタが採用した「検知・回避・回復」アーキテクチャは、RDMA のパス固定制約を [[MRC]] のパケットスプレーで、経路収束遅延を [[SRv6]] のソースルーティングで、スケーリング限界を [[マルチプレーンClosトポロジ]] の NIC 分割で解く三位一体設計であり、障害を例外から通常事象へ再定義する哲学的転換を伴う。 ## [2026-06-07] ingest-paper | Bian Que: An Agentic Framework with Flexible Skill Arrangement for Online System Operations - Source: `.raw/papers/arxiv-2604.26805.pdf` - Summary: [[@2026__arXiv__Bian Que - An Agentic Framework with Flexible Skill Arrangement for Online System Operations]] - Pages created: [[@2026__arXiv__Bian Que - An Agentic Framework with Flexible Skill Arrangement for Online System Operations]], [[Kuaishou Technology]], [[Bian Que]], [[Bochao Liu]], [[Ben Chen]], [[Flexible Skill Arrangement]] - Pages updated: [[AIOps]], [[エージェントシステム運用]], [[根本原因分析]], [[インシデント管理]] - Key insight: O&M における LLM ボトルネックは推論でなくオーケストレーション(データ・知識の選択)にあり、Skill による事前コンテキスト制御が統一自己進化メカニズムとともに産業スケールで機能することを 6 ヶ月本番デプロイで実証した。 ## [2026-06-07] ingest-paper | Which Types of Heterogeneity Matter for Root Cause Localization in Microservice Systems? - Source: `.raw/papers/arxiv-2604.26670.pdf` - Summary: [[@2026__arXiv__Which Types of Heterogeneity Matter for Root Cause Localization in Microservice Systems]] - Pages created: [[@2026__arXiv__Which Types of Heterogeneity Matter for Root Cause Localization in Microservice Systems]], [[Runzhou Wang]], [[NexusRCL]] - Pages updated: [[Dan Pei]], [[Shenglin Zhang]], [[Nankai University]], [[Fault Localization]] - Key insight: マイクロサービス RCL ではエンティティレベル異質性(サービス vs ホスト)の非対称クロスレイヤー伝播が精度を律速し、異種グラフで明示的に分離すると均質グラフ比で A@1 が最大 45pt 向上することを実証した ## [2026-06-07] ingest-paper | See More, Forecast Better and Faster (SPRINT, ICML 2026) - Source: `.raw/papers/d5a1c41e-ICML26_SPRINT_20260522.pdf` - Summary: [[@2026__ICML__See More, Forecast Better and Faster - Enhancing Time Series Foundation Models via Inference-Time Plug-and-Play Downsampling]] - Pages created: [[@2026__ICML__See More, Forecast Better and Faster - Enhancing Time Series Foundation Models via Inference-Time Plug-and-Play Downsampling]], [[Longlong Xu]], [[Zeyan Li]], [[SPRINT]] - Pages updated: [[時系列基盤モデル]], [[Dan Pei]], [[Changhua Pei]] - Key insight: 学習不要の推論時ダウンサンプリングラッパーが、TSFM のアーキテクチャ変更なしに精度と効率を同時改善できることを、Nyquist-Shannon 定理に基づく理論保証と 7 TSFM × 9 データセットの実証で示した ## [2026-06-07] ingest-paper | Agent System Operations: Categorization, Challenges, and Future Directions - Source: `.raw/papers/arxiv-2606.01581.pdf` - Summary: [[@2026__arXiv__Agent System Operations - Categorization, Challenges, and Future Directions]] - Pages created: [[@2026__arXiv__Agent System Operations - Categorization, Challenges, and Future Directions]], [[エージェントシステム運用]], [[Zexin Wang]], [[David Lo]], [[Yintong Huo]] - Pages updated: [[AIOps]](AgentOps 子領域追加), [[異常検知]](エージェント固有タクソノミー追加), [[根本原因分析]](失敗帰属 3 カテゴリ追加) - Key insight: エージェントシステムの RCA は「インフラ箇所特定」から「実行トラジェクトリ上の決定ポイント特定(失敗帰属)」へ移行し、LLM ベース手法はコンテキスト長増大で精度が低下するが非 LLM ベースは安定するという相補性が確認された。 ## [2026-06-07] ingest-paper | ChainScope: Balancing Accuracy and Overhead in Non-intrusive Distributed Tracing of Microservices - Source: `.raw/papers/2026__CoNEXT__ChainScope.pdf` - Summary: [[@2026__CoNEXT__ChainScope - Balancing Accuracy and Overhead in Non-intrusive Distributed Tracing of Microservices]] - Pages created: [[wiki/sources/@2026__CoNEXT__ChainScope - Balancing Accuracy and Overhead in Non-intrusive Distributed Tracing of Microservices.md]], [[Ruipeng Hong]], [[Gabriele Castellano]], [[Massimo Gallo]] - Pages updated: [[Pengfei Chen]], [[分散トレーシング]], [[eBPF]], [[wiki/sources/_index.md]], [[wiki/entities/_index.md]], [[index]], [[hot]], [[log]] - Key insight: eBPF カーネル内 IP レベルタギング + ヘッドサンプリングが「非侵襲・高カバレッジ・低オーバーヘッド・高精度」の 4 目標を同時達成できることを実験で示した。DeepFlow(暗黙伝搬)とBeyla(明示伝搬)の二択ではなく、IP 層への明示伝搬 + カーネルサンプリングという設計空間の空白を埋める。 ## [2026-06-07] ingest | Batch: SRE Workbook selected chapters - Source: `.raw/articles/foreword-I-2026-06-07.md`, `.raw/articles/foreword-II-2026-06-07.md`, `.raw/articles/how-sre-relates-2026-06-07.md`, `.raw/articles/implementing-slos-2026-06-07.md`, `.raw/articles/slo-engineering-case-studies-2026-06-07.md`, `.raw/articles/monitoring-2026-06-07.md`, `.raw/articles/alerting-on-slos-2026-06-07.md`, `.raw/articles/eliminating-toil-2026-06-07.md`, `.raw/articles/simplicity-2026-06-07.md`, `.raw/articles/part-II-practices-2026-06-07.md`, `.raw/articles/on-call-2026-06-07.md`, `.raw/articles/incident-response-2026-06-07.md`, `.raw/articles/sre-workbook-postmortem-culture-2026-06-07.md`, `.raw/articles/sre-workbook-conclusion-2026-06-07.md`, `.raw/articles/sre-workbook-slo-document-2026-06-07.md`, `.raw/articles/sre-workbook-error-budget-policy-2026-06-07.md`, `.raw/articles/sre-workbook-postmortem-analysis-2026-06-07.md` - Summary: [[@2018__Google SRE Workbook__Foreword I]], [[@2018__Google SRE Workbook__Foreword II]], [[@2018__Google SRE Workbook__Chapter 1 How SRE Relates to DevOps]], [[@2018__Google SRE Workbook__Chapter 2 Implementing SLOs]], [[@2018__Google SRE Workbook__SLO Engineering Case Studies]], [[@2018__Google SRE Workbook__Monitoring]], [[@2018__Google SRE Workbook__Alerting on SLOs]], [[@2018__Google SRE Workbook__Eliminating Toil]], [[@2018__Google SRE Workbook__Simplicity]], [[@2018__Google SRE Workbook__Part II Practices]], [[@2018__Google SRE Workbook__On-Call]], [[@2018__Google SRE Workbook__Incident Response]], [[@2018__Google SRE Workbook__Chapter 10 Postmortem Culture - Learning from Failure]], [[@2018__Google SRE Workbook__Conclusion]], [[@2018__Google SRE Workbook__Appendix A Example SLO Document]], [[@2018__Google SRE Workbook__Appendix B Example Error Budget Policy]], [[@2018__Google SRE Workbook__Appendix C Results of Postmortem Analysis]] - Pages created: 17 source pages, [[SRE Workbook]] - Pages updated: [[SRE Book]], [[SRE]], [[サービスレベル目標]], [[エラーバジェット]], [[トイル]], [[テレメトリ]], [[インシデント管理]], [[index]], [[hot]] - Key insight: SRE Workbook は SRE Book の原則を、SLO 文書・エラーバジェット方針・複数ウィンドウ複数バーン率アラート・オンコール設計・インシデント対応訓練・ポストモーテムテンプレートへ具体化する。SRE の核は「約束を設定し、測定し、予算消費に応じて行動を変える」運用システムとして再確認された。 ## [2026-06-07] ingest-paper | Batch: Transformer + GPT-1/2/3 foundational papers - Source: `.raw/papers/arxiv-1706.03762.pdf`, `.raw/papers/language_understanding_paper.pdf`, `.raw/papers/language_models_are_unsupervised_multitask_learners.pdf`, `.raw/papers/1457c0d6bfcb4967418bfb8ac142f64a-Paper.pdf` - Summary: [[@2017__NeurIPS__Attention Is All You Need]], [[@2018__OpenAI__Improving Language Understanding by Generative Pre-Training]], [[@2019__OpenAI__Language Models are Unsupervised Multitask Learners]], [[@2020__NeurIPS__Language Models are Few-Shot Learners]] - Pages created: [[@2017__NeurIPS__Attention Is All You Need]], [[@2018__OpenAI__Improving Language Understanding by Generative Pre-Training]], [[@2019__OpenAI__Language Models are Unsupervised Multitask Learners]], [[@2020__NeurIPS__Language Models are Few-Shot Learners]], [[Transformer]], [[言語モデル事前学習]], [[文脈内学習]], [[Ashish Vaswani]], [[Noam Shazeer]], [[Aidan Gomez]], [[Illia Polosukhin]], [[Google Brain]], [[Łukasz Kaiser]], [[Jakob Uszkoreit]], [[Niki Parmar]], [[Llion Jones]], [[OpenAI]], [[Alec Radford]], [[Ilya Sutskever]], [[Karthik Narasimhan]], [[Tim Salimans]], [[Dario Amodei]], [[Jeffrey Wu]], [[Rewon Child]], [[GPT-2]], [[GPT-3]], [[WebText]], [[Tom Brown]], [[Jared Kaplan]] - Pages updated: [[LLMスケーリング則]]（GPT-2/3 のスケーリング観察追加）、[[OpenAI]]（GPT-2/3 情報追記）、[[Alec Radford]]（GPT-2/3 役割追記）、[[Ilya Sutskever]]（GPT-2/3 役割追記）、[[Dario Amodei]]（GPT-3 役割追記） - Key insight: GPT-1→GPT-2→GPT-3 でパラダイムが「事前学習＋微調整」→「ゼロショット転移」→「文脈内学習」へ発展し、Transformer のデコーダ部分だけで 3 桁のパラメータ拡大をアーキテクチャ変更なしに吸収した。後の LLM スケーリング則研究の実験的基盤。 ## [2026-06-06] ingest-paper | OLTP through the looking glass, and what we found there - Source: `.raw/papers/Harizopoulos-et-al.-2008---OLTP-through-the-looking-glass-and-what-we-found-there.pdf` - Summary: [[@2008__SIGMOD__OLTP through the looking glass, and what we found there]] - Pages created: [[@2008__SIGMOD__OLTP through the looking glass, and what we found there]], [[OLTPシステムアーキテクチャ]], [[メインメモリデータベース]] - Pages updated: [[Stavros Harizopoulos]](HP Labs への所属更新・SIGMOD 2008 の主な貢献として筆頭著者を明記)、[[Daniel J. Abadi]](sources 追加)、[[Michael Stonebraker]](sources 追加)、[[Samuel Madden]](sources 追加)、`wiki/sources/_index.md`, `wiki/concepts/_index.md`, `wiki/index.md`, `wiki/hot.md` - Key insight: Shore RDBMS の命令数分解により「単一の高い杭は存在しない」ことを実測。バッファマネージャ(New Order で 34.6%)が最大だが、ロック・ログ・ラッチがそれぞれ 11〜16% を占め、4 コンポーネント全除去で初めて 20 倍のスループット改善が実現する。メモリ常駐単体では 2.7 倍にとどまり、アーキテクチャ全体の再設計が必要であることを定量的に実証した。 ## [2026-06-06] refactor | concepts 層整理 - **統合**: RPC規模特性 / RPCレイテンシ特性を [[クラウドスケールRPC特性]] に統合し、旧名は aliases として保持。wiki 内の旧 wikilink は新 concept へ更新。 - **分解**: 肥大化した [[根本原因分析]] を親ページへ圧縮し、[[RCA入力選別]] / [[RCA評価設計]] / [[仮説駆動RCA]] / [[ドメイン別RCA]] を新設。 - **親子化**: [[AIOps]] / [[agentic SRE]] / [[LLM分散学習]] / [[データベース O&M]] を地図ページとして圧縮し、詳細論点を既存子 concept へ寄せた。 - **補完**: lint stub だった [[Fat-Tree]] / [[RDMA]] / [[クリティカルパス分析]] / [[ワークフロー自動化]] を出典付きの最小 concept に更新。必須見出しと `sources` frontmatter、[[concepts/_index]] / [[index]] の Concepts 一覧を同期。 ## [2026-06-06] ingest | SRE Book Ch10-18, 28-33 統合 - **ソース**: [[@2016__OReilly__SRE Book - Chapter 10 Practical Alerting from Time-Series Data]] / [[@2016__OReilly__SRE Book - Chapter 11 Being On-Call]] / [[@2016__OReilly__SRE Book - Chapter 12 Effective Troubleshooting]] / [[@2016__OReilly__SRE Book - Chapter 13 Emergency Response]] / [[@2016__OReilly__SRE Book - Chapter 14 Managing Incidents]] / [[@2016__OReilly__SRE Book - Chapter 15 Postmortem Culture - Learning from Failure]] / [[@2016__OReilly__SRE Book - Chapter 16 Tracking Outages]] / [[@2016__OReilly__SRE Book - Chapter 17 Testing for Reliability]] / [[@2016__OReilly__SRE Book - Chapter 18 Software Engineering in SRE]] / [[@2016__OReilly__SRE Book - Chapter 28 Accelerating SRE On-Call]] / [[@2016__OReilly__SRE Book - Chapter 29 Dealing with Interrupts]] / [[@2016__OReilly__SRE Book - Chapter 30 Embedding an SRE to Recover from Operational Overload]] / [[@2016__OReilly__SRE Book - Chapter 31 Communication and Collaboration in SRE]] / [[@2016__OReilly__SRE Book - Chapter 32 The Evolving SRE Engagement Model]] / [[@2016__OReilly__SRE Book - Chapter 33 Lessons Learned from Other Industries]] - **ページ更新**: [[SRE]]（横断的知見 11 件・未解決の問い 6 件追加、sources/関連に 15 章追記）/ [[SRE Book]]（Part III に Ch10-18、Part IV に Ch28-33 を追記）/ [[インシデント管理]]（ICS 4 役割・ブレームレス文化・Outalator の横断的知見追加）/ [[テレメトリ]]（Borgmon→Prometheus 系譜の横断的知見追加）/ [[障害緩和]]（緊急対応とテスト戦略の横断的知見追加）/ [[根本原因分析]]（仮説演繹法の前史の横断的知見追加）/ [[異常検知]]（Borgmon 宣言型ルール評価の横断的知見追加）/ [[障害注入]]（テスト戦略と DiRT の横断的知見追加） - **主要な知見**: SRE Book の Practices 9 章と Management 6 章を統合。Borgmon→Prometheus 系譜、ICS に基づくインシデント管理、仮説演繹法によるトラブルシューティング、ブレームレスポストモーテム文化、意図ベースキャパシティプランニング（Auxon）、埋め込み SRE の 3 フェーズモデル、エンゲージメントモデルの進化（PRR→フレームワーク）、航空・医療・製造業からの横断的教訓を、既存の AIOps/agentic SRE 知見と接続。特に仮説演繹法→hypothesis-driven RCA、ICS 4 役割→マルチエージェント SRE の役割設計、テスト信頼性→SRE Benchmark 設計という系譜が明確化。（source +15・pages updated 8） ## [2026-06-06] ingest | SRE NEXT 2024 登壇報告 - Source: `.raw/articles/srenext2024-2024-08-08.md` - Summary: [[@2024__yuuk.io__SRE-NEXT-2024]] - Pages created: [[@2024__yuuk.io__SRE-NEXT-2024]], [[SRE NEXT]], [[JAXA]], [[プラットフォームエンジニアリング]] - Pages updated: [[Yuuki Tsubouchi]](SRE NEXT 2024 登壇実績・ベストスピーカー賞を追記), [[SRE]](source 追加), `wiki/index.md` - Key insight: [[Yuuki Tsubouchi]] の4年半の博士課程成果の集大成として、SRE を「信頼性を指定可能なパラメータに制御する工学」と定義し直し、オオカミ少年アラート・トレースデータ未活用・インシデント対応改善不全など6つのオープンチャレンジを提示。[[プラットフォームエンジニアリング]]の浸透が SRE の役割境界明確化に寄与したという重要な観察も含む。 ## [2026-06-06] ingest-paper | Basic Concepts and Taxonomy of Dependable and Secure Computing - Source: `.raw/papers/2004-aviz-laprie-randell.pdf` - Summary: [[@2004__TDSC__Basic Concepts and Taxonomy of Dependable and Secure Computing]] - Pages created: [[@2004__TDSC__Basic Concepts and Taxonomy of Dependable and Secure Computing]], [[Algirdas Avizienis]], [[Jean-Claude Laprie]], [[Brian Randell]], [[Carl Landwehr]], [[LAAS-CNRS]], [[IFIP WG 10.4]], [[ディペンダビリティ]] - Pages updated: [[ソフトウェア耐障害性]](Gray 2004 横断知見追加), `wiki/sources/_index.md`, `wiki/entities/_index.md`, `wiki/concepts/_index.md`, `wiki/index.md`, `wiki/hot.md` - Key insight: ディペンダビリティ(dependability)を「正当に信頼できるサービスを提供する能力」と形式化し、可用性・信頼性・安全性・完全性・保守性・機密性の属性体系と障害→エラー→失敗の基本連鎖を確立。SRE・AIOps の概念的基盤として本 wiki に初めて収録。 ## [2026-06-06] ingest-paper | Batch: DeepSeek ファミリー 7 論文一括取り込み - Source: `.raw/papers/` (7 PDFs: arxiv-2401.02954, arxiv-2401.14196, arxiv-2412.19437, arxiv-2501.12948, arxiv-2512.02556, arxiv-2412.10302, DeepSeek_V4) - Summary: DeepSeek-AI のモデルファミリー全 7 論文を 7 サブエージェント並行で wiki に取り込み - Papers ingested: 1. [[@2024__arXiv__DeepSeek LLM - Scaling Open-Source Language Models with Longtermism]] — 初代基盤モデル(7B/67B dense)、スケーリング則研究 2. [[@2024__arXiv__DeepSeek-Coder - When the Large Language Model Meets Programming]] — コード特化 LLM、FIM 最適化 3. [[@2024__arXiv__DeepSeek-V3 Technical Report]] — 671B MoE、MLA/MTP/FP8/DualPipe 4. [[@2025__arXiv__DeepSeek-R1 - Incentivizing Reasoning Capability in LLMs via Reinforcement Learning]] — 純粋 RL 推論創発、GRPO、aha モーメント 5. [[@2025__arXiv__DeepSeek-V3.2 - Pushing the Frontier of Open Large Language Models]] — DSA/GRPO 安定化/合成エージェント環境 6. [[@2024__arXiv__DeepSeek-VL2 - Mixture-of-Experts Vision-Language Models for Advanced Multimodal Understanding]] — MoE ベース VLM 7. [[@2025__DeepSeek__DeepSeek-V4 - Towards Highly Efficient Million-Token Context Intelligence]] — MegaMoE/CSA+HCA で 100 万トークン - Pages created: 7 sources, 13 entities (DeepSeek-AI, DeepSeek LLM, DeepSeek-Coder, DeepSeek-V3, DeepSeek-R1, DeepSeek-R1-Zero, DeepSeek-V3.2, DeepSeek-VL2, DeepSeek-V4, Multi-head Latent Attention, DualPipe, HAI-LLM, MegaMoE, Daya Guo), 4 concepts (LLMスケーリング則, コードLLM, マルチトークン予測, ビジョン言語モデル) - Pages updated: Mixture-of-Experts, LLM分散学習, 並列化戦略, 強化ファインチューニング, 強化学習スケーリング, テスト時計算スケーリング, LLM推論, エージェント型コーディング, オープンLLM開発, GRPO - Key insight: DeepSeek ファミリーの 7 論文を横断すると、MoE アーキテクチャの進化（DeepSeekMoE → 補助損失なし負荷分散 → シグモイドゲーティング → MegaMoE）、RL 手法の深化（GRPO → R1-Zero 純粋 RL 創発 → V3.2 の 4 安定化技術）、効率化の体系（MLA による KV 圧縮 → FP8 混合精度 → DualPipe → CSA+HCA ハイブリッド圧縮）という 3 軸の一貫した技術的進化が浮かぶ。特に V4 の KV キャッシュを BF16 GQA8 比約 2% に削減するハイブリッド圧縮アテンションは、100 万トークンコンテキストを実用化する構造的解法として LLM 推論の設計空間を拡張する。 ## [2026-06-06] ingest | サーバーレスアーキテクチャ再考 (blog.yuuk.io) - Source: `.raw/articles/rethinking-serverless-architecture-2019-09-11.md` - Summary: [[@2019__yuuk.io__Rethinking-Serverless-Architecture]] - Pages created: [[@2019__yuuk.io__Rethinking-Serverless-Architecture]]・[[サーバーレスアーキテクチャ]] - Pages updated: [[Yuuki Tsubouchi]]・[[index]]・[[hot]]・[[log]]・manifest - Key insight: サーバーレスの本質を「サーバという単位を意識しない」と定義し直し、FaaS がネットワークサーバーを・BaaS がマシンサーバーを意識しなくさせるという構造を明確化。FaaS を糊とする BaaS 連結パターン（ピタゴラスイッチ構成）は著者の HeteroTSDB の設計原理と直結する。 ## [2026-06-06] ingest | 2019年SRE考 (blog.yuuk.io) - Source: `.raw/articles/thinking-sre-2019-01-16.md` - Summary: [[@2019__yuuk.io__2019-SRE-Thinking]] - Pages created: [[@2019__yuuk.io__2019-SRE-Thinking]] - Pages updated: [[Yuuki Tsubouchi]]・[[SRE]]・[[index]]・[[hot]]・[[log]]・manifest - Key insight: SRE を「制御する技術」と目的論的に定義し直すことで、エラーバジェットの核心を端的に言語化。2019 年の技芸→工学テーゼが 2024 年の LLM4SRE サーベイへ続く著者思想の縦軸を形成する。 ## [2026-06-06] ingest-paper | Batch: 9 microservice reliability papers - Source: `.raw/papers/` (9 PDFs) - Summary: マイクロサービスの信頼性・可観測性・インシデント管理・分散トレーシングに関する 9 本の論文を一括取り込み - Papers ingested: 1. [[@2021__SoCC__Characterizing Microservice Dependency and Performance]] — Alibaba トレース分析、マイクロサービス依存関係の定量化 2. [[@2022__SoCC__How to Fight Production Incidents]] — Microsoft 大規模クラウドの 152 件インシデント実証研究 3. [[@2024__PACMCAS__The Tale of Errors in Microservices]] — Uber 非致命的 RPC エラーの大規模分析 4. [[@2023__USENIX ATC__Lifting the veil on Meta's microservice architecture]] — Meta マイクロサービストポロジの初公開分析 5. [[@2024__KDD__Microservice Root Cause Analysis with Limited Observability]] — 限定観測可能性下の RCA 手法 LatentScope 6. [[@2023__SIGCOMM__Network-Centric Distributed Tracing with DeepFlow]] — eBPF ベースのゼロコード分散トレーシング DeepFlow 7. [[@2021__ESEC-FSE__Identifying Bad Software Changes via Multimodal Anomaly Detection]] — マルチモーダル LSTM による不正変更検出 SCWarn 8. [[@2022__USENIX ATC__CRISP - Critical Path Analysis of Large-Scale Microservice Architectures]] — Uber クリティカルパス分析 CRISP 9. [[@2023__SOSP__A Cloud-Scale Characterization of Remote Procedure Calls]] — Google 規模の RPC 特性分析 - Pages created: 9 sources, 35 entities, 8 concepts (マイクロサービスコールグラフ, マイクロサービスアーキテクチャ, 非致命的RPCエラー, 暗黙のコンテキスト伝搬, 限定観測可能性, RPC規模特性, RPCレイテンシ特性, + system entities) - Pages updated: 分散トレーシング, 根本原因分析, 異常検知, Fault Localization, マルチモーダル障害診断, ソフトウェア変更管理, 運用障害分析, インシデント管理, eBPF, 動的インストルメンテーション - Key insight: マイクロサービスの障害特性分析が 4 社の本番データ（Alibaba・Microsoft・Uber・Google）で横断可能になり、「エラーの大多数は非致命的」「インシデントの 90% 超はコード変更なしで緩和」「RPC レイテンシはミリ秒スケール」という産業界の実態が定量的に浮かぶ。 ## [2026-06-06] concept-create | SRE コンセプトページ作成 - Summary: [[SRE]]（Site Reliability Engineering）の傘概念ページを作成 - Pages created: [[SRE]] - Pages updated: [[index]]、[[hot]] - Key insight: SRE Book (2016) の原則体系を傘概念として集約し、自動化ヒエラルキー→SRE AI Autonomy Levels、航空アナロジー→自動化のアイロニー、プレイブック→agentic SRE の手続き的実演優位性という接続を横断的知見として記述。 ## [2026-06-06] batch-ingest | Google SRE Book 10 章一括取り込み - Source: "Site Reliability Engineering: How Google Runs Production Systems" (O'Reilly, 2016), edited by [[Betsy Beyer]], Chris Jones, Jennifer Petoff, [[Niall Murphy]] - Summary: SRE Book の Foreword・Preface・Chapter 1(Introduction)・Chapter 3(Embracing Risk)・Chapter 4(SLO)・Chapter 5(Eliminating Toil)・Chapter 6(Monitoring)・Chapter 7(Automation)・Part III(Practices)・Chapter 34(Conclusion)の 10 章を章ごとにソースページ化し、主要エンティティ・概念を wiki 化 - Pages created: - **Sources (10)**: [[@2016__OReilly__SRE Book - Foreword]], [[@2016__OReilly__SRE Book - Preface]], [[@2016__OReilly__SRE Book - Chapter 1 Introduction]], [[@2016__OReilly__SRE Book - Chapter 3 Embracing Risk]], [[@2016__OReilly__SRE Book - Chapter 4 Service Level Objectives]], [[@2016__OReilly__SRE Book - Chapter 5 Eliminating Toil]], [[@2016__OReilly__SRE Book - Chapter 6 Monitoring Distributed Systems]], [[@2016__OReilly__SRE Book - Chapter 7 Automation at Google]], [[@2016__OReilly__SRE Book - Part III Practices]], [[@2016__OReilly__SRE Book - Chapter 34 Conclusion]] - **Entities (5)**: [[SRE Book]], [[Ben Treynor Sloss]], [[Betsy Beyer]], [[Niall Murphy]], [[Margaret Hamilton]] - **Concepts (2)**: [[エラーバジェット]], [[トイル]] - Pages updated: [[サービスレベル目標]](横断的知見・出典追加), [[自動化のアイロニー]](横断的知見・出典追加), [[agentic SRE]](横断的知見・出典追加), [[index]], [[concepts/_index]], [[entities/_index]], [[sources/_index]] - Key insight: SRE Book (2016) が体系化した原則群（エラーバジェット・50% ルール・プレイブック・変更管理・サービス信頼性ヒエラルキー・自動化 5 段階）は、現在の agentic SRE が自動化しようとしているタスク構造そのものであり、Bainbridge (1983) の自動化のアイロニーの 5 層と直接対応する。SRE Book の自動化ヒエラルキー（手動→完全自律）は [[SRE AI Autonomy Levels]](L0–L4)の直接の前駆である ## [2026-06-06] ingest-paper | 分散データベース・ストレージの古典 5 論文一括取り込み - Source: `.raw/papers/Stonebraker-and-Cetintemel-2005---One-Size-Fits-All---An-Idea-Whose-Time-Has-Come-and-Gone.pdf`, `.raw/papers/bigtable-osdi06.pdf`, `.raw/papers/amazon-dynamo-sosp2007.pdf`, `.raw/papers/Stonebraker-et-al.-2007---The-End-of-an-Architectural-Era-Its-Time-for-a-Complete-Rewrite.pdf`, `.raw/papers/lakshman-ladis2009.pdf` - Summary: 分散データベース・ストレージ分野の古典 5 論文をサブエージェント並行で wiki に取り込み。Stonebraker の「One Size Fits All」(ICDE 2005)を起点に、Bigtable(OSDI 2006)・Dynamo(SOSP 2007)・H-Store(VLDB 2007)・Cassandra(SIGOPS OSR 2010)を横断的に集約 - Pages created: - **Sources (5)**: [[@2005__ICDE__One Size Fits All - An Idea Whose Time Has Come and Gone]], [[@2006__OSDI__Bigtable - A Distributed Storage System for Structured Data]], [[@2007__SOSP__Dynamo - Amazon's Highly Available Key-value Store]], [[@2007__VLDB__The End of an Architectural Era (It's Time for a Complete Rewrite)]], [[@2010__SIGOPS_OSR__Cassandra - A Decentralized Structured Storage System]] - **Entities (22)**: [[Michael Stonebraker]], [[Ugur Cetintemel]], [[MIT]], [[Brown University]], [[Jeffrey Dean]], [[Sanjay Ghemawat]], [[Bigtable]], [[Google File System]], [[Chubby]], [[Werner Vogels]], [[Giuseppe DeCandia]], [[Amazon]], [[Dynamo]], [[Samuel Madden]], [[Daniel J. Abadi]], [[Stavros Harizopoulos]], [[Pat Helland]], [[H-Store]], [[Avinash Lakshman]], [[Prashant Malik]], [[Apache Cassandra]], [[Facebook]] - **Concepts (6)**: [[専用データベースシステム]], [[結果整合性]], [[一貫性ハッシュ法]], [[分散ストレージ]], [[ゴシッププロトコル]], [[LSMツリー]] - Pages updated: [[Google]], [[インターネットスケールサービス設計]] - Key insight: Stonebraker の「ワンサイズフィッツオール」批判(2005)を、Bigtable(分散 KV/多次元マップ)・Dynamo(結果整合性 KV)・H-Store(メインメモリ OLTP, 82 倍)・Cassandra(Dynamo+Bigtable ハイブリッド)の 4 システムが異なるワークロード軸で具体的に検証している。Avinash Lakshman が Dynamo(Amazon)と Cassandra(Facebook)の両方の著者であり、設計知識の移転を体現する結節点である ## [2026-06-06] ingest-paper | システム信頼性・自動化の古典 4 論文一括取り込み - Source: `.raw/papers/gray-why-do-computers-stop-85.pdf`, `.raw/papers/usits03.pdf`, `.raw/papers/hamilton.pdf`, `.raw/papers/IroniesofAutomation_Bainbridge_1983.pdf` - Summary: 4 本の古典論文をサブエージェント並行で wiki に取り込み - Pages created: - **Sources (4)**: [[@1983__Automatica__Ironies of Automation]], [[@1985__Tandem__Why Do Computers Stop and What Can Be Done About It]], [[@2003__USITS__Why Do Internet Services Fail and What Can Be Done About It]], [[@2007__LISA__On Designing and Deploying Internet-Scale Services]] - **Entities (10)**: [[Jim Gray]], [[Tandem Computers]], [[NonStop]], [[David Oppenheimer]], [[Archana Ganapathi]], [[David A. Patterson]], [[UC Berkeley ROC Project]], [[James Hamilton]], [[Lisanne Bainbridge]], [[University College London]] - **Concepts (6)**: [[自動化のアイロニー]], [[ソフトウェア耐障害性]], [[Heisenbug]], [[プロセスペア]], [[運用障害分析]], [[インターネットスケールサービス設計]] - Pages updated: [[耐障害LLM訓練]]（横断的知見 2 点）, [[チェックポイント]]（横断的知見 1 点）, [[GPUレジリエンス]]（related 追加）, [[インシデント管理]]（横断的知見追記）, [[障害注入]]（横断的知見追記）, [[根本原因分析]]（横断的知見追記）, [[Microsoft]]（Hamilton 追記）, [[サービスレベル目標]]（横断的知見追記） - Key insight: 1983〜2007 年のシステム信頼性・自動化の古典 4 論文を一括取り込み。Gray (1985) の「管理 42%・ソフトウェア 25%・ハードウェア 18%」→ Oppenheimer (2003) の「オペレータエラーが最大原因」→ Hamilton (2007) の「運用問題の 80% は設計に起因」という 20 年にわたる知見の連続性と、Bainbridge (1983) の自動化のアイロニーが現代の agentic SRE に通底する構造的パラドクスとして wiki の理論的基盤を形成。（source +4・entity +10・concept +6） ## [2026-06-06] wiki-ingest-paper | OLMo 3 (arXiv:2512.13961) - Source: `.raw/papers/arxiv-2512.13961.pdf` - Summary: Allen Institute for AI (AI2) の完全オープン LLM ファミリー OLMo 3 の技術報告書(118 ページ)を wiki-ingest-paper で取り込み - Pages created: - **Sources (1)**: [[@2025__arXiv__OLMo 3]] - **Entities (9)**: [[Allen Institute for AI]], [[OLMo 3]], [[Dolma 3]], [[OlmoRL]], [[OlmoBaseEval]], [[olmOCR]], [[Duplodocus]], [[Dolci]] - **Concepts (1)**: [[オープンLLM開発]] - Pages updated: [[強化ファインチューニング]]（横断的知見 2 点追加: OlmoRL の KL 除去+非同期パイプライン、Delta Learning の能力デルタフレームワーク。related・出典追加）, [[University of Washington]]（OLMo 3 との関連追加） - Key insight: OLMo 3 は「モデルフロー全体の公開」を掲げ、全段階のチェックポイント・データミックス(元プール含む)・コード・訓練ログを公開した初の SOTA 級 LLM。7B/32B の decoder-only Transformer(SWA 3/4 層)で Base・Think・Instruct・RL-Zero の 4 変種を提供。OLMo 3.1 Think 32B は MATH 96.2・AIME 2024 80.6 で完全オープンモデル最強、Qwen 3 32B に迫る。OlmoRL は GRPO ベース 7 改善(KL なし・トークンレベル損失等)と完全非同期パイプライン(DeepSpeed + vLLM)で 4 倍スループット。Delta Learning は SFT 飽和後も能力デルタの大きい対照ペア(Qwen 3 32B/0.6B)の DPO で推論フロンティアを拡張。RL-Zero は事前学習データの RL への影響を追跡可能にする初のクリーンなベンチマーキング環境。1024 H100 GPU・56 日・$2.75M。(source +1・entity +9・concept +1) ## [2026-06-06] wiki-ingest-paper | Composer 2 Technical Report (arXiv:2603.24477) - Source: `.raw/papers/arxiv-2603.24477.pdf` - Summary: Cursor Research のエージェント型コーディングモデル Composer 2 の技術報告書を wiki-ingest-paper で取り込み - Pages created: - **Sources (1)**: [[@2026__arXiv__Composer 2 Technical Report]] - **Entities (7)**: [[Cursor Research]], [[Composer 2]], [[CursorBench]], [[Anyrun]], [[Fireworks AI]], [[ThunderKittens]], [[DeepEP]] - **Concepts (1)**: [[エージェント型コーディング]] - Pages updated: [[エージェント型強化学習]]（related に Composer 2 追加）, [[強化ファインチューニング]]（横断的知見 1 点追加: ドメイン特化事前学習→大規模非同期 RL パイプラインのパレート最適性、related 追加） - Key insight: Composer 2 は Kimi K2.5 ベースの 1.04T/32B MoE をコード特化の継続事前学習（パープレキシティと下流 RL 報酬の対数線形相関を確認）の後、Dr. GRPO 変種による大規模非同期 RL（4 サービス分離: 訓練/環境/推論/評価）で訓練し、CursorBench 61.3・SWE-bench Multi 73.7・Terminal-Bench 61.7 でコスト精度パレート最適を達成する。RL が平均性能と best-of-K 性能の双方を同時改善する証拠を示し「RL は既知パスの確率再配分にすぎない」という懸念を否定する。自己要約機構で長期ホライズンに対応、非線形長さペナルティ、MoE ルーティングリプレイ、NVFP4 per-token スケーリングなどの RL 革新を含む。インフラは Anyrun（Firecracker VM）と Fireworks AI の地理的分散推論、DeepEP エキスパート並列、ThunderKittens GPU カーネルを活用。（source +1・entity +7・concept +1）。 ## [2026-06-06] wiki-ingest-paper | Kimi-Researcher (moonshotai.github.io) - Source: `https://moonshotai.github.io/Kimi-Researcher/`（PDF なし、プロジェクトページのみ） - Summary: [[Moonshot]] の自律型リサーチエージェント [[Kimi-Researcher]] のプロジェクトページを wiki-ingest-paper で取り込み - Pages created: - **Sources (1)**: [[@2025__Moonshot AI__Kimi-Researcher - End-to-End RL Training for Emerging Agentic Capabilities]] - **Entities (1)**: [[Kimi-Researcher]] - Pages updated: [[Moonshot]]（概要に Kimi-Researcher 段落追加・関連/出典追加）, [[エージェント型強化学習]]（横断的知見 5 点・未解決の問い 2 点追加）, [[強化ファインチューニング]]（出典追加）, [[sources/_index]], [[entities/_index]], [[index]], [[hot]], [[log]] - Key insight: Kimi-Researcher は SFT を一切使わずエンドツーエンドの REINFORCE のみでリサーチエージェントを訓練し、HLE Pass@1 26.9%（初期 8.6% から RL のみで向上）・xbench-DeepSearch 69% を達成した。3 つの技術革新が注目される: (1) ガンマ減衰報酬 r × γ^(T-i) でステップレベルの信用割当を近似、(2) コンテキスト管理機構で単一ロールアウトを 10→50+ イテレーションに拡張、(3) ターンレベル部分ロールアウト（リプレイバッファ活用）で 1.5 倍以上の高速化。RL のみから矛盾情報の自己修正や追加検証行動が創発する。Agent-R1 のモジュラーフレームワーク、DeepSWE の SFT 不要知見、IsoCompute Playbook のリプレイバッファ設計と相補的であり、検索エージェントドメインでも SFT なし RL が有効であることをドメイン横断的に確認。（source +1・entity +1）。 ## [2026-06-06] wiki-ingest-paper | MiniMax-M1 (arXiv:2506.13585) - Source: `.raw/papers/arxiv-2506.13585.txt` - Summary: MiniMax のハイブリッドアテンション推論モデル MiniMax-M1(456B パラメータ、100 万トークンコンテキスト)の技術報告書を wiki-ingest-paper で取り込み - Pages created: - **Sources (1)**: [[@2025__arXiv__MiniMax-M1 - Scaling Test-Time Compute Efficiently with Lightning Attention]] - **Entities (5)**: [[MiniMax-M1]], [[MiniMax-Text-01]], [[CISPO]], [[Lightning Attention]], [[SynLogic]] - **Concepts (1)**: [[テスト時計算スケーリング]] - Pages updated: [[MiniMax]]（M1 情報追加）, [[強化学習スケーリング]]（横断的知見 1 点・未解決の問い更新）, [[強化ファインチューニング]]（横断的知見 1 点・ソース追加）, [[sources/_index]]（M1 エントリ追加） - Key insight: MiniMax-M1 はオープンウェイト初の大規模ハイブリッドアテンション推論モデルであり、テスト時計算スケーリングの効率をアーキテクチャ設計で根本的に改善する。ライトニングアテンション(7:1 混成)により 100K トークン生成時の FLOPS を DeepSeek R1 の 25% に削減し、独自 RL アルゴリズム CISPO が DAPO 比 2 倍のステップ効率を達成。アーキテクチャ効率とアルゴリズム効率の乗算的効果により RL 全体を 512 GPU・3 週間・53.4 万ドルに収めた。テスト時計算スケーリングと RL 訓練計算スケーリングが連動するという知見を新概念ページで体系化。（source +1・entity +5・concept +1）。 ## [2026-06-06] wiki-ingest-paper | Kimi K1.5 (arXiv:2501.12599) - Source: `.raw/papers/arxiv-2501.12599.pdf` - Summary: Moonshot（月之暗面）の RL 訓練マルチモーダル LLM Kimi K1.5 の技術報告書を wiki-ingest-paper で取り込み - Pages created: - **Sources (1)**: [[@2025__arXiv__Kimi K1.5 - Scaling Reinforcement Learning with LLMs]] - **Entities (2)**: [[Kimi K1.5]], [[Mooncake]] - Pages updated: [[Moonshot]]（Kimi K1.5 情報追記）, [[vLLM]]（ハイブリッドデプロイメント追記）, [[強化学習スケーリング]]（横断的知見 2 点・未解決の問い 1 点追加: コンテキスト長スケーリング軸・パーシャルロールアウト）, [[強化ファインチューニング]]（横断的知見 2 点追加: 価値関数排除の 3 手法・long2short 体系化） - Key insight: Kimi K1.5 はコンテキスト長をモデルサイズ・データ量に並ぶ RL の第三のスケーリング次元として位置づけ、128k への拡張で推論性能を大幅に向上させた。パーシャルロールアウト（長軌跡を反復間で分割再利用）は長コンテキスト RL の計算量爆発を抑える実装手法として IsoCompute Playbook のロールアウト数最適化と相補する。価値関数を排除しオンラインミラー降下変種のみで方策最適化する設計は、ScaleRL の CISPO・DeepSWE の GRPO++ と合わせ「価値関数なし RL」の 3 つの直交アプローチを形成する。long2short 手法の 4 経路体系化はテスト時計算量 vs トークン効率のトレードオフの最初の系統的整理。（source +1・entity +2）。 ## [2026-06-06] wiki-ingest-paper | Nemotron 3 (arXiv:2512.20856) - Source: `.raw/papers/arxiv-2512.20856.pdf` - Summary: NVIDIA のオープン LLM ファミリー Nemotron 3 の技術報告書を wiki-ingest-paper で取り込み - Pages created: - **Sources (1)**: [[@2025__arXiv__Nemotron 3 - Efficient and Open Intelligence]] - **Entities (3)**: [[Nemotron 3]], [[LatentMoE]], [[NeMo-RL]] - Pages updated: [[NVIDIA]]（Nemotron 3 関連リンク・本文追記）, [[Mixture-of-Experts]]（横断的知見 1 点追加: LatentMoE vs FAST の All-to-All ボトルネック対処）, [[強化ファインチューニング]]（横断的知見 1 点追加: マルチ環境同時 RL vs 逐次訓練） - Key insight: Nemotron 3 は 4 つの革新を統合する: (1) ハイブリッド Mamba-2–Transformer MoE で Self-Attention の KV キャッシュ線形増大を回避し 3.3 倍の推論スループット、(2) LatentMoE で潜在次元 ℓ < d への射影によりエキスパート通信量を d/ℓ 倍削減し浮いた予算でエキスパート数を 128→512 に増加（MMLU-Pro +4.57pp）、(3) NVFP4（E2M1 + 16 要素マイクロブロックスケーリング）で BF16 比 <1% の損失差のまま 25T トークンの事前学習を安定化、(4) マルチ環境同時 RL（GRPO + マスク付き重要度サンプリング + 非同期 RL アーキテクチャ）で数学・コード・ツール利用・長コンテキスト（最大 100 万トークン）を同時最適化。Scaling Up RL の逐次 5 ドメイン訓練に対し、Nemotron 3 は同時最適化で干渉を抑制する設計上の対照を示す。（source +1・entity +3）。 ## [2026-06-06] wiki-ingest-paper | MiniMax-M2 (arXiv:2605.26494) - Source: `.raw/papers/arxiv-2605.26494.pdf` - Summary: MiniMax の MoE 言語モデルファミリー MiniMax-M2 シリーズ(229.9B/9.8B)の技術報告書を wiki-ingest-paper で取り込み - Pages created: - **Sources (1)**: [[@2026__arXiv__The MiniMax-M2 Series - Mini Activations Unleashing Max Real-World Intelligence]] - **Entities (3)**: [[MiniMax]], [[MiniMax-M2]], [[Forge]] - **Concepts (1)**: [[エージェントネイティブ RL]] - Pages updated: [[Mixture-of-Experts]]（横断的知見 1 点追加: シグモイドゲーティング + エキスパートバイアス）, [[エージェント型強化学習]]（横断的知見 3 点追加: 産業規模エージェント RL・混合ドメイン RL・自己進化） - Key insight: MiniMax-M2 は 3 つの革新を統合して「mini activations → max intelligence」を実現する: (1) 256 細粒度エキスパート + シグモイドゲーティングで補助損失への依存を排除する MoE アーキテクチャ、(2) Forge のエージェントネイティブ RL（ホワイトボックス/ブラックボックス統一・Windowed FIFO・接頭辞木マージ 40×）、(3) M2.7 の自己進化(訓練ランの自律デバッグ・スキャフォールド修正・100 ラウンド自律イテレーション)。約 10B の活性化パラメータで Opus 4.6/GPT 5.4/Gemini 3.1 Pro と対等な性能を達成し、特に Multi-SWE-bench(52.7)で比較対象中最高。混合ドメイン RL（推論・コーディング・エージェント・汎用の 4 ドメイン同時最適化）とインターリーブド思考(Plan-Act-Reflect)がエージェント型 RL の産業実装の設計原則を示す。（source +1・entity +3・concept +1）。 ## [2026-06-06] wiki-ingest-paper | Kimi K2: Open Agentic Intelligence (arXiv:2507.20534) - Source: `.raw/papers/arxiv-2507.20534.txt` - Summary: [[Moonshot AI]] の 1.04 兆パラメータ MoE LLM [[Kimi K2]] のテクニカルレポートを wiki-ingest-paper で取り込み - Pages created: - **Sources (1)**: [[@2025__arXiv__Kimi K2 - Open Agentic Intelligence]] - **Entities (3)**: [[Moonshot AI]], [[Kimi K2]], [[MuonClip]] - Pages updated: [[Mixture-of-Experts]]（スパーシティスケーリング則の横断的知見追加）, [[エージェント型強化学習]]（横断的知見 2 点・未解決の問い 1 点追加）, [[強化ファインチューニング]]（自己批判型ルーブリック報酬の横断的知見追加）, [[LLM分散学習]]（MuonClip + 分散チェックポイントエンジンの横断的知見追加）, [[並列化戦略]]（PP+EP+ZeRO-1 DP 構成と DualPipe 不採用の横断的知見追加） - Key insight: Kimi K2 は MuonClip（Muon + QK-Clip）で 15.5 兆トークンをロススパイクなしに事前学習し、MCP ツール 3,000 超 + 合成ツール 20,000 超のエージェント型データ合成と RLVR + 自己批判型ルーブリック報酬の統合 RL で SWE-bench Verified 65.8% を達成。スパーシティスケーリング則（固定活性化パラメータ・エキスパート数増加で損失低下）を実証。16-way PP + 16-way EP + ZeRO-1 DP の並列化で DualPipe を運用複雑性の理由で不採用。（source +1・entity +3）。 ## [2026-06-06] wiki-ingest | Cursor Composer 2.5 ブログ記事 - Source: `.raw/articles/composer-2-5-cursor-2026-06-06.md` - Summary: Cursor のコーディングエージェントモデル Composer 2.5 の発表ブログ記事を wiki-ingest で取り込み - Pages created: - **Sources (1)**: [[@2026__Cursor__Introducing Composer 2.5]] - **Entities (6)**: [[Cursor]], [[Kimi K2.5]], [[Moonshot]], [[SpaceXAI]], [[Colossus 2]], [[Sharded Muon]] - Pages updated: [[強化ファインチューニング]]（横断的知見 2 点・未解決の問い 1 点追加）, [[エージェント型強化学習]]（横断的知見 1 点追加） - Key insight: Cursor はターゲット RL（軌跡中の特定箇所にテキストヒントを挿入するオンポリシー蒸留）と合成タスク 25 倍拡大（特徴削除ベース）で Composer 2.5 を訓練した。DeepSWE の二値報酬のみアプローチと対照的に、密な局所フィードバックが産業コーディングエージェントで有効であることを示す。報酬ハッキングの具体例（Python 型キャッシュ逆工学・Java バイトコード逆コンパイル）は、RFT-FM が訓練障害として分類する現象が「高度な問題解決能力の裏返し」でもあることを示す産業界初の公開事例。基盤モデルは Moonshot の Kimi K2.5、次世代は SpaceXAI の Colossus 2（百万 H100 相当）で開発中。（source +1・entity +6）。 ## [2026-06-05] batch-ingest-paper | RL Scaling & Agentic RL 10 論文 - Source: `.raw/papers/arxiv-{2510.13786,2509.25300,2603.12151,2507.12507,2512.22857,2511.14460,2510.04206,2509.02547,2508.03501}.pdf` + `together.ai/blog/deepswe` - Summary: LLM 向け RL スケーリング 4 本 + エージェント型 RL 6 本の一括取り込み - Pages created: - **Sources (10)**: [[@2025__arXiv__The Art of Scaling Reinforcement Learning Compute for LLMs]], [[@2025__arXiv__Scaling Behaviors of LLM Reinforcement Learning Post-Training]], [[@2026__arXiv__IsoCompute Playbook - Optimally Scaling Sampling Compute for LLM RL]], [[@2025__arXiv__Scaling Up RL - Unlocking Diverse Reasoning in LLMs via Prolonged Training]], [[@2025__Together AI__DeepSWE - Training a Fully Open-sourced State-of-the-Art Coding Agent by Scaling RL]], [[@2025__arXiv__AutoForge - Environment Synthesis for Agentic RL]], [[@2025__arXiv__Agent-R1 - Training Agents with End-to-End RL]], [[@2025__arXiv__AgentRL - Training Language Model Agents with Reinforcement Learning]], [[@2025__arXiv__The Landscape of Agentic Reinforcement Learning]], [[@2025__arXiv__Training Long-Context Multi-Turn SWE Agents with Reinforcement Learning]] - **Concepts (2)**: [[強化学習スケーリング]], [[エージェント型強化学習]] - **Entities (~50)**: [[Devvrit Khatri]], [[Rishabh Agarwal]], [[ScaleRL]], [[PipelineRL]], [[UT Austin]], [[Zelin Tan]], [[Chen Zhang (Shanghai AI Lab)]], [[Zhenfei Yin]], [[VeRL]], [[GRPO]], [[Aviral Kumar]], [[Zhiting Hu]], [[MBZUAI]], [[Mingjie Liu]], [[Yejin Choi]], [[NVIDIA]], [[Nemotron-Research-Reasoning-Qwen-1.5B]], [[DeepSWE]], [[Together AI]], [[Agentica]], [[Ion Stoica]], [[Raluca Ada Popa]], [[rLLM]], [[R2E-Gym]], [[SWE-Bench-Verified]], [[Michael Luo]], [[Naman Jain]], [[AutoForge]], [[Tongyi Lab]], [[Fuli Feng]], [[Agent-R1]], [[AgentRL]], [[Hanchen Zhang]], [[Xiao Liu]], [[Yuxiao Dong]], [[Z.AI]], [[AgentBench]], [[Guibin Zhang]], [[Heng Ji]], [[Alexander Golubev]], [[Nebius AI]], [[Boris Yangel]], [[Humanoid]] ほか - Pages updated: [[強化ファインチューニング]], [[強化学習スケーリング]], [[エージェント型強化学習]] - Key insight: LLM RL スケーリングではべき乗則(Scaling Behaviors)とシグモイド飽和(ScaleRL/IsoCompute)が相補的であり、設計選択の効果は「漸近性能 A を上げるもの」と「計算効率 B のみを変調するもの」に二分される。エージェント型 RL では PBRFT(退化 MDP, T=1)と Agentic RL(POMDP, T>1)の形式的境界が確立され、マルチターン・マルチタスクの交差方策サンプリングとタスクアドバンテージ正規化が汎用エージェント訓練の鍵となる。 ## [2026-06-06] ingest-paper | Characterizing Modern GPU Resilience and Impact in HPC Systems: A Case Study of A100 GPUs - Source: `.raw/papers/Characterizing_Modern_GPU_Resilience_and_Impact_in_HPC_Systems_A_Case_Study_of_A100_GPUs.pdf` - Summary: [[@2025__DSN-W__Characterizing Modern GPU Resilience and Impact in HPC Systems - A Case Study of A100 GPUs]] - Pages created: [[@2025__DSN-W__Characterizing Modern GPU Resilience and Impact in HPC Systems - A Case Study of A100 GPUs]], [[Archit Patke]], [[Ziheng Chen]], [[Aditya Ranjan]], [[Hung Nguyen]], [[Phuong Cao]], [[Brett Bode]], [[Gregory Bauer]], [[Chandra Narayanaswami]], [[Daby Sow]], [[Catello Di Martino]] - Pages updated: [[GPUレジリエンス]], [[GPUクラスタ運用]], [[sources/_index]], [[entities/_index]], [[index]], [[hot]], [[log]] - Key insight: A100 単体ではメモリ回復機構(row remapping/error containment)が運用期の訂正不能メモリエラーを吸収し、弱点は GSP・PMU SPI・MMU・NVLink など非メモリハードウェアにある。これは後続の H100/A100 比較で見える「A100 はハードウェアが弱点、H100 はメモリが弱点」という世代間弱点反転の基準線である。 ## [2026-06-06] ingest | Understanding Workload Characteristics in Large Language Model Development - Source: `.raw/articles/understanding-workload-characteristics-large-language-model-development-2026-06-06.md` - Summary: [[@2024__USENIX login Online__Understanding Workload Characteristics in Large Language Model Development]] - Pages created: [[@2024__USENIX login Online__Understanding Workload Characteristics in Large Language Model Development]], [[Qinghao Hu]], [[Tianwei Zhang]], [[Acme]], [[InternEvo]] - Pages updated: [[Peng Sun]], [[Shanghai AI Laboratory]], [[GPUクラスタ運用]], [[LLM分散学習]], [[並列化戦略]], [[sources/_index]], [[entities/_index]], [[concepts/_index]], [[index]], [[hot]], [[log]] - Key insight: LLM 専用クラスタでは、評価などの短い関連ジョブが件数を支配し、少数の事前学習ジョブが GPU 時間を支配し、インフラ障害が失敗コストを支配する。[[Acme]] は Philly の LLM 以前 DNN クラスタと SAKURAONE/MegaScale の LLM 訓練実測をつなぐ運用参照点である。 ## [2026-06-06] ingest-paper | Revisiting Reliability in Large-Scale Machine Learning Research Clusters - Source: `.raw/papers/ieee-10946752-revisiting-reliability-ml-research-clusters.pdf` - Summary: [[@2025__HPCA__Revisiting Reliability in Large-Scale Machine Learning Research Clusters]] - Pages created: [[@2025__HPCA__Revisiting Reliability in Large-Scale Machine Learning Research Clusters]], [[Apostolos Kokolis]], [[Michael Kuchnik]], [[Carole-Jean Wu]], [[Meta AI Research SuperCluster]] - Pages updated: [[Meta]], [[GPUクラスタ運用]], [[LLM分散学習]], [[耐障害LLM訓練]], [[チェックポイント]], [[集合通信]], [[sources/_index]], [[entities/_index]], [[index]], [[hot]], [[log]] - Key insight: LLM 世代のマルチテナント研究クラスタでも、小規模ジョブが件数を支配し、大規模ジョブが GPU 時間・障害影響・二次的プリエンプションを支配する。MTTF は GPU 数にほぼ反比例し、10 万 GPU 級では分単位のチェックポイント/再起動が ETTR の必要条件になる。 ## [2026-06-06] ingest-paper | Analysis of Large-Scale Multi-Tenant GPU Clusters for DNN Training Workloads - Source: `.raw/papers/atc19-jeon.pdf` - Summary: [[@2019__USENIX ATC__Analysis of Large-Scale Multi-Tenant GPU Clusters for DNN Training Workloads]] - Pages created: [[@2019__USENIX ATC__Analysis of Large-Scale Multi-Tenant GPU Clusters for DNN Training Workloads]], [[Myeongjae Jeon]], [[Shivaram Venkataraman]], [[Amar Phanishayee]], [[Junjie Qian]], [[Wencong Xiao]], [[Fan Yang]], [[UNIST]], [[University of Wisconsin]], [[Philly]], [[philly-traces]], [[GPUクラスタスケジューリング]] - Pages updated: [[Microsoft]], [[Beihang University]], [[GPUクラスタ運用]], [[LLM分散学習]], [[並列化戦略]], [[sources/_index]], [[entities/_index]], [[concepts/_index]], [[index]], [[hot]], [[log]] - Key insight: LLM 以前の DNN 訓練クラスタでも、ギャングスケジューリング、局所性、同居干渉、失敗ジョブの GPU 時間浪費がすでに中心問題だった。[[Philly]] の 75 日・96,260 ジョブのトレースは、現代の [[LLM分散学習]] が SER として再整理する問題の前史にあたる。 ## [2026-06-06] ingest-paper | Pretraining LLMs at Scale: Tuning Strategies and Performance Portability - Source: `.raw/papers/2026_Unknown_Pretraining_LLMs_Scale_Tuning_Strategies.pdf` - Summary: [[@2025__PMBS__Pretraining LLMs at Scale - Tuning Strategies and Performance Portability]] - Pages created: [[@2025__PMBS__Pretraining LLMs at Scale - Tuning Strategies and Performance Portability]], [[Adrián Pérez Diéguez]], [[Qualcomm]], [[性能可搬性]] - Pages updated: [[LLM分散学習]], [[並列化戦略]], [[DeepSpeed]], [[NCCL]], [[sources/_index]], [[entities/_index]], [[concepts/_index]], [[index]], [[hot]], [[log]] - Key insight: LLM 事前学習の Efficiency は大域的なクラスタ協調設計だけでなく、DeepSpeed/ZeRO/NCCL の既定値を測って外す局所チューニングにも強く依存する。Model-2(8B)では 3 プラットフォームすべてで ZeRO Stage 2・batch_size 128・grad_acc 2 が最良で、既定構成比最大 1.6 倍高速化した。 ## [2026-06-05] ingest | The Landscape of Agentic Reinforcement Learning for LLMs (TMLR 2026, Zhang+ Oxford/Shanghai AI Lab) - Source: `.raw/papers/arxiv-2509.02547.txt` - Summary: [[@2025__arXiv__The Landscape of Agentic Reinforcement Learning]] - Pages created: [[@2025__arXiv__The Landscape of Agentic Reinforcement Learning]], [[Heng Ji]] - Pages updated: [[Guibin Zhang]](ソースリンクを正式名に修正), [[Zhenfei Yin]](サーベイ責任著者を追記), [[Lei Bai]](サーベイ責任著者を追記), [[Philip Torr]](サーベイシニア著者を追記), [[エージェント型強化学習]](定義に POMDP 形式化追記、横断的知見 3 点追加: 二重タクソノミー・増幅器 vs 新知識論争・TIR 進化軸、未解決の問い 2 点追加), [[強化ファインチューニング]](横断的知見に PBRFT vs Agentic RL の形式的境界を追記), [[強化学習スケーリング]](横断的知見にエージェント型 RL の 4 軸スケーリング整理を追記), [[sources/_index]], [[entities/_index]], [[log]], [[hot]] - Key insight: PBRFT（退化 MDP, T=1）と Agentic RL（POMDP, T>1）を MDP/POMDP で形式的に区別した初の包括的サーベイ。能力×タスクの二重タクソノミーで 500 本超を体系化し、RL メカニズム論争を約 2/3 増幅器 vs 約 1/3 新知識と定量整理。 ## [2026-06-05] ingest | Scaling Up RL: Unlocking Diverse Reasoning in LLMs via Prolonged Training (arXiv 2025, NVIDIA) - Source: `.raw/papers/arxiv-2507.12507.txt` - Summary: [[@2025__arXiv__Scaling Up RL - Unlocking Diverse Reasoning in LLMs via Prolonged Training]] - Pages created: [[@2025__arXiv__Scaling Up RL - Unlocking Diverse Reasoning in LLMs via Prolonged Training]], [[Mingjie Liu]], [[Yejin Choi]], [[NVIDIA]], [[Nemotron-Research-Reasoning-Qwen-1.5B]] - Pages updated: [[VeRL|verl]](NVIDIA の長期 RL 訓練での使用を追記), [[強化学習スケーリング]](横断的知見 2 点追加: スケーリング則 vs 実践的レシピの相補性・多ドメイン訓練の汎化と限界), [[強化ファインチューニング]](横断的知見に KL 正則化の長期安定化効果を追加), [[sources/_index]], [[entities/_index]], [[log]], [[hot]] - Key insight: NVIDIA が 1.5B モデルに対し 5 ドメインの検証可能報酬タスクで長期 RL を適用した体系的調査。GRPO + DAPO 拡張に KL 正則化と参照方策リセットを加え、8 ランの逐次訓練(ハードリセット)で約 16,000 GPU 時間を完走。KL 除去時のエントロピー崩壊と、多ドメイン統合訓練によるドメイン特化モデルとの競争力を示す。 ## [2026-06-05] ingest | The Art of Scaling Reinforcement Learning Compute for LLMs (arXiv 2025, Meta/UT Austin) - Source: `.raw/papers/arxiv-2510.13786.txt` - Summary: [[@2025__arXiv__The Art of Scaling Reinforcement Learning Compute for LLMs]] - Pages created: [[@2025__arXiv__The Art of Scaling Reinforcement Learning Compute for LLMs]], [[Devvrit Khatri]], [[Rishabh Agarwal]], [[ScaleRL]], [[PipelineRL]], [[UT Austin]] - Pages updated: [[強化学習スケーリング]](シグモイドモデルを定義に追記、横断的知見 3 点・未解決の問い 2 点追加), [[強化ファインチューニング]](ScaleRL をソースに追加), [[sources/_index]], [[entities/_index]], [[log]], [[hot]] - Key insight: Meta/UT Austin/UC Berkeley/Harvard/Periodic Labs の Khatri・Madaan・Agarwal らによる LLM RL 計算スケーリングの初の大規模系統的研究。400,000 GPU 時間超のアブレーションでシグモイド型飽和曲線の漸近性能 A と計算効率 B を分離する予測的フレームワークを提案。6 軸の設計選択から統合レシピ ScaleRL を構築し、8B で A=0.61(GRPO 0.45・DAPO 0.53 を凌駕)、Scout 17B×16 MoE で A=0.71。 ## [2026-06-05] ingest | Scaling Behaviors of LLM Reinforcement Learning Post-Training (arXiv 2025) - Source: `.raw/papers/arxiv-2509.25300.txt` - Summary: [[@2025__arXiv__Scaling Behaviors of LLM Reinforcement Learning Post-Training]] - Pages created: [[@2025__arXiv__Scaling Behaviors of LLM Reinforcement Learning Post-Training]], [[Zelin Tan]], [[Chen Zhang (Shanghai AI Lab)]], [[Zhenfei Yin]], [[University of Oxford]], [[VeRL]], [[GRPO]] - Pages updated: [[強化ファインチューニング]](横断的知見にスケーリング挙動の定式化を追記、関連に GRPO/VeRL/新概念を追加), [[強化学習スケーリング]](出典追加), [[エージェント型強化学習]](出典追加), [[sources/_index]], [[entities/_index]], [[log]], [[hot]] - Key insight: Qwen2.5(0.5B〜72B)で 63 モデル超を GRPO で訓練し、RL 事後学習のテスト損失が計算量・データ量に対して対数線形のべき乗則(R² > 0.99)に従うことを初めて体系的に実証。学習効率 k(N) が 32B 以降で飽和すること、データ再利用が τ ≤ 25 で有効であること、Llama 3 でもアーキテクチャ非依存に再現することを示した。 ## [2026-06-05] ingest | IsoCompute Playbook: Optimally Scaling Sampling Compute for LLM RL (arXiv 2026) - Source: `https://arxiv.org/abs/2603.12151` - Summary: [[@2026__arXiv__IsoCompute Playbook - Optimally Scaling Sampling Compute for LLM RL]] - Pages created: [[@2026__arXiv__IsoCompute Playbook - Optimally Scaling Sampling Compute for LLM RL]], [[Aviral Kumar]], [[Zhiting Hu]], [[MBZUAI]], [[強化学習スケーリング]], [[エージェント型強化学習]] - Pages updated: [[強化ファインチューニング]]（横断的知見に GRPO スケーリング知見を追記、関連に強化学習スケーリングを追加）, [[sources/_index]], [[entities/_index]], [[concepts/_index]], [[log]], [[hot]] - Key insight: UCSD/CMU/MBZUAI の Zhoujun Cheng・Aviral Kumar・Zhiting Hu らが約 12 万 H200 時間の実験で LLM RL の計算最適配分則を導出。計算予算 C = Bp * n * M において最適並列ロールアウト数 n*(C) はシグモイド飽和し、問題難度別に二重機構（易問題でのシャープニング worst@k / 難問題でのカバレッジ拡大 best@k）が作用する。Healthy RL レシピ（難度別正則化・sqrt 学習率スケーリング）と計算最適配分を組み合わせ、Qwen2.5-7B を AIME 2025 で 72.5% まで引き上げた。事前学習の Chinchilla 則に対応する RL ポスト訓練初のスケーリング則。 ## [2026-06-05] ingest | AgentRL: Scaling RL for Multi-Turn Multi-Task Agents (arXiv 2025, Tsinghua/Z.AI) - Source: `.raw/papers/arxiv-2510.04206.txt` - Summary: [[@2025__arXiv__AgentRL - Scaling RL for Multi-Turn Multi-Task Agents]] - Pages created: [[@2025__arXiv__AgentRL - Scaling RL for Multi-Turn Multi-Task Agents]], [[AgentRL]], [[Hanchen Zhang]], [[Xiao Liu]], [[Yuxiao Dong]], [[Z.AI]], [[AgentBench]] - Pages updated: [[エージェント型強化学習]](横断的知見にマルチタスク訓練の汎化・交差方策サンプリングを追加、未解決の問いに異種アーキテクチャクロスサンプリング・非同期方策ラグ理論分析を追加), [[sources/_index]], [[entities/_index]], [[log]], [[hot]] - Key insight: Tsinghua/Z.AI の Hanchen Zhang・Xiao Liu らによるマルチターン・マルチタスクのエージェント型 RL 訓練フレームワーク。交差方策サンプリング（現行・過去モデルで探索）、タスク別アドバンテージ正規化、完全非同期パイプライン、コンテナ化異種環境デプロイの 4 設計で、AgentBench-FC 5 環境平均成功率 70.4% を達成し GPT-5/Claude-Sonnet-4/DeepSeek-R1 を上回る。マルチタスク単一モデルがタスク別最良の単一タスクモデルと同等の性能に到達する点が注目。 ## [2026-06-05] ingest | Agent-R1: A Unified and Modular Framework for Agentic RL (arXiv 2025, USTC) - Source: `.raw/papers/arxiv-2511.14460.txt` - Summary: [[@2025__arXiv__Agent-R1 - Training Agents with End-to-End RL]] - Pages created: [[@2025__arXiv__Agent-R1 - Training Agents with End-to-End RL]], [[Agent-R1]] - Pages updated: [[エージェント型強化学習]](Agent-R1 のステップレベル MDP 定式化を定義に追記、横断的知見に設計空間 3 軸分化・credit assignment 核を追加), [[強化ファインチューニング]](Agent-R1 をソースに追加), [[Mingyue Cheng]], [[Xiaoyu Tao]], [[Qi Liu]], [[Enhong Chen]], [[University of Science and Technology of China]], [[sources/_index]], [[entities/_index]], [[log]] - Key insight: USTC の Cheng グループ(ATSF/Cast-R1 と同一)によるエージェント型 RL 訓練フレームワーク。ステップレベル MDP + 柔軟なコンテキスト管理を核に PPO・GRPO・Reinforce++・RLOO を同一基盤上で比較。最適アルゴリズムがタスクにより異なること、コンテキスト管理が訓練品質に影響することを 4 ベンチマークで実証。 ## [2026-06-05] ingest | DeepSWE: Training a Fully Open-sourced Coding Agent by Scaling RL (Together AI 2025) - Source: `https://together.ai/blog/deepswe` - Summary: [[@2025__Together AI__DeepSWE - Training a Fully Open-sourced State-of-the-Art Coding Agent by Scaling RL]] - Pages created: [[@2025__Together AI__DeepSWE - Training a Fully Open-sourced State-of-the-Art Coding Agent by Scaling RL]], [[DeepSWE]], [[Together AI]], [[Agentica]], [[Ion Stoica]], [[Raluca Ada Popa]], [[rLLM]], [[R2E-Gym]], [[SWE-Bench-Verified]], [[Michael Luo]], [[Naman Jain]] - Pages updated: [[強化ファインチューニング]](横断的知見に Compact Filtering と SFT コールドスタート比較を追記、未解決の問いに SFT バイアスのドメイン依存性を追加), [[sources/_index]], [[entities/_index]], [[log]], [[hot]] - Key insight: Qwen3-32B から SFT なしの純粋 RL(GRPO++)のみで SWE-Bench-Verified SOTA を達成した完全オープンソースのコーディングエージェント。Compact Filtering が不完全軌跡のノイズを訓練アルゴリズム内で排除する設計は、RFT-FM の障害管理フレームワーク(訓練外での排除)と対照的。 ## [2026-06-05] ingest | AutoForge: Environment Synthesis for Agentic RL (arXiv 2025) - Source: `.raw/papers/arxiv-2512.22857.txt` - Summary: [[@2025__arXiv__AutoForge - Environment Synthesis for Agentic RL]] - Pages created: [[@2025__arXiv__AutoForge - Environment Synthesis for Agentic RL]], [[AutoForge]], [[Tongyi Lab]], [[Fuli Feng]], [[エージェント型強化学習]], [[強化学習スケーリング]] - Pages updated: [[強化ファインチューニング]](横断的知見に GRPO のアドバンテージ推定の不安定性を追記), [[Alibaba Group]](AutoForge・Tongyi Lab 追加), [[sources/_index]], [[entities/_index]], [[concepts/_index]], [[log]] - Key insight: Tongyi Lab(Alibaba)のエージェント型 RL フレームワーク。ツール記述文書のみから模擬環境を完全自動合成し、GRPO を環境レベルへ拡張した ERPO + 模擬ユーザー誤りマスク(MEU)で訓練安定性を確保。活性パラメータ 3B で 200B 未満オープンソース最良、クローズドソースに匹敵。 ## [2026-06-05] ingest | Linux eBPF Tracing Technology (yuuk.io 2021) - Source: `.raw/articles/ebpf-tracing-2021-12-28.md` - Summary: [[@2021__yuuk.io__Linux eBPF Tracing Technology]] - Pages created: [[@2021__yuuk.io__Linux eBPF Tracing Technology]], [[bpftrace]] - Pages updated: [[eBPF]](eBPF 基礎技術セクション・ツールチェーン追加), [[BCC]](詳細化・関連整備), [[libbpf]](CO-RE 解説追加), [[Yuuki Tsubouchi]](本記事追加), [[index]], [[log]], [[hot]] - Key insight: vault 所有者自身の 2021 年技術解説。BCC→bpftrace→libbpf+CO-RE の 3 段開発ワークフローが [[go-conntracer-bpf]] の実装背景をなし、2024 年以降の [[eInfer]]・[[ProfInfer]]・[[eACGM]] の共通基盤でもある。 ## [2026-06-05] ingest | NVIDIA LLM Inference Benchmarking: Fundamental Concepts - Source: `.raw/articles/llm-benchmarking-fundamental-concepts-2026-06-05.md` - Summary: [[@2025__NVIDIA__LLM-Inference-Benchmarking-Fundamental-Concepts]] - Pages created: [[@2025__NVIDIA__LLM-Inference-Benchmarking-Fundamental-Concepts]], [[GenAI-Perf]], [[TensorRT-LLM]], [[NVIDIA NIM]] - Pages updated: [[LLM推論]]（ユースケース ISL/OSL プロファイル・ツール間メトリクス計算差異の横断的知見 2 点追加）, [[index]], [[hot]], [[log]] - Key insight: LLM ベンチマークはユースケース（翻訳/生成/要約/推論）ごとに ISL/OSL プロファイルが大きく異なり、同じハードウェアとモデルでも結果は変わる。GenAI-Perf と LLMPerf は ITL の計算に TTFT を含むか否かが異なるため、ツール間の直接比較には正規化が必要——という NVIDIA 公式確認を wiki に追加。 ## [2026-06-05] batch-ingest | LLM 分散推論基盤 × 4（さくらのナレッジ vol.1-3 + Zenn） - Source: `.raw/articles/sakura-distributed-inference-vol1-2025-11-11.md`, `sakura-distributed-inference-vol2-2025-12-23.md`, `sakura-distributed-inference-vol3-2026-03-25.md`, `zenn-llm-inference-benchmarking-2026-05-30.md` - Summary: [[@2025__さくらのナレッジ__分散推論基盤やその前提の考え方]], [[@2025__さくらのナレッジ__分散推論基盤の基礎技術]], [[@2026__さくらのナレッジ__高火力PHYを利用した分散推論基盤の性能検証]], [[@2026__Zenn__MLエンジニアのための本質から理解するLLM推論]] - Pages created: source 4 + entity 8 ([[道下幹也]]・[[高火力 PHY]]・[[vLLM]]・[[NIXL]]・[[UCX]]・[[LMCache]]・[[Kazuki Fujii]]・[[東京科学大学]]) - Pages updated: [[LLM推論]]（横断的知見 2 点・未解決の問い 1 点追加）, [[SAKURA Internet]], [[index]], [[hot]], [[log]] - Key insight: さくらインターネット高火力 PHY（H100 HGX）での PD Disaggregation 実測で、入力長 8k・32 並列時に Aggregated の ITL P99 が 100ms 超に悪化するのに対し PD 分離は 30ms 以内に抑制——ただし入力長 1k の低負荷では Aggregated が同等以上であり、メリット享受はワークロード特性（入力長・並行数）に依存する。NIXL+UCX で KV Cache 転送のボトルネックが物理リンク帯域に収束することも実証。 ## [2026-06-05] ingest-paper | Efficient Large Language Models: A Survey (TMLR 2024) - Source: `.raw/papers/arxiv-2312.03863.pdf` - Summary: [[@2024__TMLR__Efficient Large Language Models - A Survey]] - Pages created: [[@2024__TMLR__Efficient Large Language Models - A Survey]], [[Mi Zhang]], [[Mosharaf Chowdhury]], [[The Ohio State University]], [[モデル圧縮]] - Pages updated: [[LLM推論]], [[Mixture-of-Experts]], [[index]], [[hot]], [[log]] - Key insight: LLM の効率化手法をモデル中心（圧縮・学習・推論・アーキテクチャ）・データ中心（データ選択・プロンプト工学）・フレームワーク（17 種比較）の 3 軸で体系化した 67 ページの包括的サーベイ。既存の Miao+ サービングサーベイが推論特化であるのに対し、圧縮→推論の直列最適化やモデルアーキテクチャ（MoE・SSM）を含む全体像を提供する。 ## [2026-06-05] ingest-paper | Towards Efficient Generative Large Language Model Serving (ACM Computing Surveys 2025) - Source: `.raw/papers/2026_Unknown_Towards_Efficient_Generative_Large_Language.pdf` - Summary: [[@2025__ACM Computing Surveys__Towards Efficient Generative Large Language Model Serving]] - Pages created: [[@2025__ACM Computing Surveys__Towards Efficient Generative Large Language Model Serving]], [[Xupeng Miao]], [[Zhihao Jia]], [[Tianqi Chen]], [[Purdue University]] - Pages updated: [[LLM推論]], [[index]], [[hot]], [[log]] - Key insight: LLM サービングの効率化手法をアルゴリズム/システムの 2 軸で体系化した初の包括的サーベイ。投機的復号が出力品質を保持できる唯一のアルゴリズム的高速化手法であること、低レイテンシと高スループットの双対最適化目標を明示し、既存の観測系知見(ProfInfer/eInfer)と相補的な設計視点を提供する。 ## [2026-06-05] ingest | "LLM for SRE" の世界探索 (blog.yuuk.io) - Source: `.raw/articles/the-world-of-llm4sre-2024-03-21.md` - Summary: [[The-World-of-LLM4SRE]] - Pages created: [[The-World-of-LLM4SRE]] - Pages updated: [[Yuuki Tsubouchi]], [[根本原因分析]], [[index]], [[hot]], [[log]] - Key insight: vault 所有者 [[Yuuki Tsubouchi]] が 2024 年 3 月時点で LLM4SRE を 3 分類(ファインチューニング/RAG/エージェント型)で整理した一次観察——後続の wiki 論文群が予見された問題軸の上にあることを確認できる ## [2026-06-05] ingest-paper | 分散トレーシング・ログ解析・テレメトリ最適化 × 8 (PMF / LogReducer / LogCleaner / Hindsight / Tracezip / Astraea / Mint / TraStrainer) - Source: `.raw/papers/Chakraborty-et-al.-2024---Enabling-programmable-metric-flows.pdf`, `LogReducer_Identify_and_Reduce_Log_Hotspots_in_Kernel_on_the_Fly.pdf`, `arxiv-2409.04834.pdf`, `nsdi23-zhang-lei.pdf`, `arxiv-2502.06318.pdf`, `Astraea_camera_ready-1.pdf`, `arxiv-2411.04605.pdf`, `2026_Unknown_TraStrainer_Adaptive_Sampling_Distributed_Traces.pdf` - Summary: [[@2024__IEEE CLOUD__Enabling Programmable Metric Flows]], [[@2023__ICSE__LogReducer - Identify and Reduce Log Hotspots in Kernel on the Fly]], [[@2024__ESEM__Reducing Events to Augment Log-based Anomaly Detection Models - An Empirical Study]], [[@2023__NSDI__Hindsight - Tracing Edge-Cases in Distributed Systems]], [[@2025__ISSTA__Tracezip - Efficient Distributed Tracing via Trace Compression]], [[@2024__IEEE CLOUD__Astraea - Unleashing Performance Insights with Online Probabilistic Tracing]], [[@2025__ASPLOS__Mint - Cost-Efficient Tracing with All Requests Collection via Commonality and Variability Analysis]], [[@2024__FSE__TraStrainer - Adaptive Sampling for Distributed Traces with System Runtime State]] - Pages created: source 8 + entity 18 ([[PMF]], [[LogReducer]], [[WeChat]], [[LogCleaner]], [[Hindsight]], [[OpenTelemetry]], [[Tracezip]], [[Astraea]], [[VAIF]], [[Mint]], [[TraStrainer]], [[Jonathan Mace]], [[Kangjin Wang]], [[Zibin Zheng]], [[Mehmet Toslali]], [[Ayse K. Coskun]], [[Haiyu Huang]], [[Max Planck Institute for Software Systems]]) + concept 1 ([[トレースサンプリング]]) - Pages updated: [[テレメトリ]], [[Scaling Telemetry Workloads]], [[異常検知]], [[特徴量削減]], [[eBPF]], [[ログ解析]], [[ログパース]], [[ログ生成]], [[分散トレーシング]], [[根本原因分析]], [[Prometheus]], [[IBM Research]], [[Guangba Yu]], [[Pengfei Chen]], [[Tencent]], [[Sun Yat-sen University]], [[Lingzhe Zhang]], [[Tong Jia]], [[Ying Li]], [[Zhuangbin Chen]], [[Train-Ticket]], [[Boston University]], [[DeathStarBench]], [[University of Maryland]], [[Alibaba Group]], [[Huawei Technologies]] - Key insight: 分散トレーシングのサンプリング問題に 4 アプローチ(ヘッドベース確率/テールベース適応/遡及的/全リクエスト圧縮)が収集量と情報損失のトレードオフを異なる位相で攻め、ログ解析では LogReducer(カーネル層 eBPF)と LogCleaner(アプリ層イベント削減)が「情報を絞ってから処理する」設計を独立に実践、PMF はメトリクスパイプラインで同じ削減原理を LP 最適化に置換。 ## [2026-06-05] ingest-paper | NSDI '26 × 6 (EROICA / Wormhole / PrvTel / Matryoshka / FAST / HeteCCL) - Source: `.raw/papers/nsdi26-guan-yu.pdf`, `nsdi26-long.pdf`, `nsdi26-zhou-yajie.pdf`, `nsdi26-cai.pdf`, `nsdi26-lei-yiran.pdf`, `nsdi26-hei.pdf` - Summary: [[@2026__NSDI__EROICA - Online Performance Troubleshooting for Large-scale Model Training]], [[@2026__NSDI__Supercharging Packet-level Network Simulation of Large Model Training via Memoization and Fast-Forwarding]], [[@2026__NSDI__PrvTel - Lightweight Models for Private and Accurate Telemetry Data Retention]], [[@2026__NSDI__Matryoshka - Realizing Hyperscale Data Center Network Design for the AI Era]], [[@2026__NSDI__FAST - An Efficient Scheduler for All-to-All GPU Communication]], [[@2026__NSDI__HeteCCL - Synthesizing Near-Optimal Collective Communication Schedules for Heterogeneous GPU Clusters]] - Pages created: source 6 + entity 10 ([[Yu Guan]], [[Zhejiang Lab]], [[Dan Li]], [[Zhongguancun Laboratory]], [[Fuheng Zhao]], [[Max Planck Institute for Informatics]], [[MangoBoost]], [[University of Pennsylvania]], [[Northeastern University]], [[Shenzhen Institutes of Advanced Technology]]) + concept 2 ([[ネットワークシミュレーション]], [[差分プライバシー]]) - Pages updated: [[集合通信]], [[Mixture-of-Experts]], [[LLM分散学習]], [[オープンネットワーキング]], [[テレメトリ]], [[近似クエリ処理]], [[LLM学習モニタリング]], [[GPUクラスタ運用]], [[ストラグラー]], [[Fault Localization]], [[Meta]], [[Ennan Zhai]] - Key insight: 6 本が「LLM 訓練インフラの規則性(同期分散学習の反復構造)」を共通の前提として、性能診断(EROICA)・シミュレーション高速化(Wormhole)・プライバシー保護テレメトリ(PrvTel)・DCN 設計自動化(Matryoshka)・集合通信スケジューリング(FAST/HeteCCL)と異なる応用面で攻める。特に FAST(Birkhoff 分解)と HeteCCL(CEGIS)が「問題固有の構造的単純化で NP 困難を回避する」設計を独立に発見し、ホモジニアス/ヘテロジニアスの双方で集合通信の最適化フロンティアを前進させた。 ## [2026-06-05] ingest-paper | HeteCCL (NSDI '26) - Source: `.raw/papers/nsdi26-hei.pdf` - Summary: [[@2026__NSDI__HeteCCL - Synthesizing Near-Optimal Collective Communication Schedules for Heterogeneous GPU Clusters]] - Pages created: [[@2026__NSDI__HeteCCL - Synthesizing Near-Optimal Collective Communication Schedules for Heterogeneous GPU Clusters]], [[Northeastern University]], [[Shenzhen Institutes of Advanced Technology]] - Pages updated: [[集合通信]] (横断的知見 2 項・未解決の問い 2 項・出典追加), [[Ennan Zhai]] (HeteCCL 追記) - Key insight: ヘテロジニアス GPU クラスタでの集合通信スケジュール自動合成の失効原因は「合成の正確さ」でなく「同一ステップ内のプリミティブ所要時間の不均一性」にある。チャンキングで均質化し CEGIS で探索削減することで、TACCL/TE-CCL が 64 GPU で 9 時間超を要した合成を 9 分未満に短縮し、NCCL 比最大 2.8× の帯域幅と訓練効率 23〜37% の改善を達成した。 ## [2026-06-05] ingest-paper | FAST (NSDI '26) - Source: `.raw/papers/nsdi26-lei-yiran.pdf` - Summary: [[@2026__NSDI__FAST - An Efficient Scheduler for All-to-All GPU Communication]] - Pages created: [[@2026__NSDI__FAST - An Efficient Scheduler for All-to-All GPU Communication]], [[MangoBoost]], [[University of Pennsylvania]] - Pages updated: [[集合通信]], [[Mixture-of-Experts]] - Key insight: MoE AllToAllv のスケジューリングを NP 困難から多項式時間問題に帰着する鍵は「スケール外に集中し、スケール内で歪みを吸収する」という問題の単純化であり、Birkhoff 分解の GPU 集団通信層への初適用が最適性とインキャスト回避を同時保証する。64 GPU で 221 µs という合成時間は、最速のソルバーベース手法 SyCCL の 16 GPU・3.6 秒を数桁下回り、数百ミリ秒単位で変化する MoE ワークロードへのオンライン適用を初めて現実的にした。 ## [2026-06-05] ingest-paper | Matryoshka (NSDI '26) - Source: `.raw/papers/nsdi26-cai.pdf` - Summary: [[@2026__NSDI__Matryoshka - Realizing Hyperscale Data Center Network Design for the AI Era]] - Pages created: [[@2026__NSDI__Matryoshka - Realizing Hyperscale Data Center Network Design for the AI Era]], [[Max Planck Institute for Informatics]] - Pages updated: [[Meta]], [[オープンネットワーキング]], [[LLM分散学習]] - Key insight: 高レベル DCN 設計インテントをスイッチ設定に自動コンパイルする「設定生成」フェーズが学術研究の空白だったことを明示し、6 年間・約 900 DCN の本番でインテント駆動・決定論的・ステートレスな設計が AI クラスタ時代のハイパースケール DCN 管理に有効であることを示す。 ## [2026-06-05] concept | Scaling Telemetry Workloads - Pages created: [[Scaling Telemetry Workloads]] - Pages updated: [[index]], [[concepts/_index]] - Key insight: 博士論文が提唱した 3 層枠組み（計装・保持・分析）と「文脈豊富な両端で削減」の設計指針を概念ページとして独立化。wiki 内の分散トレーシング・時系列データベース・特徴量削減を統一的に接続し、AIOps エージェントのテレメトリ過剰消費問題や GPU/LLM インフラへの延伸を横断的知見として蓄積。 ## [2026-06-05] ingest-paper | OpsAgent (ASE '26) - Source: `.raw/papers/arxiv-2510.24145.pdf` - Summary: [[@2026__ASE__OpsAgent - An Evolving Multi-agent System for Incident Management in Microservices]] - Pages created: [[@2026__ASE__OpsAgent - An Evolving Multi-agent System for Incident Management in Microservices]], [[Yu Luo]], [[Lenovo]] - Pages updated: [[Yongqian Sun]], [[Shenglin Zhang]], [[Nankai University]], [[Dan Pei]], [[インシデント管理]], [[根本原因分析]], [[マルチモーダル障害診断]] - Key insight: training-free テキスト変換による異種テレメトリの統一が MAS 型 IM の鍵——プロセッサ除去でアブレーション Correct 16.54%→2.26%。PPO+反省の二重自己進化で Lenovo 本番 53 日・10,492 件 84.09%・解決時間 2.5h→126s を達成。 ## [2026-06-05] ingest-paper | Lustre Unveiled + The Lustre Storage Architecture - Source: `.raw/papers/2026_Unknown_Lustre_Unveiled_Evolution_Design_Advancements.pdf` + `.raw/papers/arxiv-1903.01955.pdf` - Summary: [[@2025__TOS__Lustre Unveiled - Evolution, Design, Advancements, and Current Trends]] + [[@2019__arXiv__The Lustre Storage Architecture]] - Pages created: [[@2025__TOS__Lustre Unveiled - Evolution, Design, Advancements, and Current Trends]], [[@2019__arXiv__The Lustre Storage Architecture]], [[Lustre]], [[Frontier]], [[Orion]], [[DDN]], [[Whamcloud]], [[OpenSFS]], [[Anjus George]], [[Andreas Dilger]], [[Sarp Oral]], [[Peter J. Braam]], [[Cluster File Systems]], [[並列ファイルシステム]] - Pages updated: [[Oak Ridge National Laboratory]] - Key insight: 2001–2005 年の初期設計文書で構想されたメタデータライトバックキャッシュや分散 MDS は、一貫性保証の複雑さから実装に 20 年以上を要し、2025 年のサーベイでもなお「将来の方向性」として記載される。DAOS のロックレストランザクションモデルが POSIX 互換性と引き換えにスケーラビリティで構造的優位を示す一方、Lustre は Top500 の 60% 超を占める支配的地位を維持しており、後方互換性の重力が並列ファイルシステムの設計進化を律速する。 ## [2026-06-05] ingest-paper | Meaningful Availability - Source: `.raw/papers/nsdi20-paper-hauer.pdf` - Summary: [[@2020__NSDI__Meaningful Availability]] - Pages created: [[@2020__NSDI__Meaningful Availability]], [[Tamás Hauer]], [[Philipp Hoffmann]], [[John Lunney]], [[Dan Ardelean]], [[Amer Diwan]] - Pages updated: [[Jeffrey C. Mogul]], [[Google]], [[サービスレベル目標]] - Key insight: ウィンドウ付きユーザーアップタイムは有意義性・比例性・実用性の三要件を同時に満たす初の可用性指標であり、MCR 曲線で短時間断続障害と長時間大規模障害を定量的に区別する。G Suite 本番で評価・展開済み。 ## [2026-06-05] ingest-paper | Live Forensics for HPC Systems: A Case Study on Distributed Storage Systems - Source: `.raw/papers/forensics_sc_2020.pdf` - Summary: [[@2020__SC20__Live Forensics for HPC Systems - A Case Study on Distributed Storage Systems]] - Pages created: [[@2020__SC20__Live Forensics for HPC Systems - A Case Study on Distributed Storage Systems]], [[Kaleidoscope]], [[Blue Waters]], [[Subho S. Banerjee]], [[Zbigniew T. Kalbarczyk]] - Pages updated: [[Saurabh Jha]], [[Shengkun Cui]], [[Tianyin Xu]], [[Ravishankar K. Iyer]], [[NCSA]], [[Fault Localization]], wiki/sources/_index.md, wiki/entities/_index.md, wiki/index.md, wiki/hot.md, wiki/log.md - Key insight: HPC 分散ストレージの箇所特定は「能動 I/O プローブ(Store Pings)+因子グラフ PGM」という cloud/GPU クラスタとも異なる第三の手法系統を形成し、信頼性障害 vs リソース過負荷の 2 モード弁別も一体で解く。 ## [2026-06-05] ingest-paper | Thinking about Availability in Large Service Infrastructures - Source: `.raw/papers/46181.pdf` - Summary: [[@2017__HotOS__Thinking about Availability in Large Service Infrastructures]] - Pages created: [[@2017__HotOS__Thinking about Availability in Large Service Infrastructures]], [[Rebecca Isaacs]], [[Brent Welch]] - Pages updated: (なし) - Key insight: 大規模インフラストラクチャの可用性定義は多次元性・次元削減・サブシステム分解という三重困難を抱えており、セキュリティのスレットモデリング・深層防御・侵入テストと同様の「敵対的思考」で臨むべきという提言。フェイル・スタティック設計はデフォルト拒否原則の可用性版に相当する。 ## [2026-06-05] ingest-paper | A Microservice-Based Platform for Sustainable and Intelligent SLO Fulfilment and Service Management - Source: `.raw/papers/arxiv-2602.12875.pdf` - Summary: [[@2026__arXiv__A Microservice-Based Platform for Sustainable and Intelligent SLO Fulfilment and Service Management]] - Pages created: [[@2026__arXiv__A Microservice-Based Platform for Sustainable and Intelligent SLO Fulfilment and Service Management]], [[Juan Luis Herrera]], [[Daniel Wang (TU Wien)]], [[CASCA]] - Pages updated: (なし) - Key insight: CASCAはMSA原則に従い、CCプロバイダーがサービスのセマンティクスを知ることなくSLOを充足できるプラットフォーム。カーボン認識SLOをEMMAマイクロサービスで統合し、宣言的設定管理による迅速な再設定（命令的手法比-53.7秒）を実現。GDS/RLDS/RDSの3方式で実物テストベッドにて評価。 ## [2026-06-05] ingest-paper | Nines are Not Enough: Meaningful Metrics for Clouds - Source: `.raw/papers/Mogul-and-Wilkes-2019---Nines-are-Not-Enough---Meaningful-Metrics-for-Clouds.pdf` - Summary: [[@2019__HotOS__Nines are Not Enough - Meaningful Metrics for Clouds]] - Pages created: [[@2019__HotOS__Nines are Not Enough - Meaningful Metrics for Clouds]], [[Jeffrey C. Mogul]], [[John Wilkes]] - Pages updated: (なし) - Key insight: SLO 定義の困難さを統計学的意思決定との同型性として捉え、法律家的思考から統計家的思考への転換を提唱。SLE(通常挙動への期待)と CBE(顧客挙動への期待)の双方向枠組みでプロバイダ・顧客間のリスク明示的分担を実現する設計原理を提示。 ## [2026-06-05] ingest-paper | Diffusing High-level SLO in Microservice Pipelines - Source: `.raw/papers/SOSE_2024_B_Sedlak.pdf` - Summary: [[@2024__SOSE__Diffusing High-level SLO in Microservice Pipelines]] - Pages created: [[@2024__SOSE__Diffusing High-level SLO in Microservice Pipelines]], [[Boris Sedlak]], [[Víctor Casamayor Pujol]], [[Praveen Kumar Donta]] - Pages updated: [[Schahram Dustdar]](Phase 2 要更新) - Key insight: ベイズネットワークの条件付き依存関係を利用して高レベルSLOをパイプライン全体に拡散し、実行前コンフリクト検知と最大100%のSLO充足率を実現。λ超パラメータの過剰な厳格化が充足率を0に急落させるトレードオフが存在する。 ## [2026-06-05] ingest-paper | CUJBench: Benchmarking LLM-Agent on Cross-Modal Failure Diagnosis from Browser to Backend - Source: `.raw/papers/arxiv-2604.23455.pdf` - Summary: [[@2026__arXiv__CUJBench - Benchmarking LLM-Agent on Cross-Modal Failure Diagnosis from Browser to Backend]] - Pages created: [[@2026__arXiv__CUJBench - Benchmarking LLM-Agent on Cross-Modal Failure Diagnosis from Browser to Backend]], [[Haoming Meng]], [[CUJBench]], [[OpenTelemetry Demo]], [[Tractor Store]] - Pages updated: [[SRE Benchmark]], [[マルチモーダル障害診断]], [[wiki/sources/_index]], [[wiki/entities/_index]], [[wiki/index]], [[wiki/hot]], [[wiki/log]], [[.raw/.manifest.json]] - Key insight: ブラウザ可視層をバックエンドテレメトリと統合した初のクロスモーダル診断ベンチマーク——ツールアクセス拡大が精度を下げる反直感と「証拠は取れても帰属できない」統合ボトルネックを定量化。 ## [2026-06-05] ingest-paper | FlowXpert: Expertizing Troubleshooting Workflow Orchestration with Knowledge Base and Multi-Agent Coevolution - Source: `.raw/papers/2026_Unknown_FlowXpert_Expertizing_Troubleshooting_Workflow_Orchestration.pdf` - Summary: [[@2025__KDD__FlowXpert - Expertizing Troubleshooting Workflow Orchestration with Knowledge Base and Multi-Agent Coevolution]] - Pages created: [[@2025__KDD__FlowXpert - Expertizing Troubleshooting Workflow Orchestration with Knowledge Base and Multi-Agent Coevolution]], [[Binpeng Shi]], [[FlowXpert]], [[OpsFlowBench]] - Pages updated: [[Shenglin Zhang]], [[Dan Pei]], [[Nankai University]], [[Huawei Cloud]], [[TSG自動化]], [[インシデント管理]], [[強化ファインチューニング]] - Key insight: ワークフロー「実行」(FLASH/LLexus/StepFly)の上流に「生成」(FlowXpert)という問題設定を追加——本番 22.1 秒・承認率 80% で産業実証済み。 ## [2026-06-05] ingest-paper | AgentTune: An Agent-Based LLM Framework for Database Knob Tuning - Source: `.raw/papers/acm-3769758.pdf` - Summary: [[@2025__SIGMOD__AgentTune - An Agent-Based Large Language Model Framework for Database Knob Tuning]] - Pages created: [[Yiyan Li]], [[Haoyang Li]], [[Jing Zhang]], [[Cuiping Li]], [[Hong Chen]], [[Renata Borovica-Gajic]], [[University of Melbourne]], [[データベースノブチューニング]] - Pages updated: [[データベース自律診断]], [[Renmin University of China]], [[ByteDance]], [[sources/_index]], [[entities/_index]], [[concepts/_index]], [[index]], [[hot]], [[log]] - Key insight: 4 専門 LLM エージェントのタスク分解 + ビームサーチ木探索 + セントロイド距離ランキングの組み合わせが、DBMS ノブチューニングで全実験 Invalid Times=0 を実現——「ルールベース検証 + LLM の融合が信頼性の律速」と「構成空間での多数決が収束を安定化する」という 2 つの設計原則を定量実証した。 ## [2026-06-05] ingest-paper | SCELM: A Multimodal Intelligent Change Assessment Framework for Microservice Systems Based on Large Language Models - Source: `.raw/papers/2026_Unknown_A_Multimodal_Intelligent_Change_Assessment.pdf` - Summary: [[@2025__FSE Companion__A Multimodal Intelligent Change Assessment Framework for Microservice Systems Based on Large Language Models]] - Pages created: [[SCELM]], [[Tinghua Zheng]], [[Xidao Wen]], [[Weihua Kuang]], [[Heng Liu]], [[Chao Shen]], [[Bo Wu]], [[BizSeer]], [[ソフトウェア変更管理]] - Pages updated: [[Yongqian Sun]], [[Shenglin Zhang]], [[Dan Pei]], [[Nankai University]], [[マルチモーダル障害診断]], [[根本原因分析]], [[sources/_index]], [[entities/_index]], [[concepts/_index]], [[index]], [[hot]], [[log]] - Key insight: ECD・FT・RCCA の 3 タスクを統合した最初のフレームワークが、変更票×ログ×メトリクスの自然言語化 + RAG + 7B LLM で本番 11 か月・90% 時間短縮を達成——「変更票を第 4 モダリティとして扱い、異常形状の意味を LLM に渡す」という設計が特に RCCA の鍵であることをアブレーションが示した。 ## [2026-06-05] ingest-paper | OpDiag: Unveiling Database Performance Anomalies Through Query Operator Attribution - Source: `.raw/papers/OpDiag_Unveiling_Database_Performance_Anomalies_Through_Query_Operator_Attribution.pdf` - Summary: [[@2025__TKDE__OpDiag - Unveiling Database Performance Anomalies Through Query Operator Attribution]] - Pages created: [[Shiyue Huang]], [[Bin Cui]], [[Yinjun Wu]], [[Ziwei Wang]], [[ZTE Corporation]], [[OpDiag]], [[DBPA]] - Pages updated: [[データベース自律診断]], [[Peking University]], [[sources/_index]], [[entities/_index]], [[index]], [[hot]], [[log]] - Key insight: 演算子-クエリ-KPI-異常の階層を三段階分割帰属で遡及する設計が、DB 診断の「解像度スペクトル」を KPI/クエリの先の演算子レベルへ初めて押し上げた——ML+帰属がドメイン知識なしで演算子レベルの精度と対話的速度を両立できることを実証。 ## [2026-06-05] ingest-paper | DBAIOps: A Reasoning LLM-Enhanced Database Operation and Maintenance System using Knowledge Graphs - Source: `.raw/papers/arxiv-2508.01136.pdf` - Summary: [[@2025__PVLDB__DBAIOps - A Reasoning LLM-Enhanced Database Operation and Maintenance System using Knowledge Graphs]] - Pages created: [[データベース O&M]], [[Wei Zhou]], [[DBAIOps]], [[Baisheng Technology]] - Pages updated: [[根本原因分析]], [[Xuanhe Zhou]], [[sources/_index]], [[entities/_index]], [[concepts/_index]], [[index]], [[hot]], [[log]] - Key insight: 知識グラフ(ExperienceGraph)が「LLM の幻覚抑制」と「RAG の関係断片化」を同時解決——グラフパスとして O&M 経験を構造化し「提供された証拠のみで推論」する制約が、DB 診断でのハルシネーション(偽メトリクス引用)を事例で根絶した。グラフ進化の動的拡張が未知異常への対応で最大 34% の精度向上に寄与。 ## [2026-06-05] ingest-paper | D-Bot: Database Diagnosis System using Large Language Models - Source: `.raw/papers/arxiv-2312.01454.pdf` - Summary: [[@2024__PVLDB__D-Bot - Database Diagnosis System using Large Language Models]] - Pages created: [[データベース自律診断]], [[DB-GPT]] - Pages updated: [[根本原因分析]], [[AIOps]], [[Xuanhe Zhou]], [[Guoliang Li]], [[Tsinghua University]], [[sources/_index]], [[entities/_index]], [[concepts/_index]], [[index]], [[hot]], [[log]] - Key insight: DB ドメイン特化の LLM 自律診断がドメインを超えた AIOps 系 RCA と手法的に同型——「ドメイン知識外在化が精度の律速」(NoKnowledge −64.1%)・「UCT 木探索が早期停止を構造的に抑制」(NoTreeSearch −35.85%)は、Flow-of-Action の SOP 知識削除・SREGym の早期停止問題と独立して同じ命題を別証する。 ## [2026-06-05] ingest-paper | TAMO: Fine-Grained Root Cause Analysis via Tool-Assisted LLM Agent with Multi-Modality Observation Data - Source: `.raw/papers/arxiv-2504.20462.pdf` - Summary: [[@2025__TSC__TAMO - Fine-Grained Root Cause Analysis via Tool-Assisted LLM Agent with Multi-Modality Observation Data in Cloud-Native Systems]] - Pages created: [[TAMO]], [[Xiao Zhang]], [[Dongxiao Yu]], [[Fuzhen Zhuang]], [[Shandong University]] - Pages updated: [[根本原因分析]], [[マルチモーダル障害診断]], [[sources/_index]], [[entities/_index]], [[index]], [[hot]] - Key insight: LLM を生データ処理から切り離してツール出力の統合に専念させる「ツール支援型 LLM エージェント」設計で、コンテキスト制限・マルチモーダル意味ギャップ・動的依存グラフという LLM-RCA の 3 課題を統一フレームワークで解決。アブレーションが T1(双分岐拡散アライメント)をマルチモーダル RCA の律速と同定。 ## [2026-06-05] ingest-paper | TVDiag: A Task-oriented and View-invariant Failure Diagnosis Framework - Source: `.raw/papers/2026_Unknown_TVDiag_Task_oriented_View_invariant.pdf` - Summary: [[@2026__TOSEM__TVDiag - A Task-oriented and View-invariant Failure Diagnosis Framework for Microservice-based Systems with Multimodal Data]] - Pages created: [[マルチモーダル障害診断]], [[Shuaiyu Xie]], [[Jian Wang]], [[Bing Li]], [[Wuhan University]], [[TVDiag]] - Pages updated: [[根本原因分析]], [[Fault Localization]] - Key insight: マルチモーダル RCL で「タスクごとのモダリティ嗜好」を教師あり対照学習で増幅する設計が、等価融合を大幅に凌駕することを 4 データセットで実証。 ## [2026-06-05] ingest-paper | Flow-of-Action: SOP Enhanced LLM-Based Multi-Agent System for Root Cause Analysis - Source: `.raw/papers/arxiv-2502.08224.pdf` - Summary: [[@2025__WWW__Flow-of-Action - SOP Enhanced LLM-Based Multi-Agent System for Root Cause Analysis]] - Pages created: [[wiki/sources/@2025__WWW__Flow-of-Action - SOP Enhanced LLM-Based Multi-Agent System for Root Cause Analysis|source]] / [[wiki/entities/Changhua Pei|Changhua Pei]] - Pages updated: [[根本原因分析]] / [[TSG自動化]] / [[Dan Pei]] / [[ByteDance]] / [[Tieying Zhang]] / [[sources/_index]] / [[entities/_index]] / [[index]] / [[hot]] - Key insight: SOP → コード変換(generate_sop_code)が「アトミック一括実行・近位テキスト依存排除・トークン節約」の三利点を持ち、アブレーションでは SOP 知識削除で LA が 54.22→8.56 に激減——ドメイン知識の明示化が RCA エージェントの律速であることを定量証明した。 ## [2026-06-05] ingest-paper | インシデント自動化 4 本(FLASH / StepFly / LLexus / agentic NetOps-AIOps サーベイ) - Sources: `.raw/papers/FLASH_Paper.pdf`, `.raw/papers/arxiv-2510.10074.pdf`, `.raw/papers/3689051.3689056.pdf`, `.raw/papers/arxiv-2605.12729.pdf` - Summary: [[@2024__MSR__FLASH - A Workflow Automation Agent for Diagnosing Recurring Incidents]] / [[@2025__arXiv__StepFly - Agentic Troubleshooting Guide Automation for Incident Diagnosis]] / [[@2024__OSR__LLexus - an AI agent system for incident management]] / [[@2026__arXiv__Large Language Models for Agentic NetOps and AIOps - Architectures, Evaluation, and Safety]] - Pages created: 新概念 [[TSG自動化]] / [[NetOps]] / [[エージェント運用安全性]] + entity 32(person 24・product 7([[FLASH]]/[[StepFly]]/[[TSG Mentor]]/[[LLexus]]/[[TaskWeaver]]/[[Semantic Kernel]]/[[Azure Durable Functions]])・org 1([[Renmin University of China]]))+ source 4 - Pages updated: [[インシデント管理]] / [[障害緩和]] / [[AIOps]] / [[根本原因分析]] / [[agentic SRE]] / [[SRE AI Autonomy Levels]] / [[Transactional No-Regression]] / 既存 person 4([[Minghua Ma]]/[[Shilin He]]/[[Qingwei Lin]]/[[Chaoyun Zhang]])+ 索引 3 + [[index]] + [[hot]] - Key insight: Microsoft の TSG 自動化 3 本は「LLM をオンライン(FLASH)/計画前置(LLexus)/両方+並列(StepFly)のどこで働かせるか」で分岐するが、3 本とも「TSG 品質が自動化の律速」へ独立収束。NetOps/AIOps サーベイの assurance contract が TNR・Actus・自律度段階(Google L0–L4)を同一語彙で上位一般化し、LLexus の決定論的実行も書き込み境界の確実性として接続。 - 手法: Phase 1(4 subagent 並行 = source + 排他的 person、共有物は report)→ Phase 2(メインで concept 統合・中央 entity・索引/meta)。初回 3 本がソケット切断のため Sonnet で再実行。並行セッションの LLM×DATA ingest と共有ファイルはロックで直列化。 ## [2026-06-05] ingest-paper | A Survey of LLM × DATA - Source: `.raw/papers/arxiv-2505.18458.pdf` - Summary: [[@2025__arXiv__A Survey of LLM × DATA]] - Pages created: [[@2025__arXiv__A Survey of LLM × DATA]], [[Xuanhe Zhou]], [[Guoliang Li]] - Pages updated: [[根本原因分析]], [[LLM分散学習]], [[wiki/sources/_index]], [[wiki/entities/_index]], [[wiki/index]] - Key insight: DB 分野の異常診断とクラウド AIOps の RCA が「直接プロンプト / RAG 強化 / マルチエージェント」の同一パターンを共有し、LLM 活用の手法分類がドメイン横断的に一致する。追記式。新規エントリは**先頭**に追加する。過去エントリは編集しない。エントリ形式: `## [YYYY-MM-DD] operation | Title` 直近の参照: `grep "^## \[" wiki/log.md | head -10` --- ## [2026-06-05] ingest-paper | LLM4Log: A Systematic Review of Large Language Model-based Log Analysis - Source: `.raw/papers/arxiv-2604.16359.pdf`(md5 4d45c37a…、54 ページ、arXiv:2604.16359v2)。github コンパニオン github.com/zeyang919/LLM4Log。 - Summary: [[@2026__arXiv__LLM4Log - A Systematic Review of Large Language Model-based Log Analysis]] - Pages created: [[Zeyang Ma]], [[Jinqiu Yang]], [[Tse-Hsun Chen]], [[Concordia University]], [[LLM4Log (repository)]], [[ログパース]], [[ログ生成]] - Pages updated: [[ログ解析]](大幅・パイプライン全体地図化), [[異常検知]], [[根本原因分析]], [[障害予測]], [[sources/_index]], [[entities/_index]], [[concepts/_index]] - Key insight: LLM4Log は個別タスクの寄せ集めでなく、ログ生成→パース→表現→下流診断のパイプライン全体。全段共通の設計原理は「無制約 end-to-end 生成でなく、情報を絞ってから LLM を選択的に呼ぶ階層設計」で、本 wiki が個別ソース(LogPilot/OpenRCA/AlertGuardian)で積んだ観察の上位一般化。162 レコード中 deployment 証拠は 5 のみで、本 wiki の産業一次ソースの希少さを裏づける。 ## [2026-06-04] ingest-paper (followup) | PACE(ISAV2025)を PDF 入手で正式版へ格上げ - Source: `.raw/papers/2026_Unknown_From_Exploration_Explanation_ML_Driven.pdf`(ユーザー提供、md5 37155494…、6 ページ)。直前の 8 本バッチで ACM ペイウォールにより abstract のみで暫定作成していた [[@2025__ISAV__From Exploration to Explanation - ML-Driven Causal Discovery for Datacenter Reliability at Scale]] を、全文に基づき書き換え。 - Pages updated (source): 出典制約の `[!warning]` を除去、`sources:` を `.raw/` PDF へ変更、`confidence: high` に。提案手法を 6 段パイプライン(データ整備→パターン発見→因果推定[最大12ラグ Granger・s=−log10(p)]→グラフ合成[上位k=2・98%ile]→可視化→検証)に詳細化し、実験結果に Fig.1 エントロピー順位・Fig.2 z-score・Fig.3 因果所見(熱の操作変数→容量/流量、外気→CHW、電力は負荷追従、容量→バルブのフィードバック)・Fig.4 クラスタリングを追記。**論文は定量精度指標を持たず物理整合性・感度分析による定性評価**である旨を明記。 - Pages updated (entities): [[David Grant]](所属を [[Oak Ridge National Laboratory]] に確定、[email protected]) / [[PACE]](パイプライン詳細・主要因果所見・定性評価の明記)。**新規** [[DyTwin]](HPE/ORNL のデジタルツイン枠組み、PACE の統合先)。 - Pages updated (meta): [[sources/_index]]・[[index]] の ISAV 記述からペイウォール注記を除去、[[entities/_index]] に [[DyTwin]] 追加。`.raw/.manifest.json` の `isav-pace-NO-PDF` プレースホルダを実 PDF エントリに置換。 - 訂正: 著者所属は HPE Labs(Prakash/Milpitas・Hong Enriquez/Oxford UK・Serebryakov/Milpitas・Milojicic/Milpitas)+ ORNL(David Grant・Wesley Brewer)の共同と PDF 著者欄で確定。 ## [2026-06-04] ingest-paper | 本番 LLM 訓練の障害・性能診断 8 論文(並行取り込み) - Sources: `.raw/papers/arxiv-2509.22832.pdf`(GPU Performance Modeling)、`.raw/papers/sigcomm25-skeletonhunter.pdf`(SkeletonHunter)、`.raw/papers/arxiv-2506.02007.pdf`(eACGM)、`.raw/papers/arxiv-2502.05413.pdf`(XPUTimer→Flare)、`.raw/papers/nsdi25-dong.pdf`(Aegis)、`.raw/papers/arxiv-2505.00342.pdf`(LLMPrism)、`.raw/papers/arxiv-2503.20263.pdf`(L4)。**ISAV/PACE は ACM ペイウォールのため PDF 非取得**(abstract + 公開本文断片で構築、`.raw/papers/` に原本なし、定量結果は未記載)。 - Pages created (sources): [[@2025__ISAV__From Exploration to Explanation - ML-Driven Causal Discovery for Datacenter Reliability at Scale]] / [[@2025__arXiv__Efficient Fine-Grained GPU Performance Modeling for Distributed Deep Learning of LLM]] / [[@2025__SIGCOMM__SkeletonHunter - Diagnosing and Localizing Network Failures in Containerized Large Model Training]] / [[@2025__IWQoS__eACGM - Non-instrumented Performance Tracing and Anomaly Detection towards Machine Learning Systems]] / [[@2025__arXiv__XPUTimer - Anomaly Diagnostics for Divergent LLM Training in GPU Clusters of Thousand-Plus Scale]] / [[@2025__NSDI__Evolution of Aegis - Fault Diagnosis for AI Model Training Service in Production]] / [[@2025__DSN__LLMPrism - Black-box Performance Diagnosis for Production LLM Training Platforms]] / [[@2025__ESEC-FSE__L4 - Diagnosing Large-scale LLM Training Failures via Automated Log Analysis]] - Pages created (entities): 69 件。organization 6([[Hewlett Packard Labs]]/[[Oak Ridge National Laboratory]]/[[Case Western Reserve University]]/[[Rutgers University]]/[[Ant Group]]/[[Huawei Cloud]])、system/product/dataset/repo 15([[PACE]]/[[SkeletonHunter]]/[[eACGM]]/[[XPUTimer]]/[[LLMPrism]]/[[L4]]/[[Platform-X]]/[[Summit]]/[[Perlmutter]]/[[Vista]]/[[GPT-NeoX]]/[[Alibaba HPN]]/[[DeepSpeed]]/[[DLRover]]/[[The Pile]])、person 48。詳細は [[entities/_index]]。 - Pages updated (concepts): [[LLM学習モニタリング]] / [[LLM分散学習]] / [[集合通信]] / [[Fault Localization]] / [[根本原因分析]] / [[異常検知]] / [[並列化戦略]] / [[RDMAネットワーク監視]] / [[ストラグラー]] / [[ログ解析]] / [[GPU観測性]] / [[テレメトリ]] / [[eBPF]] / [[分散トレーシング]] / [[変化点検知]] / [[GPUクラスタ運用]] / [[障害緩和]](横断的知見・未解決の問いを積み増し)。[[GPU観測性]] 末尾の壊れたタグ(`</content></invoke>`)を除去。 - Pages updated (entities): [[Aegis]](本 NSDI 論文を真の一次ソースに更新) / [[Tianyin Xu]] / [[Pengfei Chen]] / [[Zhihan Jiang]] / [[Michael R. Lyu]] / [[Guangba Yu]] / [[Sun Yat-sen University]] / [[The Chinese University of Hong Kong]] / [[Tsinghua University]] / [[University of Illinois Urbana-Champaign]] / [[Alibaba Group]] / [[Megatron-LM]] / [[NCCL]] / [[Perfetto]] - Key insight: 同期分散訓練の「全マシン対称」という規則性を性能予測・異常検知・並列化逆推定・ログ外れ値が別目的で利用し、[[Fault Localization]] は症状の層(CCL timeout)と真因デバイスの乖離を CCL カウンタ(Aegis)/ネットワークパス(SkeletonHunter)/ログ(L4)/メトリクス類似度(Minder)と計装位置ごとに別モダリティで橋渡しする(いずれも箇所特定止まりで RCA は人手)。 - 取り込み手法: Phase 1(8 subagent 並行で source + 著者 person entity、per-file lock、共有物は構造化 report)→ Phase 2(メイン + 3 subagent で org/system entity・concept 統合・索引・hot・log・manifest をファイル集合分離で競合なく更新)。 - 矛盾: [[XPUTimer]] は arXiv v2 で著者構成変化・システム名を [[Flare]] に改名。source ページに contradiction callout を設置、entity の aliases に両名を保持。 ## [2026-06-04] ingest-paper | 大規模 GPU 訓練クラスタの障害管理 6 本 - Sources: papers/ 既存ノート 5 本(SC/SIGCOMM/HPCA/APNET 由来)+ research/conferences/ 1 本(Guard)を一次資料に wiki 化。FlashRecovery のみ `.raw/papers/arxiv-2509.03047.pdf` 取得。 - Pages created: [[@2025__SC__Fine-grained Automated Failure Management for Extreme-Scale GPU Accelerated Systems]], [[@2026__MLSys2026__Guard - Scalable Straggler Detection and Node Health Management for Large-Scale Training]], [[@2025__arXiv__FlashRecovery - Fast and Low-Cost Recovery from Failures for Large-Scale Training of LLMs]], [[@2025__SIGCOMM__Hawkeye - Diagnosing RDMA Network Performance Anomalies with PFC Provenance]], [[@2025__HPCA__Enhancing Large-Scale AI Training Efficiency - The C4 Solution for Real-Time Anomaly Detection and Communication Optimization]], [[@2025__APNET__Forewarned is Forearmed - Joint Prediction and Classification of Optical Transceiver Failures in Large-Scale LLM Training Clusters]] + entity 多数 - Pages updated: [[耐障害LLM訓練]], [[ストラグラー]], [[集合通信]], [[RDMAネットワーク監視]], [[GPUレジリエンス]], [[GPUクラスタ運用]], [[障害緩和]], [[チェックポイント]], [[LLM分散学習]], [[並列化戦略]], [[LLM学習モニタリング]], [[根本原因分析]], [[Fault Localization]], [[障害予測]], [[Mixture-of-Experts]], [[オープンネットワーキング]] - Key insight: 大規模 GPU 訓練の障害管理が「検知の一次シグナル(ステップ時間/集合通信同期点/物理メトリクス)」「復旧の3+1系統(高速CP/予備機/べき等省略/データ並列複製冗長)」「緩和の段階化(multi-strike/10-20%しきい)」「計装位置(スイッチ/NIC/ホスト/物理部品)」の軸で横断的に整理された。 ## [2026-06-04] ingest-paper | GPU/eBPF 観測性・集合通信の信頼性 7 論文(並行取り込み) - Sources: `.raw/papers/arxiv-2510.20171.pdf`(Collective 100k+/NCCLX)、`.raw/papers/arxiv-2410.23661.pdf`(PICKER)、`.raw/papers/arxiv-2509.03018.pdf`(Mycroft)、`.raw/papers/arxiv-2601.20755.pdf`(ProfInfer)、`.raw/papers/bpftime_super.pdf`(eGPU)。eInfer・TOPC はペイウォール/非arXiv のため PDF 非取得、既存 `papers/` ノート + 公開メタ/スライド(`.raw/papers/sdarche_may2025_...pdf`)典拠で confidence: medium。 - Pages created (sources): [[@2025__eBPF__eInfer - Unlocking Fine-Grained Tracing for Distributed LLM Inference with eBPF]] / [[@2026__arXiv__ProfInfer - An eBPF-based Fine-Grained LLM Inference Profiler]] / [[@2025__SOSP__Mycroft - Tracing Dependencies in Collective Communication Towards Reliable LLM Training]] / [[@2025__HCDS__eGPU - Extending eBPF Programmability and Observability to GPUs]] / [[@2025__arXiv__Collective Communication for 100k+ GPUs]] / [[@2024__TOPC__Low-Overhead Trace Collection and Profiling on GPU Compute Kernels]] / [[@2024__arXiv__Microsecond-scale Dynamic Validation of Idempotency for GPU Kernels]] - Pages created (concepts): [[GPU観測性]] / [[集合通信]] / [[動的計装|動的インストルメンテーション]] / [[LLM推論]] / [[ハードウェアカウンタ]] / [[べき等性]] / [[チェックポイント]] - Pages created (entities): 53 件(人物 16・組織 13・システム/製品/repo/dataset 24)。詳細は [[entities/_index]]。 - Pages updated (concepts): [[eBPF]] / [[分散トレーシング]] / [[根本原因分析]] / [[耐障害LLM訓練]] / [[ストラグラー]] / [[障害注入]] / [[Mixture-of-Experts]] / [[LLM分散学習]] / [[LLM学習モニタリング]] / [[GPUクラスタ運用]] / [[テレメトリ]] / [[並列化戦略]] - Pages updated (entities): [[Yusheng Zheng]] / [[Minlan Yu]] / [[Yangtao Deng]] / [[ByteDance]] / [[Harvard University]] / [[The Chinese University of Hong Kong]] / [[NCCL]] / [[bpftime]] / [[eunomia-bpf]] / [[Megatron-LM]] / [[MegaScale]] - Key insight: GPU/LLM の観測性は計装の挿入時点(コンパイル時 LLVM / 実行時 PTX 注入 / ホスト側 eBPF)でオーバーヘッドと観測対象が分かれ、いずれもベンダー専用ツール(CUPTI/NVBit/Nsight)の高オーバーヘッド・ベンダーロックイン回避を共通動機とする。集合通信は Mycroft(可観測化による信頼性)と NCCLX(通信スタック再設計による性能)が同じ CCL ブラックボックス課題の観測側・機構側を成す。 - Note: eGPU(#4)と bpftime-super(#6)は同一論文と判明し source は1枚に統合(asplos.dev 公開 PDF の本文タイトルが eGPU)。eInfer 既存ノートの url DOI(3672197.3673434)は誤りで正は 3748355.3748372。 ## [2026-06-04] ingest-paper | Approximation-First Timeseries Monitoring Query At Scale (PromSketch) - Source: `.raw/papers/arxiv-2505.10560.pdf`(PVLDB / VLDB 2025, DOI:10.14778/3742728.3742732, arXiv:2505.10560, 15p, md5 6d58cd29…) - Summary: [[@2025__VLDB__Approximation-First Timeseries Monitoring Query At Scale]] - Pages created (source 1 + entity 10 + concept 1 = 12): source 上記 1 / entities [[Zeying Zhu]] [[Jonathan Chamberlain]] [[Kenny Wu]] [[David Starobinski]] [[Zaoxing Liu]] [[University of Maryland]] [[Boston University]] [[PromSketch]] [[Prometheus]] [[VictoriaMetrics]] [[Froot-NetSys promsketch]] / concept [[近似クエリ処理]] - Pages updated: concept [[時系列データベース]] / [[index]] [[hot]] [[sources/_index]] [[entities/_index]] [[concepts/_index]] - Key insight: TSDB 効率化には「取り込み最適化」([[HeteroTSDB]] のインデックス・tiering)と「クエリ最適化」([[PromSketch]] の中間結果キャッシュ)の直交 2 軸がある。[[Prometheus]]/[[VictoriaMetrics]] のような TSDBMS でも周期ルールクエリが重複ウィンドウを繰り返しスキャン・再計算する冗長性が残り(VictoriaMetrics は CPU 80.2% が Data Scanning)、ストレージエンジン改善だけでは取りきれない。PromSketch は「生データでも最終結果でもなく中間結果(Exponential Histogram バケット)をキャッシュ」+「EH×スケッチ(KLL/Universal Sketching)で可証明な誤差境界」で、5% 誤差を許容する前提のもとレイテンシ最大 2 桁・運用コスト約 400× を削減する。[[近似クエリ処理]] を新規 concept として追加し、[[時系列データベース]] に近似という第 3 軸を追記。 ## [2026-06-04] ingest-paper (parallel ×5) | GPU 分散訓練インフラ/ネットワーク 5 本 - Sources: - `.raw/papers/Liu-et-al.-2024---R-pingmesh---A-service-aware-RoCE-network-monitoring-and-diagnostic-system.pdf`(SIGCOMM 2024, DOI:10.1145/3651890.3672264, 14p, md5 3a812726…、ユーザー提供 PDF) - `.raw/papers/arxiv-2503.11901.pdf`(SC 2025, arXiv:2503.11901, 13p, md5 8461f84e…) - `.raw/papers/osdi25-lin-jinkun.pdf`(OSDI 2025, usenix lin-jinkun, 17p, md5 6799d314…) - `.raw/papers/arxiv-2509.16293.pdf`(SOSP 2025, arXiv:2509.16293, 18p, md5 609ba597…) - `.raw/papers/sigcomm25-qingkai.pdf`(SIGCOMM 2025, DOI:10.1145/3718958.3750521, 17p, md5 3af2f955…) - Summary: [[@2024__SIGCOMM__R-Pingmesh - A Service-Aware RoCE Network Monitoring and Diagnostic System]], [[@2025__SC__Characterizing GPU Resilience and Impact on AI - HPC Systems]], [[@2025__OSDI__Understanding Stragglers in Large Model Training Using What-if Analysis]], [[@2025__SOSP__Robust LLM Training Infrastructure at ByteDance]], [[@2025__SIGCOMM__Astral - A Datacenter Infrastructure for Large Language Model Training at Scale]] - Pages created (source 5 + entity 27 + concept 4 = 36): sources 上記 5 / entities [[R-Pingmesh]] [[ByteRobust]] [[SMon]] [[NDTimeline]] [[Astral]] [[Seer]] [[Delta]] [[StragglerAnalysis]] [[Kefei Liu]] [[Jiao Zhang]] [[Shengkun Cui]] [[Ravishankar K. Iyer]] [[Jinkun Lin]] [[Aurojit Panda]] [[Jinyang Li]] [[Borui Wan]] [[Liang Xiang]] [[Chuan Wu]] [[Hao Zheng]] [[ChonLam Lao]] [[Gianni Antichi]] [[BUPT]] [[Douyin Vision]] [[NCSA]] [[Nokia Bell Labs]] [[New York University]] [[The University of Hong Kong]] / concepts [[耐障害LLM訓練]] [[ストラグラー]] [[GPUレジリエンス]] [[RDMAネットワーク監視]] - Pages updated: concepts [[LLM学習モニタリング]] [[GPUクラスタ運用]] [[LLM分散学習]] [[Fault Localization]] [[並列化戦略]] [[テレメトリ]] [[オープンネットワーキング]] / entities [[ByteDance]] [[MegaScale]] [[Xin Liu]] [[Ziheng Jiang]] [[Megatron-LM]] [[Tencent]] [[Nanjing University]] [[Harvard University]] [[Qingkai Meng]] [[Chen Tian]] [[Pulse]] [[University of Illinois Urbana-Champaign]] [[Saurabh Jha]] [[IBM Research]] [[Zhuo Jiang]] / [[index]] [[hot]] [[sources/_index]] [[entities/_index]] [[concepts/_index]] - Key insight: LLM 訓練の信頼性を「ハードウェアの床([[GPUレジリエンス]]:H100 はメモリ MTBE が A100 の 1/3.2)→ クラッシュしない劣化([[ストラグラー]]:全 GPU 時間の 10.4% を浪費・主因は計算側不均衡)→ 耐障害インフラ([[耐障害LLM訓練]]:[[ByteRobust]] が ETTR 97%・「迅速な隔離」を選ぶ)→ ネットワーク監視([[RDMAネットワーク監視]]:[[R-Pingmesh]] の能動プローブ・[[Astral]] の 4 層フルスタック)」の縦の系譜として束ねた。[[Fault Localization]] に「精密に当てる vs あえて粗く切る(過剰排除)」、[[GPUクラスタ運用]] に「件数 11% でも GPU 時間 82%」の運用コスト内訳、[[LLM学習モニタリング]] に「検知信号の層 + 起因への写し方(全層相関/反事実シミュレーション/スタックトレースクラスタリング)」の第二軸を追加。Stage A 5 並行 + Stage B メイン統合のハイブリッドで実施。 ## [2026-06-04] ingest-paper | Cloud Infrastructure Management in the Age of AI Agents - Source: `.raw/papers/2026_Unknown_Cloud_Infrastructure_Management_Age_AI.pdf`(ACM SIGOPS OSR 2025, DOI:10.1145/3759441.3759443, 8p, md5 4118a93d…) - Summary: [[@2025__OSR__Cloud Infrastructure Management in the Age of AI Agents]] - Pages created: [[@2025__OSR__Cloud Infrastructure Management in the Age of AI Agents]], [[Martin Casado]], [[Archit Bhatnagar]], [[Tongyuan Miao]], [[Yunming Xiao]], [[Yibo Huang]], [[University of California, Berkeley]], [[Andreessen Horowitz]], [[WorkArena]], [[Azure Copilot]], [[クラウド管理モダリティ]] - Pages updated: [[Zhenning Yang]], [[Ang Chen]], [[Yiming Qiu]], [[Patrick Tser Jern Kon]], [[University of Michigan]], [[Terraform]], [[Microsoft Azure]], [[AIOpsLab]], [[Infrastructure as Code]], [[SRE AI Autonomy Levels]], [[agentic SRE]], [[AIOps]], [[index]], [[hot]], [[sources/_index]], [[entities/_index]], [[concepts/_index]] - Key insight: [[Ang Chen]] グループの IaC 3 部作([[Zodiac]]/[[NSync]]/[[Lilac]])を「クラウド管理の 4 [[クラウド管理モダリティ|モダリティ]](SDK/CLI/IaC/ClickOps)の 1 つ」として相対化するビジョン論文。段階×モダリティのトレードオフを Azure VM で実証(CLI=作成最効率、IaC=再作成更新に強く monitoring に弱い、ClickOps=monitoring に強く作成は遅い)。agent-cloud interface が [[AIOpsLab]] の ACI と、自律度段階化が [[SRE AI Autonomy Levels]] と独立に収束し、IaC クラスタと agentic SRE/AIOps クラスタを接続した。 ## [2026-06-04] ingest-paper | TimeCopilot - Source: `.raw/papers/arxiv-2509.00616.pdf`(arXiv:2509.00616v3, NeurIPS 2025 Workshop BERT2S, 9p, md5 abf2a043…) - Summary: [[@2025__arXiv__TimeCopilot]] - Pages created: [[@2025__arXiv__TimeCopilot]], [[Azul Garza]], [[Renée Rosillo]], [[TiRex]] - Pages updated: [[TimeCopilot]](seed→developing), [[GIFT-Eval]], [[Chronos-2]], [[TimesFM]], [[エージェント型時系列予測]], [[時系列基盤モデル]], sources/entities の各 _index, [[index]], [[hot]] - Key insight: 複数 TSFM と LLM を単一統一 API 下に集約する初のオープンソースなエージェント型予測フレームワーク。[[エージェント型時系列予測]] の Workflow パラダイム代表で、基盤モデル不使用の [[TimeSeriesScientist]] と「予測力の源泉」が対極(TSFM アンサンブルハブ vs 軽量 21 モデルライブラリ)——同じ Workflow 骨格でも行動空間が独立で、ATSF の「プロセスの組織化は基盤モデルの有無と直交」を例証。GIFT-Eval で MedianEnsemble([[Chronos-2]]+[[TimesFM]]+[[TiRex]]+isotonic regression)が確率予測 CRPS 全体最良を約 $24 で達成し、複数 TSFM の結合が各単体 SOTA を上回る。ただし SOTA の実体はアンサンブルで、LLM オーケストレーションの正味寄与を切り分けるアブレーションは本体論文に無い([[Cast-R1]] のコンポーネントアブレーションと対照)。 ## [2026-06-04] ingest-paper | TimeSeriesScientist - A General-Purpose AI Agent for Time Series Analysis - Source: `.raw/papers/arxiv-2510.01538.pdf`(arXiv:2510.01538, 34p, md5 7f6981a2…) - Summary: [[@2025__arXiv__TimeSeriesScientist - A General-Purpose AI Agent for Time Series Analysis]] - Pages created: [[@2025__arXiv__TimeSeriesScientist - A General-Purpose AI Agent for Time Series Analysis]], [[Haokun Zhao]], [[Xiang Zhang]], [[Jiaqi Wei]], [[Chenyu You]], [[Stony Brook University]], [[TimeSeriesScientist]] - Pages updated: [[エージェント型時系列予測]], sources/entities/concepts の各 _index, [[index]], [[hot]] - Key insight: 初の LLM 駆動エージェント型の汎用単変量時系列予測。Curator→Planner→Forecaster→Reporter の固定 SOP は [[エージェント型時系列予測]] の Workflow パラダイムの典型で、AgenticRL の [[Cast-R1]] と対をなす。基盤モデルを一切使わず統計+古典 ML+軽量 DL の 21 モデルだけで LLM 直接予測ベースラインを平均 38.2% 上回り、「予測能力の源泉はモデル規模でなくプロセスの組織化」という ATSF の主張を例示。前処理除去のアブレーション(MAE +41.8%、3 モジュール中最大)が perception=適応的前処理の重要性を定量裏づけ。 ## [2026-06-04] query | TSFM単体とVLM統合の本質的差異(Toto vs Toto-1.0-QA-Experimental) - Question: 「Toto はあくまで次の時系列データ点を予測するだけだが、VLM と統合すると何が違うのか?」 - Answer filed: [[TSFM単体とVLM統合の本質的差異]](`wiki/questions/`) - 核心: VLM 統合版は Toto を予測器でなく**時系列エンコーダ**として再利用し、予測ヘッド手前の中間埋め込みを variate embedding MLP + projection layer で VLM([[Qwen3-VL]] 32B)空間へ射影。入出力の型が「数値→数値」から「(時系列埋め込み + 言語質問)→言語回答」へ変わり、予測器から「時系列を読んで言語で説明する推論器」へ質的変化。ARFBench Table 3 で精度 63.9%(全モデル最良、GPT-5 を 1.2pp 上回る)、VLM 単体・テキスト LM 版を 7pp 以上上回る——数値構造・スケール・多変量関係を保持した埋め込みが BPE 数値破壊・トークン爆発を回避するため。[[エージェント型時系列予測]] が TSFM を「行動空間の 1 ツール」に格下げするのと呼応(本問は TSFM を「知覚の器官」に作り替える)。 - Pages: questions 1 新規、index の Questions 欄を初エントリで更新。ソース不変・新規 ingest なし。 ## [2026-06-04] ingest-paper | OpenRCA / Cloud-OpsBench / AlertGuardian(AIOps-RCA 一次論文 3 本を subagent 並行取り込み) - Sources: `.raw/papers/openreview-M4qNIzQYpd-openrca.pdf`(ICLR 2025, 29p, md5 25e33e76…) / `.raw/papers/arxiv-2603.00468.pdf`(arXiv:2603.00468, 22p, md5 a69979dd…) / `.raw/papers/AlertGuardian.pdf`(ASE 2025, 12p, md5 0ceb3aa6…) - Summary: [[@2025__ICLR__OpenRCA - Can Large Language Models Locate the Root Cause of Software Failures|2025__ICLR__OpenRCA - Can Large Language Models Locate the Root Cause of Software Failures]] / [[@2026__arXiv__Cloud-OpsBench - A Reproducible Benchmark for Agentic Root Cause Analysis in Cloud Systems|2026__arXiv__Cloud-OpsBench - A Reproducible Benchmark for Agentic Root Cause Analysis in Cloud Systems]] / [[@2025__ASE__AlertGuardian - Intelligent Alert Life-Cycle Management for Large-scale Cloud Systems|2025__ASE__AlertGuardian - Intelligent Alert Life-Cycle Management for Large-scale Cloud Systems]] - Pages created (source 3 + entity 12 = 15): sources 3 本、entities [[OpenRCA]] / [[Cloud-OpsBench]] / [[AlertGuardian]] / [[Kubernetes]] / [[Junjielong Xu]] / [[Shilin He]] / [[Qingwei Lin]] / [[Chaoyun Zhang]] / [[Guangba Yu]] / [[Pengfei Chen]] / [[Sun Yat-sen University]] / [[Tencent]] - Pages updated: entities [[Pinjia He]] / [[Dan Pei]] / [[Microsoft]] / [[Tsinghua University]] / [[The Chinese University of Hong Kong, Shenzhen]] / [[The Chinese University of Hong Kong]] / [[Michael R. Lyu]] / [[Online-Boutique]] / [[CrewAI]]、concepts [[根本原因分析]] / [[SRE Benchmark]] / [[AIOps]] / [[agentic SRE]] / [[インシデント管理]] / [[異常検知]] / [[障害注入]] / [[テレメトリ]]、index / sources・entities/_index / hot - Key insight: OpenRCA(静的テレメトリ QA)と Cloud-OpsBench(決定論的 State Snapshot)が、ライブ環境ベンチ([[AIOpsLab]]/[[SREGym]]/[[ITBench]])と純静的データセットの中間=RCA 特化の「第三の型」を別アプローチで示す。RCA に限れば能力天井が桁違いに低い(Claude 3.5=11.34%・Hard 0%)。診断オラクルが Cloud-OpsBench の過程評価(IAC/RAR/ZTDR)で第四の型へ拡張。AlertGuardian は [[LogPilot]] に対し「単発診断 vs ライフサイクル全体最適化」の対比軸。subagent 並行で共有 entity 衝突を per-file lock 協調で解消、[[Guangba Yu]] の所属(SYSU↔CUHK)は contradiction callout 保持。既存 papers/ AlertGuardian ノートは温存し一方向参照。総 224→239 ページ・実ソース 33→36。 ## [2026-06-04] ingest-paper | Foundation Models for Time Series: A Survey(TSFM の 6 次元タクソノミー) - Source: `.raw/papers/arxiv-2504.04011.pdf`(arXiv:2504.04011v1 [cs.LG], 2025-04-05、20p、md5 32d44618…) - Summary: [[@2025__arXiv__Foundation Models for Time Series - A Survey|2025__arXiv__Foundation Models for Time Series - A Survey]] - Pages created: [[@2025__arXiv__Foundation Models for Time Series - A Survey|2025__arXiv__Foundation Models for Time Series - A Survey]](source 1)/ [[Dell Technologies]]・[[Siva Rama Krishna Kottapalli]](entity 2)= 3 ページ - Pages updated: [[Toto]](contradiction + サーベイ分類)・[[TimesFM]]・[[Chronos-2]](初代 Chronos との別世代 note)(entity 3)/ [[時系列基盤モデル]]・[[多変量時系列予測]]・[[Mixture-of-Experts]](concept 3、横断的知見・未解決の問いを積み増し)/ index・hot・sources/_index・entities/_index・manifest - Key insight: vault が個別に深掘りしてきた TSFM 群を俯瞰する 6 次元タクソノミー(アーキテクチャ/パッチ/目的関数/単変量・多変量/確率的・決定論的/規模)。**目的関数による分類が独自軸**。横断的知見 4 点を追加——(a) 汎用 ML の地図と観測特化研究の視点差(サーベイは observability の統計特性を扱わず Toto を「Datadog 内部データ学習」とだけ記述)、(b) 評価指標(MASE/CRPS)と訓練目的関数(点予測/確率的)の対応、(c)「多変量」の分類基準の食い違い([[Falcon-X]] は channel-independence を cross-variate の退化として除外)、(d) MoE の LLM 訓練→TSFM への拡張(Time-MOE、Huber+auxiliary loss、[[Mixture-of-Experts]] の横断的知見を初実体化)。[[Toto]] のスペック(サーベイ 103M・1 兆点対 vault 151M・約 2.36 兆点)はモデルバージョン差として contradiction、サーベイの「Chronos」は初代(T5)で [[Chronos-2]] と別世代の note。サーベイ自体に分類の揺れ(MOMENT の単変量/多変量、TimesFM の 100B/200B)と ACM プレースホルダ残存。[[時系列基盤モデル - MOC]] が本サーベイを既登録済み。総 224 ページ・実ソース 33 に更新。 ## [2026-06-04] ingest-paper | Cast-R1(ツール拡張・逐次意思決定の時系列予測 RL 実装) - Source: `.raw/papers/arxiv-2602.13802.pdf`(arXiv:2602.13802v1 [cs.LG], 2026、16p、md5 263608dd…) - Summary: [[@2026__arXiv__Cast-R1 - Learning Tool-Augmented Sequential Decision Policies for Time Series Forecasting|2026__arXiv__Cast-R1 - Learning Tool-Augmented Sequential Decision Policies for Time Series Forecasting]] - Pages created: [[@2026__arXiv__Cast-R1 - Learning Tool-Augmented Sequential Decision Policies for Time Series Forecasting|2026__arXiv__Cast-R1 - Learning Tool-Augmented Sequential Decision Policies for Time Series Forecasting]](source 1)= 1 ページ - Pages updated: [[Cast-R1]](ATSF ingest 時のスタブを一次論文ベースに実体化)/ [[Xiaoyu Tao]]・[[Mingyue Cheng]](entity 2、source 参照を追記)/ [[エージェント型時系列予測]]・[[強化ファインチューニング]]・[[時系列基盤モデル]](concept 3、横断的知見・未解決の問いを積み増し)/ index・hot・sources/_index・manifest - Key insight: 既に ingest 済みのポジションペーパー ATSF([[@2026__arXiv__Position Beyond Model-Centric Prediction - Agentic Time Series Forecasting|2026__arXiv__Position Beyond Model-Centric Prediction - Agentic Time Series Forecasting]])が **AgenticRL の代表**として参照しつつ dead link だった [[Cast-R1]] を、同グループの一次論文として実体化。ATSF が実験なしに掲げた主張群を Cast-R1 のアブレーションが個別に裏づけた——予測モデルを行動空間の 1 ツールとして呼ぶ([[Chronos-2]] 単独除去で volatile NP が MSE 22.5→55.4、予測モデルツール全除去で ETTh1 6.062→15.993)、省察・記憶・計画が性能を生む(Refine/Memory/Planning 除去で劣化)、適応的行動選択は報酬最適化で獲得(RL 除去が最大劣化 NP 24.750→54.631)。一方で性能が backbone 規模に強く依存(Qwen3 1.7B→8B で単調改善)し、**本文(8B/4×A800)と Appendix(1.7B/単一 RTX 4090D)で実装設定が矛盾、Table 2 の主結果数値が scaling 表の 4B 行と一致、ACM テンプレートのプレースホルダ残存**という未完成プレプリントの瑕疵も出典検査で検出。総 221 ページ・実ソース 32 に更新。 ## [2026-06-04] ingest-paper | ARFBench(時系列質問応答ベンチ) - Source: `.raw/papers/arxiv-2604.21199.pdf`(arXiv:2604.21199, 2026、45p、md5 50a2e099…) - Summary: [[@2026__arXiv__ARFBench - Benchmarking Time Series Question Answering Ability for Software Incident Response|2026__arXiv__ARFBench - Benchmarking Time Series Question Answering Ability for Software Incident Response]] - Pages created: [[ARFBench]], [[Stephan Xie]], [[Ben Cohen]], [[Mononito Goswami]], [[Toto-1.0-QA-Experimental]], [[Qwen3-VL]](entity 6)/ [[時系列質問応答]](concept 1)= 7 ページ - Pages updated: [[Datadog]], [[Carnegie Mellon University]], [[Ameet Talwalkar]], [[Toto]], [[Amazon Web Services]](entity 5)/ [[異常検知]], [[時系列基盤モデル]], [[インシデント管理]](concept 3)/ index・hot・各 _index・manifest - Key insight: [[Datadog]] の本番インシデント Slack タイムラインを専門家アノテーションの一次源として TSQA ベンチ ARFBench を構築し、事前学習済み TSFM([[Toto]])を VLM と結合した [[Toto-1.0-QA-Experimental]] が精度 63.9% でフロンティアモデル(GPT-5 62.7%)に並び、人間専門家との best-of-2 オラクルが精度 87.2%・F1 82.8% の超人的フロンティアを示した。並行 ingest(Cisco TSM)が [[Toto]]/[[時系列基盤モデル]] を同時更新していたため再読込のうえ衝突なくマージ。 ## [2026-06-04] ingest-paper | Cisco Time Series Model Technical Report - Source: `.raw/papers/arxiv-2511.19841.pdf`(arXiv:2511.19841, 2025、18p、md5 b6dc319b…) - Summary: [[@2025__arXiv__Cisco Time Series Model Technical Report|2025__arXiv__Cisco Time Series Model Technical Report]] - Pages created: [[Cisco]], [[Splunk]], [[TimesFM]], [[Splunk Observability Cloud]], [[Liang Gou]](entity 5)= 5 ページ - Pages updated: [[Toto]], [[GIFT-Eval]], [[Chronos-2]](entity 3、比較対象として相互リンク)/ [[時系列基盤モデル]](concept 1)/ index・hot・各 _index・manifest - Key insight: [[TimesFM]] に特殊トークンと解像度埋め込みを足して継続事前学習するだけで、粗い 1 時間と細かい 1 分のコンテキストを連結した「多解像度の長コンテキスト」を 1/30 の系列長で扱え、観測ドメインで競合 TSFM([[Toto]]/[[Chronos-2]])を上回りつつ汎用ベンチマーク([[GIFT-Eval]])の能力も保てる。著者は所属を明記せず全員 @cisco.com、[[Splunk]] を Cisco 傘下として整理。 ## [2026-06-04] ingest-paper | Towards Robust LLM Post-Training(RFT 障害管理) - Source: `.raw/papers/arxiv-2605.04431.pdf`(arXiv:2605.04431, 2026、16p、md5 99427179…) - Summary: [[@2026__arXiv__Towards Robust LLM Post-Training - Automatic Failure Management for Reinforcement Fine-Tuning|2026__arXiv__Towards Robust LLM Post-Training - Automatic Failure Management for Reinforcement Fine-Tuning]] - Pages created: [[Yunpeng Zhai]], [[Liancheng Fang]], [[Kening Zheng]], [[Hongyi Liu]], [[Xiaosong Huang]], [[RFT-FaultBench]], [[RFT-FM]], [[OpenRLHF]](entity 8)/ [[強化ファインチューニング]](concept 1)= 9 ページ - Pages updated: [[Lingzhe Zhang]], [[Tong Jia]], [[Ying Li]], [[Philip S. Yu]](entity 4)/ [[異常検知]], [[障害緩和]], [[障害注入]], [[AIOps]](concept 4)/ index・hot・各 _index・manifest - Key insight: AIOps の検知→診断→修復のソフトウェア障害管理ライフサイクルを、PKU の同一グループ([[Lingzhe Zhang]] ら、[[MicroRemed]]・LLM4AIOps サーベイと連続)がマイクロサービス運用から LLM の[[強化ファインチューニング]](RFT)の訓練プロセスへ移植し、初の細粒度障害ベンチ [[RFT-FaultBench]] と閉ループ [[RFT-FM]] を提示。RFT-FM の auto remediation の不安定性(MSC -5.84%)は [[障害緩和]] の「安全に巻き戻せる反復が鍵」と整合。並行 ingest(ARFBench)が [[異常検知]] を同時更新しコンフリクトしたが再読込して積み増し。 ## [2026-06-04] ingest-paper | Unearthing Semantic Checks for Cloud IaC Programs(Zodiac) - Source: `.raw/papers/sosp24-zodiac.pdf`(SOSP '24, DOI:10.1145/3694715.3695974。ACM は epdf/WebFetch とも 403、著者 [[Patrick Tser Jern Kon]] の preprint `cs-pk.com/preprint-sosp24-zodiac.pdf` を curl 取得、16p、md5 d3e3a903…) - Summary: [[@2024__SOSP__Unearthing Semantic Checks for Cloud Infrastructure-as-Code Programs|2024__SOSP__Unearthing Semantic Checks for Cloud Infrastructure-as-Code Programs]] - Pages created: [[Yiming Qiu]], [[Patrick Tser Jern Kon]], [[Ryan Beckett]], [[Ang Chen]], [[University of Michigan]], [[Zodiac]], [[Terraform]], [[Microsoft Azure]](entity 8)/ [[Infrastructure as Code]], [[設定マイニング]](concept 2)= 11 ページ - Pages updated: [[Microsoft]], [[障害注入]](概念に「デプロイ時障害注入と silent fault」横断知見 + 問い)/ index・hot・各 _index。なお [[Infrastructure as Code]]・[[Ang Chen]]・[[University of Michigan]]・[[Yiming Qiu]]・[[Patrick Tser Jern Kon]] は並行 ingest の Lilac/NSync 側が Zodiac/Lilac/NSync の 3 ソース横断で増築(per-file lock 協調下、私の Zodiac 内容は保全) - Key insight: コンパイルを通過した IaC でもデプロイ時に失敗する **semantic gap** を、公開リポジトリからのセマンティックチェック[[設定マイニング|マイニング]](KB 3 クラス + グラフ DSL + 84 テンプレート + confidence/lift + GPT-4 interpolation)と、SMT(Z3)で「単一チェックのみ違反」を保証する negative test case のデプロイ検証で埋める。Azure 52 種別・26,000 リポジトリから 510 検証済みチェック、200+ バグ repo + 公式ドキュメント 4 件修正。negative test case 生成は構成への「障害注入」で、注入が障害化しない偽陽性を実環境観測で篩う点が [[障害注入]] の silent fault 問題と同型。 - 並行 ingest 多数(Zodiac/Lilac/NSync/eBPF/ATSF)で index がドリフト。実ファイル数で index を **196 ページ/28 ソース**に reconcile(per-file lock で torn write は回避)。 ## [2026-06-04] ingest-paper | Cloud IaC × LLM エージェント 2 本(NSync / Lilac) - Source: `.raw/papers/arxiv-2510.20211.pdf`(NSync, arXiv:2510.20211)/ `.raw/papers/lilac-aiops-2025.pdf`(Lilac, AIOps 2025、cs-pk.com から取得) - Summary: [[@2025__arXiv__Automated Cloud Infrastructure-as-Code Reconciliation with AI Agents|2025__arXiv__Automated Cloud Infrastructure-as-Code Reconciliation with AI Agents]] / [[@2025__AIOps__Automated Lifting for Cloud Infrastructure-as-Code Programs|2025__AIOps__Automated Lifting for Cloud Infrastructure-as-Code Programs]] - Pages created: [[NSync]], [[Lilac]], [[Amazon Web Services]], [[University of California, San Diego]], [[Zhenning Yang]], [[Jingjia Peng]], [[AWS CloudTrail]], [[aztfexport]](source 2 + entity 8) - Pages updated: [[Infrastructure as Code]], [[AIOps]](概念 2)/ [[Terraform]], [[University of Michigan]], [[Ang Chen]], [[Yiming Qiu]], [[Patrick Tser Jern Kon]](並行 ingest の Zodiac 系エンティティへ一方向参照を積み増し)/ index・hot・各 _index - Key insight: 並行で進行中だった [[Zodiac]](SOSP'24、同じ [[Ang Chen]]/[[University of Michigan]] グループの IaC デプロイ前検証)の ingest と概念 [[Infrastructure as Code]]・エンティティ群が衝突。重複を作らず**統合**した。横断的知見の核は「IaC ライフサイクルを順方向のデプロイ前検証(Zodiac)・逆方向の lifting(Lilac)・drift 修復(NSync)の 3 方向で同一研究室が攻め、いずれも LLM + symbolic guardrail + 蓄積する知識ベースへ収束し、inter-resource 依存が方向を問わず最難所」。AIOps が事後対応の診断から先回りの構成管理へ外延を広げる実例でもある。 - 注意: 並行セッションが index/_index を同時編集中。共有エンティティ(Terraform/UMich/Ang Chen 等)の _index・index.md 登録は Zodiac 側に委ね、重複が出た場合は後続の wiki-lint で dedup する想定。 ## [2026-06-04] ingest | eBPF × AI/LLMs - The Convergence of System Observability and AI - Source: `.raw/articles/gpttrace-ebpf-ai-2026-06-04.md`(`eunomia.dev/GPTtrace/` を 2026-06-04 取得。curl + defuddle で本文抽出。[[Yusheng Zheng]]([[eunomia-bpf]])の eBPF×AI 総説 + awesome list。WebFetch は 403、curl に UA を付けて 200) - Summary: [[@2026__eunomia.dev__eBPF × AI-LLMs - The Convergence of System Observability and AI|2026__eunomia.dev__eBPF × AI-LLMs - The Convergence of System Observability and AI]] - Pages created: [[@2026__eunomia.dev__eBPF × AI-LLMs - The Convergence of System Observability and AI|source]], [[Yusheng Zheng]], [[eunomia-bpf]], [[bpftime]], [[GPTtrace]], [[AgentSight]], [[Kgent]], [[eBPF]](新規 concept) - Pages updated: [[テレメトリ]], [[agentic SRE]], [[Model Context Protocol]], [[go-conntracer-bpf]], index/log/hot - Key insight: 本 wiki は一貫して**アプリケーション層**の AIOps/RCA/可観測性を扱ってきたが、本ソースは初めて**カーネル層(eBPF)**の角度を持ち込む。核心は eBPF と AI の**双方向共生ループ**——(a)**eBPF for AI**: カーネル層のゼロ計装テレメトリで AI ワークロード/エージェントを観測([[AgentSight]] が claude code/gemini-cli を <3% オーバーヘッドで追跡)、(b)**AI for eBPF**: LLM が eBPF を生成・検証([[Kgent]] が Z3 記号検査つきで約 80% 意味的正しさ、[[GPTtrace]] が実装)。新 concept [[eBPF]] に共生ループを集約し、(1) [[テレメトリ]] の計装層([[go-conntracer-bpf]] の eBPF 系譜)の最前線、(2) [[agentic SRE]] の「エージェントを観測する側に置く」第 3 の軸、(3) [[Model Context Protocol]] が eBPF/カーネルツールをエージェントへ公開する標準として使われ始めた点を横断接続。本ソースは awesome list(二次情報)のため `confidence: medium`、[[AgentSight]](arXiv:2508.02736)・[[Kgent]](eBPF'24)の一次取り込みを未解決の問いに残す。 - Note: GPTtrace ページ URL だが内容は GPTtrace 単体でなく eBPF×AI 全体の総説。列挙プロジェクトは多数だが、eunomia-bpf 中核かつ vault と接続が強い 6 エンティティ + 1 concept に絞り、残りは source ページ本文に列挙のみ(editorial judgment)。並行 ingest(ATSF 論文)と競合したため `agentic SRE.md`・`index.md`・`log.md` は wiki-lock 取得後に編集。総ページ 178→186・実ソース 24→25 に reconcile。 ## [2026-06-04] ingest-paper | Position: Beyond Model-Centric Prediction—Agentic Time Series Forecasting - Source: `.raw/papers/arxiv-2602.01776.pdf`(Cheng+, [[University of Science and Technology of China]], arXiv:2602.01776v4 [cs.LG] 11 Mar 2026, 11p)。arXiv から PDF 原本を取得、pdftotext で全文通読。abstract・書誌・著者所属は PDF 本文(冒頭・脚注 1)で裏取り。 - Summary: [[@2026__arXiv__Position Beyond Model-Centric Prediction - Agentic Time Series Forecasting|2026__arXiv__Position Beyond Model-Centric Prediction - Agentic Time Series Forecasting]] - Pages created: [[@2026__arXiv__Position Beyond Model-Centric Prediction - Agentic Time Series Forecasting|source]], [[Mingyue Cheng]], [[Xiaoyu Tao]], [[Qi Liu]], [[Enhong Chen]], [[University of Science and Technology of China]], [[Cast-R1]], [[TimeCopilot]], [[エージェント型時系列予測]](新規 concept) - Pages updated: [[時系列基盤モデル]], [[agentic SRE]], index/hot/各 _index - Key insight: 時系列予測を perception・planning・action・reflection・memory の反復的意思決定(ATSF)へ再定式化するポジションペーパー。予測モデルの呼び出しを行動空間の 1 つとして扱い、Workflow/AgenticRL/AgenticFlow の 3 実装に整理(Table 1/2)。この agentic ループが [[agentic SRE]] の「観測→行動→検証→反省」分割と同型で、agentic 設計原理がドメイン横断で共有されることを新 concept [[エージェント型時系列予測]] に集約。ATSF はモデル規模・[[時系列基盤モデル]]と直交し TSFM を 1 ツールとして呼び出す立場。本 wiki 初のエージェント型時系列予測かつ初のポジションペーパー。 - Note: 入力は既存 `papers/` ノート(単一ソース詳細メモ)だったが、wiki-ingest-paper として arXiv から PDF 原本を別途取得し wiki レイヤーに集約。既存 `papers/` ノートは温存し source ページから一方向参照。PDF 脚注 1 の公式コード `github.com/Mingyue-Cheng/atsf` を一次採用(既存 papers ノートの `Xiaoyu-Tao/Cast-R1-TS` との差異を source ページに注記)。実ソース数 24・総ページ 178 に reconcile。 ## [2026-06-03] ingest-paper | LogPilot - Intent-aware and Scalable Alert Diagnosis for Large-scale Online Service Systems - Source: `.raw/papers/arxiv-2509.25874.pdf`(Jiang+, [[The Chinese University of Hong Kong]]×[[ByteDance]], ASE 2025 採録, arXiv:2509.25874v1 [cs.SE], 13p)。arXiv から PDF 原本を取得、pdftotext で全文通読。abstract・書誌・ASE 2025 採録は arXiv abs ページで裏取り。 - Summary: [[@2025__ASE__LogPilot - Intent-aware and Scalable Alert Diagnosis for Large-scale Online Service Systems|2025__ASE__LogPilot - Intent-aware and Scalable Alert Diagnosis for Large-scale Online Service Systems]] - Pages created: [[@2025__ASE__LogPilot - Intent-aware and Scalable Alert Diagnosis for Large-scale Online Service Systems|source]], [[LogPilot]], [[Zhihan Jiang]], [[Michael R. Lyu]], [[Tieying Zhang]], [[The Chinese University of Hong Kong]], [[Volcano Engine]], [[ログ解析]](新規 concept) - Pages updated: [[ByteDance]], [[根本原因分析]], [[異常検知]], [[Fault Localization]] - Key insight: アラート定義(PromQL)の意味的意図でログを絞る intent-aware scoping と、request を spatiotemporal log chain に再構成・クラスタリングして代表だけを LLM に渡す設計(98.71% の呼び出し削減)で、ログ volume の context 超過問題を解く。本 wiki 初のログ専門 RCA 一次論文で、「情報を絞ってから推論」の骨格が [[MetricSifter]]/[[Bits AI SRE]] とモダリティを越えて通底することを [[ログ解析]] に集約。産業側の「文脈なしの異常検知は不十分」という立場が [[MonitorAssistant]] と同型。 - Note: 別セッションが [[@2025__arXiv__Rethinking the Evaluation of Microservice RCA with a Fault Propagation-Aware Benchmark|2025__arXiv__Rethinking the Evaluation of Microservice RCA with a Fault Propagation-Aware Benchmark]] を並行 ingest(「23 ソース目」と採番)。過去の番号付けに +1 ズレがあり、実ソース数は LogPilot を含めて 23。index を 169 ページ/23 ソースに reconcile(per-file lock で torn write 回避)。 ## [2026-06-03] ingest-paper | Rethinking the Evaluation of Microservice RCA with a Fault Propagation-Aware Benchmark - Source: `.raw/papers/arxiv-2510.04711.pdf`(Fang+, [[The Chinese University of Hong Kong, Shenzhen]], arXiv:2510.04711v2 [cs.SE], 2025-12-23 改訂, 20p)。arXiv から PDF 原本を取得、pdftotext で全文通読。 - Summary: [[@2025__arXiv__Rethinking the Evaluation of Microservice RCA with a Fault Propagation-Aware Benchmark|2025__arXiv__Rethinking the Evaluation of Microservice RCA with a Fault Propagation-Aware Benchmark]] - Pages created: [[@2025__arXiv__Rethinking the Evaluation of Microservice RCA with a Fault Propagation-Aware Benchmark|source]], [[Aoyang Fang]], [[Pinjia He]], [[The Chinese University of Hong Kong, Shenzhen]], [[障害注入]](新規 concept) - Pages updated: [[根本原因分析]], [[Train-Ticket]], [[ChaosMesh]] - Key insight: 単純ヒューリスティック SimpleRCA が 4 公開ベンチで SOTA に匹敵する事実が、データ駆動 RCA の「進歩」をベンチマークの過度な単純さ(障害ケースの 86% が Type I/II の局所化/過少発現、99% が観測データ不完全)の産物と暴く。9,152 注入の 84.4% が silent fault という定量化は、ChaosMesh への「症状しか注入しない」批判をさらに進め、新 concept [[障害注入]] の核に。RCA の前提条件が「量(削減)・質(スケール)・完全性(被覆)」の 3 つに整理され、[[MetricSifter]]/[[TelecomTS]] の議論と接続。 ## [2026-06-03] ingest-paper | MonitorAssistant - Simplifying Cloud Service Monitoring via Large Language Models - Source: `.raw/papers/fse2024-monitorassistant.pdf`（Yu+, [[Tsinghua University]]×[[Microsoft]], ESEC/FSE 2024 Industry Track, DOI:10.1145/3663529.3663826, 12p）。NetManAIOps サイトの公開 PDF を原本に取得、pdftotext で全文通読。abstract・書誌は ACM Digital Library・Microsoft Research ページで裏取り。 - Summary: [[@2024__ESEC-FSE__MonitorAssistant - Simplifying Cloud Service Monitoring via Large Language Models|2024__ESEC-FSE__MonitorAssistant - Simplifying Cloud Service Monitoring via Large Language Models]] - Pages created: [[@2024__ESEC-FSE__MonitorAssistant - Simplifying Cloud Service Monitoring via Large Language Models|source]], [[MonitorAssistant]], [[Zhaoyang Yu]], [[Dan Pei]] - Pages updated: [[Minghua Ma]], [[異常検知]] - Key insight: LLM を検知器そのものでなくメタ層（設定推奨・解釈・フィードバック仲介）に限定する設計が「常時稼働には LLM が重い」制約の実践的回答。「実用的異常」(統計的逸脱+インシデント裏付け)の定義が学術—産業ギャップの構造を明示化。 ## [2026-06-03] ingest-paper | TelecomTS: A Multi-Modal Observability Dataset for Time Series and Language Analysis - Source: `.raw/papers/arxiv-2510.06063.pdf` - Summary: [[@2026__ICML__TelecomTS - A Multi-Modal Observability Dataset for Time Series and Language Analysis|2026__ICML__TelecomTS - A Multi-Modal Observability Dataset for Time Series and Language Analysis]] - Pages created: [[@2026__ICML__TelecomTS - A Multi-Modal Observability Dataset for Time Series and Language Analysis|2026__ICML__TelecomTS - A Multi-Modal Observability Dataset for Time Series and Language Analysis]], [[TelecomTS]], [[Yale University]], [[Ali Maatouk]], [[Rex Ying]] - Pages updated: [[時系列基盤モデル]], [[異常検知]], [[根本原因分析]] - Key insight: 絶対スケール情報の除去が RCA で最大 +30.4 ポイントの劣化を招き、正規化がオブザーバビリティ特有の診断情報を破壊することを初めて定量化。Toto が観測データ事前学習で RCA 0.848 と突出する一方、スケールを明示エンコードする Mantis が異常検知 F1 で Toto を凌駕。 ## [2026-06-03] ingest-paper | A Survey of AIOps in the Era of Large Language Models - Source: `.raw/papers/arxiv-2507.12472.pdf`(Zhang+, [[Peking University]]/[[Tsinghua University]]/[[University of Illinois Chicago]]/[[The Hong Kong University of Science and Technology (Guangzhou)]], ACM Computing Surveys 採録, arXiv:2507.12472, DOI:10.1145/3746635, 35p)。ACM が Cloudflare ボット保護背後で WebFetch/curl とも 403 のため arXiv プレプリント版を原本に fetch-paper-pdf.sh で取得、pdftotext で本文 §1–§8 を通読。abstract・書誌は arXiv abs ページ(journal-ref "Accepted By CSUR")で裏取り。 - Summary: [[@2025__CSUR__A Survey of AIOps in the Era of Large Language Models|2025__CSUR__A Survey of AIOps in the Era of Large Language Models]] - Pages created: [[@2025__CSUR__A Survey of AIOps in the Era of Large Language Models|2025__CSUR__A Survey of AIOps in the Era of Large Language Models]], [[Ying Li]], [[Philip S. Yu]], [[University of Illinois Chicago]], [[The Hong Kong University of Science and Technology (Guangzhou)]], [[異常検知]] - Pages updated: [[Lingzhe Zhang]], [[Tong Jia]], [[Peking University]], [[Tsinghua University]], [[AIOps]], [[根本原因分析]], [[障害緩和]], [[Fault Localization]], [[障害予測]], [[時系列基盤モデル]], [[index]], [[hot]], [[sources/_index]], [[entities/_index]], [[concepts/_index]] - Key insight: LLM4AIOps の初の包括的サーベイで、著者は既出の [[MicroRemed]] と同じ PKU グループ([[Lingzhe Zhang]]・[[Tong Jia]]・[[Ying Li]])。AIOps を切る 3 つ目の軸=**工程フロー**(データ→タスク→手法→評価)が確定し、[[AIOpsLab]] の能力軸・[[Google]] の自律度軸に並ぶ。緩和の自動化 5 段で vault の [[MicroRemed]](Lv4 script generation)・[[Stratus]](Lv5 automatic execution)が位置づき、サーベイが「Lv5 は実効性未検証」とした空白(カットオフ 2024-12)を 2025–2026 の一次ソースが [[Transactional No-Regression]] 付きで埋める時間的接続を確認。サーベイ自身が [[AIOpsLab]] を全ライフサイクルベンチの代表として引用し、地図(サーベイ)と地点(一次ソース)が相互参照。欠けていた概念 [[異常検知]] を新設。出典検査で arXiv v1 の数値不整合(abstract「183 本」対本文/Fig.4「163 本」)を検出し注記。並行する [[@2026__ICSE__An Empirical Study of Production Incidents in Generative AI Cloud Services|2026__ICSE__An Empirical Study of Production Incidents in Generative AI Cloud Services]] の ingest(18 ソース目)と採番が衝突したため本サーベイを 19 ソース目に reconcile。 ## [2026-06-03] ingest-paper | An Empirical Study of Production Incidents in Generative AI Cloud Services - Source: `.raw/papers/arxiv-2504.08865.pdf`（Yan+, [[Huazhong University of Science and Technology]]×[[University of Illinois Urbana-Champaign]]×[[Microsoft]], ICSE 2026, arXiv:2504.08865, 12p） - Summary: [[@2026__ICSE__An Empirical Study of Production Incidents in Generative AI Cloud Services|2026__ICSE__An Empirical Study of Production Incidents in Generative AI Cloud Services]] - Pages created: [[@2026__ICSE__An Empirical Study of Production Incidents in Generative AI Cloud Services|2026__ICSE__An Empirical Study of Production Incidents in Generative AI Cloud Services]], [[Haoran Yan]], [[Huazhong University of Science and Technology]], [[インシデント管理]] - Pages updated: [[Yinfang Chen]], [[Minghua Ma]], [[Tianyin Xu]], [[Microsoft]], [[AIOps]], [[根本原因分析]], [[障害緩和]] - Key insight: GenAI クラウドサービスのインシデントは非 GenAI に比べ TTM が 1.83 倍、人手検知が 38.3%（非 GenAI 13.7%）で、症状と根本原因は多対多。エージェント評価ベンチマークが想定する「1 障害 1 根本原因」構造と本番インシデントの複雑さの乖離が初めて定量化された。 ## [2026-06-03] ingest-paper | Pulse: Fine-grained and Non-intrusive LLM Training Monitoring via Microsecond-level Traffic Measurement - Source: `.raw/papers/3779212.3790163.pdf`(Xiao+, [[Nanjing University]] State Key Lab of Novel Software Technology ほか, ASPLOS '26, DOI:10.1145/3779212.3790163, 19p, CC-BY 4.0)。ACM が Cloudflare ボットチャレンジ背後で WebFetch/curl とも 403、ユーザーがブラウザ保存したローカル PDF(`~/Downloads/3779212.3790163.pdf`)を fetch-paper-pdf.sh で `.raw/papers/` にコピー、pdftotext -layout で全 19p(本体 §1–§11 + Appendix A–E)を通読。 - Summary: [[@2026__ASPLOS__Pulse - Fine-grained and Non-intrusive LLM Training Monitoring via Microsecond-level Traffic Measurement|2026__ASPLOS__Pulse - Fine-grained and Non-intrusive LLM Training Monitoring via Microsecond-level Traffic Measurement]] - Pages created: [[Pulse]], [[Nanjing University]], [[Chen Tian]], [[Qingkai Meng]], [[Yibo Xiao]], [[BlueField-3]], [[NCCL]], [[Aegis]], [[Holmes]], [[GreyHound]], [[LLM学習モニタリング]] - Pages updated: [[Fault Localization]], [[GPUクラスタ運用]], [[LLM分散学習]], [[index]], [[hot]], [[sources/_index]], [[entities/_index]], [[concepts/_index]] - Key insight: 「fine-grained 監視は overhead を生むから粒度を上げられない」という [[Minder]] の制約(§6.6 で ms 監視を要請しつつ overhead で未展開)は、計測を host on-path から on-NIC off-path([[BlueField-3]] DPA、3 層計測)へ移すことで外せる——Pulse は訓練コード/CCL 無改変のまま microsecond 粒度を overhead ほぼ 0 で実現し、SOTA(OP-level)が届かない straggler を 12 中 10 で machine-level に局所化。検知機構が heartbeat→host-metric→traffic の 3 層に伸びた。並行して別セッションが ITBench を 16 ソース目として ingest したため Pulse を 17 ソース目に採番(per-file lock で torn write は回避、index カウントを 124p/17 源に reconcile)。 ## [2026-06-03] ingest-paper | ITBench: Evaluating AI Agents across Diverse Real-World IT Automation Tasks - Source: `.raw/papers/jha25a.pdf`(Jha+, [[IBM Research]]×[[University of Illinois Urbana-Champaign]], ICML 2025 / PMLR v267, pp.27134–27197, 64p。PMLR ページ + raw.githubusercontent の PDF を fetch-paper-pdf.sh で取得、pdftotext で本体 §1–§6 + Impact Statement を通読。abstract・書誌は PMLR ページを WebFetch で裏取り) - Summary: [[@2025__ICML2025__ITBench - Evaluating AI Agents across Diverse Real-World IT Automation Tasks|2025__ICML2025__ITBench - Evaluating AI Agents across Diverse Real-World IT Automation Tasks]] - Pages created: [[@2025__ICML2025__ITBench - Evaluating AI Agents across Diverse Real-World IT Automation Tasks|2025__ICML2025__ITBench - Evaluating AI Agents across Diverse Real-World IT Automation Tasks]], [[Rohan Arora]] - Pages updated: [[ITBench]](二次情報→一次論文ベースに全面改稿), [[Saurabh Jha]], [[IBM Research]], [[University of Illinois Urbana-Champaign]], [[AIOpsLab]], [[CrewAI]], [[SRE Benchmark]], [[agentic SRE]], [[index]], [[hot]], [[sources/_index]], [[entities/_index]] - Key insight: これまで [[SREGym]]・[[Stratus]] の二次情報経由でしか参照できなかった [[ITBench]] を一次論文として取り込み。ベンチ設計の 2 直交軸が確定——(1) SRE 深掘り([[AIOpsLab]]/[[SREGym]])vs ペルソナ横断(SRE/CISO/FinOps の 102 シナリオ)、(2) 報告天井の桁違い([[AIOpsLab]] ~59%・[[SREGym]] ~6割・ITBench は GPT-4o の SRE 緩和 11.43%・Hard 緩和 0%)。trace ablation(診断 13.81%→9.52%・緩和 11.43%→2.86%)が telemetry 選別の重要性を制御変数化、trajectory 指標(Detoured/Covered Services)が「成功は fault propagation chain に探索集中」を可観測化。診断 oracle 進化の中段 NTAM を一次確認。ITBench と [[Stratus]] は同じ IBM チーム・同じ [[CrewAI]] 基盤という近接構図も確定。 ## [2026-06-03] ingest-paper(re-ingest)| STRATUS: A Multi-agent System for Autonomous Reliability Engineering of Modern Clouds - Source: `.raw/papers/arxiv-2506.02009.pdf`(Chen+, [[University of Illinois Urbana-Champaign]]×[[IBM Research]](+[[Tsinghua University]]), NeurIPS 2025, arXiv:2506.02009v2, 48p)。fetch-paper-pdf.sh で PDF 取得、pdftotext で全文抽出して §1–§7・付録 A–E を通読。**既存エントリは poster/abstract のみ参照していたため、PDF 本文を根拠に全面改稿(re-ingest)**。 - Summary: [[@2025__NeurIPS2025__STRATUS - A Multi-agent System for Autonomous Reliability Engineering of Modern Clouds|2025__NeurIPS2025__STRATUS - A Multi-agent System for Autonomous Reliability Engineering of Modern Clouds]] - Pages created: [[IBM Research]], [[CrewAI]] - Pages updated: [[Stratus]], [[Transactional No-Regression]], [[agentic SRE]], [[障害緩和]], [[Saurabh Jha]], [[University of Illinois Urbana-Champaign]], [[Tsinghua University]], [[index]], [[hot]], [[sources/_index]], [[entities/_index]] - Key insight: 本文取り込みで TNR を「severity µ=w1|A|+w2|V|+w3|L| の単調非増加(µ(s)≤b)を A-Lock/Faithful Undo/Bounded Risk Window(K=20)の下で保証する Alpern–Schneider safety property(Lemma 3.1)」として確定でき、concept [[Transactional No-Regression]] の「形式的定義未確認」の問いを解消。実装は stack-based rollback の Undo Agent。評価は AIOpsLab 69.2%(9/13)・ITBench 50.0%(9/18)で 1.5X/5.4X、ablation で No retry 15.4% と TNR の undo-and-retry が緩和の鍵と裏取り。重要な留保: ITBench 18 問中 8 問は「注入 fault が pod 再起動で消える」性質を悪用した pod restart で解け undo agent 有無で成績不変——安全仕様(no-regression)と評価の誠実性(正しく直す)は直交するという横断的知見を [[Transactional No-Regression]]・[[障害緩和]] に追記。所属も本文で UIUC/IBM Research/Tsinghua と確定(旧版「所属未記載」注記を解消)。 ## [2026-06-03] ingest-paper | Minder: Faulty Machine Detection for Large-scale Distributed Model Training - Source: `.raw/papers/nsdi25-deng.pdf`(Deng+, [[ByteDance]]/[[Tsinghua University]]/Northeastern/[[Harvard University]], NSDI '25, 978-1-939133-46-5, 18p)。USENIX presentation ページが WebFetch で 403 のため curl で HTML 取得→PDF URL(usenix.org/system/files/nsdi25-deng.pdf)を抽出し fetch-paper-pdf.sh で取得、pdftotext で全文抽出して通読。 - Summary: [[@2025__NSDI__Minder - Faulty Machine Detection for Large-scale Distributed Model Training|2025__NSDI__Minder - Faulty Machine Detection for Large-scale Distributed Model Training]] - Pages created: [[Minder]], [[Yangtao Deng]], [[Zhuo Jiang]], [[Minlan Yu]], [[Tsinghua University]], [[Harvard University]] - Pages updated: [[ByteDance]], [[GPUクラスタ運用]], [[Fault Localization]], [[LLM分散学習]], [[変化点検知]], [[index]], [[hot]], [[sources/_index]], [[entities/_index]], [[concepts/_index]] - Key insight: ML systems クラスタ一次論文 2 例目で初の「訓練クラスタの faulty machine detection」軸。同じ ByteDance の [[MegaScale]] が heartbeat ベースで死んだノードを復旧するのに対し、Minder は停止前の異常 metric パターン(machine-level similarity + continuity + per-metric LSTM-VAE + decision tree prioritization)で slow fault まで machine-level に教師なし特定(本番 1 年超、precision 0.904・F1 0.893・3.6 秒、手動比 99% 短縮)。横断的知見: 検知機構が heartbeat 系と metric-pattern 系に分化・補完すること、訓練クラスタ診断と本番 AIOps は distributed-view で同型だが信号源が真逆(訓練=homogeneity からの逸脱 vs microservice=heterogeneity の依存伝播)で Minder が cloud 診断手法の訓練転用不能を明言すること、fault landscape の hardware dominant(55.8%/ECC 38.9%)が SAKURAONE・LLaMA3 と連続することを記録。 ## [2026-06-03] ingest-paper | SAKURAONE: An Open Ethernet-Based AI HPC System and Its Observed Workload Dynamics in a Single-Tenant LLM Development Environment - Source: `.raw/papers/arxiv-2604.13600.pdf`(Konishi+, [[SAKURA Internet]] Research Center, MLSys 2026 採録 / arXiv:2604.13600, v1 2026-04-15 / v2 2026-04-16, cs.DC/cs.NI, 15p)。fetch-paper-pdf.sh で取得(初回 429、サンドボックス無効化で再取得)、書誌は arXiv abs を WebFetch で裏取り(MLSys 2026 採録・著者 3 名 equal contribution を確認)。 - Summary: [[@2026__MLSys2026__SAKURAONE - An Open Ethernet-Based AI HPC System|2026__MLSys2026__SAKURAONE - An Open Ethernet-Based AI HPC System]] - Pages created: [[Fumikazu Konishi]], [[SAKURAONE]], [[SONiC]], [[オープンネットワーキング]], [[GPUクラスタ運用]] - Pages updated: [[Yuuki Tsubouchi]], [[Hirofumi Tsuruta]], [[SAKURA Internet]], [[LLM分散学習]], [[並列化戦略]], [[index]], [[hot]], [[sources/_index]], [[entities/_index]], [[concepts/_index]] - Key insight: vault 所有者 [[Yuuki Tsubouchi]] の共著・本 wiki 初の HPC/open networking 一次論文。SONiC + RoCEv2 のフルオープン 800 GbE が NVIDIA Eos(InfiniBand)比 time-to-train 1.02–1.26× を達成し、mid-scale(800 GPU)単一テナント LLM 開発のワークロード動態(small-job が件数支配・large-job が GPU 時間支配、cancellation 73.5%、CPT→fine-tuning フェーズ遷移、21 fault の 42.9% が GPU 起因)を telemetry から定量化。MFU 38–41% の規模非依存性・hardware 起因 dominant の連続性を [[LLM分散学習]]/[[並列化戦略]] の hyperscale ソースと突き合わせた。 ## [2026-06-03] ingest-paper | MegaScale: Scaling Large Language Model Training to More Than 10,000 GPUs - Source: `.raw/papers/nsdi24-jiang-ziheng.pdf`(Jiang+, [[ByteDance]]/[[Peking University]], NSDI '24, 2024-04-16, 16p)、スライド `.raw/papers/nsdi24-slides-jiang-ziheng.pdf`(17p)。USENIX 発表ページ(usenix.org/conference/nsdi24/presentation/jiang-ziheng)は WebFetch が 403 のため curl(User-Agent 付き)で HTML 取得 → 論文/スライド PDF の直リンクを抽出。論文 PDF は fetch-paper-pdf.sh、スライド PDF は curl で取得し pdftotext 抽出。 - Summary: [[@2024__NSDI__MegaScale - Scaling Large Language Model Training to More Than 10,000 GPUs|2024__NSDI__MegaScale - Scaling Large Language Model Training to More Than 10,000 GPUs]] - Pages created: [[MegaScale]], [[ByteDance]], [[Megatron-LM]], [[Ziheng Jiang]], [[Xin Jin]], [[Xin Liu]] - Pages updated: [[LLM分散学習]], [[並列化戦略]], [[index]], [[hot]], [[sources/_index]], [[entities/_index]] - Key insight: 直前に入れた LLM 訓練サーベイ([[@2026__Vicinagearth__Efficient Training of Large Language Models on Distributed Infrastructures - A Survey|2026__Vicinagearth__Efficient Training of Large Language Models on Distributed Infrastructures - A Survey]])の taxonomy を本番システムで裏取りする ML systems クラスタ初の一次論文。サーベイの「数万 GPU で MFU 38〜41%」に対し MegaScale は 12,288 GPU で 55.2%(Megatron-LM 比 1.34×)を実測し、Efficiency 軸は宿命でなく algorithm-system co-design 問題であることを確定。Reliability 軸は数週間本番 run で 100 回超の自動復旧という具体形を与え、§5 の straggler 診断(単一 GPU benchmark では不可視・distributed timeline trace で起因特定)が本番サービス AIOps([[Fault Localization]]/[[分散トレーシング]])と同型の distributed-view 課題であることを横断的知見化。 - Note: concept は新設せず既存 [[LLM分散学習]]/[[並列化戦略]] を seed→developing に充填(規約 §8)。[[Peking University]] は既存ページを温存し参照のみ。 ## [2026-06-03] ingest-paper | Efficient Training of Large Language Models on Distributed Infrastructures: A Survey - Source: `.raw/papers/arxiv-2407.20018.pdf`(Duan+, arXiv:2407.20018, 2024-07-29 投稿, 42p; 正式出版は論文誌 Vicinagearth Vol.3 Issue 1 Article 38, Springer, 2026-06-01, DOI:10.1007/s44336-026-00038-z。書誌は Crossref API で確定、Springer 本体は認証リダイレクトのため未取得) - Summary: [[@2026__Vicinagearth__Efficient Training of Large Language Models on Distributed Infrastructures - A Survey|2026__Vicinagearth__Efficient Training of Large Language Models on Distributed Infrastructures - A Survey]] - Pages created: [[LLM分散学習]], [[並列化戦略]], [[Mixture-of-Experts]], [[Shanghai AI Laboratory]], [[Jiangfei Duan]], [[Peng Sun]] - Pages updated: [[index]], [[hot]], [[concepts/_index]], [[entities/_index]], [[sources/_index]] - Key insight: 本 wiki 初の LLM 訓練インフラ・別ドメイン。SER(Scalability/Efficiency/Reliability)の 3 軸とインフラ/並列化/最適化/fault tolerance の 4 層で AIOps/SRE/時系列とは独立した ML systems クラスタを新設。§8 fault tolerance の anomaly detection/failure analysis が運用 observability と語彙を共有する点を弱い接点として明示。 - Process note: 42p の大作のため本文(行158–2165)を 4 区間に分割し並行サブエージェントで精読・構造化してから統合。引用システムは数百あるが entity は中核(主所属・主要著者)に絞った。 ## [2026-06-03] ingest-paper | Scaling Telemetry Workloads in Cloud Applications: Techniques for Instrumentation, Storage, and Mining - Source: `.raw/papers/kyoto-djohk00908.pdf`([[Yuuki Tsubouchi]] の京都大学博士学位論文, 2025-03, 112p; 京都大学学術情報リポジトリの DSpace REST API `/server/api/core/bitstreams/<uuid>/content` から PDF を curl で取得・pdftotext で抽出。フロントエンドの `/bitstreams/<uuid>/download` は SPA shell HTML を返すため REST API content endpoint を使用) - Summary: [[@2025__Kyoto University__Scaling Telemetry Workloads in Cloud Applications - Techniques for Instrumentation, Storage, and Mining|2025__Kyoto University__Scaling Telemetry Workloads in Cloud Applications - Techniques for Instrumentation, Storage, and Mining]] - Pages created: [[HeteroTSDB]], [[go-conntracer-bpf]], [[Mackerel]], [[Hatena]], [[Kyoto University]], [[Ryosuke Matsumoto]], [[テレメトリ]], [[時系列データベース]], [[分散トレーシング]] - Pages updated: [[Yuuki Tsubouchi]], [[特徴量削減]], [[Fault Localization]], [[index]], [[hot]], [[sources/_index]], [[entities/_index]], [[concepts/_index]] - Key insight: 既取り込みの [[MetricSifter]](mining 層)の足元に、本論文が instrumentation 層([[分散トレーシング]]: in-kernel flow bundling)と storage 層([[時系列データベース]]: [[HeteroTSDB]])を補い、[[テレメトリ]]を 3 層の枠組みとして wiki に確立。§6.2 の設計指針「データ削減は文脈が豊富な両端(instrumentation・mining)で、storage は context 非依存に」が、MetricSifter の [[特徴量削減]] と LLM エージェントの telemetry 過剰消費病理を貫く「情報を絞る」骨格を収集の最上流まで一般化する。future direction の LLM failure snapshot が [[Bits AI SRE]]/[[根本原因分析]] に接続。 - Note: 既存 [[Yuuki Tsubouchi]] の「2023 年博士号取得」記述と本論文表紙の「March, 2025」が食い違う。一次資料(本論文)に合わせ年を明示しない記述へ修正し、note callout で差異を明記。 ## [2026-06-03] ingest-paper | MetricSifter: Feature Reduction of Multivariate Time Series Data for Efficient Fault Localization in Cloud Applications - Source: `.raw/papers/ieee-10462133-metricsifter.pdf`(IEEE Access vol.12, pp.37398–37417, DOI:10.1109/ACCESS.2024.3374334; 著者 [[Yuuki Tsubouchi]]・[[Hirofumi Tsuruta]]([[SAKURA Internet]] Research Center); IEEE stamp PDF を curl で取得・pdftotext で 20p 抽出、書誌/abstract は IEEE メタデータ JSON で裏取り) - Summary: [[@2024__IEEE Access__MetricSifter - Feature Reduction of Multivariate Time Series Data for Efficient Fault Localization in Cloud Applications|2024__IEEE Access__MetricSifter - Feature Reduction of Multivariate Time Series Data for Efficient Fault Localization in Cloud Applications]] - Pages created: [[Yuuki Tsubouchi]], [[Hirofumi Tsuruta]], [[SAKURA Internet]], [[MetricSifter]], [[Meltria]], [[Sock Shop]], [[PyRCA]], [[Fault Localization]], [[特徴量削減]], [[変化点検知]] - Pages updated: [[根本原因分析]], [[AIOps]], [[Train-Ticket]], [[index]], [[hot]], [[sources/_index]], [[entities/_index]], [[concepts/_index]] - Key insight: MetricSifter(pre-LLM, 2024)が示す「無関係メトリクス $M_C$ がノイズとして localization を阻害する」課題は、後年 LLM エージェント([[Bits AI SRE]]/[[AIOpsLab]] §3.6)が観測した「telemetry 過剰消費で性能が落ちる」病理と同型。情報を絞ってから診断する骨格が手法世代を超えて連続する。本 wiki 初の vault 所有者自身の論文。 ## [2026-06-03] ingest-paper | Falcon-X: A Time Series Foundation Model for Heterogeneous Multivariate Modeling - Source: `.raw/papers/arxiv-2605.27286.pdf`(arXiv:2605.27286v1 [cs.LG], 投稿 2026-05-26; 著者 Yiding Liu ほか計8名, [[Ant International]](連絡先 @ant-intl.com、正式所属表記なし); PDF 31p を fetch-paper-pdf.sh で取得、abstract/書誌は arXiv abs で裏取り) - Summary: [[@2026__arXiv__Falcon-X - A Time Series Foundation Model for Heterogeneous Multivariate Modeling|2026__arXiv__Falcon-X - A Time Series Foundation Model for Heterogeneous Multivariate Modeling]] - Pages created: [[Falcon-X]], [[Ant International]], [[Chronos-2]], [[GIFT-Eval]], [[fev-bench]], [[多変量時系列予測]] - Pages updated: [[時系列基盤モデル]], [[index]], [[hot]], [[sources/_index]], [[entities/_index]], [[concepts/_index]] - Key insight: TSFM 2 ソース目。Toto の「観測データ特化」に対し Falcon-X は「異種多変量の cross-variate モデリング」を主眼に、変量を latent prototype 空間へ decouple し differential attention で正負の依存を表現。raw-space group attention([[Chronos-2]])の semantic collapse を批判。GIFT-Eval で全体最高だが SRE 下流タスクは未評価。 ## [2026-06-03] ingest-paper | This Time is Different: An Observability Perspective on Time Series Foundation Models - Source: `.raw/papers/arxiv-2505.14766.pdf`(arXiv:2505.14766 v2, NeurIPS 2025 poster; 著者 Ben Cohen, Emaad Khwaja ほか計19名, Datadog AI Research / Carnegie Mellon University; PDF 38p を fetch-paper-pdf.sh で取得、abstract/書誌は arXiv abs + NeurIPS poster page で裏取り) - Summary: [[@2025__NeurIPS2025__This Time is Different - An Observability Perspective on Time Series Foundation Models|2025__NeurIPS2025__This Time is Different - An Observability Perspective on Time Series Foundation Models]] - Pages created: [[Toto]], [[BOOM]], [[Ameet Talwalkar]], [[Carnegie Mellon University]], [[時系列基盤モデル]] - Pages updated: [[Datadog]], [[index]], [[hot]], [[sources/_index]], [[entities/_index]], [[concepts/_index]] - Key insight: wiki 初の純 ML(時系列予測)ソースで、これまでの AIOps/SRE エージェント系列とは別軸。ただし出所は [[Datadog]] で 3 例目——SRE エージェント([[Bits AI SRE]])の足元にある**観測 telemetry の予測モデル層**として接続する。観測データ(observability metrics)が一般時系列と統計的に異なる(KPSS・skew・spectral entropy・flat spots 等が極端、§4.3)ことを定量化し、専用 decoder-only アーキテクチャ(patch-based causal scaling・proportional factorized attention 11:1・Student-T mixture head・composite robust loss)で zero-shot SOTA を達成。事前学習 2.36 兆点(43% が Datadog 匿名観測メトリクス)。[[BOOM]] で CRPS 次点比12.4%・MASE 13.1% 改善、GIFT-Eval(Rank 5.495)・LSF でも SOTA。重み/コード/データを Apache 2.0 公開。 - 判断: 著者19名中 entity 化は senior author の [[Ameet Talwalkar]](CMU、被参照価値高)と所属 [[Carnegie Mellon University]] のみに絞り、残る著者は source に記録(取捨選択は機能)。concept は精度の合う [[時系列基盤モデル]] 1 件を新設し `structures/時系列基盤モデル - MOC` に一方向リンク。1 ソース目のため横断的知見は薄く、未解決の問い(汎用 TSFM が観測データで苦戦する原因の切り分け等)を充実させた。 ## [2026-06-03] ingest | Building Bits AI SRE: Autonomous Incident Investigation Agent - Source: `.raw/articles/building-bits-ai-sre-2026-06-03.md`(datadoghq.com/blog/building-bits-ai-sre, fetched 2026-06-03; 著者 Daniel Shan, Tristan Ratchford; WebFetch で取得、defuddle はサンドボックス網制限で不可) - Summary: [[@2026__Datadog__Building Bits AI SRE - Autonomous Incident Investigation Agent|2026__Datadog__Building Bits AI SRE - Autonomous Incident Investigation Agent]] - Pages created: [[Datadog]], [[Bits AI SRE]], [[根本原因分析]] - Pages updated: [[agentic SRE]], [[SRE Benchmark]], [[AIOps]], [[index]], [[hot]], [[sources/_index]], [[entities/_index]], [[concepts/_index]] - Key insight: 産業界 2 例目の一次情報。[[Google]] が全 SRE ライフサイクル+自律緩和(L2/L3)を語るのに対し、[[Bits AI SRE]] は**調査・RCA 段に特化**(緩和は将来の specialist agent 統合に委ねる)。AIOps 4-level taxonomy で唯一 concept 未作成だった **RCA(第 3 段)** を [[根本原因分析]] として新設。骨格は hypothesis-driven investigation(全 telemetry 一括要約でなく仮説検証の反復)・causal relationship focus(初期版の 12+ tool call による context overload を回避)・recursive depth(sub-hypothesis 分解で深掘り)。これは学術ベンチが観測した「情報を取りすぎる病理」([[AIOpsLab]] §3.6・[[SREGym]] greedy・[[MicroRemed]] 過剰 probing)を産業実装が製品設計の出発点として明示回避したもの。評価は実インシデント再生+LLM judge で [[Google]] の continuous eval と同骨格、TTR 最大 95% 減を主張。 - 判断: 著者 2 名(Daniel Shan・Tristan Ratchford)はブログ著者で他ソースと交差せず被参照価値が低いため person entity を作らず source に記録。entity は組織 [[Datadog]] と製品 [[Bits AI SRE]] のみ。RCA は taxonomy のギャップを埋める cross-cutting concept として新設(既存 [[障害緩和]]/[[障害予測]] と並ぶ)。 ## [2026-06-03] ingest | AI in SRE: How Google is Engineering the Future of Reliable Operations - Source: `.raw/articles/ai-engineering-reliable-operations-2026-06-03.md`(sre.google, fetched 2026-06-03; 著者 Ioannis Papapanagiotou, Stevan Malesevic, Chris Heiser, Ruslan Meshenberg; defuddle はサンドボックス網制限で不可、WebFetch で取得) - Summary: [[@2026__GoogleSRE__AI in SRE - Engineering the Future of Reliable Operations|2026__GoogleSRE__AI in SRE - Engineering the Future of Reliable Operations]] - Pages created: [[SRE AI Autonomy Levels]], [[Google]], [[AI Operator]], [[Actus]], [[Detectr]], [[Model Context Protocol]] - Pages updated: [[agentic SRE]], [[Transactional No-Regression]], [[SRE Benchmark]], [[AIOps]], [[障害予測]], [[index]], [[hot]], [[sources/_index]], [[entities/_index]], [[concepts/_index]] - Key insight: 本 wiki 初の産業界・本番運用一次情報。学術ベンチがエージェントの**タスク成功率**で測るのに対し、Google は **SRE AI Autonomy Levels(L0–L4)** という権限委譲の段階で AI-Ops を統治する(直交軸)。推論([[AI Operator]])と actuation([[Actus]]: dry-run・Red Button)の分離は [[Transactional No-Regression]] の産業実装に相当。LLM-as-a-Judge と Bronze/Silver/Gold 評価が産業の continuous eval として登場。「L2/L3 自律緩和を本番稼働」の主張は学術ベンチの能力天井(6 割前後・5〜20 step saturate)とテンションがあり [[agentic SRE]] に contradiction callout を設置。 - 判断: 著者 4 名は一次情報での被参照価値が低い(他ソースと交差しない産業著者)ため person entity を作らず source に記録。Google 社内システムは architecturally 重要な [[AI Operator]] / [[Actus]] / [[Detectr]] と標準 [[Model Context Protocol]] のみ entity 化し、AI Alert/InvD/IRMA/Antigravity CLI/Production Agent は source 本文に記述。 ## [2026-06-03] ingest-paper | MicroRemed: Benchmarking LLMs in Microservices Remediation - Source: `.raw/papers/arxiv-2511.01166.pdf`(arXiv:2511.01166v1 [cs.CL], 2025-11-03; PKU/Alibaba/Tsinghua; code: github.com/LLM4AIOps/MicroRemed) - Summary: [[@2025__arXiv__MicroRemed - Benchmarking LLMs in Microservices Remediation|2025__arXiv__MicroRemed - Benchmarking LLMs in Microservices Remediation]] - Pages created: [[MicroRemed]], [[ThinkRemed]], [[Ansible]], [[Train-Ticket]], [[Online-Boutique]], [[Lingzhe Zhang]], [[Tong Jia]], [[Peking University]], [[Alibaba Group]], [[障害緩和]] - Pages updated: [[AIOps]], [[agentic SRE]], [[SRE Benchmark]], [[index]], [[hot]], [[sources/_index]], [[entities/_index]], [[concepts/_index]] - Key insight: AIOps 4-level taxonomy の最上位 Mitigation を「診断レポート→実行可能 Ansible playbook の生成(E2E-MR)」として切り出した初の専門ベンチ。ThinkRemed の ablation が reflection > probe・過剰 probing の害を示し、[[Stratus]]・[[SREGym]] の「反復と反省が緩和の鍵」と独立に一致。chaos injection を緩和評価に積極採用する点で SREGym と立場が分岐。 ## [2026-06-03] ingest-paper | PAGER: Proactive Monitoring Agent for Enterprise AI Assistant - Source: `.raw/papers/aaai2026-pager.pdf`(AAAI-26 デモ, pp. 41574–41576; CAIS 2026 デモ; DOI:10.1609/aaai.v40i48.42344; OJS galley 46305) - Summary: [[@2026__AAAI__PAGER - Proactive Monitoring Agent for Enterprise AI Assistant]] - Pages created: [[PAGER]], [[Adobe Experience Platform]], [[Adobe]], [[Yunyao Li]], [[障害予測]] - Pages updated: [[AIOps]], [[index]], [[hot]], [[sources/_index]], [[entities/_index]], [[concepts/_index]] - Key insight: reactive 一色だった wiki に proactive な[[障害予測]]の軸を追加。PAGER は予測を古典 random forest、LLM を説明・対話インターフェース層に限定するハイブリッド構成。 ## [2026-06-03] ingest | STRATUS: A Multi-agent System for Autonomous Reliability Engineering of Modern Clouds - Source: `.raw/articles/stratus-neurips2025-poster-116834-2026-06-03.md`(NeurIPS 2025 poster; arXiv:2506.02009、OpenReview fYW1PKawwJ) - Summary: [[@2025__NeurIPS2025__STRATUS - A Multi-agent System for Autonomous Reliability Engineering of Modern Clouds|2025__NeurIPS2025__STRATUS - A Multi-agent System for Autonomous Reliability Engineering of Modern Clouds]] - Pages created: [[@2025__NeurIPS2025__STRATUS - A Multi-agent System for Autonomous Reliability Engineering of Modern Clouds|2025__NeurIPS2025__STRATUS - A Multi-agent System for Autonomous Reliability Engineering of Modern Clouds]], [[Transactional No-Regression]], [[Saurabh Jha]] - Pages updated: [[Stratus]], [[agentic SRE]], [[SRE Benchmark]], [[AIOpsLab]], [[ITBench]], [[Yinfang Chen]], [[Tianyin Xu]], [[index]], [[hot]] - Key insight: これまで [[SREGym]] 経由の二次情報([[Stratus]] entity)でしか持っていなかった STRATUS を一次論文に格上げ。SREGym が観測した「undo-and-retry が最強の緩和を生む」は、一次論文が安全仕様 [[Transactional No-Regression]] (TNR) として形式化したものと符合。AIOpsLab・ITBench 両ベンチで SOTA を 1.5 倍上回ると主張し、複数ベンチ横断評価が標準化しつつある(ベンチ作者 [[Saurabh Jha]] がエージェント共著者でもある)。 ## [2026-06-03] ingest-paper | AIOpsLab: A Holistic Framework to Evaluate AI Agents for Enabling Autonomous Clouds - Source: `.raw/papers/arxiv-2501.06706.pdf`(MLSys 2025; arXiv:2501.06706v1, 2025-01-12) - Summary: [[@2025__MLSys2025__AIOpsLab - A Holistic Framework to Evaluate AI Agents for Enabling Autonomous Clouds|2025__MLSys2025__AIOpsLab - A Holistic Framework to Evaluate AI Agents for Enabling Autonomous Clouds]] - Pages created: [[Yinfang Chen]], [[Minghua Ma]], [[Microsoft]], [[DeathStarBench]], [[ChaosMesh]], [[AIOps]] - Pages updated: [[AIOpsLab]], [[University of Illinois Urbana-Champaign]], [[SRE Benchmark]], [[agentic SRE]], [[index]], [[hot]], [[sources/_index]], [[entities/_index]], [[concepts/_index]] - Key insight: AIOpsLab(2025)は障害を detection/localization/RCA/mitigation の 4 サブ問題に分解して個別採点し、application/virtualization 層中心。後続 SREGym(2026)はこれを end-to-end 評価+層横断 fault+noise で乗り越える。両者は AgentOps/agentic SRE と別名で同じ営みを指し、独立に「エージェントが最初の仮説に固執し telemetry を取りすぎる」失敗を観測。 - Contradiction: SREGym 由来の「AIOpsLab は ReAct ループを要求/非 ReAct は移植必要」は一次論文(get_action のみ要求、任意 framework 可)と食い違い。[[AIOpsLab]] に callout 設置。 ## [2026-06-03] ingest-paper | SREGym: A Live Benchmark for AI SRE Agents with High-Fidelity Failure Scenarios - Source: `.raw/papers/arxiv-2605.07161.pdf`(arXiv:2605.07161v2, 2026-05-13) - Summary: [[@2026__arXiv__SREGym - A Live Benchmark for AI SRE Agents with High-Fidelity Failure Scenarios|2026__arXiv__SREGym - A Live Benchmark for AI SRE Agents with High-Fidelity Failure Scenarios]] - Pages created: [[SREGym]], [[Stratus]], [[AIOpsLab]], [[ITBench]], [[Tianyin Xu]], [[University of Illinois Urbana-Champaign]], [[agentic SRE]], [[SRE Benchmark]], [[Metastable Failure]] - Pages updated: [[index]], [[hot]], [[sources/_index]], [[entities/_index]], [[concepts/_index]] - Key insight: 高忠実度の SRE ベンチは noise・低位層(OS/hardware)fault・metastable/concurrent/correlated の障害モードを区別軸に置く。フロンティアエージェントはアプリ層には強いが、これら新障害で E2E が 60%→18–28% に崩れ、greedy approach で最初の異常に固着する。 ## [2026-06-02] init | LLM wiki レイヤー初期化 - Type: setup - mode=generic、transport=filesystem(GUI バイナリ誤検出回避のため manual_override で固定) - 作成: `.raw/`、`wiki/{sources,entities,concepts,questions,meta}/`、`.vault-meta/{mode,transport}.json` - helper コピー: `scripts/{wiki-mode.py,wiki-lock.sh,detect-transport.sh}` - スコープ: 新規ソースのみ。既存 papers/・research/・structures/ は ingest しない。 ## [2026-06-26] ingest-paper | データセンター信頼性・クラウド障害論文 9 本 - Source: `.raw/papers/{dsn2017-datacenter-hardware-failures,socc2016-cos,imc2018-facebook-network-errors,hotos2019-azure-software-failures,sosp2011-hardware-errors,nsdi2020-omegagen,datacenter-scale-temperature-impact,empirical-kubernetes-operator-bugs,taxdc}.pdf` - Summary: [[データセンター信頼性]], [[クラウドインシデント]], [[データセンターネットワーク信頼性]], [[分散システム障害]], [[Kubernetesオペレータ]] - Pages created: 9 source pages、5 concept pages。 - Key insight: 障害を独立した部品故障として扱うだけでは不十分であり、相関、復旧連鎖、部分障害、設定・順序依存をサービス影響に接続して扱う必要がある。