アラートストーム - yuuk1's Digital Garden

# アラートストーム ## 定義アラートストーム(Alert Storm)は、サービス障害や大規模イベント発生時に、数百〜数千件のアラートが短時間に集中して発火する現象である。手作業での全件調査が実質不可能で、平均 55 分・6 名のエンジニアを消費した事例が [[China EverBright Bank]] の 3 年・300 万件超アラート分析で報告された([[@2020__ICSE-SEIP__Understanding and Handling Alert Storm for Online Service Systems|Zhao+ ICSE-SEIP2020]] §2.2、図2)。語源は IT 業界の慣用語で、ISA 18.2 / EEMUA 191 が定める industrial alarm flood の基準(10 件/10 分/オペレータ)とは桁違いに大きい(Source: [[@2024__JNCA__A survey on intelligent management of alerts and incidents in IT services|Yu+ JNCA2024]] §3.2)。形態は事業ドメインで分化する。クラウドサービスの「断続的バースト」(障害トリガで一気に発生し収束する)に対し、スーパーコンピュータの NUDT 環境では「連続的バーストの流れ」=アラート過負荷(alert overload)と呼ばれ、98 万〜211 万件/130 日と発生密度が桁違いに高い([[@2024__ISSRE__Exploring Hierarchical Patterns for Alert Aggregation in Supercomputers|Yuan+ ISSRE2024]] §III-C)。両者を区別する閾値や検知器の設計は事業ドメインに強く依存する。 ## 横断的知見 - **「Alert flooding from severe failure」という第三カテゴリの登場**: [[@2025__SIGCOMM__SkyNet - Analyzing Alert Flooding from Severe Network Failures in Large Cloud Infrastructures|SkyNet(Yang+ SIGCOMM2025)]] は「年間数回しか起きないが損失の大半を占める」severe failure を別カテゴリとして扱い、Alibaba Cloud で 10,000 alerts/分の事例を具体例で示す(§2.2: half of internet entry cables 同時故障、Syslog/SNMP/Ping/Out-of-band 全ツールが flood)。これは Zhao+ ICSE-SEIP2020 の「断続的ストーム」(典型障害トリガ)、Yuan+ ISSRE2024 の「連続的アラート過負荷」(HPC 持続バースト)に加わる第三カテゴリで、(a) 頻度が極端に低く統計学習に十分なデータがない、(b) "unknown failure"(過去事例と一致しない)を含む、(c) 単一監視ツールでは検知不能で複数ツール統合が必須、という固有性を持つ。SkyNet が deep learning ベースの DeepIP[9] を severe failure には不採用とした(§8: "for severe network failures it is impossible to get enough history data for model training")のはこのデータ希少性が理由。(Source: [[@2025__SIGCOMM__SkyNet - Analyzing Alert Flooding from Severe Network Failures in Large Cloud Infrastructures]] §1, §2.2) - **アラートストームへの「アラート分類」と「閾値設計」が論文間で異なる**: 1 storm 内 alert を細分類するアプローチは論文で異なる。SkyNet は **Failure / Abnormal / Root cause alerts の 3 分類**を経験的に導入し、failure alerts が「出現率は低いが failure incident のほぼ全てに付随する」事実から incident detection の中核トリガとして使う(Figure 5d)。Zha+ 2024 は **critical / high / medium / low の 4 severity 分類**だが、これは attribute として扱い分類器の特徴量にすぎない。Yuan+ 2024(SuperAgg)は **stable / fake / wandering / jittering の 4 パターン分類**で、時間軸の挙動形状を分類する。3 者の「alert 分類軸」が「重大度(severity-driven)」「異常タイプ(state-driven)」「時系列形状(behavior-driven)」と直交する点が興味深い。(Source: [[@2025__SIGCOMM__SkyNet - Analyzing Alert Flooding from Severe Network Failures in Large Cloud Infrastructures]] §4.2, [[@2024__Electronics__Leveraging Large Language Models for Efficient Alert Aggregation in AIOPs]] §2.2, [[@2024__ISSRE__Exploring Hierarchical Patterns for Alert Aggregation in Supercomputers]] §IV) - **「断続的ストーム」と「連続的アラート過負荷」は別問題**: [[@2020__ICSE-SEIP__Understanding and Handling Alert Storm for Online Service Systems|Zhao+ 2020]] が定義したクラウドサービスの alert storm は障害トリガで断続的に発火し収束するパターンで、EVT(極値理論)による変化点検知が有効。一方 [[@2024__ISSRE__Exploring Hierarchical Patterns for Alert Aggregation in Supercomputers|Yuan+ ISSRE2024]] はスーパーコンピュータの alert overload を「持続的バーストの連続的流れ」と再定義し、SuperAgg では Apriori によるシステム層の主従関係マイニングと SOM ベースのセンサ層パターン抽出を組み合わせる。変化点検知が無意味化する持続シナリオでは集約戦略そのものを再設計する必要がある。(Source: [[@2020__ICSE-SEIP__Understanding and Handling Alert Storm for Online Service Systems]] §3.2, [[@2024__ISSRE__Exploring Hierarchical Patterns for Alert Aggregation in Supercomputers]] §I, §III-C) - **アラートストームの「伝播」を一次特性として扱う系統の登場**: [[@2023__ASE__Dynamic Graph Neural Networks-Based Alert Link Prediction for Online Service Systems|Chen+ ASE2023]] はアラートストームを「アラートの伝播(propagation)が引き起こす現象」と再定義し、リンク予測タスクへ落とし込んだ。DyAlert は AMDG(Alert-Metric Dynamic Graph)を離散時間動的グラフとして設計し、異種 k-GNN + GRU で時空間表現を学習する。[[@2020__ICSE-SEIP__Understanding and Handling Alert Storm for Online Service Systems|Zhao+ 2020]] が「ストームをまとめて要約する」スタンスを取ったのに対し、Chen+ 2023 は「ストームを構成するペア間リンクを予測する」スタンスを取る。前者は OCE の認知負荷削減を、後者は障害伝播の構造復元を目的とする違い。(Source: [[@2020__ICSE-SEIP__Understanding and Handling Alert Storm for Online Service Systems]] §3, [[@2023__ASE__Dynamic Graph Neural Networks-Based Alert Link Prediction for Online Service Systems]] §I, §III) - **検知 vs 要約 vs 構造化の 3 種アプローチの分業**: アラートストームへの対処は (1) **検知**(EVT で「いまストーム中か」を判定: [[@2020__ICSE-SEIP__Understanding and Handling Alert Storm for Online Service Systems|Zhao+ 2020]] §3.2)、(2) **要約**(代表アラートに絞る: 同 §3.3、effort reduction 98% 超)、(3) **構造化**(因果リンクを復元: [[@2023__ASE__Dynamic Graph Neural Networks-Based Alert Link Prediction for Online Service Systems|Chen+ 2023]]、または階層パターン抽出: [[@2024__ISSRE__Exploring Hierarchical Patterns for Alert Aggregation in Supercomputers|Yuan+ 2024]] §IV)に分かれる。3 種は同時に成立する別レイヤで、検知が要約・構造化のトリガとなり、要約と構造化は出力形態の選択(代表アラート集合 vs 因果グラフ vs 階層パターン)。Yang+ DSN2022 が Cascading Alerts を集合アンチパターンとして定義したのは、この (3) 構造化の必要性を裏付ける。(Source: [[@2020__ICSE-SEIP__Understanding and Handling Alert Storm for Online Service Systems]], [[@2023__ASE__Dynamic Graph Neural Networks-Based Alert Link Prediction for Online Service Systems]], [[@2024__ISSRE__Exploring Hierarchical Patterns for Alert Aggregation in Supercomputers]], [[@2022__DSN__Characterizing and Mitigating Anti-patterns of Alerts in Industrial Cloud Systems]]) - **テキスト類似度単独の集約は伝播シナリオで失敗する**: [[@2020__ICSE-SEIP__Understanding and Handling Alert Storm for Online Service Systems|Zhao+ 2020]] のアブレーションは、テキスト類似度(Jaccard)単独 α=1 やトポロジ距離単独 α=0 が、両者の重み付き組み合わせ(α=0.6)に大きく劣ることを示した(§4.4.2、図10)。[[@2023__ASE__Dynamic Graph Neural Networks-Based Alert Link Prediction for Online Service Systems|Chen+ 2023]] はこの観察を強化し、テキスト意味のみ DyAlert-T 比で +10.1%、メトリクスのみ DyAlert-M 比で +11.1%、時系列のみ DyAlert-G 比で +5.6% の精度差を測定し、3 要素統合が不可欠と確認した(§IV-B)。「semantic 距離が大きいが因果連鎖を持つアラートペア」(server overload と DB slowdown など)の存在が、テキスト単独アプローチを根本的に壊している。(Source: [[@2020__ICSE-SEIP__Understanding and Handling Alert Storm for Online Service Systems]] §4.4.2, [[@2023__ASE__Dynamic Graph Neural Networks-Based Alert Link Prediction for Online Service Systems]] §IV-B, [[@2024__ICSE-SEIP__Knowledge-aware Alert Aggregation in Large-scale Cloud Systems - a Hybrid Approach]] §1) - **SPOT/DSPOT(Siffer+ KDD2017)がアラートストーム検知器の統計的ルーツ**: [[@2017__KDD__Anomaly Detection in Streams with Extreme Value Theory|SPOT/DSPOT]] は EVT(Peaks-Over-Threshold + GPD)で分布仮定不要・閾値手動設定不要のストリーム異常検知を実現し、リスクパラメータ q 1 つだけで動作する。[[@2020__ICSE-SEIP__Understanding and Handling Alert Storm for Online Service Systems|Zhao+ ICSE-SEIP2020]] の Alert Storm 検知器が EVT を採用したことで、本論文はクラウドサービスのアラートストーム検知の理論的根拠を提供。単変量 iid 仮定と n≈1000 件初期バッチ要件は cold-start 課題を残す。(Source: [[@2017__KDD__Anomaly Detection in Streams with Extreme Value Theory]] §2–3, [[@2020__ICSE-SEIP__Understanding and Handling Alert Storm for Online Service Systems]] §3.2) - **OAS → DyAlert → ProAlert の Fudan アラート集約三部作はアラートストームの後段処理として連結**: 3 論文(ICSE2022・ASE2023・FSE2025)で同一 Fudan グループが「障害トリガで起きるアラートストームを根本原因ごとに畳む」処理を漸進的に強化。教師あり semantic + behavior → 動的グラフ → 教師なしトポロジセマンティクス。アラートストーム研究は「検知(SPOT/Zhao+)」と「集約(Fudan 三部作)」の 2 系統に役割分担している。(Source: [[@2022__ICSE__Actionable and Interpretable Fault Localization for Recurring Failures in Online Service Systems]], [[@2023__ASE__Dynamic Graph Neural Networks-Based Alert Link Prediction for Online Service Systems]], [[@2025__FSE__Alert Summarization for Online Service Systems by Validating Propagation Paths of Faults]]) ## 未解決の問い - **Severe failure(SkyNet 想定)と Alert storm(Zhao+ 2020)の関係**: severe failure は alert storm を「副作用として」発生させるが、severe failure 自体は「年間数件しか起きないが甚大な損失」というプロパティで定義され、「短時間に集中発火」という alert storm の定義とは独立した属性。両者は包含関係でなく交差関係に近い。両者の "alert flooding" を統一的に扱うフレームワークは未提案。 - **新次世代 LLM の context window 拡張が "severe failure × LLM" の境界を変える**: SkyNet §2.3 が LLM 不採用の主理由として挙げる「Syslog 10M/15min が 20M トークン context 超過」は、Gemini 1.5(2M トークン)、Claude 3(200K)、潜在的に 100M 級の次世代 LLM では緩和する可能性。LLM × severe failure の境界は技術トレンドで動的に変わるため、定期的な再評価が必要。 - 「断続的ストーム」(Zhao+ 2020 想定)と「連続的アラート過負荷」(Yuan+ 2024 想定)の中間形態——例えば日中だけ高頻度になる商用クラウドの周期的バースト——を統一的に扱う検知器は未提案。EVT は断続前提、Apriori は持続前提で、両方を同時に処理するアーキテクチャがいる。 - アラートストーム中の「正規アラート(regular alerts)」と「障害関連アラート」の比率は事業ドメインでどう変動するか? [[@2020__ICSE-SEIP__Understanding and Handling Alert Storm for Online Service Systems|Zhao+ 2020]] は China EverBright Bank で正規アラートが少数(デノイジング単独効果は 5-8%)としたが、Alibaba や Microsoft Cloud などで再測定されているか不明。 - リンク予測([[@2023__ASE__Dynamic Graph Neural Networks-Based Alert Link Prediction for Online Service Systems|Chen+ 2023]])と階層パターンマイニング([[@2024__ISSRE__Exploring Hierarchical Patterns for Alert Aggregation in Supercomputers|Yuan+ 2024]])は同じ問題を異なる構造仮定(対称的ペア vs 主従ルール)で解く。両者をオンライン環境で比較した実証研究は本サーベイ時点で未着手。 - アラートストーム中の対処は通常「要約」「集約」だが、ストーム自体を**減らす**上流策(rule refinement・SLO 設計・依存トポロジの簡素化)とのバランスはどう設計するか? [[@2025__ASE__AlertGuardian - Intelligent Alert Life-Cycle Management for Large-scale Cloud Systems|AlertGuardian]] の rule refinement と Zhao+ 2020 の要約は補完的だが、両者を統合した本番運用報告はまだない。 - SPOT の iid 仮定はクラウド監視メトリクス(自己相関強)で成立するか? 多変量 EVT 拡張がアラートストームの相関アラート群に対しどの程度有効か未検証。 ## 関連 - 親概念: [[アラート管理]](alert storm handling がサブプロセス)、[[アラート集約]](ストーム時の主要対処) - 関連アンチパターン: [[アラートアンチパターン]] の Cascading Alerts(A6、Zhao+ 2020 が初定義) - 関連手法: AlertStorm(Zhao+ 2020)、DyAlert(Chen+ 2023)、SuperAgg(Yuan+ 2024)、COLA(Kuang+ 2024)、Zha+ Electronics2024、SkyNet - 対比: industrial alarm flood(ISA 18.2 / EEMUA 191、IT alert storm より桁違いに小規模) - ソース: [[@2020__ICSE-SEIP__Understanding and Handling Alert Storm for Online Service Systems]]、[[@2023__ASE__Dynamic Graph Neural Networks-Based Alert Link Prediction for Online Service Systems]]、[[@2024__ISSRE__Exploring Hierarchical Patterns for Alert Aggregation in Supercomputers]]、[[@2024__Electronics__Leveraging Large Language Models for Efficient Alert Aggregation in AIOPs]]、[[@2025__SIGCOMM__SkyNet - Analyzing Alert Flooding from Severe Network Failures in Large Cloud Infrastructures]] ## 出典 - [[@2020__ICSE-SEIP__Understanding and Handling Alert Storm for Online Service Systems]] §2.2(現象実証)、§3(検知・要約)、§4.4(アブレーション) - [[@2023__ASE__Dynamic Graph Neural Networks-Based Alert Link Prediction for Online Service Systems]] §I(伝播定義)、§III(DyAlert)、§IV-B(アブレーション) - [[@2024__ISSRE__Exploring Hierarchical Patterns for Alert Aggregation in Supercomputers]] §I, §III-C(alert overload 定義)、§IV(SuperAgg) - [[@2024__Electronics__Leveraging Large Language Models for Efficient Alert Aggregation in AIOPs]] §1(電力業界本番ストーム実証 100K alerts)、§3.1(時空間局所性 + cascading 観察) - [[@2025__SIGCOMM__SkyNet - Analyzing Alert Flooding from Severe Network Failures in Large Cloud Infrastructures]] §2.2(severe failure 事例: 10K+ alerts/分)、§2.3(LLM 不採用根拠)、§4-§7(SkyNet 設計と本番 1.5 年運用)