Detecting and Localizing Anomalies in Container Cl

# Detecting and Localizing Anomalies in Container Clusters Using Markov Models Created: February 2, 2021 10:53 AM URL: [[]]pi.com/2079-9292/9/1/64 Year: 2020 # Abstract Detecting the location of performance anomalies in complex distributed systems is critical to ensuring the effective operation of a system, in particular, if short-lived container deployments are considered, adding challenges to anomaly detection and localization. In this paper, we present a framework for monitoring, detecting and localizing performance anomalies for container-based clusters using the hierarchical hidden Markov model (HHMM). The model aims at detecting and localizing the root cause of anomalies at runtime in order to maximize the system availability and performance. The model detects response time variations in containers and their hosting cluster nodes based on their resource utilization and tracks the root causes of variations. To evaluate the proposed framework, experiments were conducted for container orchestration, with different performance metrics being used. The results show that HHMMs are able to accurately detect and localize performance anomalies in a timely fashion. 複雑な分散システムにおける性能異常の位置を検出することは、システムの効果的な運用を確保するために非常に重要であり、特に、短期間のコンテナ展開を考慮した場合には、異常の検出と定位に課題が追加されます。本論文では、階層的隠れマルコフモデル(HHMM)を用いて、コンテナベースのクラスタの性能異常を監視、検出、局所化するためのフレームワークを提示する。このモデルは、システムの可用性とパフォーマンスを最大化するために、実行時に異常の根本原因を検出し、ローカライズすることを目的としている。このモデルは、コンテナとそのホスティングクラスタノードのリソース利用率に基づいて応答時間の変動を検出し、変動の根本原因を追跡する。提案されたフレームワークを評価するために、異なる性能指標を用いてコンテナオーケストレーションの実験を行いました。その結果、HHMMはパフォーマンスの異常を正確に検出し、タイムリーに特定できることがわかりました。 # まとめ - 本論文は，コンテナベースのクラスタの性能異常を監視，検出，局所化するために，階層的隠れマルコフモデル（HHMM）を用いたフレームワークを提案する． - HHMMは，階層構造を持つドメインをモデル化するために設計された隠れマルコフモデル（HMM）の一般化である． - 想定するシステムにおけるノードは1つまたは複数のコンテナで構成されているため，このような階層構造を反映させるためにHHMMを採用する． - （一貫してシステムの"性能"の異常にフォーカスしている） - コンテナとノードの応答時間とリソース利用率（CPUとメモリなど）の測定値を利用してHHMMを構築し，コンテナとノードのワークロードに見られる異常な動作を検出してその根本原因を特定する． - HHMMの設計 - HHMMは、コンポーネントに割り当てられた隠れCPUとメモリリソースの応答時間に基づいて学習 - （エッジとなる遷移確率が何を表しているかなど詳細をまだ理解できていない） ![[Detecting and Localizing Anomalies in Container Cl/Untitled.png]] - このフレームワークは2つのフェーズで構成されている 1. 検出観測された応答時間とリソース使用率に基づいてコンポーネントの作業負荷の挙動を検出する 2. 識別検出された異常な挙動の原因を複数のパス上で特定する．パスは2つのコンポーネント（コンテナとノードなど）間のリンクとして定義される． ### 評価 - 実験環境には，TPC-W ([http://www.tpc.org/tpcw/](http://www.tpc.org/tpcw/)) を採用 - ベースライン手法は，dynamic bayesian network (DBN) とhierarchical temporal memory (HTM) - 実験では，HHMMを用いた提案手法は，94.3%以上の精度で異常な挙動を検出し，その根本原因を特定できることが示された ![[Detecting and Localizing Anomalies in Container Cl/Untitled 1.png]]