Self-Adaptive Root Cause Diagnosis for Large-Scale

# Self-Adaptive Root Cause Diagnosis for Large-Scale Microservice Architecture Bibliography: MA, Meng, et al. Self-Adaptive Root Cause Diagnosis for Large-Scale Microservice Architecture. IEEE Transactions on Services Computing, 2020. Created: November 5, 2020 12:56 PM URL: https://ieeexplore.ieee.org/abstract/document/9090324/ Year: 2020 こちらがMS-Rankを2020年にジャーナル出版された論文。 ## Abstract > The emergence of microservice architecture in Cloud systems poses new challenges for reliability operation and maintenance. Due to numerous services and diverse types of metrics, it is time-consuming and challenging to identify the root cause of anomaly in large-scale microservice architecture. To solve this issue, this paper presents a multi-metric and self-adaptive root cause diagnosis framework, named MS-Rank. MS-Rank decomposes the task into four phases: impact graph construction, random walk diagnosis, result precision evaluation, metrics weight update. Initially, we introduce the concept of implicit metrics and propose a composite impact graph construction algorithm, using multiple types of metrics to discover causal relationships between services. Afterwards, we propose a diagnostic algorithm in which forward, selfward and backward transitions are designed to heuristically identify the root cause services. In addition, we establish a self-adaptive mechanism to update the confidence of different metrics dynamically according to their diagnostic precision. Lastly, we develop a prototype system and integrate MS-Rank into real production system - IBM Cloud. Experimental results show that MS-Rank has a high diagnostic precision and its performance outperforms several selected benchmarks. Through multiple rounds of diagnosis, MS-Rank can optimize itself effectively. MS-Rank can be rapidly deployed in various microservice-based systems and applications, requiring no predefined knowledge. MS-Rank also allows us to introduce expert experiences into its framework to improve the diagnostic efficiency and precision. (DeepL翻訳) クラウドシステムにおけるマイクロサービスアーキテクチャの出現は、信頼性の運用・保守に新たな課題を投げかけています。大規模なマイクロサービスアーキテクチャでは、多数のサービスと多様な種類のメトリクスが存在するため、異常の根本原因を特定するのに時間がかかり、困難な作業となっている。この問題を解決するために、本論文では、マルチメトリックで自己適応的な根本原因診断フレームワークであるMS-Rankを提案する。MS-Rankでは，タスクをインパクトグラフ構築，ランダムウォーク診断，結果精度評価，メトリクスの重み更新の4つのフェーズに分解する．まず、暗黙のメトリクスの概念を導入し、サービス間の因果関係を発見するために、複数種類のメトリクスを用いた複合インパクトグラフ構築アルゴリズムを提案する。その後、サービスの根本原因をヒューリスティックに特定するために、前方遷移、自己遷移、後方遷移を設計する診断アルゴリズムを提案する。また、診断精度に応じて、各メトリクスの信頼度を動的に更新する自己適応的な仕組みを構築する。最後に、プロトタイプシステムを開発し、MS-Rankを実運用システムであるIBMクラウドに統合する。実験の結果、MS-Rankは高い診断精度を持ち、その性能はいくつかのベンチマークを上回ることが示された。複数回の診断を繰り返すことで、MS-Rankは効果的に自己最適化を行うことができます。MS-Rankは、事前に定義された知識を必要とせず、様々なマイクロサービスベースのシステムやアプリケーションに迅速に導入することができます。また、MS-Rankのフレームワークに専門家の経験を導入することで、診断の効率と精度を向上させることができます。 [[Self-Adaptive Root Cause Diagnosis for Large-Scale__translations]]