[[Root Cause Analysis]]のMap of Contents。 ## Tools - [[PyWhy]] ## Papers Map - [[人間によるフィードバックを考慮するAIOps論文]] - [[シミュレーション評価を含むAIOps Failure Management研究論文]] - [[データベースシステムのAIOps論文]] - [[ネットワークシステムのAIOps]] - [GitHub - dreamhomes/RCAPapers: Papers about Root Cause Analysis in MicroService Systems. Reference to Paper Notes: https://dreamhomes.top/](https://github.com/dreamhomes/RCAPapers) - [[コールグラフを使用するAIOps Fault Localization論文]] - [[クラスタリングを用いるAIOps Failure Managementの論文]] - [[多次元データに対する根本原因分析]] - [[オフライン解析を含むAIOps Fault Localization論文]] - [[Fault Localization Overviewリスト]] - [[パブリッククラウドベンダー向けの論文]] - [[End-to-end Anomaly Detectrion and Fault Localization AIOps論文]] - [[マルチモーダルAIOps論文]] ## Survey - [[2025__CSUR__Trustworthy AI-based Performance Diagnosis Systems for Cloud Applications - A Review]] - [[2024__misc__Root Cause Analysis for Distributed Systems]] - [[2024__arXiv__A Comprehensive Survey on Root Cause Analysis in (Micro) Services - Methodologies, Challenges, and Trends]] - [[2024__arXiv__Failure Diagnosis in Microservice Systems - A Comprehensive Survey and Analysis]] - [[2024__Dissertation__Towards Effective Performance Diagnosis for Distributed Applications]] - [[2023__ESEC-FSE__Adapting Performance Analytic Techniques in a Real-World Database-Centric System - An Industrial Experience Report]] - [[2023__arXiv__Case Studies of Causal Discovery from IT Monitoring Time Series]] - [[2023__EIEDP__Causality between violations and failure factors in cloud service and its analysis methods - a survey]] - [[2017__arXiv__Survey on Models and Techniques for Root-Cause Analysis]] ## Papers ### 2025 - IDI: [[2025__ICLR__Robust Root Cause Diagnosis using In-Distribution Interventions]] - PA-Rank: [[2025__IoTJournal__PA-Rank - A GAN and Reinforcement Learning Powered Framework for Multi-Metric Anomaly Detection and Causal Diagnosis]] - LogDB: [[2025__arXiv__LogDB - Multivariate Log-based Failure Diagnosis for Distributed Databases (Extended from MultiLog)]] - TAMO: [[2025__arXiv__TAMO - Fine-Grained Root Cause Analysis via Tool-Assisted LLM Agent with Multi-Modality Observation Data]] - ThinkFL: [[2024__arXiv__ThinkFL - Self-Refining Failure Localization for Microservice Systems via Reinforcement Fine-Tuning]] - eARCO: [[2025__arXiv__eARCO - Efficient Automated Root Cause Analysis with Prompt Optimization]] - OpDiag: [[2025__TKDE__OpDiag - Unveiling Database Performance Anomalies Through Query Operator Attribution]] - SDN: [[2025__ICDE__Anomaly Diagnosis with Siamese Discrepancy Networks in Distributed Cloud Databases]] - AutoDebugger: [[2025__AIDB__AutoDebugger - Efficient Root Cause Analysis for Anomaly Jobs]] - GALA: [[2025__arXiv__Can Graph-Augmented Large Language Model Agentic Workflows Elevate Root Cause Analysis]] - LogInsight: [[2025__TSC__Accurate and Interpretable Log-Based Fault Diagnosis using Large Language Models]] - DBAIOps: [[2025__arXiv__DBAIOps - A Reasoning LLM Enhanced Database Operation and Maintenance System using Knowledge Graphs]] - RCRank: [[2024__VLDB__RCRank - Multimodal Ranking of Root Causes of Slow Queries in Cloud Database Systems]] - [[2025__arXiv__Causal AI-based Root Cause Identification - Research to Practice at Scale]] - BSODiag: [[2025__ICSE-SEIP__BSODiag - A Global Diagnosis Framework for Batch Servers Outage in Large-scale Cloud Infrastructure Systems]] - COCA: [[2025__ICSE__COCA - Generative Root Cause Analysis for Distributed Systems with Code Knowledge]] - L4: [[2025__ESEC-FSE__L4 - Diagnosing Large-scale LLM Training Failures via Automated Log Analysis]] - [[2025__arXiv__RADICE - Causal Graph Based Root Cause Analysis for System Performance Diagnostic]] - [[2025__AAAI__Causal Discovery for Cloud Microservice Architectures]] - DiagMLP: [[2025__arXiv__Are GNNs Actually Effective for Multimodal Fault Diagnosis in Microservice Systems?]] - SinkFlow [[2025__EAAI__SinkFlow - Fast and traceable root-cause localization for multidimensional anomaly events]] ### 2024 - [[2024__TSEM__Making Fault Localization in Online Service Systems More Actionable and Interpretable]] - [[2024__arXiv__Breaking the Cycle of Recurring Failures - Applying Generative AI to Root Cause Analysis in Legacy Banking Systems]] - [[2024__ICCSN__A Root Cause Localization Method Based on Event Call Chains for Microservices]] - yRCA: [[2024__SPE__Explaining Microservices’ Cascading Failures From Their Logs]] - SinkFlow: [[2024__EAAI__SinkFlow - Fast and traceable root-cause localization for multidimensional anomaly events]] - Zoom-inRCL: [[Zoom-inRCL - Fine-grained root cause localization for B5G-6G network slicing]] - UniDiag: [[2024__TSC__No More Data Silos - Unified Microservice Failure Diagnosis with Temporal Knowledge Graph]] - SLIM: [[2024__ASE__SLIM - a Scalable and Interpretable Light-weight Fault Localization Algorithm for Imbalanced Data in Microservice]] - MRCA: [[2024__ASE__MRCA - Metric-level Root Cause Analysis for Microservices via Multi-Modal Data]] - LasRCA: [[2024__ASE__The Potential of One-Shot Failure Root Cause Analysis - Collaboration of the Large Language Model and Small Classifier]] - FaaSRCA: [[2024__ISSRE__FaaSRCA - Full Lifecycle Root Cause Analysis for Serverless Applications]] - KPIRoot: [[2024__ISSRE__KPIRoot - Efficient Monitoring Metric-based Root Cause Localization in Large-scale Cloud Systems]] - [[2024__ICWS__Anomaly Detection and Root Cause Analysis of Microservices Energy Consumption]] - iTCRL: [[2024__TSE__iTCRL - Causal-Intervention-Based Trace Contrastive Representation Learning for Microservice Systems1]] - G-Cause: [[2024__ICWS__G-Cause - Parameter-free Global Diagnosis for Hyperscale Web Service Infrastructures]] - [[2024__arXiv__Root Cause Analysis of Outliers with Missing Structural Knowledge1]] - OCEAN: [[2024__arXiv__Online Multi-modal Root Cause Analysis]] - HolisticRCA: [[2024__TSC__Holistic Root Cause Analysis for Failures in Cloud-Native Systems Through Observability Data]] - Vista: [[2024__SoCC__Vista - Machine learning based database performance troubleshooting framework in Amazon RDS]] - LoFI: [[2024__ISSRE__Demystifying and Extracting Fault-indicating Information from Logs for Failure Diagnosis]] - ST-RF: [[2024__NaNA__Multi-source KPIs’ root cause localization in online service systems]] - MicroHFRCL: [[2024__IJCNN__MicroHFRCL - A History Faults Based Root Cause Localization Framework in Microservice Systems]] - DeepHunt: [[2024__TOSEM__Interpretable Failure Localization for Microservice Systems Based on Graph Autoencoder]] - Medicine: [[2024__ASE__Giving Every Modality a Voice in Microservice Failure Diagnosis via Multimodal Adaptive Optimization]] - [[2024__ISSRE__SparseRCA - Unsupervised Root Cause Analysis in Sparse Microservice Testing Traces]] - [[2024__OSR__LLexus - an AI agent system for incident management]] - [[2024__DSN__Fault Localization Using Interventional Causal Learning for Cloud-Native Applications]] - ResilienceGuardian: [[2024__IWQoS__Guardian of the Resiliency - Detecting Erroneous Software Changes Before They Make Your Microservice System Less Fault-Resilient]] - [[2024__ASE__Root Cause Analysis for Microservices based on Causal Inference - How Far Are We?]] - FaultInsight: [[2024__KDD__FaultInsight - Interpreting Hyperscale Data Center Host Faults]] - [[2024__ICC__Efficient Learning Framework for Failure Identification Model Based on Failure Injection]] - [[2024__CSCWD__Advancing Root Cause Analysis in Cloud-native System with Knowledge Graph Path Embedding Translation]] - TVDiag: [[2024__arXiv__TVDiag - A Task-oriented and View-invariant Failure Diagnosis Framework with Multimodal Data]] - LGRCL: [[2024__ICIC__Variational Autoencoder and Graph Attention Root Cause Localization Model Based on Log Data and Graph Structure]] - STRCA: [[2024__ICIC__STRCA - A Lightweight and Accurate Root Cause Analysis System Based on 5G Signalling Trace]] - [[2024__Thesis__Anomaly Detection of Microservices Runtime Performance]] - [[2024__arXiv__Industrial-Grade Time-Dependent Counterfactual Root Cause Analysis through the Unanticipated Point of Incipient Failure - a Proof of Concept]] - [[2024__CSCWD__Multi-fusion algorithm root cause location model based on causal failure dependency graph]] - Cloud Atlas: [[2024__arXiv__Cloud Atlas - Efficient Fault Localization for Cloud Systems using Language Models and Causal Insight]] - PORCA: [[2024__arXiv__PORCA - Root Cause Analysis with Partially Observed Data]] - Chain-of-Event: [[2024__FSE__Chain-of-Event - Interpretable Root Cause Analysis for Microservices through Automatically Learning Weighted Event Causal Graph]] - LatentSpace: [[2024__KDD__Microservice Root Cause Analysis With Limited Observability Through Intervention Recognition in the Latent Space]] - CHASE: [[2024__arXiv__CHASE - A Causal Heterogeneous Graph based Framework for Root Cause Analysis in Multimodal Microservice Systems]] - HeMiRCA: [[2024__TOSEM__HeMiRCA - Fine-Grained Root Cause Analysis for Microservices with Heterogeneous Data Sources]] - MicroIRC: [[2024__JSS__MicroIRC - Instance-level Root Cause Localization for Microservice Systems]] - MicroCERCL: [[2024__arXiv__Root Cause Localization for Microservice Systems in Cloud-edge Collaborative Environments]] - [[2024__arXiv__Root Cause Analysis of Outliers with Missing Structural Knowledge]] - [[2024__FGCS__Autonomous selection of the fault classification models for diagnosing microservice applications]] - STRCA: [[2024__ICASSP__Semi-Supervised Metrics-Based Self-Training Root Cause Analysis for Cloud-Native Systems with Class-Imbalanced Data]] - RCInvestigator: [[2024__arXiv__RCInvestigator - Towards Better Investigation of Anomaly Root Causes in Cloud Computing Systems]] - LogRCA: [[2024__Euro-Par__LogRCA - Log-based Root Cause Analysis for Distributed Services]] - GrayScope: [[2024__ESEC-FSE__Illuminating the Gray Zone - Non-intrusive Gray Failure Localization in Server Operating Systems]] - SynthoDiag: [[2024__ESEC-FSE__Fault Diagnosis for Test Alarms in Microservices through Multi-source Data]] - MicroDig: [[2024__TSC__Diagnosing Performance Issues for Large-Scale Microservice Systems with Heterogeneous Graph]] - InstantOps: [[2024__ICPE__InstantOps - A Joint Approach to System Failure Prediction and Root Cause Identification in Microservices Cloud-Native Applications]] - BARO: [[2024__ESEC-FSE__BARO - Robust Root Cause Analysis for Microservices via Multivariate Bayesian Online Change Point Detection]] - [[2024__arXiv__Few-Shot Cross-System Anomaly Trace Classification for Microservice-based systems]] - [[2024__ICPE__Context-aware Root Cause Localization in Distributed Traces Using Social Network Analysis]] - KGroot: [[2024__ESWA__KGroot - Enhancing Root Cause Analysis through Knowledge Graphs and Graph Convolutional Neural Networks]] - [[2024__ICSE-SEIP__Autonomous Monitors for Detecting Failures Early and Reporting Interpretable Alerts in Cloud Operations]] - [[2024__ICSE__Trace-based Multi-Dimensional Root Cause Localization of Performance Issues in Microservice Systems]] - [[2024__TCC__Root Cause Analysis for Cloud-Native Applications]] - ExChain: [[2024__NSDI__ExChain - Exception Dependency Analysis for Root Cause Diagnosis]] - NetAssistant: [[2024__NSDI__NetAssistant - Dialogue Based Network Diagnosis in Data Center Networks]] - GAMMA: [[2024__WWW__GAMMA - Graph Neural Network-Based Multi-Bottleneck Localization for Microservices Applications]] - ChangeRCA: [[2024__FSE__ChangeRCA - Finding Root Causes from Software Changes in Large Online Systems]] - [[2024__SPE__Detection of microservice‐based software anomalies based on OpenTracing in cloud]] - [[2024__arXiv__Dependency Aware Incident Linking in Large Cloud Systems]] - [[2024__ICCE__Exploration of Fault Identification and Automatic Recovery in Cloud-based FPGA Systems]] - AlertRCA: [[2024__CCGrid__Causality Enhanced Graph Representation Learning for Alert-Based Root Cause Analysis]] - FIRED: [[2024__FGCS__A fine-grained robust performance diagnosis framework for run-time cloud applications]] - Panda: [[2024__CIDR__Panda - Performance Debugging for Databases using LLM Agents]] - T-RCA: [[2024__arXiv__On the Fly Detection of Root Causes from Observed Data with Application to IT Systems]] - [[2024__Transactions on Reliability__Multilayered Fault Detection and Localization With Transformer for Microservice Systems]] - [[2024__AAAI__Root Cause Analysis In Microservice Using Neural Granger Causal Discovery]] - RCACopilot: [[2024__EuroSys__Automatic Root Cause Analysis via Large Language Models for Cloud Incidents]] - [[2024__JTPES__Intelligent Fault Analysis with AIOps Technology]] - [[2024__arXiv__Automated Root Causing of Cloud Incidents using In-Context Learning with GPT-4]] - ASFC: [[2024__Future Generation Computer Systems__Autonomous selection of the fault classification models for diagnosing microservice applications]] - EffCause:[[2024__KDD__EffCause - Discover Dynamic Causal Relationships Efficiently from Time-Series]] ### 2023 - PatternRCA: [[2023__ICDM__PatternRCA - A Pattern-Aware Root Cause Analysis Framework for Multi-Dimensional Time Series]] - BALANCE: [[2023__SIGMOD__BALANCE - Bayesian Linear Attribution for Root Cause Localization]] - DARC: [[2023__BigData__DARC - High-dimensional Diffusing Anomaly Detection and Root Cause Location in Cloud Computing Systems]] - EasyRCA: [[2023__AISTATS__Root Cause Identification for Collective Anomalies in Time Series given an Acyclic Summary Causal Graph with Loops]] - HFDG: [[2023__TCE__Heterogeneous Data-Driven Failure Diagnosis for Microservice-Based Industrial Clouds Towards Consumer Digital Ecosystems]] - [[2023__Applied Sciences__A Study on Automated Problem Troubleshooting in Cloud Environments with Rule Induction and Verification]] - GTrace: [[2023__ESEC-FSE__From Point-wise to Group-wise - A Fast and Accurate Microservice Trace Anomaly Detection Approach]] - TraceDiag: [[2023__ESEC-FSE__TraceDiag - Adaptive, Interpretable, and Efficient Root Cause Analysis on Large-Scale Microservice Systems]] - DiagConfig: [[2023__ESEC-FSE__DiagConfig - Configuration Diagnosis of Performance Violations in Configurable Software Systems]] - Raccoon: [[2023__ISSRE__Identifying Root-Cause Changes for User-Reported Incidents in Online Service Systems]] - TraceStream: [[2023__ISSRE__TraceStream - Anomalous Service Localization based on Trace Stream Clustering with Online Feedback]] - ServerRCA: [[2023__ISSRE__ServerRCA - Root Cause Analysis for Server Failure using Operating System Logs]] - HRCA: [[2023__ISSRE__HRCA - A Heterogeneous Graph-based Adaptive Root Cause Analysis Framework]] - EvLog: [[2023__ISSRE__EvLog - Identifying Anomalous Logs over Software Evolution]] - HEAL: [[2023__POMACS__HEAL - Performance Troubleshooting Deep inside Data Center Hosts]] - [[2023__ICDM__Progressing from Anomaly Detection to Automated Log Labeling and Pioneering Root Cause Analysis]] - RootCLAM: [[2023__CIKM__On Root Cause Localization and Anomaly Mitigation through Causal Inference]] - DyAlert: [[2023__ASE__Dynamic Graph Neural Networks-based Alert Link Prediction for Online Service Systems]] - [[2023__CASE__A Graph-Based Algorithm for Root Cause Analysis of Faults in Telecommunication Networks]] - trACE: [[2023__Computación y Sistemas__trACE-Anomaly Correlation Engine for Tracing the Root Cause on a Cloud based Microservice Architecture]] - Grace: [[2023__IWQoS__Grace - Interpretable Root Cause Analysis by Graph Convolutional Network for Microservices]] - ESRO: [[2023__ASE__ESRO - Experience Assisted Service Reliability against Outages]] - PACE: [[2023__arXiv__PACE-LM - Prompting and Augmentation for Calibrated Confidence Estimation with GPT-4 in Cloud Incident Root Cause Analysis]] - FTM-RCA: [[2023__IWQoS__FTM-RCA - A Fast Two-Stage Multi-dimensional Root-Cause Analysis of Network Anomalies]] - Murphy: [[2023__SIGCOMM__Murphy - Performance Diagnosis of Distributed Cloud Applications]] - Nezha: [[2023__ESEC-FSE__Nezha - Interpretable Fine-Grained Root Causes Analysis for Microservices on Multi-modal Observability Data]] - [[2023__KDD__Root Cause Analysis for Microservice Systems via Hierarchical Reinforcement Learning from Human Feedback]] - ID-GCN: [[2023__ICCAI__Microservice Anomaly Diagnosis with Graph Convolution Network Based on Implicit Microservice Dependency]] - TLS-WGAN-GP: [[2023__TCE__TLS-WGAN-GP - A Generative Adversarial Network Model for Data-Driven Fault Root Cause Location]] - RTAnomaly: [[2023__arXiv__Performance Issue Identification in Cloud Systems with Relational-Temporal Anomaly Detection]] - MARS: [[2023__ICPP__MARS - Fault Localization in Programmable Networking Systems with Low-cost In-Band Network Telemetry]] - LogKG: [[2023__IEEE Transactions on Services Computing__LogKG - Log Failure Diagnosis through Knowledge Graph]] - DTFL: [[2023__TCC__DTFL - A Digital Twin-assisted Graph Neural Network Approach for Service Function Chains Failure Localization]] - CONAN: [[2023__ICSE-SEIP__CONAN - Diagnosing Batch Failures for Cloud Systems]] - Aegis: [[2023__ICSE-SEIP__Aegis - Attribution of Control Plane Change Impact across Layers and Components for Cloud Systems]] - yRCA: [[2023__Science of Computer Programming__yRCA - An explainable failure root cause analyser]] - MetricMiner: [[2023__NOMS__Multi-stage Location for Root-Cause Metrics in Online Service Systems]] - PyRCA: [[2023__arXiv__PyRCA - A Library for Metric-based Root Cause Analysis]] - ImpactTracer: [[2023__DATE__ImpactTracer - Root Cause Localization in Microservices Based on Fault Propagation Modeling]] - LogRule: [[2023__TNSM__LogRule - Efficient Structured Log Mining for Root Cause Analysis]] - [[2023__arXiv__Empowering Practical Root Cause Analysis by Large Language Models for Cloud Incidents]] - CORAL: [[2023__arXiv__Incremental Causal Graph Learning for Online Unsupervised Root Cause Analysis]] - [[2023__KDD__Incremental Causal Graph Learning for Online Root Cause Analysis]] - TADL: [[2023__SANER__TADL - Fault Localization with Transformer-based Anomaly Detection for Dynamic Microservice Systems]] - Oasis: [[2023__arXiv__Assess and Summarize - Improve Outage Understanding with Large Language Models]] - ProphetKdeRCL: [[2023__TNSM__Root Cause Location Based on Prophet and Kernel Density Estimation]] - MicroState: [[2023__IEICE TRANSACTIONS __MicroState - An Anomaly Localization Method in Heterogeneous Microservice Systems]] - [[2023__arXiv__Causal fault localisation in dataflow systems]] - CausIL: [[2023__WWW__CausIL - Causal Graph for Instance Level Microservice Data]] - DiagFusion: [[2023__TSC__Robust Failure Diagnosis of Microservice System through Multimodal Data]] - CMDiagnostor: [[2023__WWW__CMDiagnostor - An Ambiguity-Aware Root Cause Localization Approach Based on Call Metric Data]] - Eadro: [[2023__ICSE__Eadro - An End-to-End Troubleshooting Framework for Microservices on Multi-source Data]] - REASON: [[2023__arXiv__Hierarchical Graph Neural Networks for Causal Discovery and Root Cause Localization]] - [[2023__KDD__Interdependent Causal Networks for Root Cause Localization]] - DyCause: [[2023__TDSC__DyCause - Crowdsourcing to Diagnose Microservice Kernel Failure]] ### 2022 - [[2022__arXiv__A Causal Approach to Detecting Multivariate Time-series Anomalies and Root Causes]] - RCD: [[2022__NeurIPS__Root Cause Analysis of Failures in Microservices through Causal Discovery]] - [[2022__ICML__Causal structure-based root cause analysis of outliers]] - GIED: [[2022__ASE__Graph based Incident Extraction and Diagnosis in Large-Scale Online Systems]] - MicroSketch [[2022__ICSOC__MicroSketch - Lightweight and Adaptive Sketch Based Performance Issue Detection and Localization in Microservice Systems]] - AFETM: [[2022__arXiv__AFETM - Adaptive Function Execution Trace Monitoring for Fault Diagnosis]] - [[2022__CLOUD__Localizing and Explaining Faults in Microservices Using Distributed Tracing]] - FRL-MFPG: [[2022__Information and Software Technology__FRL-MFPG - Propagation-aware fault root cause location for microservice intelligent operation and maintenance]] - TS-InvarNet: [[2022__ICWS__TS-InvarNet - Anomaly Detection and Localization based on Tempo-spatial KPI Invariants in Distributed Services]] - MicroLens: [[2022__CLOUD__MicroLens - A Performance Analysis Framework for Microservices Using Hidden Metrics With BPF]] - CausalRCA: [[2022__arXiv__CausalRCA - Causal Inference based Precise Fine-grained Root Cause Localization for Microservice Applications]] - FIRED: [[2022__arXiv__FIRED - a fine-grained robust performance diagnosis framework for cloud applications]] - Journal [[2024__FGCS__A fine-grained robust performance diagnosis framework for run-time cloud applications]] - MicroCBR: [[2022__ICCBR__MicroCBR - Case-Based Reasoning on Spatio-temporal Fault Knowledge Graph for Microservices Troubleshooting]] - PERFCE: [[2022__arXiv__PerfCE - Performance Debugging on Databases with Chaos Engineering-Enhanced Causality Analysis]] - [[2022__arXiv__PerfCE - Performance Debugging on Databases with Chaos Engineering-Enhanced Causality Analysis]] - De ́ja`Vu`: [[2022__ESEC-FSE__Actionable and Interpretable Fault Localization for Recurring Failures in Online Service Systems]] - PSqueeze: [[2022__SSRN__Generic and Robust Root Cause Localization for Multi-Dimensional Data in Online Service Systems]] - CIRCA: [[2022__KDD__Causal Inference-Based Root Cause Analysis for Online Service Systems with Intervention Recognition]] - CMMD: [[2022__KDD__CMMD - Cross-Metric Multi-Dimensional Root Cause Analysis]] - CauseRank: [[2022__CCGrid__Generic and Robust Performance Diagnosis via Causal Inference for OLTP Database Systems]] - [[2022__arXiv__Mining Root Cause Knowledge from Cloud Service Incident Investigations for AIOps]] - RobustSpot: [[2022__ITSC__Robust Anomaly Localization of Multi-dimensional Derived Measure for Online Video]] - MicroHECL: [[2022__ICSE-SEIP__MicroHECL - High-Efficient Root Cause Localization in Large-Scale Microservice Systems]] ### 2021 - TraceModel: [[2021__MSN__TraceModel - An Automatic Anomaly Detection and Root Cause Localization Framework for Microservice Systems]] - [[2021__ACSOS__Causal Inference Techniques for Microservice Performance Diagnosis - Evaluation and Guiding Recommendations]] - DyCause: [[2021__SIGSOFT__Faster, deeper, easier - crowdsourcing diagnosis of microservice kernel failure from user space]] - HALO: [[2021__KDD__HALO - Hierarchy-aware Fault Localization for Cloud Systems]] - CloudRCA: [[2021__CIKM__CloudRCA - A Root Cause Analysis Framework for Cloud Computing Platforms]] - Groot: [[2021__ASE__Groot - An event-graph-based approach for root cause analysis in industrial settings]] - ModelCoder: [[2021__IWQOS__ModelCoder - A Fault Model based Automatic Root Cause Localization Framework for Microservice Systems]] - [[2021__CLOUD__Detecting Causal Structure on Cloud Application Microservices Using Granger Causality Models]] - Sage: [[2021__ASPLOS__Sage―Practical and Scalable ML-Driven Performance Debugging in Microservices]] - Sage: [[2022__OSR__Enabling Practical Cloud Performance Debugging with Unsupervised Learning]] - GRLIA: [[2021__ASE__Graph-based Incident Aggregation for Large-Scale Online Service Systems]] - [[2021__ASSE__Network root fault location based on network topology and alarm]] - PDiagnose: [[2021__ISPA__Diagnosing Performance Issues in Microserviceswith Heterogeneous Data Source]] - PatternMatcher: [[2021__ISSRE__Identifying Root-Cause Metrics for Incident Diagnosis in Online Service Systems]] - TraceRank [[2021__Journal-of-Software__TraceRank - Abnormal service localization with dis-aggregated end-to-end tracing data in cloud native systems]] - [[2021__ICSE__Fast Outage Analysis of Large-Scale Production Clouds with Service Correlation Mining]] - MicroDiag [[2021__ICSE__MicroDiag - Fine-grained Performance Diagnosis for Microservice Systems]] - MicroRank [[2021__WWW__MicroRank―End-to-End Latency Issue Localization with Extended Spectrum Analysis in Microservice Environments]] - [[2021__CODS-COMAD__Evaluation of Causal Inference Techniques for AIOps]] ### 2020 - [[2020__TNSM__Workflow-Aware Automatic Fault Diagnosis for Microservice-Based Applications With Statistics]] - [[2020__JSS__Graph-Based Root Cause Analysis for Service-Oriented and Microservice Architectures]] - Apriori: [[2020__POMACS__Fast Dimensional Analysis for Root Cause Investigation in a Large-Scale Service Environment]] - FluxInfer [[2020__IPCCC__FluxInfer―Automatic Diagnosis of Performance Anomaly for Online Database System]] - GMTA [[2020__ESEC-FSE__Graph-Based Trace Analysis for Microservice Architecture Understanding and Problem Diagnosis]] - ISQUAD: [[2020__VLDB__Diagnosing Root Causes of Intermittent Slow Queries in Cloud Databases]] - DeCaf [[2020__ICSE-SEIP__DeCaf - Diagnosing and Triaging Performance Issues in Large-Scale Cloud Services]] - [[2020__ICSE-SEIP__Fast Dimensional Analysis for Root Cause Investigation in a Large-Scale Service Environment]] - [[2020__ICSE-SEIP__Understanding and Handling Alert Storm for Online Service Systems]] - AutoMAP [[2020__WWW__AutoMAP - Diagnose Your Microservice-based Web Application]] - MicroRCA [[2020__NOMS__MicroRCA - Root Cause Localization of Performance Issues in Microservices]] - MicroCause [[2020__IWQoS__Localizing Failure Root Causes in a Microservice through Causality Inference]] - CSL: [[2020__ICSOC__Performance Diagnosis in Cloud Microservices Using Deep Learning]] - [[2020__Applied Science__A Causality Mining and Knowledge Graph Based Method of Root Cause Diagnosis for Performance Anomaly in Cloud Applications]] ### 2019 - FluxRank [[2019__ISSRE__FluxRank―A Widely-Deployable Framework to Automatically Localizing Root Cause Machines for Software Service Failure Mitigation]] - ε-Diagnosis [[2019__WWW__ε-Diagnosis - Unsupervised and Real-time Diagnosis of Small-window Long-tail Latency in Large-scale Microservice Platforms]] - ExplainIt! [[2019__SIGMOD__ExplainIt!– A Declarative Root-cause Analysis Engine for Time Series Data]] - AirAlert [[2019__WWW__Outage prediction and diagnosis for cloud service systems]] - Grano [[2019__VLDB__GRANO - Interactive Graph-based Root Cause Analysis for Cloud-Native Distributed Data Platfor]] - Squeeze: [[2019__ISSRE__Generic and Robust Localization of Multi-Dimensional Root Causes]] ### 2018 - [Weng+, TON2018]: [[2018__TON__Root Cause Analysis of Anomalies of Multitier Services in Public Clouds]] - LOUD: [[2018__ICST__Localizing Faults in Cloud Systems]] - Microscope [[2018__ICSOC__Microscope―Pinpoint Performance Issues with Causal Graphs in Micro-service Environments]] - MS-Rank [[2018__MS-Rank Multi-Metric and Self-Adaptive Root Cause]] - CloudRanger [[2018__CCGRID__CloudRanger―Root Cause Identification for Cloud Native Systems]] - FacGraph [[2018__IPCCC__FacGraph - Frequent Anomaly Correlation Graph Mining for Root Cause Diagnose in Micro-Service Architecture]] ### 2017 and before - Roots [[2017__WWW__Performance Monitoring and Root Cause Analysis for Cloud-hosted Web Applications]] - LogCluster: [[2016__ICSE-C__Log Clustering based Problem Identification for Online Service Systems]] - DBSherlock [[2016__SIGMOD__DBSherlock―A Performance Diagnostic Tool for Transactional Databases]] - PerfCompass [[2015__TPDS__PerfCompass - Online Performance Anomaly Fault Localization and Inference in Infrastructure-as-a-Service Clouds]] - [[2014__KDD__Correlating Events with Time Series for Incident Diagnosis]] - CauseInfer [[2014__INFOCOM__CauseInfer―Automatic and distributed performance diagnosis with hierarchical causality graph in large distributed systems]] - [[2016__TSC__CauseInfer―Automated End-to-End Performance Diagnosis with Hierarchical Causality Graph in Cloud Environment]] - MonitorRank [[2013__PER__Root Cause Detection in a Service-Oriented Architecture]] - FChain [[2013__ICDCS__FChain - Toward Black-box Online Fault Localization for Cloud Systems]] - CloudDiag: [[2013__TPDS__Toward Fine-Grained, Unsupervised, Scalable Performance Diagnosis for Production Cloud Computing Systems]] - TBAC [[2009__CSMR__Automatic Failure Diagnosis Support in Distributed Large-Scale Software Systems based on Timing Behavior Anomaly Correlation]] - NetMedic [[2009__SIGCOMM__Detailed Diagnosis in Enterprise Networks]] - Pinpoint: [[2002__DSN__Pinpoint - Problem Determination in Large, Dynamic Internet Services]] - TAN [[2004__OSDI__Correlating Instrumentation Data to System States - A Building Block for Automated Diagnosis and Control]] ### Network - [[2022__ICASSP__Accurate Inference of Unseen Combinations of Multiple Rootcauses with Classifier Ensemble]] - NetRCA: [[2022__ ICASSP__NetRCA―An Effective Network Fault Cause Localization Algorithm]]