[[Root Cause Analysis]]のMap of Contents。
## Tools
- [[PyWhy]]
## Papers Map
- [[人間によるフィードバックを考慮するAIOps論文]]
- [[シミュレーション評価を含むAIOps Failure Management研究論文]]
- [[データベースシステムのAIOps論文]]
- [[ネットワークシステムのAIOps]]
- [GitHub - dreamhomes/RCAPapers: Papers about Root Cause Analysis in MicroService Systems. Reference to Paper Notes: https://dreamhomes.top/](https://github.com/dreamhomes/RCAPapers)
- [[コールグラフを使用するAIOps Fault Localization論文]]
- [[クラスタリングを用いるAIOps Failure Managementの論文]]
- [[多次元データに対する根本原因分析]]
- [[オフライン解析を含むAIOps Fault Localization論文]]
- [[Fault Localization Overviewリスト]]
- [[パブリッククラウドベンダー向けの論文]]
- [[End-to-end Anomaly Detectrion and Fault Localization AIOps論文]]
- [[マルチモーダルAIOps論文]]
## Survey
- [[2025__CSUR__Trustworthy AI-based Performance Diagnosis Systems for Cloud Applications - A Review]]
- [[2024__misc__Root Cause Analysis for Distributed Systems]]
- [[2024__arXiv__A Comprehensive Survey on Root Cause Analysis in (Micro) Services - Methodologies, Challenges, and Trends]]
- [[2024__arXiv__Failure Diagnosis in Microservice Systems - A Comprehensive Survey and Analysis]]
- [[2024__Dissertation__Towards Effective Performance Diagnosis for Distributed Applications]]
- [[2023__ESEC-FSE__Adapting Performance Analytic Techniques in a Real-World Database-Centric System - An Industrial Experience Report]]
- [[2023__arXiv__Case Studies of Causal Discovery from IT Monitoring Time Series]]
- [[2023__EIEDP__Causality between violations and failure factors in cloud service and its analysis methods - a survey]]
- [[2017__arXiv__Survey on Models and Techniques for Root-Cause Analysis]]
## Papers
### 2025
- IDI: [[2025__ICLR__Robust Root Cause Diagnosis using In-Distribution Interventions]]
- PA-Rank: [[2025__IoTJournal__PA-Rank - A GAN and Reinforcement Learning Powered Framework for Multi-Metric Anomaly Detection and Causal Diagnosis]]
- LogDB: [[2025__arXiv__LogDB - Multivariate Log-based Failure Diagnosis for Distributed Databases (Extended from MultiLog)]]
- TAMO: [[2025__arXiv__TAMO - Fine-Grained Root Cause Analysis via Tool-Assisted LLM Agent with Multi-Modality Observation Data]]
- ThinkFL: [[2024__arXiv__ThinkFL - Self-Refining Failure Localization for Microservice Systems via Reinforcement Fine-Tuning]]
- eARCO: [[2025__arXiv__eARCO - Efficient Automated Root Cause Analysis with Prompt Optimization]]
- OpDiag: [[2025__TKDE__OpDiag - Unveiling Database Performance Anomalies Through Query Operator Attribution]]
- SDN: [[2025__ICDE__Anomaly Diagnosis with Siamese Discrepancy Networks in Distributed Cloud Databases]]
- AutoDebugger: [[2025__AIDB__AutoDebugger - Efficient Root Cause Analysis for Anomaly Jobs]]
- GALA: [[2025__arXiv__Can Graph-Augmented Large Language Model Agentic Workflows Elevate Root Cause Analysis]]
- LogInsight: [[2025__TSC__Accurate and Interpretable Log-Based Fault Diagnosis using Large Language Models]]
- DBAIOps: [[2025__arXiv__DBAIOps - A Reasoning LLM Enhanced Database Operation and Maintenance System using Knowledge Graphs]]
- RCRank: [[2024__VLDB__RCRank - Multimodal Ranking of Root Causes of Slow Queries in Cloud Database Systems]]
- [[2025__arXiv__Causal AI-based Root Cause Identification - Research to Practice at Scale]]
- BSODiag: [[2025__ICSE-SEIP__BSODiag - A Global Diagnosis Framework for Batch Servers Outage in Large-scale Cloud Infrastructure Systems]]
- COCA: [[2025__ICSE__COCA - Generative Root Cause Analysis for Distributed Systems with Code Knowledge]]
- L4: [[2025__ESEC-FSE__L4 - Diagnosing Large-scale LLM Training Failures via Automated Log Analysis]]
- [[2025__arXiv__RADICE - Causal Graph Based Root Cause Analysis for System Performance Diagnostic]]
- [[2025__AAAI__Causal Discovery for Cloud Microservice Architectures]]
- DiagMLP: [[2025__arXiv__Are GNNs Actually Effective for Multimodal Fault Diagnosis in Microservice Systems?]]
- SinkFlow [[2025__EAAI__SinkFlow - Fast and traceable root-cause localization for multidimensional anomaly events]]
### 2024
- [[2024__TSEM__Making Fault Localization in Online Service Systems More Actionable and Interpretable]]
- [[2024__arXiv__Breaking the Cycle of Recurring Failures - Applying Generative AI to Root Cause Analysis in Legacy Banking Systems]]
- [[2024__ICCSN__A Root Cause Localization Method Based on Event Call Chains for Microservices]]
- yRCA: [[2024__SPE__Explaining Microservices’ Cascading Failures From Their Logs]]
- SinkFlow: [[2024__EAAI__SinkFlow - Fast and traceable root-cause localization for multidimensional anomaly events]]
- Zoom-inRCL: [[Zoom-inRCL - Fine-grained root cause localization for B5G-6G network slicing]]
- UniDiag: [[2024__TSC__No More Data Silos - Unified Microservice Failure Diagnosis with Temporal Knowledge Graph]]
- SLIM: [[2024__ASE__SLIM - a Scalable and Interpretable Light-weight Fault Localization Algorithm for Imbalanced Data in Microservice]]
- MRCA: [[2024__ASE__MRCA - Metric-level Root Cause Analysis for Microservices via Multi-Modal Data]]
- LasRCA: [[2024__ASE__The Potential of One-Shot Failure Root Cause Analysis - Collaboration of the Large Language Model and Small Classifier]]
- FaaSRCA: [[2024__ISSRE__FaaSRCA - Full Lifecycle Root Cause Analysis for Serverless Applications]]
- KPIRoot: [[2024__ISSRE__KPIRoot - Efficient Monitoring Metric-based Root Cause Localization in Large-scale Cloud Systems]]
- [[2024__ICWS__Anomaly Detection and Root Cause Analysis of Microservices Energy Consumption]]
- iTCRL: [[2024__TSE__iTCRL - Causal-Intervention-Based Trace Contrastive Representation Learning for Microservice Systems1]]
- G-Cause: [[2024__ICWS__G-Cause - Parameter-free Global Diagnosis for Hyperscale Web Service Infrastructures]]
- [[2024__arXiv__Root Cause Analysis of Outliers with Missing Structural Knowledge1]]
- OCEAN: [[2024__arXiv__Online Multi-modal Root Cause Analysis]]
- HolisticRCA: [[2024__TSC__Holistic Root Cause Analysis for Failures in Cloud-Native Systems Through Observability Data]]
- Vista: [[2024__SoCC__Vista - Machine learning based database performance troubleshooting framework in Amazon RDS]]
- LoFI: [[2024__ISSRE__Demystifying and Extracting Fault-indicating Information from Logs for Failure Diagnosis]]
- ST-RF: [[2024__NaNA__Multi-source KPIs’ root cause localization in online service systems]]
- MicroHFRCL: [[2024__IJCNN__MicroHFRCL - A History Faults Based Root Cause Localization Framework in Microservice Systems]]
- DeepHunt: [[2024__TOSEM__Interpretable Failure Localization for Microservice Systems Based on Graph Autoencoder]]
- Medicine: [[2024__ASE__Giving Every Modality a Voice in Microservice Failure Diagnosis via Multimodal Adaptive Optimization]]
- [[2024__ISSRE__SparseRCA - Unsupervised Root Cause Analysis in Sparse Microservice Testing Traces]]
- [[2024__OSR__LLexus - an AI agent system for incident management]]
- [[2024__DSN__Fault Localization Using Interventional Causal Learning for Cloud-Native Applications]]
- ResilienceGuardian: [[2024__IWQoS__Guardian of the Resiliency - Detecting Erroneous Software Changes Before They Make Your Microservice System Less Fault-Resilient]]
- [[2024__ASE__Root Cause Analysis for Microservices based on Causal Inference - How Far Are We?]]
- FaultInsight: [[2024__KDD__FaultInsight - Interpreting Hyperscale Data Center Host Faults]]
- [[2024__ICC__Efficient Learning Framework for Failure Identification Model Based on Failure Injection]]
- [[2024__CSCWD__Advancing Root Cause Analysis in Cloud-native System with Knowledge Graph Path Embedding Translation]]
- TVDiag: [[2024__arXiv__TVDiag - A Task-oriented and View-invariant Failure Diagnosis Framework with Multimodal Data]]
- LGRCL: [[2024__ICIC__Variational Autoencoder and Graph Attention Root Cause Localization Model Based on Log Data and Graph Structure]]
- STRCA: [[2024__ICIC__STRCA - A Lightweight and Accurate Root Cause Analysis System Based on 5G Signalling Trace]]
- [[2024__Thesis__Anomaly Detection of Microservices Runtime Performance]]
- [[2024__arXiv__Industrial-Grade Time-Dependent Counterfactual Root Cause Analysis through the Unanticipated Point of Incipient Failure - a Proof of Concept]]
- [[2024__CSCWD__Multi-fusion algorithm root cause location model based on causal failure dependency graph]]
- Cloud Atlas: [[2024__arXiv__Cloud Atlas - Efficient Fault Localization for Cloud Systems using Language Models and Causal Insight]]
- PORCA: [[2024__arXiv__PORCA - Root Cause Analysis with Partially Observed Data]]
- Chain-of-Event: [[2024__FSE__Chain-of-Event - Interpretable Root Cause Analysis for Microservices through Automatically Learning Weighted Event Causal Graph]]
- LatentSpace: [[2024__KDD__Microservice Root Cause Analysis With Limited Observability Through Intervention Recognition in the Latent Space]]
- CHASE: [[2024__arXiv__CHASE - A Causal Heterogeneous Graph based Framework for Root Cause Analysis in Multimodal Microservice Systems]]
- HeMiRCA: [[2024__TOSEM__HeMiRCA - Fine-Grained Root Cause Analysis for Microservices with Heterogeneous Data Sources]]
- MicroIRC: [[2024__JSS__MicroIRC - Instance-level Root Cause Localization for Microservice Systems]]
- MicroCERCL: [[2024__arXiv__Root Cause Localization for Microservice Systems in Cloud-edge Collaborative Environments]]
- [[2024__arXiv__Root Cause Analysis of Outliers with Missing Structural Knowledge]]
- [[2024__FGCS__Autonomous selection of the fault classification models for diagnosing microservice applications]]
- STRCA: [[2024__ICASSP__Semi-Supervised Metrics-Based Self-Training Root Cause Analysis for Cloud-Native Systems with Class-Imbalanced Data]]
- RCInvestigator: [[2024__arXiv__RCInvestigator - Towards Better Investigation of Anomaly Root Causes in Cloud Computing Systems]]
- LogRCA: [[2024__Euro-Par__LogRCA - Log-based Root Cause Analysis for Distributed Services]]
- GrayScope: [[2024__ESEC-FSE__Illuminating the Gray Zone - Non-intrusive Gray Failure Localization in Server Operating Systems]]
- SynthoDiag: [[2024__ESEC-FSE__Fault Diagnosis for Test Alarms in Microservices through Multi-source Data]]
- MicroDig: [[2024__TSC__Diagnosing Performance Issues for Large-Scale Microservice Systems with Heterogeneous Graph]]
- InstantOps: [[2024__ICPE__InstantOps - A Joint Approach to System Failure Prediction and Root Cause Identification in Microservices Cloud-Native Applications]]
- BARO: [[2024__ESEC-FSE__BARO - Robust Root Cause Analysis for Microservices via Multivariate Bayesian Online Change Point Detection]]
- [[2024__arXiv__Few-Shot Cross-System Anomaly Trace Classification for Microservice-based systems]]
- [[2024__ICPE__Context-aware Root Cause Localization in Distributed Traces Using Social Network Analysis]]
- KGroot: [[2024__ESWA__KGroot - Enhancing Root Cause Analysis through Knowledge Graphs and Graph Convolutional Neural Networks]]
- [[2024__ICSE-SEIP__Autonomous Monitors for Detecting Failures Early and Reporting Interpretable Alerts in Cloud Operations]]
- [[2024__ICSE__Trace-based Multi-Dimensional Root Cause Localization of Performance Issues in Microservice Systems]]
- [[2024__TCC__Root Cause Analysis for Cloud-Native Applications]]
- ExChain: [[2024__NSDI__ExChain - Exception Dependency Analysis for Root Cause Diagnosis]]
- NetAssistant: [[2024__NSDI__NetAssistant - Dialogue Based Network Diagnosis in Data Center Networks]]
- GAMMA: [[2024__WWW__GAMMA - Graph Neural Network-Based Multi-Bottleneck Localization for Microservices Applications]]
- ChangeRCA: [[2024__FSE__ChangeRCA - Finding Root Causes from Software Changes in Large Online Systems]]
- [[2024__SPE__Detection of microservice‐based software anomalies based on OpenTracing in cloud]]
- [[2024__arXiv__Dependency Aware Incident Linking in Large Cloud Systems]]
- [[2024__ICCE__Exploration of Fault Identification and Automatic Recovery in Cloud-based FPGA Systems]]
- AlertRCA: [[2024__CCGrid__Causality Enhanced Graph Representation Learning for Alert-Based Root Cause Analysis]]
- FIRED: [[2024__FGCS__A fine-grained robust performance diagnosis framework for run-time cloud applications]]
- Panda: [[2024__CIDR__Panda - Performance Debugging for Databases using LLM Agents]]
- T-RCA: [[2024__arXiv__On the Fly Detection of Root Causes from Observed Data with Application to IT Systems]]
- [[2024__Transactions on Reliability__Multilayered Fault Detection and Localization With Transformer for Microservice Systems]]
- [[2024__AAAI__Root Cause Analysis In Microservice Using Neural Granger Causal Discovery]]
- RCACopilot: [[2024__EuroSys__Automatic Root Cause Analysis via Large Language Models for Cloud Incidents]]
- [[2024__JTPES__Intelligent Fault Analysis with AIOps Technology]]
- [[2024__arXiv__Automated Root Causing of Cloud Incidents using In-Context Learning with GPT-4]]
- ASFC: [[2024__Future Generation Computer Systems__Autonomous selection of the fault classification models for diagnosing microservice applications]]
- EffCause:[[2024__KDD__EffCause - Discover Dynamic Causal Relationships Efficiently from Time-Series]]
### 2023
- PatternRCA: [[2023__ICDM__PatternRCA - A Pattern-Aware Root Cause Analysis Framework for Multi-Dimensional Time Series]]
- BALANCE: [[2023__SIGMOD__BALANCE - Bayesian Linear Attribution for Root Cause Localization]]
- DARC: [[2023__BigData__DARC - High-dimensional Diffusing Anomaly Detection and Root Cause Location in Cloud Computing Systems]]
- EasyRCA: [[2023__AISTATS__Root Cause Identification for Collective Anomalies in Time Series given an Acyclic Summary Causal Graph with Loops]]
- HFDG: [[2023__TCE__Heterogeneous Data-Driven Failure Diagnosis for Microservice-Based Industrial Clouds Towards Consumer Digital Ecosystems]]
- [[2023__Applied Sciences__A Study on Automated Problem Troubleshooting in Cloud Environments with Rule Induction and Verification]]
- GTrace: [[2023__ESEC-FSE__From Point-wise to Group-wise - A Fast and Accurate Microservice Trace Anomaly Detection Approach]]
- TraceDiag: [[2023__ESEC-FSE__TraceDiag - Adaptive, Interpretable, and Efficient Root Cause Analysis on Large-Scale Microservice Systems]]
- DiagConfig: [[2023__ESEC-FSE__DiagConfig - Configuration Diagnosis of Performance Violations in Configurable Software Systems]]
- Raccoon: [[2023__ISSRE__Identifying Root-Cause Changes for User-Reported Incidents in Online Service Systems]]
- TraceStream: [[2023__ISSRE__TraceStream - Anomalous Service Localization based on Trace Stream Clustering with Online Feedback]]
- ServerRCA: [[2023__ISSRE__ServerRCA - Root Cause Analysis for Server Failure using Operating System Logs]]
- HRCA: [[2023__ISSRE__HRCA - A Heterogeneous Graph-based Adaptive Root Cause Analysis Framework]]
- EvLog: [[2023__ISSRE__EvLog - Identifying Anomalous Logs over Software Evolution]]
- HEAL: [[2023__POMACS__HEAL - Performance Troubleshooting Deep inside Data Center Hosts]]
- [[2023__ICDM__Progressing from Anomaly Detection to Automated Log Labeling and Pioneering Root Cause Analysis]]
- RootCLAM: [[2023__CIKM__On Root Cause Localization and Anomaly Mitigation through Causal Inference]]
- DyAlert: [[2023__ASE__Dynamic Graph Neural Networks-based Alert Link Prediction for Online Service Systems]]
- [[2023__CASE__A Graph-Based Algorithm for Root Cause Analysis of Faults in Telecommunication Networks]]
- trACE: [[2023__Computación y Sistemas__trACE-Anomaly Correlation Engine for Tracing the Root Cause on a Cloud based Microservice Architecture]]
- Grace: [[2023__IWQoS__Grace - Interpretable Root Cause Analysis by Graph Convolutional Network for Microservices]]
- ESRO: [[2023__ASE__ESRO - Experience Assisted Service Reliability against Outages]]
- PACE: [[2023__arXiv__PACE-LM - Prompting and Augmentation for Calibrated Confidence Estimation with GPT-4 in Cloud Incident Root Cause Analysis]]
- FTM-RCA: [[2023__IWQoS__FTM-RCA - A Fast Two-Stage Multi-dimensional Root-Cause Analysis of Network Anomalies]]
- Murphy: [[2023__SIGCOMM__Murphy - Performance Diagnosis of Distributed Cloud Applications]]
- Nezha: [[2023__ESEC-FSE__Nezha - Interpretable Fine-Grained Root Causes Analysis for Microservices on Multi-modal Observability Data]]
- [[2023__KDD__Root Cause Analysis for Microservice Systems via Hierarchical Reinforcement Learning from Human Feedback]]
- ID-GCN: [[2023__ICCAI__Microservice Anomaly Diagnosis with Graph Convolution Network Based on Implicit Microservice Dependency]]
- TLS-WGAN-GP: [[2023__TCE__TLS-WGAN-GP - A Generative Adversarial Network Model for Data-Driven Fault Root Cause Location]]
- RTAnomaly: [[2023__arXiv__Performance Issue Identification in Cloud Systems with Relational-Temporal Anomaly Detection]]
- MARS: [[2023__ICPP__MARS - Fault Localization in Programmable Networking Systems with Low-cost In-Band Network Telemetry]]
- LogKG: [[2023__IEEE Transactions on Services Computing__LogKG - Log Failure Diagnosis through Knowledge Graph]]
- DTFL: [[2023__TCC__DTFL - A Digital Twin-assisted Graph Neural Network Approach for Service Function Chains Failure Localization]]
- CONAN: [[2023__ICSE-SEIP__CONAN - Diagnosing Batch Failures for Cloud Systems]]
- Aegis: [[2023__ICSE-SEIP__Aegis - Attribution of Control Plane Change Impact across Layers and Components for Cloud Systems]]
- yRCA: [[2023__Science of Computer Programming__yRCA - An explainable failure root cause analyser]]
- MetricMiner: [[2023__NOMS__Multi-stage Location for Root-Cause Metrics in Online Service Systems]]
- PyRCA: [[2023__arXiv__PyRCA - A Library for Metric-based Root Cause Analysis]]
- ImpactTracer: [[2023__DATE__ImpactTracer - Root Cause Localization in Microservices Based on Fault Propagation Modeling]]
- LogRule: [[2023__TNSM__LogRule - Efficient Structured Log Mining for Root Cause Analysis]]
- [[2023__arXiv__Empowering Practical Root Cause Analysis by Large Language Models for Cloud Incidents]]
- CORAL: [[2023__arXiv__Incremental Causal Graph Learning for Online Unsupervised Root Cause Analysis]]
- [[2023__KDD__Incremental Causal Graph Learning for Online Root Cause Analysis]]
- TADL: [[2023__SANER__TADL - Fault Localization with Transformer-based Anomaly Detection for Dynamic Microservice Systems]]
- Oasis: [[2023__arXiv__Assess and Summarize - Improve Outage Understanding with Large Language Models]]
- ProphetKdeRCL: [[2023__TNSM__Root Cause Location Based on Prophet and Kernel Density Estimation]]
- MicroState: [[2023__IEICE TRANSACTIONS __MicroState - An Anomaly Localization Method in Heterogeneous Microservice Systems]]
- [[2023__arXiv__Causal fault localisation in dataflow systems]]
- CausIL: [[2023__WWW__CausIL - Causal Graph for Instance Level Microservice Data]]
- DiagFusion: [[2023__TSC__Robust Failure Diagnosis of Microservice System through Multimodal Data]]
- CMDiagnostor: [[2023__WWW__CMDiagnostor - An Ambiguity-Aware Root Cause Localization Approach Based on Call Metric Data]]
- Eadro: [[2023__ICSE__Eadro - An End-to-End Troubleshooting Framework for Microservices on Multi-source Data]]
- REASON: [[2023__arXiv__Hierarchical Graph Neural Networks for Causal Discovery and Root Cause Localization]]
- [[2023__KDD__Interdependent Causal Networks for Root Cause Localization]]
- DyCause: [[2023__TDSC__DyCause - Crowdsourcing to Diagnose Microservice Kernel Failure]]
### 2022
- [[2022__arXiv__A Causal Approach to Detecting Multivariate Time-series Anomalies and Root Causes]]
- RCD: [[2022__NeurIPS__Root Cause Analysis of Failures in Microservices through Causal Discovery]]
- [[2022__ICML__Causal structure-based root cause analysis of outliers]]
- GIED: [[2022__ASE__Graph based Incident Extraction and Diagnosis in Large-Scale Online Systems]]
- MicroSketch [[2022__ICSOC__MicroSketch - Lightweight and Adaptive Sketch Based Performance Issue Detection and Localization in Microservice Systems]]
- AFETM: [[2022__arXiv__AFETM - Adaptive Function Execution Trace Monitoring for Fault Diagnosis]]
- [[2022__CLOUD__Localizing and Explaining Faults in Microservices Using Distributed Tracing]]
- FRL-MFPG: [[2022__Information and Software Technology__FRL-MFPG - Propagation-aware fault root cause location for microservice intelligent operation and maintenance]]
- TS-InvarNet: [[2022__ICWS__TS-InvarNet - Anomaly Detection and Localization based on Tempo-spatial KPI Invariants in Distributed Services]]
- MicroLens: [[2022__CLOUD__MicroLens - A Performance Analysis Framework for Microservices Using Hidden Metrics With BPF]]
- CausalRCA: [[2022__arXiv__CausalRCA - Causal Inference based Precise Fine-grained Root Cause Localization for Microservice Applications]]
- FIRED: [[2022__arXiv__FIRED - a fine-grained robust performance diagnosis framework for cloud applications]]
- Journal [[2024__FGCS__A fine-grained robust performance diagnosis framework for run-time cloud applications]]
- MicroCBR: [[2022__ICCBR__MicroCBR - Case-Based Reasoning on Spatio-temporal Fault Knowledge Graph for Microservices Troubleshooting]]
- PERFCE: [[2022__arXiv__PerfCE - Performance Debugging on Databases with Chaos Engineering-Enhanced Causality Analysis]]
- [[2022__arXiv__PerfCE - Performance Debugging on Databases with Chaos Engineering-Enhanced Causality Analysis]]
- De ́ja`Vu`: [[2022__ESEC-FSE__Actionable and Interpretable Fault Localization for Recurring Failures in Online Service Systems]]
- PSqueeze: [[2022__SSRN__Generic and Robust Root Cause Localization for Multi-Dimensional Data in Online Service Systems]]
- CIRCA: [[2022__KDD__Causal Inference-Based Root Cause Analysis for Online Service Systems with Intervention Recognition]]
- CMMD: [[2022__KDD__CMMD - Cross-Metric Multi-Dimensional Root Cause Analysis]]
- CauseRank: [[2022__CCGrid__Generic and Robust Performance Diagnosis via Causal Inference for OLTP Database Systems]]
- [[2022__arXiv__Mining Root Cause Knowledge from Cloud Service Incident Investigations for AIOps]]
- RobustSpot: [[2022__ITSC__Robust Anomaly Localization of Multi-dimensional Derived Measure for Online Video]]
- MicroHECL: [[2022__ICSE-SEIP__MicroHECL - High-Efficient Root Cause Localization in Large-Scale Microservice Systems]]
### 2021
- TraceModel: [[2021__MSN__TraceModel - An Automatic Anomaly Detection and Root Cause Localization Framework for Microservice Systems]]
- [[2021__ACSOS__Causal Inference Techniques for Microservice Performance Diagnosis - Evaluation and Guiding Recommendations]]
- DyCause: [[2021__SIGSOFT__Faster, deeper, easier - crowdsourcing diagnosis of microservice kernel failure from user space]]
- HALO: [[2021__KDD__HALO - Hierarchy-aware Fault Localization for Cloud Systems]]
- CloudRCA: [[2021__CIKM__CloudRCA - A Root Cause Analysis Framework for Cloud Computing Platforms]]
- Groot: [[2021__ASE__Groot - An event-graph-based approach for root cause analysis in industrial settings]]
- ModelCoder: [[2021__IWQOS__ModelCoder - A Fault Model based Automatic Root Cause Localization Framework for Microservice Systems]]
- [[2021__CLOUD__Detecting Causal Structure on Cloud Application Microservices Using Granger Causality Models]]
- Sage: [[2021__ASPLOS__Sage―Practical and Scalable ML-Driven Performance Debugging in Microservices]]
- Sage: [[2022__OSR__Enabling Practical Cloud Performance Debugging with Unsupervised Learning]]
- GRLIA: [[2021__ASE__Graph-based Incident Aggregation for Large-Scale Online Service Systems]]
- [[2021__ASSE__Network root fault location based on network topology and alarm]]
- PDiagnose: [[2021__ISPA__Diagnosing Performance Issues in Microserviceswith Heterogeneous Data Source]]
- PatternMatcher: [[2021__ISSRE__Identifying Root-Cause Metrics for Incident Diagnosis in Online Service Systems]]
- TraceRank [[2021__Journal-of-Software__TraceRank - Abnormal service localization with dis-aggregated end-to-end tracing data in cloud native systems]]
- [[2021__ICSE__Fast Outage Analysis of Large-Scale Production Clouds with Service Correlation Mining]]
- MicroDiag [[2021__ICSE__MicroDiag - Fine-grained Performance Diagnosis for Microservice Systems]]
- MicroRank [[2021__WWW__MicroRank―End-to-End Latency Issue Localization with Extended Spectrum Analysis in Microservice Environments]]
- [[2021__CODS-COMAD__Evaluation of Causal Inference Techniques for AIOps]]
### 2020
- [[2020__TNSM__Workflow-Aware Automatic Fault Diagnosis for Microservice-Based Applications With Statistics]]
- [[2020__JSS__Graph-Based Root Cause Analysis for Service-Oriented and Microservice Architectures]]
- Apriori: [[2020__POMACS__Fast Dimensional Analysis for Root Cause Investigation in a Large-Scale Service Environment]]
- FluxInfer [[2020__IPCCC__FluxInfer―Automatic Diagnosis of Performance Anomaly for Online Database System]]
- GMTA [[2020__ESEC-FSE__Graph-Based Trace Analysis for Microservice Architecture Understanding and Problem Diagnosis]]
- ISQUAD: [[2020__VLDB__Diagnosing Root Causes of Intermittent Slow Queries in Cloud Databases]]
- DeCaf [[2020__ICSE-SEIP__DeCaf - Diagnosing and Triaging Performance Issues in Large-Scale Cloud Services]]
- [[2020__ICSE-SEIP__Fast Dimensional Analysis for Root Cause Investigation in a Large-Scale Service Environment]]
- [[2020__ICSE-SEIP__Understanding and Handling Alert Storm for Online Service Systems]]
- AutoMAP [[2020__WWW__AutoMAP - Diagnose Your Microservice-based Web Application]]
- MicroRCA [[2020__NOMS__MicroRCA - Root Cause Localization of Performance Issues in Microservices]]
- MicroCause [[2020__IWQoS__Localizing Failure Root Causes in a Microservice through Causality Inference]]
- CSL: [[2020__ICSOC__Performance Diagnosis in Cloud Microservices Using Deep Learning]]
- [[2020__Applied Science__A Causality Mining and Knowledge Graph Based Method of Root Cause Diagnosis for Performance Anomaly in Cloud Applications]]
### 2019
- FluxRank [[2019__ISSRE__FluxRank―A Widely-Deployable Framework to Automatically Localizing Root Cause Machines for Software Service Failure Mitigation]]
- ε-Diagnosis [[2019__WWW__ε-Diagnosis - Unsupervised and Real-time Diagnosis of Small-window Long-tail Latency in Large-scale Microservice Platforms]]
- ExplainIt! [[2019__SIGMOD__ExplainIt!– A Declarative Root-cause Analysis Engine for Time Series Data]]
- AirAlert [[2019__WWW__Outage prediction and diagnosis for cloud service systems]]
- Grano [[2019__VLDB__GRANO - Interactive Graph-based Root Cause Analysis for Cloud-Native Distributed Data Platfor]]
- Squeeze: [[2019__ISSRE__Generic and Robust Localization of Multi-Dimensional Root Causes]]
### 2018
- [Weng+, TON2018]: [[2018__TON__Root Cause Analysis of Anomalies of Multitier Services in Public Clouds]]
- LOUD: [[2018__ICST__Localizing Faults in Cloud Systems]]
- Microscope [[2018__ICSOC__Microscope―Pinpoint Performance Issues with Causal Graphs in Micro-service Environments]]
- MS-Rank [[2018__MS-Rank Multi-Metric and Self-Adaptive Root Cause]]
- CloudRanger [[2018__CCGRID__CloudRanger―Root Cause Identification for Cloud Native Systems]]
- FacGraph [[2018__IPCCC__FacGraph - Frequent Anomaly Correlation Graph Mining for Root Cause Diagnose in Micro-Service Architecture]]
### 2017 and before
- Roots [[2017__WWW__Performance Monitoring and Root Cause Analysis for Cloud-hosted Web Applications]]
- LogCluster: [[2016__ICSE-C__Log Clustering based Problem Identification for Online Service Systems]]
- DBSherlock [[2016__SIGMOD__DBSherlock―A Performance Diagnostic Tool for Transactional Databases]]
- PerfCompass [[2015__TPDS__PerfCompass - Online Performance Anomaly Fault Localization and Inference in Infrastructure-as-a-Service Clouds]]
- [[2014__KDD__Correlating Events with Time Series for Incident Diagnosis]]
- CauseInfer [[2014__INFOCOM__CauseInfer―Automatic and distributed performance diagnosis with hierarchical causality graph in large distributed systems]]
- [[2016__TSC__CauseInfer―Automated End-to-End Performance Diagnosis with Hierarchical Causality Graph in Cloud Environment]]
- MonitorRank [[2013__PER__Root Cause Detection in a Service-Oriented Architecture]]
- FChain [[2013__ICDCS__FChain - Toward Black-box Online Fault Localization for Cloud Systems]]
- CloudDiag: [[2013__TPDS__Toward Fine-Grained, Unsupervised, Scalable Performance Diagnosis for Production Cloud Computing Systems]]
- TBAC [[2009__CSMR__Automatic Failure Diagnosis Support in Distributed Large-Scale Software Systems based on Timing Behavior Anomaly Correlation]]
- NetMedic [[2009__SIGCOMM__Detailed Diagnosis in Enterprise Networks]]
- Pinpoint: [[2002__DSN__Pinpoint - Problem Determination in Large, Dynamic Internet Services]]
- TAN [[2004__OSDI__Correlating Instrumentation Data to System States - A Building Block for Automated Diagnosis and Control]]
### Network
- [[2022__ICASSP__Accurate Inference of Unseen Combinations of Multiple Rootcauses with Classifier Ensemble]]
- NetRCA: [[2022__ ICASSP__NetRCA―An Effective Network Fault Cause Localization Algorithm]]