[[AIOps]]のMap of Contentsノート。
## General
- [[AIOps]]
- [[AIOps Platforms - Garther Blog]]
- [[Reliability-Driven AIOps for Cloud Resilience - ICSE21 Keynote]]
- [[What Does AIOps Mean for SREs? It’s Complicated.]]
- [[What If the Promise of AIOps Was True? - SREcon21]]
- [[aiops-handbook]]
- [GitHub - OpsPAI/awesome-AIOps: A curated list of awesome academic researches and industrial materials about Artificial Intelligence for IT Operations (AIOps).](https://github.com/OpsPAI/awesome-AIOps)
- [[IntelligentDDS - awesome-papers]]
- [[Cloud Intelligence-AIOps across academia and industry]]
- [[awesome-failure-diagnosis]]
## Concept & Survey Papers
- [[2025__Computing__AI Techniques in the Microservices Life-Cycle]]
- [[2025__arXiv__A Survey on AgentOps - Categorization, Challenges, and Future Directions]]
- [[2025__MLSys__AIOpsLab - A Holistic Framework for Evaluating AI Agents for Enabling Autonomous Cloud]]
- [[2024__Dissertation__AI-based Proactive Failure Management in Large-scale Cloud Environments]]
- [[2024__arXiv__Building AI Agents for Autonomous Clouds - Challenges and Design Principles]]
- [[2024__arXiv__A Comprehensive Survey on Root Cause Analysis in (Micro) Services - Methodologies, Challenges, and Trends]]
- [[2024__arXiv__Failure Diagnosis in Microservice Systems - A Comprehensive Survey and Analysis]]
- [[2024__CSUR__A Survey on Failure Analysis and Fault Injection in AI Systems]]
- [[2025__CSUR__A Survey of AIOps for Failure Management in the Era of Large Language Models]]
- [[2024__arXiv__AIOps Solutions for Incident Management - Technical Guidelines and A Comprehensive Literature Review]]
- [[2023__HotNets__A Holistic View of AI-driven Network Incident Management]]
- [[2023__ICSE__Deep Learning or Classical Machine Learning? An Empirical Study on Log-Based Anomaly Detection]]
- [[2023__CSUR__A Joint Study of the Challenges, Opportunities, and Roadmap of MLOps and AIOps - A Systematic Survey]]
- [[2023__ICDM__A Roadmap towards Intelligent Operations for Reliable Cloud Computing Systems]]
- [[2023__Dissertation__Deep Learning techniques for system optimization in Cloud architectures]]
- [[2023__ESEC-FSE__Adapting Performance Analytic Techniques in a Real-World Database-Centric System - An Industrial Experience Report]]
- [[2023__ICECAA__Importance of AIOps for Turn Metrics and Log Data - A Survey]]
- [[2023__arXiv__A Survey of Time Series Anomaly Detection Methods in the AIOps Domain]]
- [[2023__arXiv__AI Techniques in the Microservices Life-Cycle - A Survey]]
- [[2023__arXiv__AI for IT Operations (AIOps) on Cloud Platforms - Reviews, Opportunities and Challenges]]
- [[2022__arXiv__Studying the Characteristics of AIOps Projects on GitHub]]
- [[2022__SIGKDD__Robust Time Series Analysis and Applications - An Industrial Perspective]]
- [[2022__Mob__A Survey On Log Research Of AIOps - Methods and Trends]]
- [[2022__CSUR__Anomaly Detection and Failure Root Cause Analysis in (Micro)Service-Based Cloud Applications - A Survey]]
- [[2021__HPCC-DSS-SmartCity-DependSys__A Survey on Failure Prediction in Large-scale Computing Systems]]
- [[2021__CSUR__A Survey on Automated Log Analysis for Reliability Engineering]]
- [[2021__ACCESS__IT infrastructure anomaly detection and failure handling - a systematic literature review focusing on datasets, log preprocessing, machine and deep learning approaches and automated tool]]
- [[2021__TOSEM__Towards a Consistent Interpretation of AIOps Models]]
- [[2020__ICSOC__A Systematic Mapping Study in AIOps]]
- [[2021__TIST__A Survey of AIOps Methods for Failure Management]]
- [[2019__ICSE-Companion__AIOps - Real-World Challenges and Research Innovations]]
- [[2017__CSUR__Data-Driven Techniques in Computing System Management]]
- [[2010__CSUR__A Survey of Online Failure Prediction Methods]]
## AIOps Products
- [[Zebrium]]
- [[BigPanda]]
- [[LLMベースのAIOpsプロダクト]]
## [[Failure Management]]
### General
- [[2024__arXiv__AI Assistants for Incident Lifecycle in a Microservice Environment - A Systematic Literature Review]]
- [[2024__ASE__ART - A Unified Unsupervised Framework for Incident Management in Microservice Systems]]
- [[2023__TSEM__On the Model Update Strategies for Supervised Learning in AIOps Solutions]]
- [[2023__ICSE__Knowledge-based Intelligent System for IT Incident DevOps]]
- [[2022__ESEC-FSE__Metadata-based Retrieval for Resolution Recommendation in AIOps]]
- [[2013__ASE__Software Analytics for Incident Management of Online Services - An Experience Report]]
### Pattern Recognition
- [[2022__arXiv__UniParser - A Unified Log Parser for Heterogeneous Log Data]]
- [[2014__OSDI__The Mystery Machine - End-to-end Performance Analysis of Large-scale Internet Services]]
### Failure Prevention
- [[2024__TNSM__Adaptive Feature Selection for Predicting Application Performance Degradation in Edge Cloud Environments]]
- [[2023__ICSE-Companion__Incident Prevention Through Reliable Changes Deployment]]
- [[2020__NSDI__Check before You Change - Preventing Correlated Failures in Service Updates]]
### Failure Prediction
- [[2024__CloudNet__A Multi-Stage Framework for Failure Prediction and Classification in Cloud Native Applications]]
- VMFT-LAD: [[2024__ACCESS__Virtual Machine Proactive Fault Tolerance Using Log-Based Anomaly Detection]]
- SCALEDFP: [[2024__ICDM__Scaling Disk Failure Prediction via Multi-Source Stream Mining]]
- Uptake: [[2024__ISSRE__Can We Trust Auto-Mitigation? Improving Cloud Failure Prediction with Uncertain Positive Learning]]
- Early Bird: [[2024__ISSRE__Early Bird - Ensuring Reliability of Cloud Systems Through Early Failure Prediction]]
- TAAT: [[2024__KDD__Time-Aware Attention-Based Transformer (TAAT) for Cloud Computing System Failure Prediction]]
- [[2024__ESEC-FSE__Predicting Failures of Autoscaling Distributed Applications]]
- [[2024__WWW__SOIL - Score Conditioned Diffusion Model for Imbalanced Cloud Failure Prediction]]
- [[2024__ICPE__InstantOps - A Joint Approach to System Failure Prediction and Root Cause Identification in Microservices Cloud-Native Applications]]
- [[2024__arXiv__Why does Prediction Accuracy Decrease over Time? Uncertain Positive Learning for Cloud Failure Prediction]]
- [[2024__arXiv__McUDI - Model-Centric Unsupervised Degradation Indicator for Failure Prediction AIOps Solutions]]
- [[2023__ESEC-FSE__Diffusion-Based Time Series Data Imputation for Cloud Failure Prediction at Microsoft 365]]
- [[2023__arXiv__Outage-Watch - Early Prediction of Outages using Extreme Event Regularizer]]
- [[2023__DSN__HiMFP - Hierarchical Intelligent Memory Failure Prediction for Cloud Service Reliability]]
- [[2023__WWW__EDITS - An Easy-to-difficult Training Strategy for Cloud Failure Prediction]]
- [[2022__SIGKDD__Multi-task Hierarchical Classification for Disk Failure Prediction in Online Service Systems]]
- [[2010__CSUR__A Survey of Online Failure Prediction Methods]]
### Failure Detection
[[AIOps - Failure Detection - MOC]]
### Fault Localization
[[AIOps - Fault Localization - MOC]]
### Remediation
- DeployFix: [[2024__ASE__DeployFix - Dynamic Repair of Software Deployment Failures via Constraint Solving]]
- Deoxys: [[2024__SoCC__Deoxys - A Causal Inference Engine for Unhealthy Node Mitigation in Large-scale Cloud Infrastructure]]
- [[2023__CASCON__ADARMA - Auto-Detection and Auto-Remediation of Microservice Anomalies by Leveraging Large Language Models]]
- [[2023__CIKM__On Root Cause Localization and Anomaly Mitigation through Causal Inference]]
- [[2022__SIGKDD__NENYA - Cascade Reinforcement Learning for Cost-Aware Failure Mitigation at Microsoft 365]]
- [[2021__AAAI__Carbon to Diamond - An Incident Remediation Assistant System From Site Reliability Engineers' Conversations in Hybrid Cloud Operations]]
- [[2020__NOMS__MicroRCA - Root Cause Localization of Performance Issues in Microservices]]
### Log Analysis
- [[AIOps - Log Analysis - MOC]]
### Others
- [[2024__arXiv__Early Detection of Performance Regressions by Bridging Local Performance Data and Architectural Models]]
- DashChef: [[2024__ICECCS__DashChef - A Metric Recommendation Service for Online Systems Using Graph Learning]]
- SALO: [[2024__CLOUD__Self Adjusting Log Observability for Cloud Native Applications]]
- ParaSeer: [[2024__arXiv__Predicting Parameter Change's Effect on Cellular Network Time Series]]
- ResilienceGuardian: [[2024__IWQoS__Guardian of the Resiliency - Detecting Erroneous Software Changes Before They Make Your Microservice System Less Fault-Resilient]]
- LabelEase: [[2024__ISSRE__LabelEase - A Semi-Automatic Tool for Efficient and Accurate Trace Labeling in Microservices]]
- SORN: [[2024__KDD__Cluster-Wide Task Slowdown Detection in Cloud System]]
- Auto-PIP: [[2024__ISSRE__Auto-PIP - Real-time Identification of Critical Performance Inflection Points in Software Stress Testing]]
- [[2024__ICSE__Intelligent Monitoring Framework for Cloud Services - A Data-Driven Approach]]
- FLARE: [[2023__Middleware__Fast, Light-weight, and Accurate Performance Evaluation using Representative Datacenter Behaviors]]
- Prism: [[2023__ASE__Prism - Revealing Hidden Functional Clusters from Massive Instances in Cloud Systems]]
- PERT-GNN: [[2023__KDD__PERT-GNN - Latency Prediction for Microservice-based Cloud-Native Applications via Graph Neural Networks]]
- AMW: [[2023__arXiv__Real-time Workload Pattern Analysis for Large-scale Cloud Databases]]
- Oasis: [[2023__ESEC-FSE__Assess and Summarize - Improve Outage Understanding with Large Language Models]]
- [[2022__ESEC-FSE__An Empirical Investigation of Missing Data Handling in Cloud Node Failure Prediction]]
- TraceCRL: [[2022__ESEC-FSE__TraceCRL - Contrastive Representation Learning for Microservice]]
- [[2021__ICWS__Sieve - Attention-based Sampling of End-to-End Trace Data in Distributed Microservice Systems]]
## Resource Provisioning
### Resource Optimization / Scaling
- [[2022__TCC__Microscaler - Cost-Effective Scaling for Microservice Applications in the Cloud With an Online Learning Approach]]
- TUNA: [[2025__ECCS__TUNA - Tuning Unstable and Noisy Cloud Applications]]
- OPPerTune: [[2024__NSDI__OPPerTune - Post-Deployment Configuration Tuning of Services Made Easy]]
- [[2024__ICSSAS__Optimizing Cloud Infrastructure Management Using Large Language Models - A DevOps Perspective]]
- MLETune: [[2024__ICPADS__MLETune - Streamlining Database Knob Tuning via Multi-LLMs Experts Guided Deep Reinforcement Learning]]
- DeepCAT+: [[2024__TPDS__DeepCAT+ - A Low-Cost and Transferrable Online Configuration Auto-Tuning Approach for Big Data Frameworks]]
- HRAMONY: [[2024__arXiv__Adaptive Two-Stage Cloud Resource Scaling via Hierarchical Multi-Indicator Forecasting and Bayesian Decision-Making]]
- [[2024__arXiv__Automatic Configuration Tuning on Cloud Database - A Survey]]
- [[2024__CCGrid__SLO-Power - SLO and Power-aware Elastic Scaling for Web Services]]
- [[2024__EuroSys__Erlang - Application-Aware Autoscaling for Cloud Microservices]]
- [[2024__VLDB__DB-BERT - making database tuning tools “read” the manual]]
- [[2024__arXiv__Analytically-Driven Resource Management for Cloud-Native Microservices]]
- [[2023__NeurIPS__On the Promise and Challenges of Foundation Models for Learning-based Cloud Systems Management]]
- [[2023__SoCC__μConAdapter - Reinforcement Learning-based Fast Concurrency Adaptation for Microservices in Cloud]]
- [[2023__TPDS__Optimizing IO Performance through Effective vCPU Scheduling Interference Management]]
- [[2023__arXiv__Gecko - Automated Feature Degradation for Cloud Resilience]]
- [[2020__EuroSys__Autopilot―workload autoscaling at Google]]
### Workload Estimation
- Osprey: [[2024__ESEC-FSE__OS Pre-trained Transformer - Predicting Query Latencies across Changing System Contexts]]
- [[2024__arXiv__Risk-aware Adaptive Virtual CPU Oversubscription in Microsoft Cloud via Prototypical Human-in-the-loop Imitation Learning]]
### Resource Clustering
- EFection: [[2024__IJCIS__EFection - Effectiveness Detection Technique for Clustering Cloud Workload Traces]]
- Prism: [[2023__ASE__Prism - Revealing Hidden Functional Clusters from Massive Instances in Cloud Systems]]
- CloudCluster: [[2022__NSDI__CloudCluster - Unearthing the Functional Structure of a Cloud Service]]
- OmniCluster: [[2022__WWW__Robust System Instance Clustering for Large-Scale Web Services]]
## Data Sampling
- [[2023__ESEC-FSE__STEAM - Observability-Preserving Trace Sampling]]
- [[2021__ICWS__Sieve - Attention-based Sampling of End-to-End Trace Data in Distributed Microservice Systems]]
- [[2019__SoCC__Sifter - Scalable Sampling for Distributed Traces, without Feature Engineering]]
## Visualization
- [[2023__TVCG__QEVIS - Multi-grained Visualization of Distributed Query Execution]]
## [[AIOpsのRCA研究の評価実験のデータセット作成手法調査 2021年]]
## [[AIOps用データセット]]
## [[LLM4SRE]]
## Case Studies
- [[Spike Detection in Alert Correlation at LinkedIn - SREcon21]]
- [[LINE Pay監視システムの構築とMLを用いた異常ログ検知]]
- [[Anomaly Detection on Golden Signals - SREcon19 Asia-Pasific]]
- [[Automatic Metric Screening for Service Diagnosis - SREcon18 Americas]]
- [[Anomaly Detection in Infrequently Occurred Patterns - SREcon17 Americas]]
- [[Smart Monitor System For Automatic Anomaly Detection @Baidu - SREcon15]]
- [[Next Generation of DevOps - AIOps in Practice @Baidu - SREcon17 Asia]]
## 研究グループ
- [[AIOps研究グループ]]
## Competitions
- [[International AIOps Challenge]]
- [[ICASSP-SPGC 2022 AIOps Challenge in Communication Networks]]
## Books
- [[Intelligent Network Management and Operation Systems]]
## Conferences
- [[Workshop on ML for Systems at NeurIPS]]
## Others
- [[2023__arXiv__Assessing the Maturity of Model Maintenance Techniques for AIOps Solutions]]
- [[2023__CASCON__Meta-learning Generalized AIOps Models for Multi-cloud Computer using Digital Twins]]
- [[2023__ECNCT__Design and Implement of AIOps System Based on Knowledge Graph]]
- [[2023__KDD__Contextual Self-attentive Temporal Point Process for Physical Decommissioning Prediction of Cloud Assets]]
- [[Observability and the Misleading Promise of AIOps]]
- [[LogPAI]]
## Other Domains
Software Engineering
- [[2024__arXiv__A Systematic Literature Review on Explainability for Machine-Deep Learning-based Software Engineering Research]]
- [[2023__arXiv__Are They All Good? Studying Practitioners’ Expectations on the Readability of Log Messages]]
Robotics
- [[2023__ICSME__An Empirical Study on Fault Diagnosis in Robotic Systems]]
Road Traffic
- [[2012__ICDM__Inferring the Root Cause in Road Traffic Anomalies]]
## Related MOC
- [[AI for DB - MOC]]
- [[SRE - MOC]]