[[AIOps]]のMap of Contentsノート。 ## General - [[AIOps]] - [[AIOps Platforms - Garther Blog]] - [[Reliability-Driven AIOps for Cloud Resilience - ICSE21 Keynote]] - [[What Does AIOps Mean for SREs? It’s Complicated.]] - [[What If the Promise of AIOps Was True? - SREcon21]] - [[aiops-handbook]] - [GitHub - OpsPAI/awesome-AIOps: A curated list of awesome academic researches and industrial materials about Artificial Intelligence for IT Operations (AIOps).](https://github.com/OpsPAI/awesome-AIOps) - [[IntelligentDDS - awesome-papers]] - [[Cloud Intelligence-AIOps across academia and industry]] - [[awesome-failure-diagnosis]] ## Concept & Survey Papers - [[2025__Computing__AI Techniques in the Microservices Life-Cycle]] - [[2025__arXiv__A Survey on AgentOps - Categorization, Challenges, and Future Directions]] - [[2025__MLSys__AIOpsLab - A Holistic Framework for Evaluating AI Agents for Enabling Autonomous Cloud]] - [[2024__Dissertation__AI-based Proactive Failure Management in Large-scale Cloud Environments]] - [[2024__arXiv__Building AI Agents for Autonomous Clouds - Challenges and Design Principles]] - [[2024__arXiv__A Comprehensive Survey on Root Cause Analysis in (Micro) Services - Methodologies, Challenges, and Trends]] - [[2024__arXiv__Failure Diagnosis in Microservice Systems - A Comprehensive Survey and Analysis]] - [[2024__CSUR__A Survey on Failure Analysis and Fault Injection in AI Systems]] - [[2025__CSUR__A Survey of AIOps for Failure Management in the Era of Large Language Models]] - [[2024__arXiv__AIOps Solutions for Incident Management - Technical Guidelines and A Comprehensive Literature Review]] - [[2023__HotNets__A Holistic View of AI-driven Network Incident Management]] - [[2023__ICSE__Deep Learning or Classical Machine Learning? An Empirical Study on Log-Based Anomaly Detection]] - [[2023__CSUR__A Joint Study of the Challenges, Opportunities, and Roadmap of MLOps and AIOps - A Systematic Survey]] - [[2023__ICDM__A Roadmap towards Intelligent Operations for Reliable Cloud Computing Systems]] - [[2023__Dissertation__Deep Learning techniques for system optimization in Cloud architectures]] - [[2023__ESEC-FSE__Adapting Performance Analytic Techniques in a Real-World Database-Centric System - An Industrial Experience Report]] - [[2023__ICECAA__Importance of AIOps for Turn Metrics and Log Data - A Survey]] - [[2023__arXiv__A Survey of Time Series Anomaly Detection Methods in the AIOps Domain]] - [[2023__arXiv__AI Techniques in the Microservices Life-Cycle - A Survey]] - [[2023__arXiv__AI for IT Operations (AIOps) on Cloud Platforms - Reviews, Opportunities and Challenges]] - [[2022__arXiv__Studying the Characteristics of AIOps Projects on GitHub]] - [[2022__SIGKDD__Robust Time Series Analysis and Applications - An Industrial Perspective]] - [[2022__Mob__A Survey On Log Research Of AIOps - Methods and Trends]] - [[2022__CSUR__Anomaly Detection and Failure Root Cause Analysis in (Micro)Service-Based Cloud Applications - A Survey]] - [[2021__HPCC-DSS-SmartCity-DependSys__A Survey on Failure Prediction in Large-scale Computing Systems]] - [[2021__CSUR__A Survey on Automated Log Analysis for Reliability Engineering]] - [[2021__ACCESS__IT infrastructure anomaly detection and failure handling - a systematic literature review focusing on datasets, log preprocessing, machine and deep learning approaches and automated tool]] - [[2021__TOSEM__Towards a Consistent Interpretation of AIOps Models]] - [[2020__ICSOC__A Systematic Mapping Study in AIOps]] - [[2021__TIST__A Survey of AIOps Methods for Failure Management]] - [[2019__ICSE-Companion__AIOps - Real-World Challenges and Research Innovations]] - [[2017__CSUR__Data-Driven Techniques in Computing System Management]] - [[2010__CSUR__A Survey of Online Failure Prediction Methods]] ## AIOps Products - [[Zebrium]] - [[BigPanda]] - [[LLMベースのAIOpsプロダクト]] ## [[Failure Management]] ### General - [[2024__arXiv__AI Assistants for Incident Lifecycle in a Microservice Environment - A Systematic Literature Review]] - [[2024__ASE__ART - A Unified Unsupervised Framework for Incident Management in Microservice Systems]] - [[2023__TSEM__On the Model Update Strategies for Supervised Learning in AIOps Solutions]] - [[2023__ICSE__Knowledge-based Intelligent System for IT Incident DevOps]] - [[2022__ESEC-FSE__Metadata-based Retrieval for Resolution Recommendation in AIOps]] - [[2013__ASE__Software Analytics for Incident Management of Online Services - An Experience Report]] ### Pattern Recognition - [[2022__arXiv__UniParser - A Unified Log Parser for Heterogeneous Log Data]] - [[2014__OSDI__The Mystery Machine - End-to-end Performance Analysis of Large-scale Internet Services]] ### Failure Prevention - [[2024__TNSM__Adaptive Feature Selection for Predicting Application Performance Degradation in Edge Cloud Environments]] - [[2023__ICSE-Companion__Incident Prevention Through Reliable Changes Deployment]] - [[2020__NSDI__Check before You Change - Preventing Correlated Failures in Service Updates]] ### Failure Prediction - [[2024__CloudNet__A Multi-Stage Framework for Failure Prediction and Classification in Cloud Native Applications]] - VMFT-LAD: [[2024__ACCESS__Virtual Machine Proactive Fault Tolerance Using Log-Based Anomaly Detection]] - SCALEDFP: [[2024__ICDM__Scaling Disk Failure Prediction via Multi-Source Stream Mining]] - Uptake: [[2024__ISSRE__Can We Trust Auto-Mitigation? Improving Cloud Failure Prediction with Uncertain Positive Learning]] - Early Bird: [[2024__ISSRE__Early Bird - Ensuring Reliability of Cloud Systems Through Early Failure Prediction]] - TAAT: [[2024__KDD__Time-Aware Attention-Based Transformer (TAAT) for Cloud Computing System Failure Prediction]] - [[2024__ESEC-FSE__Predicting Failures of Autoscaling Distributed Applications]] - [[2024__WWW__SOIL - Score Conditioned Diffusion Model for Imbalanced Cloud Failure Prediction]] - [[2024__ICPE__InstantOps - A Joint Approach to System Failure Prediction and Root Cause Identification in Microservices Cloud-Native Applications]] - [[2024__arXiv__Why does Prediction Accuracy Decrease over Time? Uncertain Positive Learning for Cloud Failure Prediction]] - [[2024__arXiv__McUDI - Model-Centric Unsupervised Degradation Indicator for Failure Prediction AIOps Solutions]] - [[2023__ESEC-FSE__Diffusion-Based Time Series Data Imputation for Cloud Failure Prediction at Microsoft 365]] - [[2023__arXiv__Outage-Watch - Early Prediction of Outages using Extreme Event Regularizer]] - [[2023__DSN__HiMFP - Hierarchical Intelligent Memory Failure Prediction for Cloud Service Reliability]] - [[2023__WWW__EDITS - An Easy-to-difficult Training Strategy for Cloud Failure Prediction]] - [[2022__SIGKDD__Multi-task Hierarchical Classification for Disk Failure Prediction in Online Service Systems]] - [[2010__CSUR__A Survey of Online Failure Prediction Methods]] ### Failure Detection [[AIOps - Failure Detection - MOC]] ### Fault Localization [[AIOps - Fault Localization - MOC]] ### Remediation - DeployFix: [[2024__ASE__DeployFix - Dynamic Repair of Software Deployment Failures via Constraint Solving]] - Deoxys: [[2024__SoCC__Deoxys - A Causal Inference Engine for Unhealthy Node Mitigation in Large-scale Cloud Infrastructure]] - [[2023__CASCON__ADARMA - Auto-Detection and Auto-Remediation of Microservice Anomalies by Leveraging Large Language Models]] - [[2023__CIKM__On Root Cause Localization and Anomaly Mitigation through Causal Inference]] - [[2022__SIGKDD__NENYA - Cascade Reinforcement Learning for Cost-Aware Failure Mitigation at Microsoft 365]] - [[2021__AAAI__Carbon to Diamond - An Incident Remediation Assistant System From Site Reliability Engineers' Conversations in Hybrid Cloud Operations]] - [[2020__NOMS__MicroRCA - Root Cause Localization of Performance Issues in Microservices]] ### Log Analysis - [[AIOps - Log Analysis - MOC]] ### Others - [[2024__arXiv__Early Detection of Performance Regressions by Bridging Local Performance Data and Architectural Models]] - DashChef: [[2024__ICECCS__DashChef - A Metric Recommendation Service for Online Systems Using Graph Learning]] - SALO: [[2024__CLOUD__Self Adjusting Log Observability for Cloud Native Applications]] - ParaSeer: [[2024__arXiv__Predicting Parameter Change's Effect on Cellular Network Time Series]] - ResilienceGuardian: [[2024__IWQoS__Guardian of the Resiliency - Detecting Erroneous Software Changes Before They Make Your Microservice System Less Fault-Resilient]] - LabelEase: [[2024__ISSRE__LabelEase - A Semi-Automatic Tool for Efficient and Accurate Trace Labeling in Microservices]] - SORN: [[2024__KDD__Cluster-Wide Task Slowdown Detection in Cloud System]] - Auto-PIP: [[2024__ISSRE__Auto-PIP - Real-time Identification of Critical Performance Inflection Points in Software Stress Testing]] - [[2024__ICSE__Intelligent Monitoring Framework for Cloud Services - A Data-Driven Approach]] - FLARE: [[2023__Middleware__Fast, Light-weight, and Accurate Performance Evaluation using Representative Datacenter Behaviors]] - Prism: [[2023__ASE__Prism - Revealing Hidden Functional Clusters from Massive Instances in Cloud Systems]] - PERT-GNN: [[2023__KDD__PERT-GNN - Latency Prediction for Microservice-based Cloud-Native Applications via Graph Neural Networks]] - AMW: [[2023__arXiv__Real-time Workload Pattern Analysis for Large-scale Cloud Databases]] - Oasis: [[2023__ESEC-FSE__Assess and Summarize - Improve Outage Understanding with Large Language Models]] - [[2022__ESEC-FSE__An Empirical Investigation of Missing Data Handling in Cloud Node Failure Prediction]] - TraceCRL: [[2022__ESEC-FSE__TraceCRL - Contrastive Representation Learning for Microservice]] - [[2021__ICWS__Sieve - Attention-based Sampling of End-to-End Trace Data in Distributed Microservice Systems]] ## Resource Provisioning ### Resource Optimization / Scaling - [[2022__TCC__Microscaler - Cost-Effective Scaling for Microservice Applications in the Cloud With an Online Learning Approach]] - TUNA: [[2025__ECCS__TUNA - Tuning Unstable and Noisy Cloud Applications]] - OPPerTune: [[2024__NSDI__OPPerTune - Post-Deployment Configuration Tuning of Services Made Easy]] - [[2024__ICSSAS__Optimizing Cloud Infrastructure Management Using Large Language Models - A DevOps Perspective]] - MLETune: [[2024__ICPADS__MLETune - Streamlining Database Knob Tuning via Multi-LLMs Experts Guided Deep Reinforcement Learning]] - DeepCAT+: [[2024__TPDS__DeepCAT+ - A Low-Cost and Transferrable Online Configuration Auto-Tuning Approach for Big Data Frameworks]] - HRAMONY: [[2024__arXiv__Adaptive Two-Stage Cloud Resource Scaling via Hierarchical Multi-Indicator Forecasting and Bayesian Decision-Making]] - [[2024__arXiv__Automatic Configuration Tuning on Cloud Database - A Survey]] - [[2024__CCGrid__SLO-Power - SLO and Power-aware Elastic Scaling for Web Services]] - [[2024__EuroSys__Erlang - Application-Aware Autoscaling for Cloud Microservices]] - [[2024__VLDB__DB-BERT - making database tuning tools “read” the manual]] - [[2024__arXiv__Analytically-Driven Resource Management for Cloud-Native Microservices]] - [[2023__NeurIPS__On the Promise and Challenges of Foundation Models for Learning-based Cloud Systems Management]] - [[2023__SoCC__μConAdapter - Reinforcement Learning-based Fast Concurrency Adaptation for Microservices in Cloud]] - [[2023__TPDS__Optimizing IO Performance through Effective vCPU Scheduling Interference Management]] - [[2023__arXiv__Gecko - Automated Feature Degradation for Cloud Resilience]] - [[2020__EuroSys__Autopilot―workload autoscaling at Google]] ### Workload Estimation - Osprey: [[2024__ESEC-FSE__OS Pre-trained Transformer - Predicting Query Latencies across Changing System Contexts]] - [[2024__arXiv__Risk-aware Adaptive Virtual CPU Oversubscription in Microsoft Cloud via Prototypical Human-in-the-loop Imitation Learning]] ### Resource Clustering - EFection: [[2024__IJCIS__EFection - Effectiveness Detection Technique for Clustering Cloud Workload Traces]] - Prism: [[2023__ASE__Prism - Revealing Hidden Functional Clusters from Massive Instances in Cloud Systems]] - CloudCluster: [[2022__NSDI__CloudCluster - Unearthing the Functional Structure of a Cloud Service]] - OmniCluster: [[2022__WWW__Robust System Instance Clustering for Large-Scale Web Services]] ## Data Sampling - [[2023__ESEC-FSE__STEAM - Observability-Preserving Trace Sampling]] - [[2021__ICWS__Sieve - Attention-based Sampling of End-to-End Trace Data in Distributed Microservice Systems]] - [[2019__SoCC__Sifter - Scalable Sampling for Distributed Traces, without Feature Engineering]] ## Visualization - [[2023__TVCG__QEVIS - Multi-grained Visualization of Distributed Query Execution]] ## [[AIOpsのRCA研究の評価実験のデータセット作成手法調査 2021年]] ## [[AIOps用データセット]] ## [[LLM4SRE]] ## Case Studies - [[Spike Detection in Alert Correlation at LinkedIn - SREcon21]] - [[LINE Pay監視システムの構築とMLを用いた異常ログ検知]] - [[Anomaly Detection on Golden Signals - SREcon19 Asia-Pasific]] - [[Automatic Metric Screening for Service Diagnosis - SREcon18 Americas]] - [[Anomaly Detection in Infrequently Occurred Patterns - SREcon17 Americas]] - [[Smart Monitor System For Automatic Anomaly Detection @Baidu - SREcon15]] - [[Next Generation of DevOps - AIOps in Practice @Baidu - SREcon17 Asia]] ## 研究グループ - [[AIOps研究グループ]] ## Competitions - [[International AIOps Challenge]] - [[ICASSP-SPGC 2022 AIOps Challenge in Communication Networks]] ## Books - [[Intelligent Network Management and Operation Systems]] ## Conferences - [[Workshop on ML for Systems at NeurIPS]] ## Others - [[2023__arXiv__Assessing the Maturity of Model Maintenance Techniques for AIOps Solutions]] - [[2023__CASCON__Meta-learning Generalized AIOps Models for Multi-cloud Computer using Digital Twins]] - [[2023__ECNCT__Design and Implement of AIOps System Based on Knowledge Graph]] - [[2023__KDD__Contextual Self-attentive Temporal Point Process for Physical Decommissioning Prediction of Cloud Assets]] - [[Observability and the Misleading Promise of AIOps]] - [[LogPAI]] ## Other Domains Software Engineering - [[2024__arXiv__A Systematic Literature Review on Explainability for Machine-Deep Learning-based Software Engineering Research]] - [[2023__arXiv__Are They All Good? Studying Practitioners’ Expectations on the Readability of Log Messages]] Robotics - [[2023__ICSME__An Empirical Study on Fault Diagnosis in Robotic Systems]] Road Traffic - [[2012__ICDM__Inferring the Root Cause in Road Traffic Anomalies]] ## Related MOC - [[AI for DB - MOC]] - [[SRE - MOC]]