[[AIOps]]のFailure Managementタスクのうち、障害検知と異常検知に関するMOCノート。 ## Papers - PA-Rank: [[2025__IoTJournal__PA-Rank - A GAN and Reinforcement Learning Powered Framework for Multi-Metric Anomaly Detection and Causal Diagnosis]] - FUSION: [[2025__KDD__Enhancing_Microservices_Anomaly_Detection_via_Multimodal_Data_Fusion_in_the_Wavelet_Domain_and_Spatiotemporal_Graph-based_Diffusion_Probabilistic_Model]] - CloudAnoAgent: [[2025__arXiv__CloudAnoAgent - Anomaly Detection for Cloud Sites via LLM Agent with Neuro-Symbolic Mechanism]] - [[2025__CLOSER__Anomaly Detection for Partially Observable Container Systems Based on Architecture Profiling]] - [[2025__CMC__LogDA - Dual Attention-Based Log Anomaly Detection Addressing Data Imbalance]] - ADAMAS: [[2025__ICSE__ADAMAS - Adaptive Domain-Aware Performance Anomaly Detection in Cloud Service Systems]] - Loggraph: [[2025__TSE__Anomaly Detection on Interleaved Log Data With Semantic Association Mining on Log-Entity Graph]] - VersaGuardian: [[2025__TON__Real-Time Anomaly Detection for Large-Scale Network Devices]] - ShareAD: [[2024__TSEM__On the Practicability of Deep Learning based Anomaly Detection for Modern Online Software Systems - A Pre-Train-and-Align Framework]] - RBAD: [[2024__TSC__Towards Close-To-Zero Runtime Collection Overhead - Raft-Based Anomaly Diagnosis on System Faults for Distributed Storage System]] - SpikeLog: [[2024__TKDE__SpikeLog - Log-Based Anomaly Detection via Potential-Assisted Spiking Neuron Network]] - GCAD: [[2025__Complex & Intelligent Systems__Towards accurate anomaly detection for cloud system via graph-enhanced contrastive learning]] - LogLLM: [[2024__arXiv__LogLLM - Log-based Anomaly Detection Using Large Language Models]] - KAN-AD: [[2024__arXiv__KAN-AD - Time Series Anomaly Detection with Kolmogorov-Arnold Networks]] - GAD: [[2024__CIKM__GAD - A Generalized Framework for Anomaly Detection at Different Risk Levels]] - [[2024__ISSRE__Can We Trust Auto-Mitigation? Improving Cloud Failure Prediction with Uncertain Positive Learning]] - [[2024__ISSRE__DRLFailureMonitor - A Dynamic Failure Monitoring Approach for Deep Reinforcement Learning System]] - [[2024__ICWS__Anomaly Detection and Root Cause Analysis of Microservices Energy Consumption]] - [[2024__arXiv__What Information Contributes to Log-based Anomaly Detection? Insights from a Configurable Transformer-Based Approach]] - WebNorm: [[2024__ASE__Detecting and Explaining Anomalies Caused by Web Tamper Attacks via Building Consistency-based Normality]] - LogCraft: [[2024__ASE__End-to-End AutoML for Unsupervised Log Anomaly Detection]] - LogCleaner: [[2024__ESEM__Reducing Events to Augment Log-based Anomaly Detection Models - An Empirical Study]] - [[2024__TSE__iTCRL - Causal-intervention-based Trace Contrastive Representation Learning for Microservice Systems]] - [[2024__ISSRE__Multivariate Time Series Anomaly Detection based on Pre-trained Models with Dual-Attention Mechanism]] - [[2024__EAAI__An effective failure detection method for microservice-based systems using distributed tracing data]] - MAAD: [[2024__IPDPS__MAAD - A Distributed Anomaly Detection Architecture for Microservices Systems]] - KAD-Disformer: [[2024__KDD__Pre-trained KPI Anomaly Detection Model Based on Disentangled Transformer]] - [[2024__ICPE__Disambiguating Performance Anomalies from Workload Changes in Cloud-Native Applications]] - [[2024__KDD__Multivariate Log-based Anomaly Detection for Distributed Database]] - [[2024__TSC__UAC-AD - Unsupervised Adversarial Contrastive Learning for Anomaly Detection on Multi-modal Data in Microservice Systems]] - [[2024__ICSE-SEIP__Autonomous Monitors for Detecting Failures Early and Reporting Interpretable Alerts in Cloud Operations]] - MonitorAssistant: [[2024__ESEC-FSE__MonitorAssistant - Simplifying Cloud Service Monitoring via Large Language Models]] - [[2024__FGCS__Multi-task federated learning-based system anomaly detection and multi-classification for microservices architecture]] - [[2024__TSE__On the Influence of Data Resampling for Deep Learning-Based Log Anomaly Detection - Insights and Recommendations]] - [[2024__WWW__Supervised Fine-Tuning for Unsupervised KPI Anomaly Detection for Mobile Web Systems]] - [[2024__AAAI__Root Cause Explanation of Outliers under Noisy Mechanisms]] - [[2024__arXiv__Few-Shot Cross-System Anomaly Trace Classification for Microservice-based systems]] - FCVAE: [[2024__WWW__Revisiting VAE for Unsupervised Time Series Anomaly Detection - A Frequency Perspective]] - ServiceAnomaly: [[2023__JSS__ServiceAnomaly - An anomaly detection approach in microservices using distributed traces and profiling metrics]] - ASGNet: [[2023__ICONIP__ASGNet - Adaptive Semantic Gate Networks for Log-Based Anomaly Diagnosis]] - BroadCAE: [[2024__TNSM__Detecting Cloud Anomaly via Broad Network-Based Contrastive Autoencoder]] - OmniTransfer: [[2023__ICWS__Efficient Multivariate Time Series Anomaly Detection Through Transfer Learning for Large-Scale Software Systems]] - [[2023__BdKCSE__AIOps-Driven Enhancement of Log Anomaly Detection in Unsupervised Scenarios]] - [[2023__ICSE__Deep Learning or Classical Machine Learning? An Empirical Study on Log-Based Anomaly Detection]] - AutoKAD: [[2023__ISSRE__ AutoKAD - Empowering KPI Anomaly Detection with Label-Free Deployment]] - TraceSieve: [[2023__ISSRE__Efficient and Robust Trace Anomaly Detection for Large-Scale Microservice Systems]] - Triple: [[2023__IWQoS__Triple - The Interpretable Deep Learning Anomaly Detection Framework based on Trace-Metric-Log of Microservice]] - [[2023__TKDE__Concept Drift-Based Runtime Reliability Anomaly Detection for Edge Services Adaptation]] - AutoLog: [[2023__ASE__AutoLog - A Log Sequence Synthesis Framework for Anomaly Detection]] - CMAnomaly: [[2023__ISSRE__Practical Anomaly Detection over Multivariate Monitoring Metrics for Online Services]] - Maat: [[2023__ASE__Maat - Performance Metric Anomaly Anticipation for Cloud Services with Conditional Diffusion]] - [[2023__AIAHPC__KPI anomaly detection method of AIOps based on GAN]] - RTAnomaly: [[2023__arXiv__Performance Issue Identification in Cloud Systems with Relational-Temporal Anomaly Detection]] - TraceArk: [[2023__ICSE-SEIP__TraceArk - Towards Actionable Performance Anomaly Alerting for Online Service Systems]] - SeaLog: [[2023__arXiv__Scalable and Adaptive Log-based Anomaly Detection with Expert in the Loop]] - AnoFusion: [[2023__KDD__Robust Multimodal Failure Detection for Microservice Systems]] - LogBD: [[2023__preprints__LogBD - A Log Anomaly Detection Method Based on Pre-trained Models and Domain Adaptation]] - OutSpot [[2023__IEEE Transactions on Computers__Efficient and Robust KPI Outlier Detection for Large-Scale Datacenters]] - [[2022__OSR__An Intelligent Framework for Timely, Accurate, and Comprehensive Cloud Incident Detection1]] - DAM [[2022__ICWS__Deep Attentive Anomaly Detection for Microservice Systems with Multimodal Time-Series Data]] - PULL: [[2023__HICSS__PULL - Reactive Log Anomaly Detection Based On Iterative PU Learning]] - HigeNet: [[2022__arXiv__HigeNet - A Highly Efficient Modeling for Long Sequence Time Series Prediction in AIOps]] - Kontrast: [[2022__ISSRE__Identifying Erroneous Software Changes through Self-Supervised Contrastive Learning on Time Series Data]] - Uni-AD: [[2022__ISSRE__Share or Not Share? Towards the Practicability of Deep Models for Unsupervised Anomaly Detection in Modern Online Systems]] - Micro2vec: [[2022__Journal of Network and Computer Applications__Micro2vec - Anomaly detection in microservices systems by mining numeric representations of computer logs]] - GGIAnomaly: [[2022__SCC__A General KPI Anomaly Detection Using Attention Models]] - AnoTransfer: [[2022__JSAC__Efficient KPI Anomaly Detection Through Transfer Learning for Large-Scale Web Services]] - ACVAE: [[2022__J-SAC__Situation-Aware Multivariate Time Series Anomaly Detection through Active Learning and Contrast VAE-based Models in Large Distributed Systems]] - HADES: [[2022__arXiv__Heterogeneous Anomaly Detection for Software Systems via Attentive Multi-modal Learning]] - [[2022__VLDB__TranAD - Deep Transformer Networks for Anomaly Detection in Multivariate Time Series Data]] - [[2022__OSR__An Intelligent Framework for Timely, Accurate, and Comprehensive Cloud Incident Detection]] - [[2022__ICSE__Adaptive Performance Anomaly Detection for Online Service Systems via Pattern Sketching]] - [[2022__WWW__Robust System Instance Clustering for Large-Scale Web Services]] - [[2022__ICSE__DeepTraLog - Trace-Log Combined Microservice Anomaly Detection through Graph-based Deep Learning]] - [[2022__ICDSM__Unstructured Log Analysis for System Anomaly Detection—A Study]] - [[2022__ICSE__Adaptive Performance Anomaly Detection for Online Service Systems via Pattern Sketching]] - [[2022__CSUR__Anomaly Detection and Failure Root Cause Analysis in (Micro)Service-Based Cloud Applications - A Survey]] - [[2021__ATC__Fighting the Fog of War - Automated Incident Detection for Cloud Systems]] - [[2021__KDD__Practical Approach to Asynchronous Multivariate Time Series Anomaly Detection and Localization]] - [[2021__INFOCOM__CTF - Anomaly Detection in High-Dimensional Time Series with Coarse-to-Fine Model Transfer]] - [[2021__ICSOC__Little Help Makes a Big Difference - Leveraging Active Learning to Improve Unsupervised Time Series Anomaly Detection]] - [[2021__Journal-of-Software__TraceRank - Abnormal service localization with dis-aggregated end-to-end tracing data in cloud native systems]] - [[2021__ISSRE__Robust KPI Anomaly Detection for Large-Scale Software Services with Partial Labels]] - MID: [[2020__ESEC-FSE__Efficient incident identification from multi-dimensional issue reports via meta-heuristic search]] - [[2020__ TNNLS__A Spatiotemporal Deep Learning Approach for Unsupervised Anomaly Detection in Cloud Systems]] - Period: [[2019__TNSM__Automatic and Generic Periodicity Adaptation for KPI Anomaly Detection]] - StepWise: [[2018__ISSRE__Robust and Rapid Adaption for Concept Drift in Software System Anomaly Detection]] - Donut: [[2018__WWW__Unsupervised Anomaly Detection via Variational Auto-Encoder for Seasonal KPIs in Web Applications]] - ROCKA [[2018__IWQoS__Robust and Rapid Clustering of KPIs for Large-Scale Anomaly Detection]] - TcpRT: [[2018__SIGMOD__TcpRT - Instrument and Diagnostic Analysis System for Service Quality of Cloud Databases at Massive Scale in Real-time]] - [[2018__CLOUD__Detecting Anomalous Behavior of Black-Box Services Modeled with Distance-Based Online Clustering]] - [[2020__CLOUD__Anomaly Detection from System Tracing Data using Multimodal Deep Learning]] - [[Automated Anomaly Detection and Localization Syste]] - [[1993__USENIX STC__Computer System Performance Problem Detection Using Time Series Models]] ## Case Studies - [[Mackerelの異常検知]] - [[Announcing Smarter Real-Time Alerts With the KPSS Statistic for Signalflow]] - [[AIを活用した障害検知システムの運用を開始 - KDDI]]