- 研究室サイト - https://netman.aiops.org/ - code & dataset - [NetManAIOps · GitHub](https://github.com/NetManAIOps) - Professor - [[Dan Pei]] ## Panels - [[Cloud Intelligence-AIOps across academia and industry]] ## Papers ### Survey - [[2025__arXiv__A Survey on AgentOps - Categorization, Challenges, and Future Directions]] - [[2025__TSEM__Failure Diagnosis in Microservice Systems - A Comprehensive Survey and Analysis]] - [[2024__arXiv__Failure Diagnosis in Microservice Systems - A Comprehensive Survey and Analysis]] - [[2024__JNCA__A survey on intelligent management of alerts and incidents in IT services]] - [[2023__arXiv__A Survey of Time Series Anomaly Detection Methods in the AIOps Domain]] ### Dataset - MicroServo: [[2024__arXiv__A Scenario-Oriented Benchmark for Assessing AIOps Algorithms in Microservice Management]] - [[2024__ISSRE__TimeSeriesBench - An Industrial-Grade Benchmark for Time Series Anomaly Detection Models]] - [[2022__arXiv__Constructing Large-Scale Real-World Benchmark Datasets for AIOps]] - [[2019__INFOCOM__Label-Less - A Semi-Automatic Labelling Tool for KPI Anomalies]] ### Prediction - eWarn: [[2020__ESEC-FSE__Real-Time Incident Prediction for Online Service Systems]] - データソースがアラート - [[XAI]]の[[LIME]]を使っている ### Failure detection - VersaGuardian: [[2025__TON__Real-Time Anomaly Detection for Large-Scale Network Devices]] - KAN-Ad: [[2024__arXiv__KAN-AD - Time Series Anomaly Detection with Kolmogorov-Arnold Networks]] - LogCraft: [[2024__ASE__End-to-End AutoML for Unsupervised Log Anomaly Detection]] - KAD-Disformer: [[2024__KDD__Pre-trained KPI Anomaly Detection Model Based on Disentangled Transformer]] - [[2024__WWW__Supervised Fine-Tuning for Unsupervised KPI Anomaly Detection for Mobile Web Systems]] - [[2023__WWW__Unsupervised Anomaly Detection on Microservice Traces through Graph VAE]] - InterFusion: [[2021__KDD__Multivariate Time Series Anomaly Detection and Interpretation using Hierarchical Inter-Metric and Temporal Embedding]] - CTF: [[2021__INFOCOM__CTF - Anomaly Detection in High-Dimensional Time Series with Coarse-to-Fine Model Transfer]] - PUAD: [[2021__ISSRE__Robust KPI Anomaly Detection for Large-Scale Software Services with Partial Labels]] - JumpStarter: [[2021__ATC__Jump-Starting Multivariate Time Series Anomaly Detection for Online Service Systems]] - ソフトウェアの変更による再学習問題に対処するために、異常検知の学習フェーズの時間を短縮する - LogAD: [[2021__ESEC-FSE__An Empirical Investigation of Practical Log Anomaly Detection for Online Service Systems]] - メトリックに関するドメイン知識を注入できる。 - 解釈しやすさを向上 > 異常とは何か、なぜ現在の時刻が異常なのか、期待される正常なパターンはどのように振る舞うべきかを直感的に理解することは困難 - ScWarn: [[2021__ESEC_FSE__Identifying Bad Software Changes via Multimodal Anomaly Detection for Online Service Systems]] - [[マルチモーダル学習]]を使用 - ソフトウェアの不正な変更に着目している - LogTransfer: [[2020__ISSRE__LogTransfer - Cross-System Log Anomaly Detection for Software Systems with Transfer Learning]] - クロスシステム異常検知: 異常ラベルが不十分なソフトウェアシステム(ターゲットシステム)に対して、異常ラベルが十分なソフトウェアシステム(ソースシステム)から学習する - [[転移学習]]を使用している - クラウドサービスプロバイダーから収集した様々なベンダーのネットワークスイッチログを使う - TraceAnomaly: [[2020__ISSRE__Unsupervised Detection of Microservice Trace Anomalies through Service-Level Deep Bayesian Networks]] - Period: [[2019__TNSM__Automatic and Generic Periodicity Adaptation for KPI Anomaly Detection]] - ROCKA: [[2018__IWQoS__Robust and Rapid Clustering of KPIs for Large-Scale Anomaly Detection]] - StepWise: [[2018__ISSRE__Robust and Rapid Adaption for Concept Drift in Software System Anomaly Detection]] - Donut: [[2018__WWW__Unsupervised Anomaly Detection via Variational Auto-Encoder for Seasonal KPIs in Web Applications]] ### Root cause analysis - [[2024__TSC__No More Data Silos - Unified Microservice Failure Diagnosis with Temporal Knowledge Graph]] - SparseRCA: [[2024__ISSRE__SparseRCA - Unsupervised Root Cause Analysis in Sparse Microservice Testing Traces]] - LatentSpace: [[2024__KDD__Microservice Root Cause Analysis With Limited Observability Through Intervention Recognition in the Latent Space]] - Chain-of-Event: [[2024__FSE__Chain-of-Event - Interpretable Root Cause Analysis for Microservices through Automatically Learning Weighted Event Causal Graph]] - SynthoDiag: [[2024__ESEC-FSE__Fault Diagnosis for Test Alarms in Microservices through Multi-source Data]] - [[2024__ESEC-FSE__Illuminating the Gray Zone - Non-intrusive Gray Failure Localization in Server Operating Systems]] - MonitorAssistant: [[2024__ESEC-FSE__MonitorAssistant - Simplifying Cloud Service Monitoring via Large Language Models]] - MicroDig: [[2024__TSC__Diagnosing Performance Issues for Large-Scale Microservice Systems with Heterogeneous Graph]] - GTrace: [[2023__ESEC-FSE__From Point-wise to Group-wise - A Fast and Accurate Microservice Trace Anomaly Detection Approach]] - [[2023__WWW__CMDiagnostor - An Ambiguity-Aware Root Cause Localization Approach Based on Call Metric Data]] - RC-LIR: [[2022__ISSRE__Effective Attribute Selection for Multi-dimensional Root Cause Analysis]] - De ́ja`Vu`: [[2022__ESEC-FSE__Actionable and Interpretable Fault Localization for Recurring Failures in Online Service Systems]] - CIRCA [[2022__KDD__Causal Inference-Based Root Cause Analysis for Online Service Systems with Intervention Recognition]] - CauseRank: [[2022__CCGrid__Generic and Robust Performance Diagnosis via Causal Inference for OLTP Database Systems]] - メトリクスグループに対する時系列[[因果探索]] - RobustSpot: [[2022__ITSC__Robust Anomaly Localization of Multi-dimensional Derived Measure for Online Video]] - OmniCluster: [[2022__WWW__Robust System Instance Clustering for Large-Scale Web Services]] - PatternMatcher: [[2021__ISSRE__Identifying Root-Cause Metrics for Incident Diagnosis in Online Service Systems]] - エンジニアのヒアリングから典型的な異常パターンを整理している - 解釈性の考慮 - TraceRCA: [[2021__IWQOS__Practical Root Cause Localization for Microservice Systems via Trace Analysis]] - 異常なトレースが多く、正常なトレースが少ないマイクロサービスが、根本原因のマイクロサービスである可能性が高いという洞察を仮定している - FluxInfer: [[2020__IPCCC__FluxInfer―Automatic Diagnosis of Performance Anomaly for Online Database System]] - [[条件付き独立性]]による方向づけは間違えやすいので[[PCアルゴリズム]]ではなく、WUDG(重み付き無向依存グラフ)を使う - [[深さ優先探索]]や[[ランダムウォーク]]ではなく、[[PageRank]]アルゴリズムを使用 - MicroCause: [[2020__IWQoS__Localizing Failure Root Causes in a Microservice through Causality Inference]] - [[PCアルゴリズム]]の条件付き独立性検定で時間的考慮をいれている - CoFlux: [[2019__ IWQoS__CoFlux - Robustly Correlating KPIs by Fluctuations for Service Troubleshooting]] - 時間的順序に着目したフラックス相関ベースのトラブルシューティング手法 - FluxRank [[2019__ISSRE__FluxRank―A Widely-Deployable Framework to Automatically Localizing Root Cause Machines for Software Service Failure Mitigation]] - Squeeze [[2019__ISSRE__Generic and Robust Localization of Multi-Dimensional Root Causes]] ## Others - LabelEase: [[2024__ISSRE__LabelEase - A Semi-Automatic Tool for Efficient and Accurate Trace Labeling in Microservices]] - ChatTS: [[2024__arXiv__ChatTS - Aligning Time Series with LLMs via Synthetic Data for Enhanced Understanding and Reasoning]] - SelfLog: [[2024__ISSRE__Self-Evolutionary Group-wise Log Parsing Based on Large Language Mode]] - ParaSeer: [[2024__arXiv__Predicting Parameter Change's Effect on Cellular Network Time Series]] - LogEval: [[2024__arXiv__LogEval - A Comprehensive Benchmark Suite for Large Language Models In Log Analysis]] - YADING [[2015__VLDB__YADING - Fast Clustering of Large-Scale Time Series Data]]