- 研究室サイト
- https://netman.aiops.org/
- code & dataset
- [NetManAIOps · GitHub](https://github.com/NetManAIOps)
- Professor
- [[Dan Pei]]
## Panels
- [[Cloud Intelligence-AIOps across academia and industry]]
## Papers
### Survey
- [[2025__arXiv__A Survey on AgentOps - Categorization, Challenges, and Future Directions]]
- [[2025__TSEM__Failure Diagnosis in Microservice Systems - A Comprehensive Survey and Analysis]]
- [[2024__arXiv__Failure Diagnosis in Microservice Systems - A Comprehensive Survey and Analysis]]
- [[2024__JNCA__A survey on intelligent management of alerts and incidents in IT services]]
- [[2023__arXiv__A Survey of Time Series Anomaly Detection Methods in the AIOps Domain]]
### Dataset
- MicroServo: [[2024__arXiv__A Scenario-Oriented Benchmark for Assessing AIOps Algorithms in Microservice Management]]
- [[2024__ISSRE__TimeSeriesBench - An Industrial-Grade Benchmark for Time Series Anomaly Detection Models]]
- [[2022__arXiv__Constructing Large-Scale Real-World Benchmark Datasets for AIOps]]
- [[2019__INFOCOM__Label-Less - A Semi-Automatic Labelling Tool for KPI Anomalies]]
### Prediction
- eWarn: [[2020__ESEC-FSE__Real-Time Incident Prediction for Online Service Systems]]
- データソースがアラート
- [[XAI]]の[[LIME]]を使っている
### Failure detection
- VersaGuardian: [[2025__TON__Real-Time Anomaly Detection for Large-Scale Network Devices]]
- KAN-Ad: [[2024__arXiv__KAN-AD - Time Series Anomaly Detection with Kolmogorov-Arnold Networks]]
- LogCraft: [[2024__ASE__End-to-End AutoML for Unsupervised Log Anomaly Detection]]
- KAD-Disformer: [[2024__KDD__Pre-trained KPI Anomaly Detection Model Based on Disentangled Transformer]]
- [[2024__WWW__Supervised Fine-Tuning for Unsupervised KPI Anomaly Detection for Mobile Web Systems]]
- [[2023__WWW__Unsupervised Anomaly Detection on Microservice Traces through Graph VAE]]
- InterFusion: [[2021__KDD__Multivariate Time Series Anomaly Detection and Interpretation using Hierarchical Inter-Metric and Temporal Embedding]]
- CTF: [[2021__INFOCOM__CTF - Anomaly Detection in High-Dimensional Time Series with Coarse-to-Fine Model Transfer]]
- PUAD: [[2021__ISSRE__Robust KPI Anomaly Detection for Large-Scale Software Services with Partial Labels]]
- JumpStarter: [[2021__ATC__Jump-Starting Multivariate Time Series Anomaly Detection for Online Service Systems]]
- ソフトウェアの変更による再学習問題に対処するために、異常検知の学習フェーズの時間を短縮する
- LogAD: [[2021__ESEC-FSE__An Empirical Investigation of Practical Log Anomaly Detection for Online Service Systems]]
- メトリックに関するドメイン知識を注入できる。
- 解釈しやすさを向上 > 異常とは何か、なぜ現在の時刻が異常なのか、期待される正常なパターンはどのように振る舞うべきかを直感的に理解することは困難
- ScWarn: [[2021__ESEC_FSE__Identifying Bad Software Changes via Multimodal Anomaly Detection for Online Service Systems]]
- [[マルチモーダル学習]]を使用
- ソフトウェアの不正な変更に着目している
- LogTransfer: [[2020__ISSRE__LogTransfer - Cross-System Log Anomaly Detection for Software Systems with Transfer Learning]]
- クロスシステム異常検知: 異常ラベルが不十分なソフトウェアシステム(ターゲットシステム)に対して、異常ラベルが十分なソフトウェアシステム(ソースシステム)から学習する
- [[転移学習]]を使用している
- クラウドサービスプロバイダーから収集した様々なベンダーのネットワークスイッチログを使う
- TraceAnomaly: [[2020__ISSRE__Unsupervised Detection of Microservice Trace Anomalies through Service-Level Deep Bayesian Networks]]
- Period: [[2019__TNSM__Automatic and Generic Periodicity Adaptation for KPI Anomaly Detection]]
- ROCKA: [[2018__IWQoS__Robust and Rapid Clustering of KPIs for Large-Scale Anomaly Detection]]
- StepWise: [[2018__ISSRE__Robust and Rapid Adaption for Concept Drift in Software System Anomaly Detection]]
- Donut: [[2018__WWW__Unsupervised Anomaly Detection via Variational Auto-Encoder for Seasonal KPIs in Web Applications]]
### Root cause analysis
- [[2024__TSC__No More Data Silos - Unified Microservice Failure Diagnosis with Temporal Knowledge Graph]]
- SparseRCA: [[2024__ISSRE__SparseRCA - Unsupervised Root Cause Analysis in Sparse Microservice Testing Traces]]
- LatentSpace: [[2024__KDD__Microservice Root Cause Analysis With Limited Observability Through Intervention Recognition in the Latent Space]]
- Chain-of-Event: [[2024__FSE__Chain-of-Event - Interpretable Root Cause Analysis for Microservices through Automatically Learning Weighted Event Causal Graph]]
- SynthoDiag: [[2024__ESEC-FSE__Fault Diagnosis for Test Alarms in Microservices through Multi-source Data]]
- [[2024__ESEC-FSE__Illuminating the Gray Zone - Non-intrusive Gray Failure Localization in Server Operating Systems]]
- MonitorAssistant: [[2024__ESEC-FSE__MonitorAssistant - Simplifying Cloud Service Monitoring via Large Language Models]]
- MicroDig: [[2024__TSC__Diagnosing Performance Issues for Large-Scale Microservice Systems with Heterogeneous Graph]]
- GTrace: [[2023__ESEC-FSE__From Point-wise to Group-wise - A Fast and Accurate Microservice Trace Anomaly Detection Approach]]
- [[2023__WWW__CMDiagnostor - An Ambiguity-Aware Root Cause Localization Approach Based on Call Metric Data]]
- RC-LIR: [[2022__ISSRE__Effective Attribute Selection for Multi-dimensional Root Cause Analysis]]
- De ́ja`Vu`: [[2022__ESEC-FSE__Actionable and Interpretable Fault Localization for Recurring Failures in Online Service Systems]]
- CIRCA [[2022__KDD__Causal Inference-Based Root Cause Analysis for Online Service Systems with Intervention Recognition]]
- CauseRank: [[2022__CCGrid__Generic and Robust Performance Diagnosis via Causal Inference for OLTP Database Systems]]
- メトリクスグループに対する時系列[[因果探索]]
- RobustSpot: [[2022__ITSC__Robust Anomaly Localization of Multi-dimensional Derived Measure for Online Video]]
- OmniCluster: [[2022__WWW__Robust System Instance Clustering for Large-Scale Web Services]]
- PatternMatcher: [[2021__ISSRE__Identifying Root-Cause Metrics for Incident Diagnosis in Online Service Systems]]
- エンジニアのヒアリングから典型的な異常パターンを整理している
- 解釈性の考慮
- TraceRCA: [[2021__IWQOS__Practical Root Cause Localization for Microservice Systems via Trace Analysis]]
- 異常なトレースが多く、正常なトレースが少ないマイクロサービスが、根本原因のマイクロサービスである可能性が高いという洞察を仮定している
- FluxInfer: [[2020__IPCCC__FluxInfer―Automatic Diagnosis of Performance Anomaly for Online Database System]]
- [[条件付き独立性]]による方向づけは間違えやすいので[[PCアルゴリズム]]ではなく、WUDG(重み付き無向依存グラフ)を使う
- [[深さ優先探索]]や[[ランダムウォーク]]ではなく、[[PageRank]]アルゴリズムを使用
- MicroCause: [[2020__IWQoS__Localizing Failure Root Causes in a Microservice through Causality Inference]]
- [[PCアルゴリズム]]の条件付き独立性検定で時間的考慮をいれている
- CoFlux: [[2019__ IWQoS__CoFlux - Robustly Correlating KPIs by Fluctuations for Service Troubleshooting]]
- 時間的順序に着目したフラックス相関ベースのトラブルシューティング手法
- FluxRank [[2019__ISSRE__FluxRank―A Widely-Deployable Framework to Automatically Localizing Root Cause Machines for Software Service Failure Mitigation]]
- Squeeze [[2019__ISSRE__Generic and Robust Localization of Multi-Dimensional Root Causes]]
## Others
- LabelEase: [[2024__ISSRE__LabelEase - A Semi-Automatic Tool for Efficient and Accurate Trace Labeling in Microservices]]
- ChatTS: [[2024__arXiv__ChatTS - Aligning Time Series with LLMs via Synthetic Data for Enhanced Understanding and Reasoning]]
- SelfLog: [[2024__ISSRE__Self-Evolutionary Group-wise Log Parsing Based on Large Language Mode]]
- ParaSeer: [[2024__arXiv__Predicting Parameter Change's Effect on Cellular Network Time Series]]
- LogEval: [[2024__arXiv__LogEval - A Comprehensive Benchmark Suite for Large Language Models In Log Analysis]]
- YADING [[2015__VLDB__YADING - Fast Clustering of Large-Scale Time Series Data]]