[[LLM]]を用いた[[AIOps]]
## Products
- [[LLMベースのAIOpsプロダクト]]
## Blog / Slides
- [[Classifying Error Logs with AI Can DeepSeek R1 Outperform GPT-4o and Llama 3?]]
- [[EventOrOutage - Rootly-AI-Labs]]
## Papers
- <https://github.com/IntelligentDDS/awesome-papers/blob/main/LLM4Ops/README.md>
- [[Awesome LLM AIOps]]
- [[LLM4DB - code4DB]]
### General Remarks
- [[2025__arXiv__A Survey on AgentOps - Categorization, Challenges, and Future Directions]]
- [[2025__arXiv__Intent-based System Design and Operation]]
- ChatTS: [[2024__arXiv__ChatTS - Aligning Time Series with LLMs via Synthetic Data for Enhanced Understanding and Reasoning]]
- [[2023__ICDM__Exploring Large Language Models for Low-Resource IT Information Extraction]]
- [[2023__CASCON__Proactive Continuous Operations using Large Language Models (LLMs) and AIOps]]
- Owl: [[2024__ICLR__Owl - A Large Language Model for IT Operations]]
### Benchmark
- OpenRCA: [[2025__ICLR__OpenRCA - Can Large Language Models Locate the Root Cause of Software Failures]]
- AIOpsLab: [[2025__MLSys__AIOpsLab - A Holistic Framework for Evaluating AI Agents for Enabling Autonomous Cloud]]
- ITBench: [[2025__arXiv__ITBench - Evaluating AI Agents across Diverse Real-World IT Automation Tasks]]
- OpsEval: [[2023__arXiv__OpsEval - A Comprehensive Task-Oriented AIOps Benchmark for Large Language Models]]
### Failure Management
- ServiceOdessy: [[2025__arXiv__Enabling Autonomic Microservice Management through Self-Learning Agents]]
- FLASH: [[2024__MSResearch__FLASH - A Workflow Automation Agent for Diagnosing Recurring Incidents]]
- LLexus: [[2024__OSR__LLexus - an AI agent system for incident management]]
- [[2025__CSUR__A Survey of AIOps for Failure Management in the Era of Large Language Models]]
#### Applications
- VOCE: [[2025__FASE__VOCE A Virtual On-Call Engineer for Automated Alert Incident Analysis Using a Large Language Model]]
- TAMO: [[2025__arXiv__TAMO - Fine-Grained Root Cause Analysis via Tool-Assisted LLM Agent with Multi-Modality Observation Data]]
- ThinkFL: [[2024__arXiv__ThinkFL - Self-Refining Failure Localization for Microservice Systems via Reinforcement Fine-Tuning]]
- eARCO: [[2025__arXiv__eARCO - Efficient Automated Root Cause Analysis with Prompt Optimization]]
- GALA: [[2025__arXiv__Can Graph-Augmented Large Language Model Agentic Workflows Elevate Root Cause Analysis]]
- LogInsight: [[2025__TSC__Accurate and Interpretable Log-Based Fault Diagnosis using Large Language Models]]
- [[2025__ICCS__AIOps for Reliability - Evaluating Large Language Models for Automated Root Cause Analysis in Chaos Engineering]]
- InsightAI: [[2025__CAIN__InsightAI - Root Cause Analysis in Large Log Files with Private Data Using Large Language Model]]
- DECO: [[2024__arXiv__DECO - Life-Cycle Management of Production-Scale Copilots]]
- MonitorAssistant: [[2024__ESEC-FSE__MonitorAssistant - Simplifying Cloud Service Monitoring via Large Language Models]]
- [[2024__Electronics__Leveraging Large Language Models for Efficient Alert Aggregation in AIOPs]]
- RasRCA: [[2024__ASE__The Potential of One-Shot Failure Root Cause Analysis - Collaboration of the Large Language Model and Small Classifier]]
- DualLMAD: [[2024__ISSRE__Multivariate Time Series Anomaly Detection based on Pre-trained Models with Dual-Attention Mechanism]]
- Cloud Atlas: [[2024__arXiv__Cloud Atlas - Efficient Fault Localization for Cloud Systems using Language Models and Causal Insight]]
- COMET: [[2024__ISSRE__Can We Trust Auto-Mitigation? Improving Cloud Failure Prediction with Uncertain Positive Learning]]
- [[2024__ISSTA__Face It Yourselves - An LLM-Based Two-Stage Strategy to Localize Configuration Errors via Logs]]
- [[2024__ICSE-SEIP__Autonomous Monitors for Detecting Failures Early and Reporting Interpretable Alerts in Cloud Operations]]
- [[2024__arXiv__mABC - multi-Agent Blockchain-Inspired Collaboration for root cause analysis in micro-services architecture]]
- [[2024__none__Automated processing of monitoring data for proactive root cause analysis in service-based systems]]
- [[2024__arXiv__X-lifecycle Learning for Cloud Incident Management using LLMs]]
- [[2024__ICSE__Knowledge-aware Alert Aggregation in Large-scale Cloud Systems - a Hybrid Approach]]
- [[2024__arXiv__Exploring LLM-based Agents for Root Cause Analysis]]
- RCACopilot: [[2024__EuroSys__Automatic Root Cause Analysis via Large Language Models for Cloud Incidents]]
- [[2023__arXiv__Empowering Practical Root Cause Analysis by Large Language Models for Cloud Incidents]]
- [[2024__arXiv__Automated Root Causing of Cloud Incidents using In-Context Learning with GPT-4]]
- Xpert: [[2024__ICSE__Xpert - Empowering Incident Management with Query Recommendations via Large Language Models]]
- [[2023__CLOUD__InsightsSumm - Summarization of ITOps Incidents Through In-Context Prompt Engineering]]
- [[2023__CASCON__ADARMA - Auto-Detection and Auto-Remediation of Microservice Anomalies by Leveraging Large Language Models]]
- RCAgent: [[2024__CIKM__RCAgent - Cloud Root Cause Analysis by Autonomous Agents with Tool-Augmented Large Language Models]]
- PACE-LM: [[2023__arXiv__PACE-LM - Prompting and Augmentation for Calibrated Confidence Estimation with GPT-4 in Cloud Incident Root Cause Analysis]]
- Oasis: [[2023__arXiv__Assess and Summarize - Improve Outage Understanding with Large Language Models]]
- [[2023__ICSE__Recommending Root-Cause and Mitigation Steps for Cloud Incidents using Large Language Models]]
- [[Large-language models for automatic cloud incident management]]
#### Databases
- AgentFM: [[2025__FSE__AgentFM - Role-Aware Failure Management for Distributed Databases with LLM-Driven Multi-Agents]]
- LLMIdxAdvis: [[2025__PVLDB__LLMIdxAdvis - Resource-Efficient Index Advisor Utilizing Large Language Model]]
- DBAIOps: [[2025__arXiv__DBAIOps - A Reasoning LLM Enhanced Database Operation and Maintenance System using Knowledge Graphs]]
- Andromeda: [[2025__SIGMOD__Automatic Database Configuration Debugging using Retrieval-Augmented Language Models]]
- [[2024__arXiv__Query Performance Explanation through Large Language Model for HTAP Systems]]
- MLETune: [[2024__ICPADS__MLETune - Streamlining Database Knob Tuning via Multi-LLMs Experts Guided Deep Reinforcement Learning]]
- Panda: [[2024__CIDR__Panda - Performance Debugging for Databases using LLM Agents]]
- [[2024__VLDB__DB-BERT - making database tuning tools “read” the manual]]
[[Tsinghua Database Group]]
- D-Bot: [[2023__arXiv__D-Bot - Database Diagnosis System using Large Language Models]]
- [[2023__arXiv__LLM As DBA]]
#### Network
- [[2025__ACCESS__Small Language Model Agent for the Operations of Continuously Updating ICT Systems]]
- NetSemantic: [[2025__arXiv__Adapting Network Information into Semantics for Generalizable and Plug-and-Play Multi-Scenario Network Diagnosis]]
- [[2024__OPAC__Large language model-based optical network log analysis using LLaMA2 with instruction tuning]]
- [[2023__HotNets__A Holistic View of AI-driven Network Incident Management]]
#### General
- LLMAD: [[2024__arXiv__Large Language Models can Deliver Accurate and Interpretable Time Series Anomaly Detection]]
### Operatinal Log Analysis / Logging
- [[2025__ICPE__LogAn - An LLM-Based Log Analytics Tool with Causal Inferencing]]
- [[2025__PAKDD__Adapting Large Language Models for Parameter-Efficient Log Anomaly Detection]]
- [[2025__I2CACIS__Extensive Log Analysis Using Small Language Models in Minimal Resource Environments]]
- LogBabylon: A Unified Framework for Cross-Log File Integration and Analysis
- Chatting with Logs: [[2024__arXiv__Chatting with Logs - An exploratory study on Finetuning LLMs for LogQL]]
- SuperLog: [[2024__arXiv__Adapting Large Language Models to Log Analysis with Interpretable Domain Knowledge]]
- LogLLM: [[2024__arXiv__LogLLM - Log-based Anomaly Detection Using Large Language Models]]
- LoFI: [[2024__ISSRE__Demystifying and Extracting Fault-indicating Information from Logs for Failure Diagnosis]]
- LogRAG: [[2024__ISSRE__Leveraging RAG-Enhanced Large Language Model for Semi-Supervised Log Anomaly Detection]]
- LLmeLog: [[2024__ISSRE__LLMeLog - An Approach for Anomaly Detection based on LLM-enriched Log Events]]
- SelfLog: [[2024__ISSRE__Self-Evolutionary Group-wise Log Parsing Based on Large Language Mode]]
- [[2024__arXiv__Studying and Benchmarking Large Language Models For Log Level Suggestion]]
- LogGenius: [[2024__ICWS__LogGenius - An Unsupervised Log Parsing Framework with Zero-shot Prompt Engineering]]
- LogLM: [[2024__arXiv__LogLM - From Task-based to Instruction-based Automated Log Analysis]]
- [[2024__arXiv__LUK - Empowering Log Understanding with Expert Knowledge from Large Language Models]]
- [[2024__KDD__LogParser-LLM - Advancing Efficient Log Parsing with Large Language Models]]
- [The Effectiveness of Compact Fine-Tuned LLMs in Log Parsing](https://www.semanticscholar.org/paper/The-Effectiveness-of-Compact-Fine-Tuned-LLMs-in-Log-Mehrabi-Hamou-Lhadj/86c21bffd675216befc20afd3cda69c056f84dab?utm_source=alert_email&utm_content=LibraryFolder&utm_campaign=AlertEmails_WEEKLY&utm_term=PaperCitation+LibraryFolder+AuthorPaper&email_index=61-0-122&utm_medium=38776941)
- LogEval: [[2024__arXiv__LogEval - A Comprehensive Benchmark Suite for Large Language Models In Log Analysis]]
- ULog: [[2024__arXiv__ULog - Unsupervised Log Parsing with Large Language Models through Log Contrastive Units]]
- DivLog: [[2024__ICSE__DivLog - Log Parsing with Prompt Enhanced In-Context Learning]]
- LLMParser: [[2023__arXiv__LLMParser - A LLM-based Log Parsing Framework]]
- LogPrompt: [[2023__arXiv__LogPrompt - Prompt Engineering Towards Zero-Shot and Interpretable Log Analysis]]
- LogDiv: [[2023__arXiv__Prompting for Automatic Log Template Extraction]]
- [[2023__CLOUD__Learning Representations on Logs for AIOps]]
### Logging
- [[2024__TSE__Exploring the Effectiveness of LLMs in Automated Logging Statement Generation - An Empirical Study]]
- UniLog: [[2024__ICSE__UniLog - Automatic Logging via LLM and In-Context Learning]]
- [[2023__arXiv__Exploring the Effectiveness of LLMs in Automated Logging Generation - An Empirical Study]]
### Configuration Management
- Ciri: [[2025__ICSE__Large Language Models as Configuration Validators]]
- [[2024__arXiv__Large Language Models for Zero Touch Network Configuration Management]]
- PerfSense: [[2024__arXiv__Identifying Performance-Sensitive Configurations in Software Systems through Code Analysis with LLM Agents]]
- LogConfigLocalizer: [[2024__ISSTA__Face It Yourselves - An LLM-Based Two-Stage Strategy to Localize Configuration Errors via Logs]]
- Ciri: [[2023__arXiv__Configuration Validation with Large Language Models]]
### [[Infrastructure as Code|IaC]]
- [[2024__ICOIN__A Survey of using Large Language Models for Generating Infrastructure as Code]]
- [[2023__HotNets__Simplifying Cloud Management with Cloudless Computing]]kj
### Data management
- [[2024__arXiv__LLM-Enhanced Data Management]]
### Optimization
- [[2024__ICSSAS__Optimizing Cloud Infrastructure Management Using Large Language Models - A DevOps Perspective]]
### [[Chaos Engineering]]
- ChaosEater: [[2025__arXiv__ChaosEater - Fully Automating Chaos Engineering with Large Language Models]]
### Security
- [[2025__arXiv__When AIOps Become AI Oops - Subverting LLM - driven IT Operations via Telemetry Manipulation]]