LLM4SRE - yuuk1's Digital Garden

[[LLM]]を用いた[[AIOps]] ## Products - [[LLMベースのAIOpsプロダクト]] ## Blog / Slides - [[Classifying Error Logs with AI Can DeepSeek R1 Outperform GPT-4o and Llama 3?]] - [[EventOrOutage - Rootly-AI-Labs]] ## Papers - <https://github.com/IntelligentDDS/awesome-papers/blob/main/LLM4Ops/README.md> - [[Awesome LLM AIOps]] - [[LLM4DB - code4DB]] ### General Remarks - [[2025__CSUR__A Survey of AIOps for Failure Management in the Era of Large Language Models]] - [[2025__arXiv__A Survey on AgentOps - Categorization, Challenges, and Future Directions]] - [[2025__arXiv__Intent-based System Design and Operation]] - ChatTS: [[2024__arXiv__ChatTS - Aligning Time Series with LLMs via Synthetic Data for Enhanced Understanding and Reasoning]] - [[2023__ICDM__Exploring Large Language Models for Low-Resource IT Information Extraction]] - [[2023__CASCON__Proactive Continuous Operations using Large Language Models (LLMs) and AIOps]] - Owl: [[2024__ICLR__Owl - A Large Language Model for IT Operations]] ### Benchmark - OpenRCA: [[2025__ICLR__OpenRCA - Can Large Language Models Locate the Root Cause of Software Failures]] - AIOpsLab: [[2025__MLSys__AIOpsLab - A Holistic Framework for Evaluating AI Agents for Enabling Autonomous Cloud]] - [[2025__FSE__AIOpsLab in Action - An Open Platform for AIOps Research]] - ITBench: [[2025__arXiv__ITBench - Evaluating AI Agents across Diverse Real-World IT Automation Tasks]] - OpsEval: [[2023__arXiv__OpsEval - A Comprehensive Task-Oriented AIOps Benchmark for Large Language Models]] ### Failure Management - ServiceOdessy: [[2025__arXiv__Enabling Autonomic Microservice Management through Self-Learning Agents]] - FLASH: [[2024__MSResearch__FLASH - A Workflow Automation Agent for Diagnosing Recurring Incidents]] - LLexus: [[2024__OSR__LLexus - an AI agent system for incident management]] - [[2025__CSUR__A Survey of AIOps for Failure Management in the Era of Large Language Models]] #### Applications - SynergyRCA: [[2025__arXiv__Simplifying Root Cause Analysis in Kubernetes with StateGraph and LLM]] - StepFly: [[2025__arXiv__Agentic Troubleshooting Guide Automation for Incident Management]] - AlertGurdian: [[2025__ASE__AlertGuardian - Intelligent Alert Life-Cycle Management for Large-scale Cloud Systems]] - STRATUS: [[2025__NeurIPS__STRATUS - A Multi agent System for Autonomous Reliability Engineering of Modern Clouds]] - RCLAgent: [[2025__arXiv__Adaptive Root Cause Localization for Microservice Systems with Multi-Agent Recursion-of-Thought]] - MicroRCA-Agent: [[2025__arXiv__MicroRCA-Agent - Microservice Root Cause Analysis Method Based on Large Language Model Agents]] - Flow of Action: [[2025__WWW__Flow of Action - SOP Enhanced LLM-Based Multi-Agent System for Root Cause Analysis]] - TAMO: [[2025__arXiv__TAMO - Fine-Grained Root Cause Analysis via Tool-Assisted LLM Agent with Multi-Modality Observation Data]] - ThinkFL: [[2024__arXiv__ThinkFL - Self-Refining Failure Localization for Microservice Systems via Reinforcement Fine-Tuning]] - eARCO: [[2025__arXiv__eARCO - Efficient Automated Root Cause Analysis with Prompt Optimization]] - GALA: [[2025__arXiv__Can Graph-Augmented Large Language Model Agentic Workflows Elevate Root Cause Analysis]] - LogInsight: [[2025__TSC__Accurate and Interpretable Log-Based Fault Diagnosis using Large Language Models]] - [[2025__ICCS__AIOps for Reliability - Evaluating Large Language Models for Automated Root Cause Analysis in Chaos Engineering]] - InsightAI: [[2025__CAIN__InsightAI - Root Cause Analysis in Large Log Files with Private Data Using Large Language Model]] - DECO: [[2024__arXiv__DECO - Life-Cycle Management of Production-Scale Copilots]] - MonitorAssistant: [[2024__ESEC-FSE__MonitorAssistant - Simplifying Cloud Service Monitoring via Large Language Models]] - [[2024__Electronics__Leveraging Large Language Models for Efficient Alert Aggregation in AIOPs]] - RasRCA: [[2024__ASE__The Potential of One-Shot Failure Root Cause Analysis - Collaboration of the Large Language Model and Small Classifier]] - DualLMAD: [[2024__ISSRE__Multivariate Time Series Anomaly Detection based on Pre-trained Models with Dual-Attention Mechanism]] - Cloud Atlas: [[2024__arXiv__Cloud Atlas - Efficient Fault Localization for Cloud Systems using Language Models and Causal Insight]] - COMET: [[2024__ISSRE__Can We Trust Auto-Mitigation? Improving Cloud Failure Prediction with Uncertain Positive Learning]] - [[2024__ISSTA__Face It Yourselves - An LLM-Based Two-Stage Strategy to Localize Configuration Errors via Logs]] - [[2024__ICSE-SEIP__Autonomous Monitors for Detecting Failures Early and Reporting Interpretable Alerts in Cloud Operations]] - [[2024__arXiv__mABC - multi-Agent Blockchain-Inspired Collaboration for root cause analysis in micro-services architecture]] - [[2024__none__Automated processing of monitoring data for proactive root cause analysis in service-based systems]] - [[2024__arXiv__X-lifecycle Learning for Cloud Incident Management using LLMs]] - [[2024__ICSE__Knowledge-aware Alert Aggregation in Large-scale Cloud Systems - a Hybrid Approach]] - [[2024__arXiv__Exploring LLM-based Agents for Root Cause Analysis]] - RCACopilot: [[2024__EuroSys__Automatic Root Cause Analysis via Large Language Models for Cloud Incidents]] - [[2023__arXiv__Empowering Practical Root Cause Analysis by Large Language Models for Cloud Incidents]] - [[2024__arXiv__Automated Root Causing of Cloud Incidents using In-Context Learning with GPT-4]] - Xpert: [[2024__ICSE__Xpert - Empowering Incident Management with Query Recommendations via Large Language Models]] - [[2023__CLOUD__InsightsSumm - Summarization of ITOps Incidents Through In-Context Prompt Engineering]] - [[2023__CASCON__ADARMA - Auto-Detection and Auto-Remediation of Microservice Anomalies by Leveraging Large Language Models]] - RCAgent: [[2024__CIKM__RCAgent - Cloud Root Cause Analysis by Autonomous Agents with Tool-Augmented Large Language Models]] - PACE-LM: [[2023__arXiv__PACE-LM - Prompting and Augmentation for Calibrated Confidence Estimation with GPT-4 in Cloud Incident Root Cause Analysis]] - Oasis: [[2023__arXiv__Assess and Summarize - Improve Outage Understanding with Large Language Models]] - [[2023__ICSE__Recommending Root-Cause and Mitigation Steps for Cloud Incidents using Large Language Models]] - [[Large-language models for automatic cloud incident management]] #### Databases - AgentFM: [[2025__FSE__AgentFM - Role-Aware Failure Management for Distributed Databases with LLM-Driven Multi-Agents]] - LLMIdxAdvis: [[2025__PVLDB__LLMIdxAdvis - Resource-Efficient Index Advisor Utilizing Large Language Model]] - DBAIOps: [[2025__arXiv__DBAIOps - A Reasoning LLM Enhanced Database Operation and Maintenance System using Knowledge Graphs]] - Andromeda: [[2025__SIGMOD__Automatic Database Configuration Debugging using Retrieval-Augmented Language Models]] - [[2024__arXiv__Query Performance Explanation through Large Language Model for HTAP Systems]] - MLETune: [[2024__ICPADS__MLETune - Streamlining Database Knob Tuning via Multi-LLMs Experts Guided Deep Reinforcement Learning]] - Panda: [[2024__CIDR__Panda - Performance Debugging for Databases using LLM Agents]] - [[2024__VLDB__DB-BERT - making database tuning tools “read” the manual]] [[Tsinghua Database Group]] - D-Bot: [[2023__arXiv__D-Bot - Database Diagnosis System using Large Language Models]] - [[2023__arXiv__LLM As DBA]] #### Network - BiAn: [[2025__SIGCOMM__Towards LLM-Based Failure Localization in Production-Scale Networks]] - [[2025__ACCESS__Small Language Model Agent for the Operations of Continuously Updating ICT Systems]] - NetSemantic: [[2025__arXiv__Adapting Network Information into Semantics for Generalizable and Plug-and-Play Multi-Scenario Network Diagnosis]] - [[2024__OPAC__Large language model-based optical network log analysis using LLaMA2 with instruction tuning]] - [[2023__HotNets__A Holistic View of AI-driven Network Incident Management]] #### General - LLMAD: [[2024__arXiv__Large Language Models can Deliver Accurate and Interpretable Time Series Anomaly Detection]] ### Operatinal Log Analysis / Logging - [[2025__ICPE__LogAn - An LLM-Based Log Analytics Tool with Causal Inferencing]] - [[2025__PAKDD__Adapting Large Language Models for Parameter-Efficient Log Anomaly Detection]] - [[2025__I2CACIS__Extensive Log Analysis Using Small Language Models in Minimal Resource Environments]] - LogBabylon: A Unified Framework for Cross-Log File Integration and Analysis - Chatting with Logs: [[2024__arXiv__Chatting with Logs - An exploratory study on Finetuning LLMs for LogQL]] - SuperLog: [[2024__arXiv__Adapting Large Language Models to Log Analysis with Interpretable Domain Knowledge]] - LogLLM: [[2024__arXiv__LogLLM - Log-based Anomaly Detection Using Large Language Models]] - LoFI: [[2024__ISSRE__Demystifying and Extracting Fault-indicating Information from Logs for Failure Diagnosis]] - LogRAG: [[2024__ISSRE__Leveraging RAG-Enhanced Large Language Model for Semi-Supervised Log Anomaly Detection]] - LLmeLog: [[2024__ISSRE__LLMeLog - An Approach for Anomaly Detection based on LLM-enriched Log Events]] - SelfLog: [[2024__ISSRE__Self-Evolutionary Group-wise Log Parsing Based on Large Language Mode]] - [[2024__arXiv__Studying and Benchmarking Large Language Models For Log Level Suggestion]] - LogGenius: [[2024__ICWS__LogGenius - An Unsupervised Log Parsing Framework with Zero-shot Prompt Engineering]] - LogLM: [[2024__arXiv__LogLM - From Task-based to Instruction-based Automated Log Analysis]] - [[2024__arXiv__LUK - Empowering Log Understanding with Expert Knowledge from Large Language Models]] - [[2024__KDD__LogParser-LLM - Advancing Efficient Log Parsing with Large Language Models]] - [The Effectiveness of Compact Fine-Tuned LLMs in Log Parsing](https://www.semanticscholar.org/paper/The-Effectiveness-of-Compact-Fine-Tuned-LLMs-in-Log-Mehrabi-Hamou-Lhadj/86c21bffd675216befc20afd3cda69c056f84dab?utm_source=alert_email&utm_content=LibraryFolder&utm_campaign=AlertEmails_WEEKLY&utm_term=PaperCitation+LibraryFolder+AuthorPaper&email_index=61-0-122&utm_medium=38776941) - LogEval: [[2024__arXiv__LogEval - A Comprehensive Benchmark Suite for Large Language Models In Log Analysis]] - ULog: [[2024__arXiv__ULog - Unsupervised Log Parsing with Large Language Models through Log Contrastive Units]] - DivLog: [[2024__ICSE__DivLog - Log Parsing with Prompt Enhanced In-Context Learning]] - LLMParser: [[2023__arXiv__LLMParser - A LLM-based Log Parsing Framework]] - LogPrompt: [[2023__arXiv__LogPrompt - Prompt Engineering Towards Zero-Shot and Interpretable Log Analysis]] - LogDiv: [[2023__arXiv__Prompting for Automatic Log Template Extraction]] - [[2023__CLOUD__Learning Representations on Logs for AIOps]] ### Logging - [[2024__TSE__Exploring the Effectiveness of LLMs in Automated Logging Statement Generation - An Empirical Study]] - UniLog: [[2024__ICSE__UniLog - Automatic Logging via LLM and In-Context Learning]] - [[2023__arXiv__Exploring the Effectiveness of LLMs in Automated Logging Generation - An Empirical Study]] ### Configuration Management - Ciri: [[2025__ICSE__Large Language Models as Configuration Validators]] - [[2024__arXiv__Large Language Models for Zero Touch Network Configuration Management]] - PerfSense: [[2024__arXiv__Identifying Performance-Sensitive Configurations in Software Systems through Code Analysis with LLM Agents]] - LogConfigLocalizer: [[2024__ISSTA__Face It Yourselves - An LLM-Based Two-Stage Strategy to Localize Configuration Errors via Logs]] - Ciri: [[2023__arXiv__Configuration Validation with Large Language Models]] ### [[Infrastructure as Code|IaC]] - [[2024__ICOIN__A Survey of using Large Language Models for Generating Infrastructure as Code]] - [[2023__HotNets__Simplifying Cloud Management with Cloudless Computing]]kj ### Data management - [[2024__arXiv__LLM-Enhanced Data Management]] ### Optimization - [[2024__ICSSAS__Optimizing Cloud Infrastructure Management Using Large Language Models - A DevOps Perspective]] ### [[Chaos Engineering]] - ChaosEater: [[2025__arXiv__ChaosEater - Fully Automating Chaos Engineering with Large Language Models]] ### Incident Management - FlowXpert: [[2025__KDD__FlowXpert - Expertizing Troubleshooting Workflow Orchestration with Knowledge Base and Multi-Agent Coevolution]] - VOCE: [[2025__FASE__VOCE - A Virtual On-Call Engineer for Automated Alert Incident Analysis Using a Large Language Model]] ### Security - [[2025__arXiv__When AIOps Become AI Oops - Subverting LLM - driven IT Operations via Telemetry Manipulation]]