## [[LLM4SRE]] - [[2023__arXiv__Empowering Practical Root Cause Analysis by Large Language Models for Cloud Incidents]] - [[2023__arXiv__Assess and Summarize - Improve Outage Understanding with Large Language Models]] - [[2023__ICSE__Recommending Root-Cause and Mitigation Steps for Cloud Incidents using Large Language Models]] - [[Large-language models for automatic cloud incident management]] ## Others - [[2025__arXiv__An Empirical Study of Production Incidents in Generative AI Cloud Services]] - [[2024__ICSE__Intelligent Monitoring Framework for Cloud Services - A Data-Driven Approach]] - [[2023__ESEC-FSE__Detection Is Better Than Cure - A Cloud Incidents Perspective]] - [[2022__SoCC__How to Fight Production Incidents? An Empirical Study on a Large-scale Cloud Service]] - [[2021__ISSRE__How Long Will it Take to Mitigate this Incident for Online Service Systems?]] - [[2020__ESEC-FSE__How to Mitigate the Incident? An Effective Troubleshooting Guide Recommendation Technique for Online Service Systems]] - [[2020__ESEC-FSE__Towards Intelligent Incident Management - Why We Need It and How We Make It]] - [[2019__ASE__Continuous Incident Triage for Large-Scale Online Service Systems]] - [[2019__HotOS__What bugs cause production cloud incidents?]]