[[SRE]]のインシデント管理のMap of Contentsノート。 ## Term & General remarks - [[Incident]] - [[PagerDutyのインシデントの定義]] - [[Incident Response]] - [[根本原因 - SRE]] - [[PagerDuty Incident Response Process]] - [[Incidental Incident]] - [[Root Cause Analysis]] - [[オンコール]] - [[FaultとFailureの差異]] - [[awesome-incident-management]] ## Incident Management - [[インシデントのライフサイクル]] - [[インシデントの分類項目]] - [[インシデントのトリガー・原因]] - [[インシデントの代表メトリクス]] - [[インシデントの代理メトリクス]] - [[修復負債]] - [[Incident legalism]] - [[Measuring the Success of Incident Management at Atlassian - SREcon17Asia]] - [[Incident Analysis - SREcon15]] - [[Improving operations using data analytics]] - [[Incident Command for IT - What We’ve Learned from the Fire Department - SREcon18 NA]] - [[on-call by default]] - [[Gitlab Incident Management]] ## Case Studies - [[Datadogのポストモーテムのベストプラクティス]] - [[メルペイのPlaybook(Runbook)]] - [[Mixiのインシデントレスポンスのフロー]] - [[Freeeのインシデントレスポンスのフロー]] - [[Cybozuの障害対応演習]] - [[ペパボのインシデントレスポンス]] - [[ペパボのオンコール体制アップデート]] - [[LINE Platformのサーバー障害処理プロセスと文化]] - [[Backlogのインシデント対応チェックリスト]] - [[KADOKAWAの障害対応演習]] - [[Automated Incident Management Through Slack - Airbnb Tech Blog]] - [[Ubieのインシデントレスポンス事例 - incident.io]] - [[メルペイにおけるインシデントマネジメントとナレッジシェア]] - [[How We Manage Incident Response at Honeycomb]] - [[10xのインシデントレスポンス]] ### [[ポストモーテム|Postmortem]] - [[A collection of postmortems]] - [[はてなのポストモーテム]] - [[Mixiのポストモーテム]] - [[Rettyのポストモーテム]] ## Related MOC - [[System Failures - MOC]] - [[AIOps - Fault Localization - MOC]] - [[Alert Handling Papers]] ## Related Papers - [[2025__arXiv__Cloud Uptime Archive - Open-Access Availability Data of Web, Cloud, and Gaming Services]] - [[2025__NSDI__One-Size-Fits-None - Understanding and Enhancing Slow-Fault Tolerance in Modern Distributed Systems]] - [[2025__ICPE__An Empirical Characterization of Outages and Incidents in Public Services for Large Language Models]] - FAILS: [[2025__ICPE__FAILS - A Framework for Automated Collection and Analysis of LLM Service Incidents]] - [[2025__arXiv__An Empirical Study of Production Incidents in Generative AI Cloud Services]] - DECO: [[2024__arXiv__DECO - Life-Cycle Management of Production-Scale Copilots]] - COMET: [[2024__ISSRE__Can We Trust Auto-Mitigation? Improving Cloud Failure Prediction with Uncertain Positive Learning]] - [[2024__ISSREW__Failing and Learning - A Study of What is Learned About Reliability From Software Incidents]] - ART: [[2024__ASE__ART - A Unified Unsupervised Framework for Incident Management in Microservice Systems]] - [[2024__ISSRE__Large Language Models Can Provide Accurate and Interpretable Incident Triage]] - [[2024__arXiv__AI Assistants for Incident Lifecycle in a Microservice Environment - A Systematic Literature Review]] - [[2024__arXiv__AIOps Solutions for Incident Management - Technical Guidelines and A Comprehensive Literature Review]] - [[2024__ICSE__Dynamic Alert Suppression Policy for Noise Reduction in AIOps]] - [[2024__JNCA__A survey on intelligent management of alerts and incidents in IT services]] - [[2024__ICSE__Intelligent Monitoring Framework for Cloud Services - A Data-Driven Approach]] - [[2024__arXiv__X-lifecycle Learning for Cloud Incident Management using LLMs]] - [[2024__arXiv__Dependency Aware Incident Linking in Large Cloud Systems]] - [[2024__ACDSA__A Comparative Review and Recommendations on Database Recovery Techniques]] - [[2024__ICSE__FaultProfIT - Hierarchical Fault Profiling of Incident Tickets in Large-scale Cloud Systems]] - [[2024__JNCA__A survey on intelligent management of alerts and incidents in IT services]] - [[2024__arXiv__Automated Root Causing of Cloud Incidents using In-Context Learning with GPT-4]] - [[2023__CLOUD__InsightsSumm - Summarization of ITOps Incidents Through In-Context Prompt Engineering]] - [[2023__ESEC-FSE__Detection Is Better Than Cure - A Cloud Incidents Perspective]] - [[2024__ICSE__Xpert - Empowering Incident Management with Query Recommendations via Large Language Models]] - [[2023__ISSRE__How to Manage Change-Induced Incidents? Lessons from the Study of Incident Life Cycle]] - [[2023__HotNets__A Holistic View of AI-driven Network Incident Management]] - [[2023__ICSE__An Empirical Study on Change-induced Incidents of Online Service Systems]] - [[2023__arXiv__Empowering Practical Root Cause Analysis by Large Language Models for Cloud Incidents]] - [[2023__ICSE__Recommending Root-Cause and Mitigation Steps for Cloud Incidents using Large Language Models]] - [[2023__ICSE__Knowledge-based Intelligent System for IT Incident DevOps]] - [[2023__ICSE__Incident-aware Duplicate Ticket Aggregation for Cloud Systems]] - [[2022__SoCC__How to Fight Production Incidents? An Empirical Study on a Large-scale Cloud Service]] - [[2022__ISSRE__Going through the Life Cycle of Faults in Clouds - Guidelines on Fault Handling]] - [[2022__ESEC-FSE__Metadata-based Retrieval for Resolution Recommendation in AIOps]] - [[2022__arXiv__Mining Root Cause Knowledge from Cloud Service Incident Investigations for AIOps]] - [[2022__Information and Software Technology__Understanding and predicting incident mitigation time]] - [[2021__ISSRE__How Long Will it Take to Mitigate this Incident for Online Service Systems?]] - [[2021__ASE__Graph-based Incident Aggregation for Large-Scale Online Service Systems]] - [[2021__ACSOS__Empirical Characterization of User Reports about Cloud Failures]] - [[2020__ESEC-FSE__Efficient customer incident triage via linking with system incidents]] - [[2020__ASE__How Incidental are the Incidents? Characterizing and Prioritizing Incidents for Large-Scale Online Service Systems]] - [[Essential Incident]] - [[Incidental Incident]] - [[2020__ESEC-FSE__Towards Intelligent Incident Management - Why We Need It and How We Make It]] - [[2020__ESEC-FSE__Efficient incident identification from multi-dimensional issue reports via meta-heuristic search]] - [[2020__ICSE-SEIP__Understanding and Handling Alert Storm for Online Service Systems]] - [[2019__HotOS__What bugs cause production cloud incidents?]] - [[2017__HotOS__Gray Failure - The Achilles Heel of Cloud Scale Systems]] - [[2016__ISSTA__Practitioners' expectations on automated fault localization]] - [[2014__OSDI__Simple Testing Can Prevent Most Critical Failures - An Analysis of Production Failures in Distributed Data-Intensive Systems]] - [[2013__ASE__Software Analytics for Incident Management of Online Services - An Experience Report]] - [[2003__USENIX Symposium__Why do Internet services fail, and what can be done about it?]] - 当時のクラウドコンピューティングの障害レポートを分析 ## SaaS - [[PagerDuty]] - [[incident.io]] - [[Rootly]] - [[Blameless]] - [[Grafana OnCall]] - [[Grafana Incident]] - [[jeli.io]] - [[Jeli インシデントレスポンスbot]] - [[SquadCast]] - [[BigPanda]] - [[Metrist]] ## Public Incident Reports - [[VOID]] - [[The VOID Report 2021]] - [[A decade of major cache incidents at Twitter]] - [[Average cost per hour of enterprise server downtime worldwide in 2019]] - [[AI Incident Database]] ## Datasets - [[Outage and Downtime Reports Global Service Data - Kaggle]] - [[EventOrOutage - Rootly-AI-Labs]] ## Books - [[Anatomy of an Incident]] ## Field Studies - [[SRE Workbook 障害トリガーと根本原因]] - [[メトリクスの変化開始時刻と障害発生時刻との間の遅延]] - [[The Real Failure Rate of EBS — PlanetScale]] ## Related Domains - [[インシデント・コマンド・システム|Incident Command System]] - [[医療のインシデント]] - [[ハインリッヒの法則]] ## Persons - [[Brent Chapman]] ## Others - [[Why do config changes keep coming up in major incidents?]] - [[Alarm Management A Comprehensive Guide]] - [[AWS Incident Response Playbook Samples]] - [[Gitlab On-call Run Books]] - [[Move past incident response to reliability]] - [[Oncall Compensation for Software Engineers]] ## Thinking - [[良いインシデント対応]]