[[SRE]]のインシデント管理のMap of Contentsノート。
## Term & General remarks
- [[Incident]]
- [[PagerDutyのインシデントの定義]]
- [[Incident Response]]
- [[根本原因 - SRE]]
- [[PagerDuty Incident Response Process]]
- [[Incidental Incident]]
- [[Root Cause Analysis]]
- [[オンコール]]
- [[FaultとFailureの差異]]
- [[awesome-incident-management]]
## Incident Management
- [[インシデントのライフサイクル]]
- [[インシデントの分類項目]]
- [[インシデントのトリガー・原因]]
- [[インシデントの代表メトリクス]]
- [[インシデントの代理メトリクス]]
- [[修復負債]]
- [[Incident legalism]]
- [[Measuring the Success of Incident Management at Atlassian - SREcon17Asia]]
- [[Incident Analysis - SREcon15]]
- [[Improving operations using data analytics]]
- [[Incident Command for IT - What We’ve Learned from the Fire Department - SREcon18 NA]]
- [[on-call by default]]
- [[Gitlab Incident Management]]
## Case Studies
- [[Datadogのポストモーテムのベストプラクティス]]
- [[メルペイのPlaybook(Runbook)]]
- [[Mixiのインシデントレスポンスのフロー]]
- [[Freeeのインシデントレスポンスのフロー]]
- [[Cybozuの障害対応演習]]
- [[ペパボのインシデントレスポンス]]
- [[ペパボのオンコール体制アップデート]]
- [[LINE Platformのサーバー障害処理プロセスと文化]]
- [[Backlogのインシデント対応チェックリスト]]
- [[KADOKAWAの障害対応演習]]
- [[Automated Incident Management Through Slack - Airbnb Tech Blog]]
- [[Ubieのインシデントレスポンス事例 - incident.io]]
- [[メルペイにおけるインシデントマネジメントとナレッジシェア]]
- [[How We Manage Incident Response at Honeycomb]]
- [[10xのインシデントレスポンス]]
### [[ポストモーテム|Postmortem]]
- [[A collection of postmortems]]
- [[はてなのポストモーテム]]
- [[Mixiのポストモーテム]]
- [[Rettyのポストモーテム]]
## Related MOC
- [[System Failures - MOC]]
- [[AIOps - Fault Localization - MOC]]
- [[Alert Handling Papers]]
## Related Papers
- [[2025__arXiv__Cloud Uptime Archive - Open-Access Availability Data of Web, Cloud, and Gaming Services]]
- [[2025__NSDI__One-Size-Fits-None - Understanding and Enhancing Slow-Fault Tolerance in Modern Distributed Systems]]
- [[2025__ICPE__An Empirical Characterization of Outages and Incidents in Public Services for Large Language Models]]
- FAILS: [[2025__ICPE__FAILS - A Framework for Automated Collection and Analysis of LLM Service Incidents]]
- [[2025__arXiv__An Empirical Study of Production Incidents in Generative AI Cloud Services]]
- DECO: [[2024__arXiv__DECO - Life-Cycle Management of Production-Scale Copilots]]
- COMET: [[2024__ISSRE__Can We Trust Auto-Mitigation? Improving Cloud Failure Prediction with Uncertain Positive Learning]]
- [[2024__ISSREW__Failing and Learning - A Study of What is Learned About Reliability From Software Incidents]]
- ART: [[2024__ASE__ART - A Unified Unsupervised Framework for Incident Management in Microservice Systems]]
- [[2024__ISSRE__Large Language Models Can Provide Accurate and Interpretable Incident Triage]]
- [[2024__arXiv__AI Assistants for Incident Lifecycle in a Microservice Environment - A Systematic Literature Review]]
- [[2024__arXiv__AIOps Solutions for Incident Management - Technical Guidelines and A Comprehensive Literature Review]]
- [[2024__ICSE__Dynamic Alert Suppression Policy for Noise Reduction in AIOps]]
- [[2024__JNCA__A survey on intelligent management of alerts and incidents in IT services]]
- [[2024__ICSE__Intelligent Monitoring Framework for Cloud Services - A Data-Driven Approach]]
- [[2024__arXiv__X-lifecycle Learning for Cloud Incident Management using LLMs]]
- [[2024__arXiv__Dependency Aware Incident Linking in Large Cloud Systems]]
- [[2024__ACDSA__A Comparative Review and Recommendations on Database Recovery Techniques]]
- [[2024__ICSE__FaultProfIT - Hierarchical Fault Profiling of Incident Tickets in Large-scale Cloud Systems]]
- [[2024__JNCA__A survey on intelligent management of alerts and incidents in IT services]]
- [[2024__arXiv__Automated Root Causing of Cloud Incidents using In-Context Learning with GPT-4]]
- [[2023__CLOUD__InsightsSumm - Summarization of ITOps Incidents Through In-Context Prompt Engineering]]
- [[2023__ESEC-FSE__Detection Is Better Than Cure - A Cloud Incidents Perspective]]
- [[2024__ICSE__Xpert - Empowering Incident Management with Query Recommendations via Large Language Models]]
- [[2023__ISSRE__How to Manage Change-Induced Incidents? Lessons from the Study of Incident Life Cycle]]
- [[2023__HotNets__A Holistic View of AI-driven Network Incident Management]]
- [[2023__ICSE__An Empirical Study on Change-induced Incidents of Online Service Systems]]
- [[2023__arXiv__Empowering Practical Root Cause Analysis by Large Language Models for Cloud Incidents]]
- [[2023__ICSE__Recommending Root-Cause and Mitigation Steps for Cloud Incidents using Large Language Models]]
- [[2023__ICSE__Knowledge-based Intelligent System for IT Incident DevOps]]
- [[2023__ICSE__Incident-aware Duplicate Ticket Aggregation for Cloud Systems]]
- [[2022__SoCC__How to Fight Production Incidents? An Empirical Study on a Large-scale Cloud Service]]
- [[2022__ISSRE__Going through the Life Cycle of Faults in Clouds - Guidelines on Fault Handling]]
- [[2022__ESEC-FSE__Metadata-based Retrieval for Resolution Recommendation in AIOps]]
- [[2022__arXiv__Mining Root Cause Knowledge from Cloud Service Incident Investigations for AIOps]]
- [[2022__Information and Software Technology__Understanding and predicting incident mitigation time]]
- [[2021__ISSRE__How Long Will it Take to Mitigate this Incident for Online Service Systems?]]
- [[2021__ASE__Graph-based Incident Aggregation for Large-Scale Online Service Systems]]
- [[2021__ACSOS__Empirical Characterization of User Reports about Cloud Failures]]
- [[2020__ESEC-FSE__Efficient customer incident triage via linking with system incidents]]
- [[2020__ASE__How Incidental are the Incidents? Characterizing and Prioritizing Incidents for Large-Scale Online Service Systems]]
- [[Essential Incident]]
- [[Incidental Incident]]
- [[2020__ESEC-FSE__Towards Intelligent Incident Management - Why We Need It and How We Make It]]
- [[2020__ESEC-FSE__Efficient incident identification from multi-dimensional issue reports via meta-heuristic search]]
- [[2020__ICSE-SEIP__Understanding and Handling Alert Storm for Online Service Systems]]
- [[2019__HotOS__What bugs cause production cloud incidents?]]
- [[2017__HotOS__Gray Failure - The Achilles Heel of Cloud Scale Systems]]
- [[2016__ISSTA__Practitioners' expectations on automated fault localization]]
- [[2014__OSDI__Simple Testing Can Prevent Most Critical Failures - An Analysis of Production Failures in Distributed Data-Intensive Systems]]
- [[2013__ASE__Software Analytics for Incident Management of Online Services - An Experience Report]]
- [[2003__USENIX Symposium__Why do Internet services fail, and what can be done about it?]]
- 当時のクラウドコンピューティングの障害レポートを分析
## SaaS
- [[PagerDuty]]
- [[incident.io]]
- [[Rootly]]
- [[Blameless]]
- [[Grafana OnCall]]
- [[Grafana Incident]]
- [[jeli.io]]
- [[Jeli インシデントレスポンスbot]]
- [[SquadCast]]
- [[BigPanda]]
- [[Metrist]]
## Public Incident Reports
- [[VOID]]
- [[The VOID Report 2021]]
- [[A decade of major cache incidents at Twitter]]
- [[Average cost per hour of enterprise server downtime worldwide in 2019]]
- [[AI Incident Database]]
## Datasets
- [[Outage and Downtime Reports Global Service Data - Kaggle]]
- [[EventOrOutage - Rootly-AI-Labs]]
## Books
- [[Anatomy of an Incident]]
## Field Studies
- [[SRE Workbook 障害トリガーと根本原因]]
- [[メトリクスの変化開始時刻と障害発生時刻との間の遅延]]
- [[The Real Failure Rate of EBS — PlanetScale]]
## Related Domains
- [[インシデント・コマンド・システム|Incident Command System]]
- [[医療のインシデント]]
- [[ハインリッヒの法則]]
## Persons
- [[Brent Chapman]]
## Others
- [[Why do config changes keep coming up in major incidents?]]
- [[Alarm Management A Comprehensive Guide]]
- [[AWS Incident Response Playbook Samples]]
- [[Gitlab On-call Run Books]]
- [[Move past incident response to reliability]]
- [[Oncall Compensation for Software Engineers]]
## Thinking
- [[良いインシデント対応]]