Incident Analysis - SREcon15 - yuuk1's Digital Garden

[[Seeking SRE]] 4.8参考文献より。 [Incident Analysis | USENIX](https://www.usenix.org/conference/srecon15/program/presentation/lueder) ## Abstract > Outages and incidents happen. Sh*t breaks, fibers get cut, bugs get pushed to production, teams fail to communicate, and all hell breaks loose. But those who don't learn from mistakes are doomed to repeat them...over and over and over again, with increasing frustration for those on the frontlines fixing the problems and from the users who suffer the impacts. > In an effort to better learn from what happened across all products and services, Google launched an initiative in 2014 to gather data from all outages and incidents that occurred on production systems for trend analysis into system and user impacts, incident timelines, and root causes. The data is then used to drive improvements across systems, processes, and tools to improve the balance between system stability and development velocity. This talk aims to share Google's approach to setting up and running such an analysis program, some preliminary results, and lessons learned. > Sue Lueder joined Google as a Site Reliability Program Manager in 2014 and is on the team responsible for disaster testing and readiness, incident management processes and tools, and incident analysis. Previous to Google, Sue was a technical program manager and a systems, software, and quality engineer in wireless and smart energy industries (OnRamp Wireless, Texas Instruments, Qualcomm). She has a M.S. in Organization Development from Pepperdine University and a B.S in Physics from UCSD. 停電やインシデントは起こります。壊れたり、繊維が切れたり、バグが本番に押し出されたり、チームのコミュニケーションがうまくいかなかったりして、大混乱に陥ります。しかし、失敗から学ばない人は、何度も何度も同じことを繰り返す運命にあります。最前線で問題を解決する人も、その影響を受けるユーザーも、不満を募らせます。 Googleは、すべての製品やサービスで発生した事象からより良く学ぶために、2014年に本番システムで発生したすべての障害やインシデントのデータを収集し、システムやユーザーへの影響、インシデントのタイムライン、根本的な原因などの傾向分析を行う取り組みを開始しました。このデータは、システムの安定性と開発速度のバランスを改善するために、システム、プロセス、ツールの改善に活用されています。本講演では、このような分析プログラムを立ち上げ、実行するためのGoogleのアプローチ、いくつかの予備的な結果、そして得られた教訓を共有することを目的としています。 Sue Luederは、2014年にサイト・リライアビリティ・プログラム・マネージャーとしてGoogleに入社し、災害時のテストと準備、インシデント管理のプロセスとツール、インシデント分析を担当するチームに所属しています。Google入社以前は、OnRamp Wireless社、Texas Instruments社、Qualcomm社などのワイヤレスおよびスマートエネルギー業界で、テクニカルプログラムマネージャー、システム、ソフトウェア、品質管理のエンジニアとして活躍。ペパーダイン大学で組織開発の修士号を、UCSDで物理学の学士号を取得しています。 ![[Pasted image 20210915203054.png]] [https://www.usenix.org/sites/default/files/conference/protected-files/srecon15_slides_lueder.pdf](https://www.usenix.org/sites/default/files/conference/protected-files/srecon15_slides_lueder.pdf) p.26