Measuring the Success of Incident Management at Atlassian - SREcon17Asia

[[Seeking SRE]] 4.8参考文献より。 [Measuring the Success of Incident Management at Atlassian | USENIX](https://www.usenix.org/conference/srecon17asia/program/presentation/millar) > When an incident happens it's the worst possible time to be bogged down with confusing systems and processes. A well defined Incident Management Process that's light-weight and supported by good automation offers a way to get fast, easy and predictable results during an incident, but if you don't implement the right things in the right way you risk bad results, such as high time-to-recovery, at critical times. > Find out how Atlassian drives value out of the Incident Management process and what metrics we use to track it. We'll also cover how we created automation to remove the overhead in managing incidents and deep dive into a case study to explain how it all ties together. > The target audience for this conceptual session is people who are involved in the management of incidents, such as SREs and delivery team members. インシデントが発生すると、混乱したシステムやプロセスに悩まされる最悪の事態に陥ります。軽量で、優れた自動化に支えられた、よく定義されたインシデント管理プロセスは、インシデント時に迅速、容易、予測可能な結果を得るための方法を提供しますが、適切なものを適切な方法で実装しなければ、重要な時に復旧までの時間が長くなるなど、悪い結果になるリスクがあります。アトラシアンがどのようにしてインシデント管理プロセスから価値を引き出し、それを追跡するためにどのような指標を使用しているかをご覧ください。また、インシデント管理のオーバーヘッドを取り除くために、どのように自動化を実現したかを説明し、ケーススタディに深く入り込んで、すべてがどのように結びついているかを説明します。このコンセプチュアルセッションの対象者は、SREやデリバリーチームのメンバーなど、インシデント管理に関わる方々です。 ![[Pasted image 20210915202517.png]] - Control Systems - Ticket System - War Room - Incident Doc - Group Chat