- [[2020__ESEC-FSE__Towards Intelligent Incident Management - Why We Need It and How We Make It]]
- [[2019__ICSE-SEIP__An Empirical Investigation of Incident Triage for Online Service Systems]]
[Cloud Computing Market Size, Share, Growth Drivers, Opportunities & Statistics](https://www.marketsandmarkets.com/Market-Reports/cloud-computing-market-234.html)
- [[DevOps and the Cost of Downtime - Fortune 1000Best Practice Metrics Quantified]]
- G. Linden. Akamai online retail performance report: Milliseconds are critical. http://glinden.blogspot.com/2006/11/ marissa-mayer-at-web-20.html, 2006.
---
[[2024__arXiv__Failure Diagnosis in Microservice Systems - A Comprehensive Survey and Analysis]]
> According to a report [23], an outage lasting 24 hours of mission-critical services from AWS us-east-1 can lead to a direct revenue loss of $3.4 billion, while an outage lasting 48 hours can exacerbate the financial impact to reaching $7.8 billion. In 2023 alone, notable service providers such as Microsoft [29], Google [20], and Alibaba Cloud [17] encountered noteworthy failures and downtime incidents.
[23]\: 2023. Parametrixinsurance Cloud Outage and the Fortune 500 Analysis. https://www.parametrixinsurance.com/cloud-outage-and-the-fortune-500-analysis.
[29]\: 2023. Where are we now – Microsoft 363? Cloud suite suffers another outage. https://www.theregister.com/2023/04/24/microsoft_365_search_outage/.
[20]\: 2023 Google Cloud Services Hit by Outage in Paris. https://thenewstack.io/google-cloud-services-hit-by-outage-in-paris/.
[17]\: 2023 Alibaba Cloud Health Dashboard. https://status.aliyun.com/#/historyEvent.
---
[[2018__ICSOC__Microscope―Pinpoint Performance Issues with Causal Graphs in Micro-service Environments]]
> According to [14], Amazon experiences 1% decrease in sales for additional 100 ms delay in response time per request while Google reports a 20% drop in traffic due to 500 ms delay in response time.
- [14]: 8. Ibidunmoye, O., Hern ́andez-Rodriguez, F., Elmroth, E.: Performance anomaly detection and bottleneck identification. ACM Comput. Surv. (CSUR) 48(1), 4 (2015)
[[2015__CSUR__Performance Anomaly Detection and Bottleneck Identification]]
> Studies [Kissmetrics 2014] have shown that there exist correlations between the end-user performance and sales or number of visitors in popular web applications and how consistently high page latency increases the page abandonment rate. It was also shown that for a small-scale e-commerce application with a daily sales of $100,000, a 1-second page delay could lead to about 7% loss in sales annually. Also according to Huang [2011], Amazon experiences a 1% decrease in sales for an additional 100ms delay in response time while Google reports a 20% drop in traffic due to 500ms delay in response time. These implications show not only the importance but also the po- tential economical value of robust and automated solutions for detecting performance problems in real time.
- [Kissmetrics 2014]: Kissmetrics. 2014. How Loading Time Affects Your Bottom Line. Retrieved April 15, 2014 from http://blog.kissmetrics.com/loading-time/.
- Huang 2011: Cheng Huang. 2011. Public DNS System and Global Traffic Management. Retrieved April 15, 2014 from http://research.microsoft.com/en-us/um/people/chengh/slides/pubdns11.pptx.pdf.
---
[[2023__ICSE__An Empirical Study on Change-induced Incidents of Online Service Systems]]
> For example, a configuration change on backbone routers caused an incident that lasted for at least 6 hours, which made Facebook lose $60 million in revenue [10].
- [10]\: 6. (2021) 2021 facebook outage. [Online]. Available: https://en.wikipedia.org/wiki/2021 Facebook outage
---
[[2024__arXiv__Early Detection of Performance Regressions by Bridging Local Performance Data and Architectural Models]]
> Such performance regres- sions can result in higher resource consumption (e.g., excessive memory or CPU usage), increased response time, or even field failures, thereby causing significant financial and reputation losses [1], [62]. For instance, according to a recent report [1], even a mere two-second difference in the website response time can drastically decrease user satisfaction, causing the bounce rate to surge from 9% to 38%.
[1]\: [I’ll Show You 23 Website Load Time Stats: Why Speed Matters](https://www.websitebuilderexpert.com/building-websites/website-load-time-statistics/)
---
[[2024__Dissertation__Enhancing Latency Reduction and Reliability for Internet Services with QUIC and WebRTC]]
> Achieving low latency and high reliability is a critical goal for many Inter- net services, largely due to their significant influence on user experience and business revenue. For instance, a report from Akamai shows that users on the low-latency site are 15% more likely to complete a purchase and 9% less likely to abandon the site after viewing just one page [1]. Amazon discovered that a 100ms reduction in page load time (PLT) can lead to a 1% increase in revenue [2], while Google observed that a 2-second delay may result in a 4.3% revenue loss per visit [3]. The performance of a service’s latency is characterized by several metrics. The total delay experienced by a service, known as the request latency [4] or the end-to- end (E2E) delay [5], spans from the moment the client sends the initial request packet to the receipt of the final response packet from the server. The E2E delay primarily consists of three components: the connection setup delay, which accounts for the time taken to establish a transport layer connection; the transmission delay, encompassing the aggregate time taken to transmit each packet within the connection; and the processing delay, which pertains to the server’s computation time needed to formulate the result at the application layer. Collectively, these delays are influenced by a myriad of factors, including the geographical proximity between the client and the server, the employed protocol, and the control parameters designated for data processing and transmission.
Meanwhile, the importance of reliability is underscored by incidents such as the 4-hour outage of Amazon Web Services (AWS) that was estimated to cause $310 million in losses [6]. Reliability is primarily gauged by service level indicators ([[SLI]]s) such as the stability of the request latency and the availability, which is the fraction of time a service is usable [4].
- [1]\: E. Nygren, R. K. Sitaraman, and J. Sun, “The Akamai Network: A Platform for High-performance Internet Applications,” in Proc. ACM Special Interest Group Oper. Syst. (SIGOPS), 2010, pp. 2–19.
- [2]\: “Latency is Everywhere and it Costs You Sales - How to Crush it,” Jul. 2009. [Online]. Available: http://highscalability.com/ latency-everywhere-and-it-costs-you-sales-how-crush-it
- [3]\: B. Briscoe, A. Brunstrom, A. Petlund, D. Hayes, D. Ros, I.-J. Tsang, S. Gjess- ing, G. Fairhurst, C. Griwodz, and M. Welzl, “Reducing Internet Latency: A Survey of Techniques and Their Merits,” IEEE Communications Surveys & Tutorials, vol. 18, no. 3, pp. 2149–2196, 2016.
- [4]\: 1. B. Beyer, C. Jones, J. Petoff, and N. R. Murphy, Site Reliability Engineering: How Google Runs Production Systems. O’Reilly, 2016. [[Site Reliability Engineering - Google|srebook]]
- [5]\: M. Iorio, F. Risso, and C. Casetti, “When latency matters: measurements and lessons learned,” ACM SIGCOMM Computer Communication Review, vol. 51, no. 4, pp. 2–13, Dec. 2021.
- [6]\: S. Vavra, “Amazon outage cost S&P 500 companies $150M,” Mar. 2017. [Online]. Available: https://www.axios.com/2017/12/15/ amazon-outage-cost-sp-500-companies-150m-1513300728