システムの実証研究論文のリスト。
## Telemetry
- [[2024__ICSE__Intelligent Monitoring Framework for Cloud Services - A Data-Driven Approach]]
- [[2021__SIGMOD__Towards Observability Data Management at Scale]]
- Slack
- [[2022__Empirical Software Engineering__Enjoy your observability - An Industrial Survey of Microservice Tracing and Analysis]]
- 10 companies
- [[2020__EuroSys__Borg - the next generation]]
- Google
- [[2018__NSDI__Performance Analysis of Cloud Applications]]
- Google
## Incident Management
- [[2025__arXiv__An Empirical Study of Production Incidents in Generative AI Cloud Services]]
- [[2024__ISSREW__Failing and Learning - A Study of What is Learned About Reliability From Software Incidents]]
- [[2024__SoCC__Demystifying the Fight Against Complexity - A Comprehensive Study of Live Debugging Activities in Production Cloud Systems]]
- [[2024__arXiv__An Empirical Study on Challenges of Event Management in Microservice Architectures]]
- [[2023__ICSE__An Empirical Study on Change-induced Incidents of Online Service Systems]]
- [[2023__ISSRE__How to Manage Change-Induced Incidents? Lessons from the Study of Incident Life Cycle]]
- [[2023__ESEC-FSE__Adapting Performance Analytic Techniques in a Real-World Database-Centric System - An Industrial Experience Report]]
- [[2023__ICSME__An Empirical Study on Fault Diagnosis in Robotic Systems]]
- [[2023__ESEC-FSE__Detection Is Better Than Cure - A Cloud Incidents Perspective]]
- Microsoft
- [[2023__ATC__Lifting the veil on Meta's microservice architecture - Analyses of topology and request workflows]]
- Meta
- [[2022__ESEC-FSE__An Empirical Study of Log Analysis at Microsoft]]
- Microsoft
- [[2022__Information and Software Technology__Understanding and predicting incident mitigation time]]
- [[2021__ISSRE__How Long Will it Take to Mitigate this Incident for Online Service Systems?]]
- [[2021__TSE__Locating Performance Regression Root Causes in the Field Operations of Web-based Systems - An Experience Report]]
- [[2021__SoCC__Characterizing Microservice Dependency and Performance - Alibaba Trace Analysis]]
- Alibaba
- [[2013__ASE__Software Analytics for Incident Management of Online Services - An Experience Report]]
## Infrastructure
- [[2025__arXiv__Complexity at Scale - A Quantitative Analysis of an Alibaba Microservice Deployment]]
- [[2025__ICDE__Stability is Not Downtime - Comprehensive Stability Evaluation for Large-Scale Cloud Servers in Alibaba Cloud]]
- [[2025__HPCA__Revisiting Reliability in Large-Scale Machine Learning Research Clusters]]
- [[2024__JISA__Dependable Microservices in the Kubernetes era - A Practitioners Survey]]
- [[2024__arXiv__Understanding Web Application Workloads and Their Applications - Systematic Literature Review and Characterization]]
- [[2024__NSDI__Characterization of Large Language Model Development in the Datacenter]]
- [[2024__ICSE__An Empirical Study on Low GPU Utilization of Deep Learning Jobs]]
- [[2023__ICSE__An Empirical Study on Quality Issues of Deep Learning Platform]]
- [[2018__ICIIS__Datacenter Workload Classification and Characterization - An Empirical Approach]]