システムの実証研究論文のリスト。 ## Telemetry - [[2024__ICSE__Intelligent Monitoring Framework for Cloud Services - A Data-Driven Approach]] - [[2021__SIGMOD__Towards Observability Data Management at Scale]] - Slack - [[2022__Empirical Software Engineering__Enjoy your observability - An Industrial Survey of Microservice Tracing and Analysis]] - 10 companies - [[2020__EuroSys__Borg - the next generation]] - Google - [[2018__NSDI__Performance Analysis of Cloud Applications]] - Google ## Incident Management - [[2025__arXiv__An Empirical Study of Production Incidents in Generative AI Cloud Services]] - [[2024__ISSREW__Failing and Learning - A Study of What is Learned About Reliability From Software Incidents]] - [[2024__SoCC__Demystifying the Fight Against Complexity - A Comprehensive Study of Live Debugging Activities in Production Cloud Systems]] - [[2024__arXiv__An Empirical Study on Challenges of Event Management in Microservice Architectures]] - [[2023__ICSE__An Empirical Study on Change-induced Incidents of Online Service Systems]] - [[2023__ISSRE__How to Manage Change-Induced Incidents? Lessons from the Study of Incident Life Cycle]] - [[2023__ESEC-FSE__Adapting Performance Analytic Techniques in a Real-World Database-Centric System - An Industrial Experience Report]] - [[2023__ICSME__An Empirical Study on Fault Diagnosis in Robotic Systems]] - [[2023__ESEC-FSE__Detection Is Better Than Cure - A Cloud Incidents Perspective]] - Microsoft - [[2023__ATC__Lifting the veil on Meta's microservice architecture - Analyses of topology and request workflows]] - Meta - [[2022__ESEC-FSE__An Empirical Study of Log Analysis at Microsoft]] - Microsoft - [[2022__Information and Software Technology__Understanding and predicting incident mitigation time]] - [[2021__ISSRE__How Long Will it Take to Mitigate this Incident for Online Service Systems?]] - [[2021__TSE__Locating Performance Regression Root Causes in the Field Operations of Web-based Systems - An Experience Report]] - [[2021__SoCC__Characterizing Microservice Dependency and Performance - Alibaba Trace Analysis]] - Alibaba - [[2013__ASE__Software Analytics for Incident Management of Online Services - An Experience Report]] ## Infrastructure - [[2025__arXiv__Complexity at Scale - A Quantitative Analysis of an Alibaba Microservice Deployment]] - [[2025__ICDE__Stability is Not Downtime - Comprehensive Stability Evaluation for Large-Scale Cloud Servers in Alibaba Cloud]] - [[2025__HPCA__Revisiting Reliability in Large-Scale Machine Learning Research Clusters]] - [[2024__JISA__Dependable Microservices in the Kubernetes era - A Practitioners Survey]] - [[2024__arXiv__Understanding Web Application Workloads and Their Applications - Systematic Literature Review and Characterization]] - [[2024__NSDI__Characterization of Large Language Model Development in the Datacenter]] - [[2024__ICSE__An Empirical Study on Low GPU Utilization of Deep Learning Jobs]] - [[2023__ICSE__An Empirical Study on Quality Issues of Deep Learning Platform]] - [[2018__ICIIS__Datacenter Workload Classification and Characterization - An Empirical Approach]]