https://www.usenix.org/conference/srecon23americas/program ## General - The Endgame of SRE - SRE's Critical Role in the COVID-19 Pandemic Response in Government - SRE in Transition: From Startup to Established Business (at Datadog) - Why This Stuff Is Hard - The Best SREs Seem to Be the Ones without an SRE Title—And What We Can Do about It? - How SRE Makes Electric Vehicles - Implementing SRE in a Regulated Environment ## [[Infrastructure as Code|IaC]] - Scaling Terraform at ThousandEyes - The Revolution Will Not Be Terraformed: SRE and the Anarchist Style ## SLI/SLO - Not All Minutes Are Equal: The Secret behind SLO Adoption Failure ## Incident Response - We're Still Down: A Metastable Failure Tale - Watering the Roots of Resilience: Learning from Failure with Decision Trees - Epic Incidents of History: The 1979 NORAD Nuclear Near Miss - Incident Commanders to Incident Analysts: How We Got Here (at jeli.io) - Cognitive Apprenticeship in Practice with Alert Triage Hour of Power - [[Turning an Incident Report into a Design Issue with TLA+ - SREcon23 Americas ]] - Incident Archaeology: Extracting Value from Paperwork and Narratives - An Organizational Response to Incidents: Designing for Smooth Coordination in High Tempo, Large Scale Software Incident Response - If I Can Do It on an Ambulance, You Can Do It in an Office: Scalable Incident Response Using ICS - [[Human Observability of Incident Response - SREcon23 Americas]] - Far from the Shallows: The Value of Deeper Incident Analysis ## Observability - Scaling Telemetry Systems with Streaming (at HoneComb) - Logs Told Us It Was DNS, It Looked like DNS, It Had to Be DNS, It Wasn't DNS (at Datadog) - OpenTelemetry Metrics 101 - Building an APM with OpenTelemetry and OpenSource - Founder/CTO Perspectives: The Future of Distributed Tracing - Sto: A Better Way to Store and Query Profiler Data - How To Take Prometheus Planet Scale: Massively Large Scale Metrics Installations - Seeing the Invisible: Two Years at Wikipedia with W3C's Network Error Logging ## Chaos Engineering - Chaos-Driven Development: TDD for Distributed Systems - Tired Reacting to Certificate Outages? Build Certificate Resilient Distributed Systems Using Chaos Engineering Practices ## Distributed systems & Networking - Beacon: Intelligent Latency-Aware and Load Shedding Service Routing - What Does "High Priority" Mean? The Secret to Happy Queues - Resiliency Practices in Managing CDN (Content Delivery Network) - The Making of an Ultra Low Latency Trading System with Go and Java - Adaptive Concurrency Control for Mixed Analytical Workloads ## Networking - Measuring Real-Life Latency of the Internet: A Netflix Story - Warding against the Dark Arts: Crafting a Defense Strategy against Botnet DDoS Attacks ## Microservices - Avoiding Cachepocalypse in the Land of the Monolith ## Platform Engineering - Lessons Learned from 7 Years of Running Developer Platforms - On the Wings of SREs; J.P. Morgan's Journey into the Cloud - Hacking the Pachyderm: Scaling Servers and People (at Twitter) - Your Infrastructure Needs to D.I.E. - Hell Is Other Platforms ## Costs - Financial Resiliency Engineering: Taming Cloud Costs ## Management - Confessions of an SRE Manager - Exploring Disconnects between Reliability Practitioners and Management/Executives