https://www.usenix.org/conference/srecon23americas/program
## General
- The Endgame of SRE
- SRE's Critical Role in the COVID-19 Pandemic Response in Government
- SRE in Transition: From Startup to Established Business (at Datadog)
- Why This Stuff Is Hard
- The Best SREs Seem to Be the Ones without an SRE Title—And What We Can Do about It?
- How SRE Makes Electric Vehicles
- Implementing SRE in a Regulated Environment
## [[Infrastructure as Code|IaC]]
- Scaling Terraform at ThousandEyes
- The Revolution Will Not Be Terraformed: SRE and the Anarchist Style
## SLI/SLO
- Not All Minutes Are Equal: The Secret behind SLO Adoption Failure
## Incident Response
- We're Still Down: A Metastable Failure Tale
- Watering the Roots of Resilience: Learning from Failure with Decision Trees
- Epic Incidents of History: The 1979 NORAD Nuclear Near Miss
- Incident Commanders to Incident Analysts: How We Got Here (at jeli.io)
- Cognitive Apprenticeship in Practice with Alert Triage Hour of Power
- [[Turning an Incident Report into a Design Issue with TLA+ - SREcon23 Americas ]]
- Incident Archaeology: Extracting Value from Paperwork and Narratives
- An Organizational Response to Incidents: Designing for Smooth Coordination in High Tempo, Large Scale Software Incident Response
- If I Can Do It on an Ambulance, You Can Do It in an Office: Scalable Incident Response Using ICS
- [[Human Observability of Incident Response - SREcon23 Americas]]
- Far from the Shallows: The Value of Deeper Incident Analysis
## Observability
- Scaling Telemetry Systems with Streaming (at HoneComb)
- Logs Told Us It Was DNS, It Looked like DNS, It Had to Be DNS, It Wasn't DNS (at Datadog)
- OpenTelemetry Metrics 101
- Building an APM with OpenTelemetry and OpenSource
- Founder/CTO Perspectives: The Future of Distributed Tracing
- Sto: A Better Way to Store and Query Profiler Data
- How To Take Prometheus Planet Scale: Massively Large Scale Metrics Installations
- Seeing the Invisible: Two Years at Wikipedia with W3C's Network Error Logging
## Chaos Engineering
- Chaos-Driven Development: TDD for Distributed Systems
- Tired Reacting to Certificate Outages? Build Certificate Resilient Distributed Systems Using Chaos Engineering Practices
## Distributed systems & Networking
- Beacon: Intelligent Latency-Aware and Load Shedding Service Routing
- What Does "High Priority" Mean? The Secret to Happy Queues
- Resiliency Practices in Managing CDN (Content Delivery Network)
- The Making of an Ultra Low Latency Trading System with Go and Java
- Adaptive Concurrency Control for Mixed Analytical Workloads
## Networking
- Measuring Real-Life Latency of the Internet: A Netflix Story
- Warding against the Dark Arts: Crafting a Defense Strategy against Botnet DDoS Attacks
## Microservices
- Avoiding Cachepocalypse in the Land of the Monolith
## Platform Engineering
- Lessons Learned from 7 Years of Running Developer Platforms
- On the Wings of SREs; J.P. Morgan's Journey into the Cloud
- Hacking the Pachyderm: Scaling Servers and People (at Twitter)
- Your Infrastructure Needs to D.I.E.
- Hell Is Other Platforms
## Costs
- Financial Resiliency Engineering: Taming Cloud Costs
## Management
- Confessions of an SRE Manager
- Exploring Disconnects between Reliability Practitioners and Management/Executives