[SREcon24 Americas Conference Program | USENIX](https://www.usenix.org/conference/srecon24americas/program) ## General remarks (Opinions) - 20 Years of SRE: Highs and Lows - Build vs. Buy in the Midst of Armageddon - When Your Open Source Turns To The Dark Side - Real Talk: What We Think We Know — That Just Ain’t So - Sustainable Reliability Engineering - Cloudy with a Chance of Operational Excellence - Frontend Design in SRE - What Can You See from Here? ## Case Studies - Product Reliability for Google Maps ## [[Observability]] - Using Generative AI Patterns for Better Observability - The Ticking Time Bomb of Observability Expectations - Synthesizing Sanity with, and in Spite of, Synthetic Monitoring - [[99.99% of Your Traces are (Probably) Trash - SREcon24 Americas]] - Kube, Where’s My Metrics? The Challenges of Scaling Multi-Cluster Prometheus - Workshop: Cloud-Native Observability with OpenTelemetry - The Invisible Door: Reliability Gaps in the Front End ## [[Incident Response]] - Thawing the Great Code Slush - Autopsy of a Cascading Outage from a MySQL Crashing Bug - "Logs Told Us It Was Kernel – It Wasn't" - What Is Incident Severity, but a Lie Agreed Upon? - Hard Choices, Tight Timelines: A Closer Look at Skip-level Tradeoff Decisions during Incidents - [[Storytelling as an Incident Management Skill]] ## FinOps - Scam or Savings? A Cloud vs. On-Prem Economic Slapfight ## Distributed systems - Capacity Constraints Unveiled: Navigating Cloud Scaling Realities - Kubernetes: The Most Graceful Termination™ - System Performance and Queuing Theory - Concepts and Application - It Is OK to be Metastable - Cross-System Interaction Failures: Don't Fail through the Cracks - Gray Failure: The Achilles’ Heel of Cloud-Scale Systems - From Chaos to Clarity: Deciphering Cache Inconsistencies in a Distributed Environment ## Database - Sharding: Growing Systems from Node-scale to Planet-scale - Migrating a Large Scale Search Dataset in Production in a Highly Available Manner - The Sins of High Cardinality - Strengthening Apache Pinot's Query Processing Engine with Adaptive Server Selection and Runtime Query Killing ## CI/CD - OIDC and CICD: Why Your CI Pipeline Is Your Greatest Security Threat ## Security - What We Want Is 90% the Same: Using Your Relationship with Security for Fun and Profit ## Migration - Optimizing Resilience and Availability by Migrating from JupyterHub to the Kubeflow Notebook Controller - Handling the Largest Domains Migration, Ever! - Navigating the Kubernetes Odyssey: Lessons from Early Adoption and Sustained Modernization - Taming the Linux Distribution Sprawl: A Journey to Standardization and Efficiency ## Data management - Quash: Patterns for Data Lifecycle Management ## Technical debt - Patching Your Way to Compliance with a Small Team and a Pile of Technical Debt ## Goverment - Demystifying FedRAMP ## Manegement - Meeting the Challenge of Burnout - Triage with Mental Models - Defence at the Boundary of Acceptable Performance - Teaching SRE - [[Measuring Reliability Culture to Optimize Tradeoffs - Perspectives from an Anthropologist - SREcon24 Americas]]