KEYNOTE: Iterative Enterprise SRE Transformation
It’s easy to get discouraged reading books about industry best practices that say things like “always test in prod!” and “10 deploys a day!!” At times, they can make the goal of being a high-functioning DevOps organization feel out-of-reach for large enterprises, where changes to the way we operate take time to roll out. A few years ago, Vanguard started its journey to adopt Site Reliability Engineering across the IT division, and that transformation effort is still underway today. In this talk, we will share where we started, how far we’ve come since then, and all of the steps we’ve taken along the way, as we’ve worked to evangelize changes to the way we measure availability, enable experimentation, leverage highly-available architecture patterns, and learn from failure.
In this presentation, we’ll focus primarily on the topic of how we monitor and measure availability - developing SLIs and SLOs, managing an error budget, leveraging Honeycomb for observability. We’ll also touch on other aspects of our SRE transformation at a higher level, including home-grown tools for self-service chaos engineering, self-service load testing, emerging cloud platforms (including serverless), proactive failure modes and effects analysis, and blameless post-incident reviews.
More about Steve Prazenica
Steve Prazenica manages the Operational Intelligence team at Vanguard, supporting the full spectrum of monitoring and observability tooling from custom scripts to popular vendor products to OpenTelemetry. What began as a semi-random placement on a Systems Management tools team as a new hire coming out of Drexel University has turned into a career specialization and passion around effective monitoring, observability and sustainable production support practices. He lives in southeastern Pennsylvania with his wife and two young boys. Steve is an avid reader, gamer and vinyl record collector.