Risk and Error Budgets


In this video, Seth Vargo and Liz Fong-Jones discuss how the SRE discipline reduces tension over velocity/stability between product teams and system operators by quantifying risk and employing error budgets. Striving for 100% availability in a service isn’t just impossible, it’s unnecessary. Maximizing stability limits how quickly new features can be delivered to users. Extreme availability produces diminishing returns as user experience becomes dominated by less reliable components like cellular networks or WiFi. While we want to reduce the risk of system failure, we also have to accept risk in order to deliver new products and features.

In the SRE discipline, error budgets are the prescriptive, quantitative measurements for how much risk a service is willing to tolerate. Error budgets are the byproduct of the agreed-upon SLOs (Service Level Objectives) between product owners and systems engineers. Risk and error budgets are directly related to many DevOps principles. Error budgets clearly define that “accidents are normal” by quantifying accidents and risk. Error budgets also enforce that “change should be gradual”, because non-gradual changes could quickly break the SLO and prevent further development for the quarter. This is why we say class SRE implements DevOps.

“Embracing Risk” in the O’Reilly SRE Book: https://goo.gl/kNpXrt
“How to Prioritize and Communicate Risks” on the GCP Blog: https://goo.gl/ffmU69

Reach out to Liz and Seth:

Watch more episodes from the playlist here: https://goo.gl/CKv3tV
Subscribe to the Google Cloud Platform channel for more Cloud content: https://goo.gl/S0AS51


Duration: 6:18
Publisher: Google Cloud
You can watch this video also at the source.


Join Us