AWS re:Invent 2018: How AWS Minimizes the Blast Radius of Failures (ARC338)

At AWS, we obsess over operational excellence. We have a deep understanding of system availability, informed by over a decade of experience operating the cloud and our roots of operating for nearly a quarter-century. One thing we’ve learned is that failures come in many forms, some expected, and some unexpected. It’s vital to build from the ground up and embrace failure. A core consideration is how to minimize the “blast radius” of any failures. In this talk, we discuss a range of blast radius reduction design techniques that we employ, including cell-based architecture, shuffle-sharding, availability zone independence, and region isolation. We also discuss how blast radius reduction infuses our operational practices.

Duration: 00:55:44
Publisher: Amazon Web Services
