Here @ Server Density we monitor 100.000+ servers processing 2B metrics a day. We deliver a service that needs to continuously monitor our customer’s infrastructure, that’s why downtime is critical for us and we keep training to react to incidents. We organize our internal War Games were all engineers practice the processes involved in incident handling. We have seen how this improves the associated human factors, our processes and our tools.
We will go through these points:
The cost of uptime
Expect downtime: prepare, respond, postmortem
Human factors and how to improve them
Train: War Games! realistic simulation
The incident handling process
Results: – revealing deficiencies – increase confidence / reduce panic – coordination / improve time to resolution
Train: – your people – your processes – your tools
Review and repeat!
You can watch this video also at the source.