Put a Stake In Your Steering Wheel

Published: 2016 Feb 16 by Avi Deitcher

When at the Container Summit, I heard a great (if somewhat perverse) line from Jacob Groundwater of New Relic. I liked it so much, I tweeted it out immediately:

If you want people to drive slower, don't give them an airbag; put a spike in their steering wheel!

While a rather morbid image, Jacob hit on a core truth: if you make dangerous activities safer, people will do more dangerous things. Or, as Milton Friedman put it (about the welfare program AFDC), if you pay people for something, you will get more of that something.

On the other hand, if you want people not to do a dangerous activity, make it more dangerous.

I am not actually advocating we actually put stakes in steering wheels! When people already will drive 120 kmh, airbags are a valuable way to reduce the already existing risk.

However, when it comes to systems and software development, Jacob has a key insight:

Software itself inevitably fails
Systems to which the software connects inevitably fail
Systems underlying the software inevitably fail
Don't make the systems failures hurt the software less; make it hurt more, and more often, to force software to handle it better

If you think your car safely can handle a crash at 200 kmh without you getting hurt, you are likely to drive 200 kmh. If you believe anything over 100 kmh will maim you, you are unlikely to drive faster than 100 kmh.

If people expect the systems on which their software depends to fail, they will make their own software handle failure better.

Netflix built this into their core years ago with their Chaos Monkey, which forces random failures of parts of their system. Netflix engineers cannot build software hoping it won't fail, or expecting the underlying infrastructure to handle recovery; they know their systems will suffer failure, because Chaos Monkey will make it happen.

At the Container Summit, Jake Loveless, CEO of Lucera, which provides "AWS for adults" as he put it, gave a similar example. Lucera provides container and VM hosting for high performance financial services organizations: the flexibility of a public cloud, the security, performance and ultra-low-latency required by financial services, with the connectivity network to just about every major financial institution.

Jake said that one of his financial services customers has an interesting setup: every day, at end of trading, all of their virtual server instances or containers are blown away, "put a bullet in it", and recreated.

The very nature of destroying and recreating the instances means that anyone who possibly could have made any changes to a live instance knows that those very changes will be gone by the end of the day. The only way to have persistent changes is to go through the proper build and deployment process.

One could argue (I would) that no one ever should have access to production instances, but that may be a little idealistic. I suspect the customer under discussion had serious problems with live changes that had to be solved once and for all.

This customer has the "Grim Reaper": Chaos Monkey for processes instead of architecture. It doesn't blow away live instances randomly during trading hours, but it does blow away and recreate them after hours, every day, like clockwork. Instead of forcing developers to build for failure, they are forcing administrators to operate for reproducibility.

Summary

Ask yourself if either of these keeps you up at night:

Some of my systems might fail
Some of my configuration and setup might be lost

If either of these has you worried - if a Chaos Monkey or Grim Reaper in your infrastructure would give you a cold sweat - your processes and systems are ripe for more reliable and, yes, much faster operations. Ask us to help.