Don't Defer the Problem, Resolve It!
I have been pondering this article for quite some time, then came across a great similar quote from Bryan Cantrill: "Don't just reboot it, goddamn it! Debug it!" Since Bryan always is a great speaker, watch it here.
Time and time and time again, I come across companies and people with systems that are misbehaving. Time and time and time again, people suggest "why don't we just restart/reboot it?" What these people really are suggesting is, "why don't we solve it for now, and not worry about when it happens again?"
The people making these suggestions are split into two categories:
- Lazy fools
- Sincere strugglers
"Lazy Fools" grew up with desktop PCs, usually in the "IS" department, supporting impatient people who "just want my PC to work again." Unfortunately, while this may be tolerable (if mistaken) in end-user support, it is intolerable in services. Rebooting/restarting never solves the problem; it just restores service for now. These are not the same thing. Sure, the customer support manager may thank you for getting it back up and running in one minute today, but you have not solved your issue, and you don't even know why it happened!
A CEO friend of mine recent told me his philosophy: if it can happen once, it will happen again. When your philosophy is "reboot it to solve it", I can promise 3 things will occur:
- This will happen again;
- It will have a much worse impact when it does;
- You will suffer far then far worse then you benefit now.
"Sincere strugglers," by contrast, are people who sincerely want not only to get it back and working right now, but really want it not to happen again. They care about the service and system, not just the accolades of today, and they want to fix it. However, they struggle because they are stuck under a combination of 2 terrible forces:
- The pressure to fix today's issue as soon as possible, without enough tools to capture enough information to figure out what really happened
- Lack of experience with true "solution-oriented" cultures
In essence, the difference between sincere strugglers and lazy fools is not one of education, but one of attitude. With enough tools and support (which is far more important) in hand, sincere strugglers become real solvers.
How do Sincere Strugglers get support to become Real Solvers? As always, it depends on a mix of investment and culture.
It Ain't Over 'Til It's Over
First, the culture, from the top down, must never, ever, consider a problem solved until you understand three things:
- Why it happened;
- Under what circumstances it happened;
- How you can prevent it from happening again or, at least, mitigate the impact.
To use ITIL terminology, every incident is caused by a problem. Solve the problem, not just the incident.
"Fair enough," executives say, "but time is of the essence! If I take an hour (or two, or ten) to solve it now, I lost that much business. Restart it, and I get my service up and running again in a fraction of the time. Do you know how much each lost hour costs us?"
Free advice (from a consultant, no less): Never dismiss this argument out of hand.
How much downtime to suffer when there is a choice is a business decision, not a technical decision. If you can have a service up and running in 5 minutes without any understanding of the issue, or in 60 minutes while gaining full understanding, that is not a decision for an engineer, or even a CIO, to make. It is a decision for a CEO to make. Only she or he knows what 55 minutes today is worth to the business versus discovering how to prevent this issue from happening again.
However, the technologist does have two jobs related to this decision:
- Present the risk factors. It is up to the technologist to show the CEO the expectation (not the risk) that this will happen again, or the inability to predict it.
- Capture data to solve offline. Put in place the toolset to capture information in a lot less than 60 minutes.
First, you, the technologist, must make sure the CEO is weighing 55 minutes today vs. 3 hours in a month, not 55 minutes today vs. 5 minutes today. If the CEO doesn't have the whole picture, give it to her or him. Better yet, have it ready beforehand. Don't scramble when the CEO needs to hear right now, under pressure, how long to let you run.
Second, find and implement tools and methodologies to gather information with minimal disruption to your business. Do you have a way of restoring service to a customer while keeping a misbehaving system running and collecting information? Can you capture data while the event happens to restore service immediately while working "offline" to solve the issue? If your design doesn't support it, think long and hard about your architecture.
The goal, of course, is twofold:
- Minimize disruption time to customers for each category of failure.
- Maximize debug information for each category of failure.
Summary
However you go about doing it, your goal must always be, debug, don't reboot, because your goal really is, solve the problem, not the incident.
What happens when your CEO asks, "so you solved the problem; are you sure it won't recur?" You need to have a valid answer to the question.
And if your CEO doesn't care, even when presented with all of the risk factors? If he or she really doesn't think about tomorrow, only today? Well, then, your business has bigger issues than service failures. Find a better boss.
In the meantime, ask yourself if you have the systems and methodologies to provide risk-reward analyses to your executive at critical junctures, and if you are ready to solve the problem instead of the incident. Then ask us to help.