Change Control in the Cloud

Published: 2015 Feb 17 by Avi Deitcher

"We made a small change and it brought down our customers for 4 hours." - colleague

"Network issues caused outage" - GoDaddy

"A configuration error... caused days of downtime." - Amazon

"Facebook was down... for 2.5 hours." - Facebook

Every one of us has seen human errors cause significant, revenue-affecting, downtime. Our stability instinct always is to tighten up change control to try and prevent a recurrence. In a cloud environment, though, our agility instinct is to be as nimble and loose as possible. How do we reconcile these two opposing pressures?

Speed: On the one hand, one of the prime reasons you are in the cloud, indeed, are operating a true cloud service, is to get "cloud-speed", along with the agility, nimbleness, and of course valuations. To get these benefits, you need fast, small releases, preferably going as far as continuous delivery.
Stability: On the other hand, because customers are depending on you not to deliver software and products but to operate them, there is a hyper-sensitivity to what a colleague of mine once termed, with irony, "service degradation." This will create great pressure upon you to have drawn-out, heavily-reviewed, and infrequent release cycles.

The key to managing change control in a cloud environment, indeed, in any environment, is understanding risk.

The Purpose of Change Control

Why do we do change control? It isn't because we like long review meetings, or wading through stacks of paper. It turns out, we don't (or shouldn't) do change control at all! What we really are trying to do is manage our risk.

Let's do a simple thought experiment. If we could magically manage all deployments so that they would have 100% effectiveness with 0% negative impact, would we hold any change control meetings? Of course not! Yet, in many (if not most) companies, change control has become a proxy for risk control, a self-perpetuating "speed bump" in the unstated belief that it reduces our risk.

But does it? If so, why do companies with the tightest change control lead to the worst metrics? If it does not, how can we create great change-driven risk management?

While it is beyond the scope of a short article to lay out an entire strategy for risk management, we will explore some basic principles.

Change Control vs. Risk Management

Change control focuses on stability of the current system in isolation; it assumes that all change is risky, and our job is to mitigate that risk as much as possible, usually through extra reviews, discussion and slowing it down.

Risk management, by contrast, looks at the complete set of risks to the entire system. It assumes every activity is risk, including inaction.

This approach, when applied to deployment or delivery, leads to asking two very important questions:

Almost all changes eventually will have to be deployed. Delaying increases the cost of each change. Is the reduced risk of not deploying now worth the increased risk of deploying later?
Not deploying changes creates short- and long-term risks to my business, whether because of aging of the environment or because of slower delivery of capabilities. Is the reduced risk of not deploying now worth the cost to the business?

The Pain of Rigid Change Control

Change control procedures with long lead times and long reviews have a tendency to make a manager feel good, that she or he has risk under control. Unfortunately, the law of unintended consequences is hard at work. It creates the following negative effects:

Larger releases: As each change requires longer review, releases tend to be less frequent and larger, many small changes aggregated together. The risk of error and cost of mitigation do not rise linearly with the number of changes, but something larger than exponentially. Five ten-minute releases together do not take 50 minutes; they take many hours. The increase in risk leads to:
Tighter controls: Along with the tighter controls come longer lead times, leading to even larger releases, with greater risk, and even tighter controls, and back again. A colleague of mine calls this the "spiral of death". The difficult environment leads to:
Frustrated employees: Great employees sign up to do great work. If they are bogged down in detailed reviews, they will get frustrated. It also leads to:
Unhappy customers: As customers see longer and longer lead times to features they have requested, along with more "service degradation", they become frustrated with your service, and begin to leave, leading to:
Very unhappy executives: After all, you worked untold hours to get all of these customers on board, and suddenly (or so it feels), with growth, things spiral out of control.

I have seen many companies get caught up in this spiral. They have great engineers, a great product, great customers.... and no matter how much manpower or effort they put into it, stability is questionable, release cycles are long, employee satisfaction is low, and customers are unhappy.

So how do you do great change control in a risk-managed, as opposed to rigidly-managed, environment?

Trust and Policies

The right policies, explicitly communicated to your trusted teams, are the keys.

Instead of telling your staff the rules for how they can get permission to deploy - no fewer than 3 business days after a change control meeting, all requests submitted at least 5 business days before the meeting with the results of all tests and all code, etc. - communicate the risks you are managing and how they are balanced. Tell them:

We are all about risk management. We expect you to take well-balanced risks.
If you take a balanced risk, we will reward you, even if it turns out poorly.
Always have a back-forward plan; not a backout plan, but a back-forward plan. Know what new deployment you can make that can fix any errors, but (almost) never go backwards.
Get peer review; be a peer.
Know if your changes might impact someone else.

It is the latter that is the purpose of change control. If your change control's job is to make sure your people don't screw up, you already are in trouble. Do you really think the people reviewing are any smarter than those who own the change? Are the people who own the change - and who will be held to task if it breaks because of something sloppy - really going to deploy without peer review? If so, you are hiring the wrong people. Especially in a services world, you have to trust your people!

The real purpose of change control is to ensure that the changes that are made responsibly, by the people you trust, that have already been peer-reviewed, and have had your risk-management principles accounted for, will not collide with someone else's significant changes or requirements. You don't want to change the database schema at the same time as the database server is undergoing an upgrade; you don't want to change login procedures without first having compliance review.

Summary

It is possible to implement positive change control procedures and have a dynamic, even a continuous delivery, environment. The purpose of change control is to coordinate among independent trusted groups so that they do not get in each other's way; it is not to check the work of untrusted groups or "slow them down."

How can these procedures be implemented in your dynamic environment? We have implemented them for clients; ask us how we can do so for you.