Operational Red Flags in the Cloud

Published: 2015 Feb 10 by Avi Deitcher

Early in my career, when I did technology for a very large financial firm, we started with dedicated servers for each business process. It was an easy way to track costs, manage risks and allow each business unit to maintain control.

Unfortunately, it was also an exorbitant way to maintain control. As servers became more powerful and disk cheaper, processes utilized less and less of their capacity. Even more than the costs of the infrastructure itself, the costs of the staff to deploy, maintain and support each piece of infrastructure could kill profitability.

We went through the pre-cursor to virtualized environments: server consolidation. With 10 servers each running at 5% peak capacity, we could easily move all 10 processes onto 1 consolidated server, save 90%+ costs, and still have double capacity to spare!

The downside was risk. Whereas before one server's downtime meant one process out, in a consolidated world one server's downtime meant ten processes out, a very expensive proposition indeed.

Our solution to this risk was threefold:

Utilize some of the savings to create extra redundancy. Instead of one server, have 2 or even 3 redundant servers, and still save 70-80% costs.
Architect the software to operate better in a shared and ephemeral environment. Whereas the software used to assume its server was "always up" (which, of course, was pure fiction, if convenient), the new software should assume that at least one of its instances was always up.
Raise our service management game to several higher planes of existence.

If this sounds like an early-1990s version of the cloud... it was! Many of the principles underlying and lessons for the cloud were hatched in the intense IT environments of these mission-critical companies.

While any technology service needs to manage itself well, and mission-critical even more so, the cloud takes it to the next level. It is amazing how well focused you become when each minute of "service degradation" costs you upwards of $1MM!

Here are some "red flags" around how you manage your cloud operations. If you see these in your company, you need to address them.

Christmas Tree

Do you have a big pane of glass, or perhaps several, with blinking red and green lights, the "Christmas tree" effect? Most operational centres have them.... and they are a warning sign. If it is red, it should be taken care of immediately; if it need not be taken care of immediately, it shouldn't be red.

The fundamental problem of the "Christmas tree" effect is that it assumes your team knows what the lights really mean or just don't care. Your team is using an instinctive filter to deal with a bad signal-to-noise ratio.

This will have 3 very serious downstream effects:

True "severity 1" or "critical" incidents will be lost in the noise.
If any of your staff leaves, a lot of knowledge is lost with him or her.
You really have no idea of the true state of your system.

Tribal Knowledge

If your head of operations / technology services / infrastructure / support and most of the team disappeared today, would you be dead in the water? Sure, your replacement staff would take time to get up to speed, but would it be days to weeks of slower service, or months of disaster?

Technology staff - indeed, all staff - should be paid for their abilities and the knowledge that gives them those abilities, and not for the specific information on which they have a stranglehold. Yes, context for work does matter, as does experience. But the day-to-day operation and engineering of your plant should be visible to and understood by all, at least inside your firm.

Ignored Alerts

Do your staff regularly suppress alerts? The Christmas tree effect means they are ignoring alerts because they just don't matter; the "Squelch Effect" means that they are suppressing them. It is the flipside of the same coin, but has different side effects. With Christmas tree effect, you have some trail as to what is still open. With the Squelch, someone went to the trouble of suppressing it, killing the ongoing trail.

Whenever I see companies with the Squelch Effect, I ask what staff turnover is. Squelch companies usually are second only to pager Squelch companies, where the staff squelch the alerts.... at 2am!

Capacity Alerts

Do you get utilization alerts? Disk full, CPU over a threshold, memory usage? Here is a hard rule:

In the cloud, you should never get capacity alerts. Period.

Why? Because alerts are for issues that require immediate treatment. Capacity, on the other hand, should be proactively managed. True cloud - and nimble IT - require a process wherein capacity usage and growth requirements are monitored and forecasted, and deployed capacity managed up or down, long before the alerts hit.

For example, if your goal is never to exceed 75% disk utilization, and it takes 2 weeks to deploy (assuming you are not working in an infrastructure cloud, wherein it takes 2 minutes to deploy), then the capacity management process should discover the need at least 3 weeks before it gets to 75%, and deployment should be done long before any alert would be triggered.

How do you build that capacity management process? What tools do you use? Who staffs it? Ask us to help.