Cloud Computing and the Glue-Sniffing Guarantee Fairy

Cloud Computing and the Glue-Sniffing Guarantee Fairy

The film, Tommy Boy, starring Chris Farley and David Spade was released in 1995. I was in high school at the time. My friends and I watched it constantly. Nearly 30 years later, people quote it, sometimes not even knowing it. I overheard my 11-year-old the other day say, “Brothers don’t shake hands. Brothers gotta hug.” I’m quite certain that he’s never seen the movie.

Now, there are plenty of leadership lessons from Tommy Boy, especially for the sales crowd. There are several fantastic articles out there analyzing the sales techniques of Tommy. I’m not a salesperson, but if you are, you should probably check them out. Google “Sales Lessons from Tommy Boy.”

This article isn’t about sales. It’s about cloud computing. That’s a concept people only dreamed about back in 1995.

Before I get into that, let’s watch one of the most famous scenes in the movie, The Guarantee:

The person in this video I most closely relate to is Ted. He’s buying from wholesale suppliers and serving his retail customers. This role, in many ways, is analogous to the modern enterprise technology shop.

On behalf of our customers (external or internal), we look across the landscape of various technology suppliers, choose the best ones, integrate them, and support them in production on behalf of our business.

So, it’s my pleasure to talk with folks like Tommy and Richard a whole lot. If you are wondering what that’s like, I’ve written about it here, here, and here.

The Guarantee

Auto parts have guarantees. Cloud services have Service Level Agreements (or SLAs). SLAs include many elements, but the key ingredient is an uptime guarantee. They promise it won’t go down any more than x, and if it does, there will be some financial recourse. These often get expressed as a percentage, such as 99.9% uptime. That sounds like a lot of availability, but it may not be as high as you think. If you do the math, this equates to over 8 hours of unplanned downtime per year.

Those 8 hours will probably come at the worst possible time, just so you know. A 99.9% SLA might make you “feel all warm and toasty inside.” But another reality strikes sooner or later.

The glue-sniffing guarantee fairy

Here’s my story: I ran an important application in one of the major public cloud providers. That application ran just fine almost all of the time and had for years. Then one day, in the middle of the business day, the application went down for about 1.5 hours.

The cloud provider gave notice of the affected component. It was a managed database instance. The rest of the servers and services were up and running, but nothing worked without the database. Eventually, the cloud provider restored the service and the application came back online.

Given the duration of the outage, I worked with my account team to make sure I got a financial credit for falling below the SLA. They did the math. The way the SLA worked, they gave us a 10% rebate for the monthly consumption of the discrete component that failed. That credit was $32.96. I won’t disclose what we paid the provider to host the entire application, but it was enough to make this credit seem laughable. A $32.96 credit is something you expect to see on your personal phone bill, not something you expect to see on your enterprise application hosting.

Now, the provider did everything they were obligated to do, and they did so with transparency and efficiency. But I didn’t feel “warm and toasty inside,” but instead felt like I got paid a visit from the glue-sniffing guarantee fairy.

That’s when I started scratching my head. What’s the SLA for? It doesn’t come close to making up for the business disruption we experienced. It doesn’t matter to the cloud provider. They all make hundreds of billions. This service failure and all of the credits doled out to all customers totaled up to a rounding error on the P&L.

Shame on them, or shame on me?

After I was finished throwing my fit, I recognized my error. I acted like Ted at the beginning of the scene. Richard protested, “Our brake pads are made of a non-corrosive poly-plated…” and Ted cut him off, saying, “Son, if you’re not talking about a guarantee, skip it.”

Now, all of a sudden, I care a lot more about non-corrosive poly-plated whatever.

Is this the cloud provider’s fault or is it mine? I decided that it was mine.

I failed to design a fault-tolerant, disaster-resilient system for our application. That was on me. Going forward, I changed the way I interacted with cloud providers.

Three things made the most difference:

  1. Assume everything will fail, no matter what the SLA says, then design accordingly. We design our applications to be highly available across physical facilities and recoverable across geographic regions.
  2. Multi-cloud creates options and flexibility. It is a burden to be proficient and built out with multiple cloud providers, but it’s worth it and necessary in today’s age.
  3. 3rd party cloud-native tools provide multi-cloud interoperability, security, manageability, and visibility. It also lessens but doesn’t eliminate the multi-cloud learning curve.

With these three principles in place, I feel a whole lot better about SLAs. If something fails, fine. I’ll take the small credit, and I won’t be bent out of shape, because my application stayed up the whole time. That’s how it’s done.

Lastly, I don’t want to overstate our position. Just like you, we are on the journey. We’ve learned a lot. We’ve achieved a lot. But we also have plenty of work left to do.

When I hear technology salespeople talk about their SLAs and guarantees, I always think about this scene from Tommy Boy. Perhaps you will too and then crack a smile. If they wonder what’s up, tell them about my blog. I always appreciate new readers, even salespeople.

Comments are closed.