When Disaster Recovery is a Disaster
I was recently looking over my LinkedIn profile, and I noticed that my most-endorsed skill is “Disaster Recovery.” It’s interesting because I’ve made no attempt in my career to make myself known as the Disaster Recovery guru go-to-guy, but alas, it looks like that’s exactly what I am most well-known for. After thinking about it a little more, I realized that Disaster Recovery really has been a big part of my career, and I’ve hardly written anything about it. Thus, here we go.
Disaster Recovery means a lot of different things to a lot of different people. I take a purest approach to it. Disaster Recovery as a modern IT function came as a result of 9/11/01: What am I going to do when my Data Center is a smoking hole in the ground? It’s not about fixing applications, rebooting servers, or failing over to another cluster node. It’s about running the enterprise in a different location assuming everything in the original site is completely gone.
You can tie most activities in IT to some kind of tangible business value. You get the sense that the technology you are building is actually helping the company make money and achieve its mission. Disaster Recovery is a little different. You spend a whole lot of energy preparing for something that in all likelihood will never actually happen. That’s not very satisfying, so you have to consider the alternative. If we have no Disaster Recovery capability, then what happens if we lose our primary Data Center? Do we just close up shop and go quietly into the pages of history? Well, we certainly can’t let that happen, so we better have a solid plan, just in case.
Fortunately for me, I’ve had the pleasure of being integral to the Disaster Recovery strategy at three different Fortune 100 companies. At one of those companies, Disaster Recovery was sometimes a disaster in itself.
For the business unit I served, we didn’t have the luxury of a dual data center with replication technologies. We did things the old fashioned way. We showed up in the cold-site recovery location with some blank equipment, our tapes, and our wits. We gave ourselves 72 hours to rebuild the enterprise. Our business process required us to fully exercise our plan annually.
My role on the recovery team was to make sure everyone was doing exactly what they were supposed to be doing, when they were supposed to do it. We didn’t have time in the plan for any slack. We didn’t have time for two people to be looking at one monitor. This wasn’t just a plan; it was a tightly-choreographed acrobatic routine. We didn’t have tolerance for missed hand-offs. We had 72 hours. Each one of those hours was accounted for. We were building infrastructure or spinning tape for each of those 72 hours. Sleep was optional.
Looking back at those exercises, it’s almost surreal. After all, it was an exercise, not the real thing, but if you were in the room with us, you wouldn’t have known the difference. I remember one particular year, everything was going wrong. Murphy’s law was in-effect and we couldn’t shake it. No matter what we did, we couldn’t get our virtual servers to restore from tape. Time was ticking, and no one was sleeping. We were on the phone with support vendors, getting flipped from one shift change to the next. Eventually, we got to the right resource that happened to know how to fix our obscure undocumented oddity, and then the data flowed.
You would have thought we just landed Apollo 13. Mission Control was rejoicing with sincere, but exhausted jubilation. We were awake for seventy-two-stinking-hours. After the business certified the data and the functionality, we burned it all down. Then we went to the hotel to pick up our luggage, which we hadn’t seen since we dropped it off three days earlier and flew home. I think I slept for two days straight after this fiasco.
It wasn’t long after, that we seriously started to change the way we did Disaster Recovery. We quickly transitioned to a dual data center replication model, which pretty much recovered itself by comparison.
The moral of the story is this: heroics make for great tales and memories, but it’s no way to live. Engineers willingly go extreme lengths to serve the business, but it all has a cost. As leaders and technology architects, we need to design systems that are resilient without heroics. No one is going to write a blog article about the system you built that never goes down and recovers itself, but that’s ok. At least you’ll get a good night’s sleep.