Keep Calm and Carry On: The Heroics, Fear, and Discipline of Technology Incident Management

Keep Calm and Carry On: The Heroics, Fear, and Discipline of Technology Incident Management

Over the years of working in enterprise technology, I’ve learned that I need to wear many hats. One of those key responsibilities is firefighter. It’s a baseline expectation that comes with the territory. Even though high availability and disaster resiliency technologies have evolved our capabilities over the years, we still deal with things that go bump in the night. Over the years, I’ve responded to (and caused) complete Data Center failures and have lost entire environments of data only to find out that the backups weren’t working. I’ve dealt with countless debilitating virus outbreaks and have responded to numerous data breaches and cyber intrusions. I’ve dealt with complicated critical application failures that went on for days. I’ve learned a lot from these experiences, and I know that they are fairly commonplace in the world of enterprise technology. There are always technical and process lessons learned that come out of any good post-mortem exercise to “prevent something like this from happening again.” But the sum of my experience has yielded something more important: how to lead when the house is on fire.

Heroics

I’ll admit it. Even though incident management is horribly stressful, wreaks havoc on my work-life balance, and causes me to go without sleep for days-on-end, there is a part of me that likes it. I like being the hero. It is very satisfying to rescue the business as if it were the damsel in distress, with my technical super-powers, and save the day. As I’ve written about previously, geeks don’t often get the opportunity to be a hero growing up in the mainstream world of organized sports, but we still have that innate desire. This drive is mostly good. You don’t want to have to light a fire under your team to respond to a critical incident. The best case is that they feel the urgency instinctively, and respond accordingly.

There is, however, a dark side to this drive. Sometimes the drive for heroics enables poor system design. I don’t think anyone does this intentionally, but if you find a pattern of behavior where someone on your team is constantly being put in a position of being the hero, and they are also responsible for system design, then you may have this issue. There are two ways to combat this: First, this person may just be starved for recognition. Make sure you recognize your team for proactive work, not just heroics. You will get more of what you recognize. Second, make sure you have sustainable processes, documented procedures, and bench strength in your organization, so you aren’t so dependent on one person to save the day. Succeed and fail as a team, not individuals.

The Leadership Traits of the Incident Manager

Anyone can be an Incident Manager. I’ve been one, and today I consider myself to be the Incident Manger of last resort, even though I’ve empowered others with these responsibilities. Generally, this is someone that can facilitate the resolution of an incident, but not the one with their hands on the keyboard. Once an incident has been identified and the correct teams have been engaged to work it, engineers need to be managed in a very specific way:

  1. Engineers need to focus. The Incident Manager’s job is enable that focus. Get out of the way, and keep other people from getting in the way.
  2. Enable and facilitate communication. This is more of an art than a science. We cannot burden the very people that will fix the problem with a constant bombardment or hovering by those that want to get an update on the problem. The Incident Manager needs to get the pertinent details in the least-distracting way possible, translate that into layman’s terms, and communicate to the IT management and business stakeholders on a pre-negotiated time frequency.
  3. Enable collaboration. Engineers can get tunnel vision under stress. In the least-disruptive way possible, constantly open doors for collaboration with other needed resources both inside the organization, and outside.
  4. Get food. Do whatever is necessary to keep the experts focused on the issue at hand and take care of necessities.
  5. Watch for exhaustion. If your key resource is exhausted, he or she could make the problem worse by making a mistake, or may start to get grumpy or hostile with coworkers. If the person has a strong sense of accountability and drive for heroics, it may be hard to take him or her out of the game. Arrange for transportation or a place to sleep if necessary.

Keep Calm and Carry On

The world of Enterprise Technology is tough, but it’s not like war or a natural disaster. Generally speaking, the worst thing that can happen is the company loses money (potentially a whole lot of money) and people could lose jobs. In extreme circumstances, the business could fold as the result of a major unmitigated technology failure, but those are rare. Leadership means keeping perspective, and maintaining a poise through severe incidents. Even if you are panicking on the inside, find a way to keep a calm and cool demeanor with the business and with your team. As a leader, your emotional response is very powerful. If you freak out, everyone freaks out. If you stay calm, chances are good that your team will mostly stay calm. One of the ways I do that is I mentally process the worst case scenario, face my fear, accept it, then move on.

No One Wins in the Blame Game

My first blog article was “The Sparky Incident.” Early in my career, I caused a major Data Center failure. My leadership refused to play the blame game and instead took accountability for the failure at an organizational level. That instilled trust, respect, honesty, vulnerability, and mutual accountability within the team. From that point forward, I learned that this is the only way to lead.

The Ends Don’t Justify the Means

During a severe incident, resolution isn’t the only goal. If we restore service, but there is collateral damage all over the place, then that isn’t a win in my book. If we restore service, but have broken relationships as a result, that isn’t a win either. If we restore service but haven’t esteemed and supported our teams through the experience, then we may have won the battle, but we’ve lost the war. Leadership matters.

If you found this article helpful, please share it with your colleagues.

2 thoughts on “Keep Calm and Carry On: The Heroics, Fear, and Discipline of Technology Incident Management

Leave a Reply