The Load Balancer with the Candlestick in the Server Room
I used to directly hire a lot of technical talent. Now, that’s mostly done by my management team. When I did that, I asked a behavioral interview question that went something like this: “Tell me about a time when you were called upon to solve a high-pressure and complex technical issue where the solution was not immediately obvious.”
The answers I got were usually pretty telling. Typically, this question elicits a bit of a pained smile, as if I just opened a wound that was simultaneously tragic and victorious.
Most of us who have been in technology support roles have our war stories. In an interview like this, I’m not trying to gauge how smart someone is, but how creative and resilient they can be in a problem-solving situation.
In essence, “do you know what to do, when you don’t know what to do?”
The problem with modern troubleshooting
The symptoms are often far-removed from the root cause. Just because the website is throwing an error, doesn’t mean the problem is with the webserver. It could be some underlying dependency such as a disk subsystem of a dependent database or it could be a service that calls the mainframe which isn’t responding because the job schedule is backed up.
Depth and breadth
As enterprise IT functions grow, we tend to specialize and compartmentalize. We have different groups that work on different technical domains. They go deep but are rarely broad.
We also have generalists that represent application functionality to the business, but they rarely have the depth to understand all of the technical underpinnings of the applications. They are great at being the face of IT but can’t really fix anything major when it’s badly broken.
Usually in every IT shop, there are a few that have been around a long time, who have both depth and breadth. This person is a tower of knowledge. Most things that are broken badly enough will stay that way until this person is engaged. Let’s call this person, “the unicorn.” If you’ve ever read the popular novel, The Phoenix Project, this role is portrayed by the character, Brent.
The robots
In modern IT, there is a potential commercial solution to this problem. There is a myriad of technology vendors that claim that they can pinpoint and resolve the root cause of any complex application issue in mere milliseconds because of their advanced artificial intelligence and machine learning algorithms.
It sounds amazing, but in my experience, there’s no such silver bullet. This can help but does not solve everything.
The vendors
Unless an application system is entirely written and hosted in-house, there are often third parties to engage when troubleshooting. This is often the long pole in the tent, as it takes time to engage, get routed to the right expert, and analyze the logs. Get this activity going early in parallel with all other troubleshooting activities.
The management
Management is a wild card. I’ve seen it hurt and help a troubleshooting process. Sometimes it creates focus and brings relevant resources to the table. Other times, it makes everyone defensive and quiet, predictably prolonging the process. Frankly, it depends entirely on the leadership style of the management. Pressure has a way of revealing an organization’s true culture.
The predicament
Most of us are doing our absolute best to engage the generalists, the specialists, the unicorns, the robots, the vendors, and the management to fix our issues, all while trying to maintain a cool head. This is hard and exhausting work.
Leadership lessons from Clue
Like many of you, I grew up playing the board game Clue. There are two ways to win at Clue. You can be lucky or savvy. Being savvy has really nothing to do with how much you know about candlesticks, ballrooms, or the psychological profile of Colonel Mustard. These skills are not needed. What you do need to do is ask the right questions, listen carefully, and move around efficiently.
Ask the right questions
When it’s your turn in Clue, you get to ask the other players questions about who you suspect, and they each reveal relevant insight if they can.
When working a high-pressure incident, you must navigate a minefield of missing information, misinformation, and exaggeration. It’s important to surround yourself with knowledgeable people. These may or may not be experts, but they are close to the issue and can provide clarity.
Asking questions such as “what do we know for certain?” amongst the right group of people can render helpful insight. Once you know what you know, then you can go about seeking answers to things you don’t know, in a targeted, parallel fashion.
Listen carefully
Many Clue players take their turn, then use their time between turns to plan their next move. The best Clue players use their time between turns to listen and observe what the other players are doing.
In a high-pressure incident, we need to keep our peripheral vision wide open. Sure, we need to focus and march down a course of action to make progress, but sometimes key insights come out of left field and if we aren’t paying attention, we will miss our opportunity to course correct.
Move around efficiently
In Clue, you can ask about suspects and weapons anywhere, but you need to be in the room, to inquire about it. Also, there are more rooms than suspects or weapons, so this is almost always the hardest part to figure out. You can move around by rolling the dice, or you can move by using secret passageways. If you spend too much time moving around the board and not in a room, you will lose opportunities to gain insight.
During a high-pressure incident, I recommend spinning up parallel efforts to identify and investigate relevant technical dependencies, then move between them rapidly to see if you can eliminate any one of them. If the Linux cluster has a rock-solid alibi, then note it and move on.
This is a leadership issue
Depending on your role in the organization, this may or may not be your official job. You may or may not be on call this time around. However, leaders rise to the occasion in a time of crisis. You don’t need decades of experience, extreme technical depth, or formal authority to resolve an incident.
It’s unnerving to take responsibility for a problem that you don’t know how to fix. But you can confidently proceed if you know how to leverage the vast resources around you and work a methodical process. Remember to ask questions, listen carefully, and move efficiently. Before you know it, you will correctly identify that it was in-fact the load balancer with the candlestick in the server room. Case closed.
Podcast: Play in new window | Download