For both IT managers and engineers alike, it is the least desired activity following a system failure of some kind, coming up with the root cause. Business and/or product owners outside of IT are waiting, after the dust settles and the system is restored to working condition, to have primarily two questions answered:
- Why did the system go down in the first place?
- What is IT going to do to make sure this doesn’t happen again?
The need for answers to these two seemingly straight forward questions generates an urgency that has all the IT stakeholders rallying together in camps. This series of articles look at this challenging exercise from an engineering management perspective with the first article introducing the “80% accurate technique”. In this article I’ll cover how to interact with your staff during the crucial fact gather activities.

Always support your team
We all have had interactions with managers when a service or system we are responsible for in some capacity is not doing what it is supposed to be doing. Rands has a recent post on his perspectives of a past manager “The Leaper” that abhorred excuses as an abdication of responsibility. So what are the characteristics of managers that have effectively enabled staff to go through the system restore to service and root cause analysis effort successfully? What are characteristics of managers that by their approach, style, involvement, or lack of involvement have actually impeded the process which, in theory, should enable all involved to learn from the events and better positioned for the future?
Do: Trust Your Staff
By and large, you have talented staff. In general, they come in to work wanting to do a good job. The ones that don’t or can’t do a good job you have either moved them into a position where they can do the least harm or moved them out all together. Maybe it is that architect that doesn’t have his head in the IT clouds and digs into the technical details. Maybe it is that engineer that just can’t stop at knowing how only his piece of the system works but has assembled an exhaustive knowledge of the entire system as a whole. Whoever is it, trust them that, once pointed in the troubleshooting direction and reminded of the need to bring knowledge back to you and the team in order to strategize on how communicate and act on it, they are doing their job. Resist the urge to ping them every 5 minutes with “did you fix it yet?” or “did you find way it crashed yet?” Nothing is more annoying in this situation than a boss hovering over your shoulder while you are trying to work.
This isn’t to say you completely ignore them. Rather, meter your check-ins for status and make sure you ask if they need anything. If the team is huddled in a cube for five hours without a break, offer to run and get some beverages.
Do: Run Interference so Your Team can Work
One thing you can definitely do while your team is feverishly trying to restore an ailing system or troll through log data to see why it might have taken a turn for the worse is run interference for your team. Offer to be the external communicator. So while they are working and feeding you bits of data, you can mull it over and craft carefully constructed emails that give outside stakeholders the impression your team is on the job, giving this issue priority and has a handle on what went wrong, etc. If there is a conference call where multiple parties are working together (or not depending on your corporate culture), volunteer to be the voice for your team. When stakeholders on the call are demanding updates or answers, have a volley of responses that keep those stakeholders informed yet buffer your team from wasting precious debug and analysis time updating the “root cause coordinator” so he/she can send out some high level update on some arbitrary schedule.
Don’t: Let your Team Members get Burned Out
So you are trusting your team members involved while running interference for them yet make sure you keep a watchful eye out for the signs of burn out. Are team members starting to verbally accost one another? Are team members pounding desks, increasingly using profanity or just plain staring at a screen full of data glassy eyed and frozen for an extended period of time? It is time to step in and try to break the tension. Humor is a good technique to provide a few moments of distraction and levity to an otherwise stressful activity. Forcing a break: “Hey guys, put the conference call on mute, I’ll let them know we need a bio break … let’s assemble outside the restroom and get something from the snack bar on me” In extreme cases where this root causes exercise is extending for days, look to swap in/out different team resources. If there is a test run of a possible break scenario that looks to be focusing on something less relevant to your team, find a junior team member to represent the team in the testing while you distract your senior resources with a break.
Do: Remind Your Team Members Their Efforts are Valued
While you are strategizing your next move and arguing with peer managers on who bares more blame for the outage, don’t forget to remind your team members involved that their efforts are valued. Remember, as much as you hate the outage and post outage activities from a management perspective, engineers want to be engineering new stuff, not involved in educating others on why the old stuff they built broke.
Do: Support the Team’s Collective Decisions
When you meet with the team to review the data and collectively agree on a result to communicate externally, stick by the agreed upon result. Once communicated, make sure you show support for your team. Don’t suddenly suffer from an attack of surprise and confusion when peers challenge your position (exaggerated bad example):
Peer Manager: “That can’t be right! There is no way my team making those system changes in module X would have caused the whole system to grind to a halt. It had to be your team updating the settings in module Y!”
You: “My team made changes to module Y? I’m surprised! Obviously my team made these changes without involving me. Of course, if I had been informed of the changes I would have made sure they were fully tested first. <insert additional back pedaling and side stepping accountability here>”
Rather:
You: “Yes, those changes were made to module Y as part of a formal change process that was approved by the change team because the appropriate testing steps were signed off by the QA team. I think we may collectively have a weakness in the over all system testing. Maybe we should invest some time in determining if the testing we’ve been doing for some time now truly accounts for all the system changes over the last N months. <target a more holistic problem rather than getting into a blame battle or worse, throwing your team in front of the bus>”
In the next article, I’ll shift the focus off your team and on to techniques to interact with your management.
Anyone have an example of a do or a don’t when it comes to how you are supported in these situations? Anything a manager did that was helpful or hurtful during these events you can share?
competing priorities, confused, play it safe, players, project sponsor, projectified, root cause, root cause analysis, ruggedization, safe, support, surprised, surprised and confused, team