For both IT managers and engineers alike, it is the least desired activity following a system failure of some kind, coming up with the root cause. Business and/or product owners outside of IT are waiting, after the dust settles and the system is restored to working condition, to have primarily two questions answered:
- Why how did the system go down in the first place?
- What is IT going to do to make sure this doesn’t happen again?
The need for answers to these two seemingly straight forward questions generates an urgency that has all the IT stakeholders rallying together in camps. This series of articles look at this challenging exercise from an engineering management perspective with the first article introducing the “80% accurate technique” and the second focusing on interacting with your team. In this article I’ll cover considerations on how to interact with your management during the outage and crucial fact gathering post outage activities.
Considering you have had a hands-on engineering role in the past but have now transitioned fully into management, you probably remember the first major systems outage you participated in as a manager. Now if you were managing a system that you had very recent hands on experience working on, you probably felt more comfortable digging into error logs and debugging lines of code than communicating to outside stakeholders. One the most important stakeholders is your management structure. If you have drifted from being the hands on guy who knows all about the system to the manager, you probably have come to grips with not being able to immediately diagnose every problem and thus have to put trust in your team members (as mentioned in the previous article). And most challenging, if you find yourself managing a service which you did not have a technical hand in designing and building, you are completely unable to rely solely on your brain power to dig into the problem and fix it without serious technical help. Yet, in all the above situations, your role as manager requires having a solid understanding of what the problem is at any given moment, what impact the problem is causing and what steps are planned to make life grand again.
Keeping your management informed of what is going on in a manner which gives them the timely information they need to act at their level is curtail. Keeping them in the dark about what you and your team are doing to work the problem and restore services by feverishly fixing things does not bode well for you being seen as a leader. Also, you may need some assistance from your management chain when other groups are being impacted by your service outage and increasingly higher levels of their management start asking tough questions. On the other side, sending your management details of new found cryptic error log data every two minutes is going to have a similar perception result … your distinct lack of leadership.
I wish I could produce a single check list of activities that would work for every organization, every culture and every one of your managers. Rather, as I look back at past companies, managers and their associated styles and cultures, there is no one size fits all. Thus, instead of a check list, I thought the best method would be to look at organizational attributes via a series of questions.
- What is the priority of the organization when it comes to systems outages?
Is the priority to restore services as fast a humanly possible, regardless of the steps taken? If so, then the information flow up the chain would be catering towards creating confidence in your team’s focus on the urgency of getting things working. At the same time, coach your team to look for every option to get it running and figure out the why later.
Experienced engineers know the best time to capture useful data is when the system is hemorrhaging error information during the failure. In high volume systems, restarting processes or rebooting systems clears a good portion of this invaluable real-time problem data from the crash. This begs the obvious contradiction: if you are rushing to get things running at all costs, isn’t one of those costs the loss of critical data that might point squarely at the root cause? The answer is “Yes”. Thus, in your communications upwards, strategically force into the communication stream the notion that as the team is rushing, the ability to slow down and interpret data for root cause is being sacrificed. That way, once everyone temporarily relaxes when the system is restored, then switches to why did it crash in the first place, you have a proverbial leg to stand on when there is a lack of critical data to support a real root cause determination. Sure, the “I told you so” conversation is never pleasant. What is worse is the “why didn’t you tell me” conversation. Choosing between the lesser of two evils, I would rather quietly and politely refer to mentioning the cost of rapid restore versus methodical data gathering first, and then restore, rather than “Oh, um, yah, I forgot to mention that when we rebooted the box, we lost all the error logs in memory thus we have no clue why the service was taking up all the CPU.”
In addition, don’t neglect keeping your upward communication stream of urgent service restore in sync with your download stream. Know your team members’ approaches involved in the system restore and ultimately the root cause exercise. You may need to help them refocus themselves on the priority of system restore at all costs. Engineers tend to want to figure out the “why” which could eat up precious time against the goal of service restoration. Plus, they know what follows the restore, thus they naturally want to continue to be viewed as a knowledge expert. They want to get their hands on as much data to process as possible to maintain that image.
In the next article, I’ll build on this theme of organization and culturally aligned approaches to management communication.
Anyone have an example of a do or a don’t when it comes to how you handle these situations? Anything you did that was helpful or hurtful during these events you can share?