Anyone that has had to participate in a meeting to determine why some IT system went down is echoing a collective groan as they read this title. For both IT managers and engineers alike, it is the least desired activity following a system failure of any kind. Business and/or product owners outside of IT are waiting, after the dust settles and the system is restored to working condition, to have primarily two questions answered
- Why did the system go down in the first place?
- What is IT going to do to make sure this doesn’t happen again?
In the first article, I outlined the business context of the root cause analysis exercise in general and the complexities in clearly and logically arriving at a true root cause for a system outage due to the interconnected players involved. In the previous article, I outline a particular IT engineering resource approach entitled “Openly Be the Hero” to participating in the root cause analysis process. This article introduces “Play it Safe”.
IT Engineering Participatory Approach C = Play it Safe
Having seen the potential pros and cons of approaches A and B, I assume you are wondering is there any way to play the root cause situation safely? There is, but you are going to have to put your engineering brain and ego on hold a bit.
Step 1 = Resist the urge to be either “surprised and confused” or the Hero.
At the onset, avoid meetings, emails, hallway conversations and basically any situation that might put you in a position to start down the road of approach A or B. Reply with vague “I’m not sure. I think we are still looking into that. I am waiting on <whatever>, let me get back to you” type answers.
Step 2 = Get with your management ASAP and give them a full run down of what is going on, the situation and the players involved.
As succinctly as humanly possible, state the problem “we may very well be part of the root cause for this outage”. That should get management’s attention very quickly. Follow-up with “here is what I know, stop me if I am going to fast or you already know all this already” and then quickly and briefly step through the problem clearly indicating where you believe/feel/think each “fact” ranks in authority. In other words, don’t claim something is a fact unless you hold a log file printout in your hand that date and time stamps what you are saying. “The commonly held view is the temporary storage volume filled up before anyone could purge files as the system needs, etc., etc., etc.” “I am 50% confident based on this log entry that disk space was an issue.” Be prepared to be stopped and asked all kinds of questions pertaining to how you know this, from whom, who else knows, etc. What is happening is management is starting to build the story of what is taking place factually, the black and white versus gray-ness of those facts and how all the players are positioned to take the blame.
Step 3 = If management doesn’t define your role and thematically what to say and not to say, suggest your role and seek confirmation
Equally as important as step 2, confirming how management wants you to proceed is critical. If you complete step 2 but then go off and “Being the Hero”, you will be susceptible to all of the cons associated with being the hero. Rather, if you are going forward and executing your role under the clear direction of management, as long as you indeed execute and seek clarity when a unclear situation presents itself, it will be exceedingly difficult to fall victim to the cons associated with “Being the Hero”.
Step 4 = Execute your role and keep management informed of major milestones
Go forth and help the post outage root cause investigation effort always being mindful of your role as indicated by your management. As you are made aware of “major” milestones, make sure you go back and update management as soon as possible. The timely updates directly assist in reshaping the story and may be accompanied by some tweak in direction to your role. “Major” represents any event or new information that changes the shape of events. “Bob in infrastructure just shared that the daily disk utilization report was indeed showing a reduction in free space for the last two weeks” = share ASAP. “Bob just shared he forgot his lunch at home” = ignore. Yes, these are rather obvious examples of what to share and not to share, but the goal here is to develop your own system for listening strategically to all the information that is being shared in order to parse out the noise and direct significant facts back to management.
In summary, the approaches and recommendations here may seem a bit extreme to many. If you are lucky enough to belong to an organization that is culturally rational and fact based, you may not be forced into these scenarios. Yet, all it takes is one situation to get out of control and the scenarios above become reality. The rational, fact based logical analysis of an outage is replaced by the panicking, irrationality and emotion of engineers and managers faced with the notion of job loss due to failure to prevent an outage disaster that had major reputational and/or financial impacts.