Anyone that has had to participate in a meeting to determine why some IT system went down is echoing a collective groan as they read this title. For both IT managers and engineers alike, it is the least desired activity following a system failure of any kind. Business and/or product owners outside of IT are waiting, after the dust settles and the system is restored to working condition, to have primarily two questions answered:
- Why did the system go down in the first place?
- What is IT going to do to make sure this doesn’t happen again?
In the previous article, I outlined the business context of the root cause analysis exercise in general and the complexities in clearly and logically arriving at a true root cause for a system outage due to the interconnected players involved. In this article, I outline a particular IT engineering resource approach to participating in the root cause analysis process.
IT Engineering Participatory Approach A = Surprised and confused
“There is a temporary storage volume? Really? What is it used for? You mean it can fill up? Wow … I’m surprised! If someone had told me about the temporary storage volume I would have obviously blah, blah, blah”

What technology? I'm supposed to know about this "technology"?
This approach is synonymous with playing dumb about the whole situation as well as the technology itself that was aligned with the outage. If you haven’t experienced someone using this technique, you might be thinking to yourself: oh come on, can this really work, doesn’t this just make one come across as inept? The answer is: absolutely. So why does this approach based in ineptitude actually work. I will put forth two arguments based on human thought processes and the pressures of time.
I’ve witnessed time and time again where the immediate reaction to the surprised and confused IT engineer is to get caught up in following that line of reasoning. People do not by and large enjoy being surprised at work with questions about their performance or work quality or responsiveness, etc. I am not a sociologist, but in my experience, when experiencing someone else suggesting they were surprised and caught off guard by this work event, they immediately identify with the notion of dreading the feelings associated with being surprised themselves and implicitly support this individual’s claim of surprise and confusion. Rarely have I witnessed the alternative reaction supersede the previous with: “wait a minute … isn’t it your job to be responsible for health of this IT service and thus why are you surprised at how the service you are responsible for functions?” If someone indeed starts to go down this logic road, the confused aspect starts to take precedence of the surprise with “well, I obviously would have known and fixed the temporary storage problem if someone had told me about it … but I’m confused, who should have notified me ….” The perception of the problem again, has been shifted from the individual off to some more nebulous area that suggests there was nothing the individual could have done because of this nebulous third party’s role as a barrier for the individual to do their job. The confusion can keep growing: “wasn’t there a project that was supposed to fix this alerting problem? Wasn’t <insert random but related engineering group name here> working on this? Isn’t there a ticket open with the vendor on this issue? Wasn’t <insert random IT resource name here> working on a fix for this?” The larger the organization, the more effort it will take to follow up on each accusation in order to find validity. And that validity gets more and more difficult as the word spreads that the proverbial witch hunt has begun:
<insert random IT resource name here>: “Working on a fix for that? Yah, but that got assigned to what’s-his-name in the quality team. You might want to follow up with whose-his-face in testing services because I think they now are responsible for the quality team.”
So this option sure sounds great since it appears the perception is always some external force that can’t be controlled is restricting one from doing their job. With the converse being if the restriction didn’t exist, I would have done my job and there wouldn’t have been a problem in the first place.
Finally, for this option to be successful, you have to stay completely away from the problem resolution itself as much as possible. If you are visibly involved in finding and fixing the problem or if you are poking around systems related to the problem and thus are appearing in audit and history logs, you can lose plausible deniability. Someone could surface at the least opportune time and reveal your involvement in the resolution process, hence reducing your legitimate claim to surprise and confusion.
My suggestion, don’t pursue this option unless explicitly directed by management with some explanation of what role you are playing in the greater root cause exercise. Why since it seems such an easy out? In one word: reputation. You will quickly be branded inept by your peers and management will see you as weak in the sense you are junior, not worthy of being trusted with an important assignment and finally, not promotion material; the latter being the most difficult to rescind if you have your heart on a different job within the organization. Ok, so you might be thinking, right now, just staying where I am is just fine. That might be true right now … but what if some new project comes along or the company purchases some cool new technology that you might want to participate in the near future? The likelihood that you will get such an opportunity given your propensity to be surprised and confused about your job assignments is exceedingly low. I will finally venture to say that when the company is facing difficult financial stress and the option of work force reduction seems eminent, guess who falls into the X percent category of people the organization can survive without: those that are perceived as surprised and confused by work.
In the next article I’ll outline the pros and cons associated with “Openly Be the Hero”.



