Anyone that has had to participate in a meeting to determine why some IT system went down is echoing a collective groan as they read this title. For both IT managers and engineers alike, it is the least desired activity following a system failure of any kind. Business and/or product owners outside of IT are waiting, after the dust settles and the system is restored to working condition, to have primarily two questions answered:
- Why did the system go down in the first place?
- What is IT going to do to make sure this doesn’t happen again?

Everything is broken!
Since the business folks have had their otherwise perfectly aligned piles of work to do that doesn’t involve IT completely interrupted by IT, they start the root cause analysis process by contacting that person in IT that represents their “relationship” with IT. That IT person is usually high enough in the management hierarchy that their need for answers to these two seemingly straight forward questions generates an urgency that has all the IT stakeholders rallying together in camps. As each camp is forming, one underlying theme prevails: no one wants to be the engineer that broke the system and no manager wants to be responsible for that engineer and thus the breakage itself.
Engineering Perspective:
The last thing an IT engineer wants to hear from his boss is that after being up all night fixing the problem, that he or she has to come into the office to participate in a root cause or root cause analysis meeting to talk about what happened.
Bob the Engineer: “But we all know what really happened … Storage Support was supposed to monitoring the temp storage volume and when it starts to fill up, get someone on the DBA team to start dumping transaction history data … but since Storage Support let the volume fill up, of course the DB transactions are going to halt, which backs up everything in the system and then it crashes!”
Sure, that could very well be an accurate root cause to the problem. Take a deeper dive into the group suggested to be dropping the proverbial ball on this one: Storage Support could be waiting on Storage Engineering to build/provide/implement a monitoring and alerting solution due to some past recognition that there doesn’t exist a reliable way to alert appropriate folks when a storage volume reaches a certain threshold. And yet, Storage Engineering is actually actively working on a solution as part of a formal IT project already in flight based on some past disaster that kicked of said “ruggedization” project. Thus, once “projectified”, the ownership of the delivery of an alerting and monitoring solution for the temporary storage volume is arguably not Storage Support, nor Storage Engineering but rather the more nebulous ruggedization project itself. Thus, if the ruggedization project has been providing frequent and accurate updates to project sponsors and stakeholders as to the status and ultimate delivery dates of the project itself, one could argue that the project sponsors failed to assign appropriate priority to the ruggedization effort itself. And since the project sponsors are most likely IT managers the situation becomes increasingly complex in that the logic (or illogic if you may) proposes that IT management, the same people that very well could be getting the phone calls from the business people, are actually the root cause of this hypothetical outage.
Whew … now if you are still reading and haven’t passed out yet, congratulations! As an engineer caught in the middle of this interconnected web of essentially competing priorities of limited resources, you have a couple different ways to participate. Each participatory approach has its positives and negatives, thus choose the approach you are most comfortable with to achieve the approaches associated outcome.
The next article will outline some participatory approaches and their associated pros and cons.
no comment untill now