Anyone that has had to participate in a meeting to determine why some IT system went down is echoing a collective groan as they read this title.  For both IT managers and engineers alike, it is the least desired activity following a system failure of any kind.  Business and/or product owners outside of IT are waiting, after the dust settles and the system is restored to working condition, to have primarily two questions answered

  1. Why did the system go down in the first place?
  2. What is IT going to do to make sure this doesn’t happen again?

In the first article, I outlined the business context of the root cause analysis exercise in general and the complexities in clearly and logically arriving at a true root cause for a system outage due to the interconnected players involved.  In the previous article, I outline a particular IT engineering resource approach entitled “Openly Be the Hero” to participating in the root cause analysis process.  This article introduces “Play it Safe”.

IT Engineering Participatory Approach C = Play it Safe

Play it Safe!

Play it Safe!

Having seen the potential pros and cons of approaches A and B, I assume you are wondering is there any way to play the root cause situation safely?  There is, but you are going to have to put your engineering brain and ego on hold a bit.

Step 1 = Resist the urge to be either “surprised and confused” or the Hero.

At the onset, avoid meetings, emails, hallway conversations and basically any situation that might put you in a position to start down the road of approach A or B.  Reply with vague “I’m not sure.  I think we are still looking into that.  I am waiting on <whatever>, let me get back to you” type answers.

Step 2 = Get with your management ASAP and give them a full run down of what is going on, the situation and the players involved.

As succinctly as humanly possible, state the problem “we may very well be part of the root cause for this outage”.  That should get management’s attention very quickly.  Follow-up with “here is what I know, stop me if I am going to fast or you already know all this already” and then quickly and briefly step through the problem clearly indicating where you believe/feel/think each “fact” ranks in authority.  In other words, don’t claim something is a fact unless you hold a log file printout in your hand that date and time stamps what you are saying.  “The commonly held view is the temporary storage volume filled up before anyone could purge files as the system needs, etc., etc., etc.”  “I am 50% confident based on this log entry that disk space was an issue.”  Be prepared to be stopped and asked all kinds of questions pertaining to how you know this, from whom, who else knows, etc.  What is happening is management is starting to build the story of what is taking place factually, the black and white versus gray-ness of those facts and how all the players are positioned to take the blame.

Step 3 = If management doesn’t define your role and thematically what to say and not to say, suggest your role and seek confirmation

Equally as important as step 2, confirming how management wants you to proceed is critical.  If you complete step 2 but then go off and “Being the Hero”, you will be susceptible to all of the cons associated with being the hero. Rather, if you are going forward and executing your role under the clear direction of management, as long as you indeed execute and seek clarity when a unclear situation presents itself, it will be exceedingly difficult to fall victim to the cons associated with “Being the Hero”.

Step 4 = Execute your role and keep management informed of major milestones

Go forth and help the post outage root cause investigation effort always being mindful of your role as indicated by your management.  As you are made aware of “major” milestones, make sure you go back and update management as soon as possible.  The timely updates directly assist in reshaping the story and may be accompanied by some tweak in direction to your role.  “Major” represents any event or new information that changes the shape of events.  “Bob in infrastructure just shared that the daily disk utilization report was indeed showing a reduction in free space for the last two weeks” = share ASAP.  “Bob just shared he forgot his lunch at home” = ignore.  Yes, these are rather obvious examples of what to share and not to share, but the goal here is to develop your own system for listening strategically to all the information that is being shared in order to parse out the noise and direct significant facts back to management.

In summary, the approaches and recommendations here may seem a bit extreme to many.  If you are lucky enough to belong to an organization that is culturally rational and fact based, you may not be forced into these scenarios.  Yet, all it takes is one situation to get out of control and the scenarios above become reality.  The rational, fact based logical analysis of an outage is replaced by the panicking, irrationality and emotion of engineers and managers faced with the notion of job loss due to failure to prevent an outage disaster that had major reputational and/or financial impacts.

, , , , , , , , , , , , ,

Related posts:

  1. The Dreaded Root Cause Meeting for the Engineer, Part 3
  2. The Dreaded Root Cause Meeting for the Engineer, Part 2
  3. The Dreaded Root Cause Meeting for the Engineer, Part 1
  4. How to Survive Your Role on a Project as an Engineer, Part 1
  5. How to Survive Your Role on a Project as an Engineer, Part 3

Anyone that has had to participate in a meeting to determine why some IT system went down is echoing a collective groan as they read this title.  For both IT managers and engineers alike, it is the least desired activity following a system failure of any kind.  Business and/or product owners outside of IT are waiting, after the dust settles and the system is restored to working condition, to have primarily two questions answered:

  1. Why did the system go down in the first place?
  2. What is IT going to do to make sure this doesn’t happen again?

In the first article, I outlined the business context of the root cause analysis exercise in general and the complexities in clearly and logically arriving at a true root cause for a system outage due to the interconnected players involved.  In the previous article, I outline a particular IT engineering resource approach entitled “Surprised and Confused” to participating in the root cause analysis process.  This article introduces “Openly Be the Hero”:

IT Engineering Participatory Approach B = Openly Be the Hero

“I know what happened, the temporary storage volume …..”

I will save the day with facts no one can refute!

I will save the day with facts no one can refute!

This approach, which is diametrically opposed to the surprised and confused approach, comes with some different risks.  By standing up and sharing every technical fact you can get your hands on to point out what really is going on can back fire in exactly the opposite way as the surprised and confused option.  People will tend to latch on to the one spouting off all the undeniable facts and suddenly the masses will associate the one with all of the answers as the one being in a position to have avoided the problem all together.  As far as your management goes, if they aren’t on board, you’ve placed them in a difficult spot to be supportive if the tide turns towards the root cause being the hero’s perceived lack of involvement.  Your peers, fearing their job might be in some jeopardy, will most likely slink down in their chairs to remain quiet and allow you to stand tall to take the proverbial daggers of blame.

Now if you are one that has put in the extra energy to understand how the system or systems were constructed, the “why” behind the seemingly architecturally backwards ways certain business processes are completed you may struggle with avoiding the hero trap.  You may be thinking: “The facts that I possess clearly indicate without compromise that what I known to be the root cause is the root cause.  Why can’t everyone just go with the facts and be done with it?”  Not everyone is comfortable accepting the facts even if they are the facts.  What if the facts suggest a particular individual or group of individuals have been linked to the last five system outages?  Maybe these five outages are legit and the individual or group is trying desperately to improve their system management activities.  The last thing they need is another problem piled on top of their previous problems to further put pressure on management to take some action.  In an effort to save their jobs and buy more time to get out from underneath their pile of problems they can redirect the masses to focus on the hero’s involvement and thus take the heat off themselves.

“Let me understand, the Hero knew that this problem was going to happen but didn’t do anything to stop it?  Why is the Hero hiding knowledge that would help the company?  This is yet another example of the Hero not sharing and not partnering.  How can the Hero just sit idly by and allow this to happen.  Something needs to be done about the Hero …”

And this “something that needs to be done” … and get ready, this is going to make any logical thinking IT engineer’s head spinning … could be as severe as disciplinary action cast upon the Hero.  Why is such an illogical outcome such as the individual that amassed such valuable knowledge to be able to assemble together all the puzzle pieces of the problem become the victim of some disciplinary action?  The answer falls more on the organizational hierarchy than on conventional logic.  If the individual that is uttering those statements about the Hero is significantly high on the organizational chart, then the layers below, who have been focusing on all sorts of other fires, are caught without a good story as to why this situation occurred and why the Hero is not the root of all evil.  Not being armed with a story that shields the Hero, the management layers in between are somewhat constrained and thus the blame lands on the Hero.  For more of the management side of the Hero’s plight, see the articles that cover this in the management section.

Sure, your peers might find you after meetings and give you kudos for standing up for the facts, but is being technically “right” worth the cost of being put through this ancillary pain?

The next article introduces the hybrid approach which I’ve entitled “Play it Safe”

, , , , , , , , , , ,

Related posts:

  1. The Dreaded Root Cause Meeting for the Engineer, Part 2
  2. The Dreaded Root Cause Meeting for the Engineer, Part 1

Anyone that has had to participate in a meeting to determine why some IT system went down is echoing a collective groan as they read this title.  For both IT managers and engineers alike, it is the least desired activity following a system failure of any kind.  Business and/or product owners outside of IT are waiting, after the dust settles and the system is restored to working condition, to have primarily two questions answered:

  1. Why did the system go down in the first place?
  2. What is IT going to do to make sure this doesn’t happen again?

In the previous article, I outlined the business context of the root cause analysis exercise in general and the complexities in clearly and logically arriving at a true root cause for a system outage due to the interconnected players involved.  In this article, I outline a particular IT engineering resource approach to participating in the root cause analysis process.

IT Engineering Participatory Approach A = Surprised and confused

“There is a temporary storage volume?  Really?  What is it used for?  You mean it can fill up?  Wow … I’m surprised!  If someone had told me about the temporary storage volume I would have obviously blah, blah, blah”

What technology?  I'm supposed to know about this "technology"?

What technology? I'm supposed to know about this "technology"?

This approach is synonymous with playing dumb about the whole situation as well as the technology itself that was aligned with the outage.  If you haven’t experienced someone using this technique, you might be thinking to yourself: oh come on, can this really work, doesn’t this just make one come across as inept?  The answer is: absolutely.  So why does this approach based in ineptitude actually work.  I will put forth two arguments based on human thought processes and the pressures of time.

I’ve witnessed time and time again where the immediate reaction to the surprised and confused IT engineer is to get caught up in following that line of reasoning.  People do not by and large enjoy being surprised at work with questions about their performance or work quality or responsiveness, etc.  I am not a sociologist, but in my experience, when experiencing someone else suggesting they were surprised and caught off guard by this work event, they immediately identify with the notion of dreading the feelings associated with being surprised themselves and implicitly support this individual’s claim of surprise and confusion.  Rarely have I witnessed the alternative reaction supersede the previous with: “wait a minute … isn’t it your job to be responsible for health of this IT service and thus why are you surprised at how the service you are responsible for functions?”  If someone indeed starts to go down this logic road, the confused aspect starts to take precedence of the surprise with “well, I obviously would have known and fixed the temporary storage problem if someone had told me about it … but I’m confused, who should have notified me ….”  The perception of the problem again, has been shifted from the individual off to some more nebulous area that suggests there was nothing the individual could have done because of this nebulous third party’s role as a barrier for the individual to do their job.  The confusion can keep growing: “wasn’t there a project that was supposed to fix this alerting problem?  Wasn’t <insert random but related engineering group name here> working on this?  Isn’t there a ticket open with the vendor on this issue?  Wasn’t <insert random IT resource name here> working on a fix for this?”  The larger the organization, the more effort it will take to follow up on each accusation in order to find validity.  And that validity gets more and more difficult as the word spreads that the proverbial witch hunt has begun:

<insert random IT resource name here>: “Working on a fix for that?  Yah, but that got assigned to what’s-his-name in the quality team.  You might want to follow up with whose-his-face in testing services because I think they now are responsible for the quality team.”

So this option sure sounds great since it appears the perception is always some external force that can’t be controlled is restricting one from doing their job.  With the converse being if the restriction didn’t exist, I would have done my job and there wouldn’t have been a problem in the first place.

Finally, for this option to be successful, you have to stay completely away from the problem resolution itself as much as possible.  If you are visibly involved in finding and fixing the problem or if you are poking around systems related to the problem and thus are appearing in audit and history logs, you can lose plausible deniability.  Someone could surface at the least opportune time and reveal your involvement in the resolution process, hence reducing your legitimate claim to surprise and confusion.

My suggestion, don’t pursue this option unless explicitly directed by management with some explanation of what role you are playing in the greater root cause exercise.  Why since it seems such an easy out?  In one word: reputation.  You will quickly be branded inept by your peers and management will see you as weak in the sense you are junior, not worthy of being trusted with an important assignment and finally, not promotion material; the latter being the most difficult to rescind if you have your heart on a different job within the organization.  Ok, so you might be thinking, right now, just staying where I am is just fine.  That might be true right now … but what if some new project comes along or the company purchases some cool new technology that you might want to participate in the near future?  The likelihood that you will get such an opportunity given your propensity to be surprised and confused about your job assignments is exceedingly low.  I will finally venture to say that when the company is facing difficult financial stress and the option of work force reduction seems eminent, guess who falls into the X percent category of people the organization can survive without: those that are perceived as surprised and confused by work.

In the next article I’ll outline the pros and cons associated with “Openly Be the Hero”.

, , , , , , , , ,

Related posts:

  1. The Dreaded Root Cause Meeting for the Engineer, Part 1

Anyone that has had to participate in a meeting to determine why some IT system went down is echoing a collective groan as they read this title.  For both IT managers and engineers alike, it is the least desired activity following a system failure of any kind.  Business and/or product owners outside of IT are waiting, after the dust settles and the system is restored to working condition, to have primarily two questions answered:

  1. Why did the system go down in the first place?
  2. What is IT going to do to make sure this doesn’t happen again?
Everything is broken!

Everything is broken!

Since the business folks have had their otherwise perfectly aligned piles of work to do that doesn’t involve IT completely interrupted by IT, they start the root cause analysis process by contacting that person in IT that represents their “relationship” with IT.  That IT person is usually high enough in the management hierarchy that their need for answers to these two seemingly straight forward questions generates an urgency that has all the IT stakeholders rallying together in camps.  As each camp is forming, one underlying theme prevails: no one wants to be the engineer that broke the system and no manager wants to be responsible for that engineer and thus the breakage itself.

Engineering Perspective:

The last thing an IT engineer wants to hear from his boss is that after being up all night fixing the problem, that he or she has to come into the office to participate in a root cause or root cause analysis meeting to talk about what happened.

Bob the Engineer: “But we all know what really happened … Storage Support was supposed to monitoring the temp storage volume and when it starts to fill up, get someone on the DBA team to start dumping transaction history data … but since Storage Support let the volume fill up, of course the DB transactions are going to halt, which backs up everything in the system and then it crashes!”

Sure, that could very well be an accurate root cause to the problem.  Take a deeper dive into the group suggested to be dropping the proverbial ball on this one: Storage Support could be waiting on Storage Engineering to build/provide/implement a monitoring and alerting solution due to some past recognition that there doesn’t exist a reliable way to alert appropriate folks when a storage volume reaches a certain threshold.  And yet, Storage Engineering is actually actively working on a solution as part of a formal IT project already in flight based on some past disaster that kicked of said “ruggedization” project.  Thus, once “projectified”, the ownership of the delivery of an alerting and monitoring solution for the temporary storage volume is arguably not Storage Support, nor Storage Engineering but rather the more nebulous ruggedization project itself.  Thus, if the ruggedization project has been providing frequent and accurate updates to project sponsors and stakeholders as to the status and ultimate delivery dates of the project itself, one could argue that the project sponsors failed to assign appropriate priority to the ruggedization effort itself.  And since the project sponsors are most likely IT managers the situation becomes increasingly complex in that the logic (or illogic if you may) proposes that IT management, the same people that very well could be getting the phone calls from the business people, are actually the root cause of this hypothetical outage.

Whew … now if you are still reading and haven’t passed out yet, congratulations!  As an engineer caught in the middle of this interconnected web of essentially competing priorities of limited resources, you have a couple different ways to participate.  Each participatory approach has its positives and negatives, thus choose the approach you are most comfortable with to achieve the approaches associated outcome.

The next article will outline some participatory approaches and their associated pros and cons.

, , , , , ,

Related posts:

  1. How to Survive Your Role on a Project as an Engineer, Part 1
  2. How to Survive Your Role on a Project as an Engineer, Part 5
  3. How to Survive Your Role on a Project as an Engineer, Part 3
  4. How to Survive Your Role on a Project as an Engineer, Part 2
  5. How to Survive Your Role on a Project as an Engineer, Part 4