For both IT managers and engineers alike, it is the least desired activity following a system failure of some kind, coming up with the root cause.  Business and/or product owners outside of IT are waiting, after the dust settles and the system is restored to working condition, to have primarily two questions answered:

  1. Why did the system go down in the first place?
  2. What is IT going to do to make sure this doesn’t happen again?
What if you approached all facts as only 80% accurate?

What if you approached all facts as only 80% accurate?

Since the business folks had their otherwise perfectly aligned piles of work to do that doesn’t involve IT completely interrupted by IT, they start the root cause analysis process by contacting that person in IT that represents their “relationship” with IT.  That IT person is usually high enough in the management hierarchy that the need for answers to these two seemingly straight forward questions generates an urgency that has all the IT stakeholders rallying together in camps.  As each camp is forming, one underlying theme prevails: no one wants to be the individual that broke the system and no manager wants to be responsible for that person and thus the breakage itself.  This series of articles look at this challenging exercise from an engineering management perspective.

Now, if you have been in an IT management position for some time, you have probably developed a system to have your staff gather for you the various facts and suppositions by the various players, build relationships where you can contact other managers “offline” and get their take on what is going on and have developed a system to inform your management of status as the events unfold.  Hopefully you haven’t proverbially been set on fire many times prior to having developed such a system of story crafting and information sharing.  I, myself, having moved from engineering into management without a mentor or coach … well … I thought about getting a pair of asbestos underwear many times.  The next few sections will offer different perspectives for developing such a system as well as interacting with peer managers that adhere to a particular style in outage situations.

Why do I need my staff to continually provide me with facts or data?  I have all the information I need to go off and establish my position!

My 80% Accurate Technique

I fell into the trap of taking seemingly factual information as absolute fact many times.  The initial information provided by a trusted engineer plus my own double checking and off I would zoom to defend my analysis.  I would clearly articulate my analysis with conviction at a root cause meeting only to find I was completely unaware of a series of parallel events that took place that completely leaves me sitting with only half the story and half my original credibility.  I found that a more credibility strengthening approach is to assume all of the information you have gathered is at most 80% accurate.  No matter how concrete the data is, such as a standard OS level error message indicating a volume is out of disk space; always assume that it is 80% accurate.  As an example:

“The temporary storage volume was out of disk space according to the OS error message thus that is why the service crashed.”

It sounds rather concrete that a service that requires a location to store temporary data would fail due to not being able to store data in that location.  But, note how the below comment from another in the meeting can seriously erode the credibility you had when you made that statement:

“But Infrastructure had that project to move all non-mission critical storage over to the UltraCheapO disk volumes.  The service’s configuration to use the UltraCheapO volumes was made last week where there is plenty of storage.”

At this point you have to back peddle quickly because you weren’t aware and seemingly whoever on your staff you pulled information from wasn’t aware of this project that magically changed the environment.  You maybe temped to fall back on being surprised and confused by this information, but as I’ll explain a bit later, this is an absolute last resort.

Now, using the OS level error message example and assuming the information is only 80% accurate, rephrasing the same message in the example below allows for a more graceful handling of unplanned follow-ups:

“My team’s understanding is the service uses the temporary storage volume to write data and if it can’t write, it crashes.  Does that OS error message indicate that the storage the service is using was full?”

“My team’s understanding” allows for some ambiguity in both the accuracy of the information as well as personal ambiguity connected to you, as a manager, in having a plausible reason not to be 100% in the know.  Thus, if you have to eat your words later, you have some face saving opportunities to shift blame to communication challenges between yourself and your team members in the proverbial rush to analyze the outage after quickly restoring service. “Does that OS error …” asked in the form of a question allows others that might be responsible for support components, such as temporary storage in this example, the ability to respond in a non-defensive, non-threatened manner.  By using a direct statement of the probable root cause, others are immediately put on the defensive.  Rarely, if ever, once put on the defensive, does someone raise their hand and admit “why yes, it was completely my team’s fault, we completely dropped the ball on this one.”  Where as, forming the question rather than the direct statement allows for the guilty party to respond in a less defensive manner.  Don’t be surprised if the response is along the lines of “surprised and confused” with a post meeting revelation that indeed that was the root cause communicated in an even more muted manner.

In summary, the 80% accurate technique allows for the very likely possibility that you don’t have 100% of the facts pertaining to the matter as well as allows peers to have an opportunity to save face in the very likely event they are indeed responsible for the outage.  By applying the 80% accurate technique as a mindset that permeates all of your fact gathering and meeting/peer interactions, you engage in more collaborative manner that allows both yourself and your peers to have ample opportunity to save face when new facts up turn the current flow of root cause analysis.

In the next article, looking at how to interact with your staff during the crucial fact gather activities.

  • Share/Bookmark
, , , , , , , , , , ,
Trackback

4 comments untill now

  1. Being 80% accurate is also a technique which your spouse will hate you for. Yes darling, I hope I will be able to pick you up from work today. Yes honey, I should remember our wedding anniversary.

    OK, enough joking. You say that

    Rarely, if ever, once put on the defensive, does someone raise their hand and admit “why yes, it was completely my team’s fault, we completely dropped the ball on this one.”

    and you are pretty much right. But to be honest I don’t understand that. I believe that ability to admit you screwed something up is one of big things in terms of building relationships. Especially relationships with clients.

    I know it works the other way in corporate environments and it can be used to make you a regular candidate in blame games but that’s one of reasons why I hate typical corporate environments. And I know people who learned how corporate politics works and they don’t give a damn anymore when they’re blamed. Funny thing is this is more a problem of organization than individuals themselves.

  2. Enjoyed your humor and agree with you 110% regarding (large) corporate environments. At the same time, and maybe I wasn’t clear in my post, in large I environments, one rarely has a complete knowledge or access to all the authoritative knowledge sources to stand completely firm on a particular technical position. From an IT management perspective, as much as you want to trust your team or even the analysis of your rock star that as proven the last 10 out of 10 situations to be completely accurate, you need to be cognizant that you may not have all the facts. Thus, couching your communications to allow for some lack of clarity on your understanding can go a long way to maintain your credibility.

  3. [...] is a threshold of total requests at which the system can no longer service all the requests and disasters occur.  As the client request count rapidly approaches this threshold, the application servers continue [...]

  4. [...] of articles look at this challenging exercise from an engineering management perspective with the first article introducing the “80% accurate technique”.  In this article I’ll cover how to interact with [...]

Add your comment now