For both IT managers and engineers alike, it is the least desired activity following a system failure of some kind, coming up with the root cause.  Business and/or product owners outside of IT are waiting, after the dust settles and the system is restored to working condition, to have primarily two questions answered:

  1. Why how did the system go down in the first place?
  2. What is IT going to do to make sure this doesn’t happen again?
The urgency can return without warning!

The urgency can return without warning!

The need for answers to these two seemingly straight forward questions generates an urgency that has all the IT stakeholders rallying together in camps.  This series of articles look at this challenging exercise from an engineering management perspective with the first article introducing the “80% accurate technique” and previous article focusing on avoiding the spinning wheel of blame.  This article considers how to approach the inevitable post outage “ruggedization” efforts.

So, you have survived with minor bumps and bruises from a service outage.  The service is now restored.  But does everyone just go about their regular work and forget this grueling event?  Nope … here come the “ruggedization” efforts.  I’ve covered one angle to the project involvement perspective in a previous post.  In summary from that post to set the tone for this extension: “ruggedization” projects tend to have strong support immediately following the outage but as time marches on and new problems and priorities pop up, the “ruggedization” effort loses momentum.  Strong resources move on to the problems of the moment and the challenges of the future leaving weaker resources behind to struggle to move forward on the “ruggedization” effort of the past.  What ultimately puts a nail in the coffin of the “ruggedization” effort is when real capital dollars need approval in order to buy new equipment and/or additional software licenses when many have forgotten the event ever occurred.

Thus you, as a manager, are faced with the strong potential for your team’s resources and your energy to get pulled into this likely to eventually stall out effort.  Walking away from these “ruggedization” efforts initially will brand you and your team as ones that don’t partner with the rest of the organization.  Assigning your top engineers and keeping tabs on all the throws of the ensuring project process could put you at even greater risk of not paying enough attention to new and in flight projects.  Thus you need a strategy to maintain a partnering perception while not losing your strategic focus.

Approach? In a word: balance

Balance in the sense that you need to balance you and your team’s involvement in the effort with the priority the rest of the organization is applying to the effort.  In the beginning, everyone will be running around with a sense of urgency about the effort and you need to be applying an equal amount of urgency.  Everyone external needs to have a similar sense that the urgency they feel is matched by the urgency you and your team export.  But as you get a sense that the organization is beginning to lose the momentum and participants are dropping off to focus on more urgent matters, begin to echo that same level of decreasing involvement.  Depending on your risk tolerance level, you can immediately start pulling back at the sign the others are doing the same.  I prefer a slightly more risk adverse approach.  I have found you get an even more concrete sense of the dissipating urgency as you directly interact with people that were demanding licensing costs, hardware estimates and testing schedules yesterday, but when you follow up with them with more questions today, they noticeably baulk at returning your calls, emails and IMs.  Plus, with this approach, if something goes wrong and everyone is brought back up to the level of urgency (example: system shows signs of potential doom and gloom again), you have a steady flow of “pre-tasks” (introduced in this article) you can reference that has “you” waiting on “them”.  These “pre-tasks” help both with your interactions with external parties and your management.  As they rush to get back into the hyper –urgent state and begin to thrash you and your team with requests, you have immediate responses that redirect them back to their world allowing you to more calming ramp backup.  The same applies to your management as they hear things are picking back up, they want/need a sense that you and your team are on top of the situation.  Nothing conveys that message as when, as you are fighting the current fires and this old fire flares back up, you have this at the ready:

“The customer service quality team is asking where the “ruggedization” project is at?  Well, we are waiting on a quote for the two different server config options the platform team recommended to add capacity from IT procurement.  We have questions out to the enterprise support team for them to confirm what data they are looking to pull from the system logs that they said don’t have the info they need.  And finally, we are waiting on Testing Services to provide a performance testing window to test if the vendor recommended performance tuning settings will have any effect.  So, we are ready to re-engage, but right now, we are waiting on these items in order to proceed.”

Want to be even more proactive?  Then email each of the contacts in the above example and check in on how they are progressing with your request as soon as you get wind the fire has re-ignited.  Then you can add the following to you response to your management:

“In addition, I’ve ping-ed each of those groups to see if they need anything from us at this point.”

This further solidifies you are on top of the situation when you can respond with this vote of confidence.

Thus, in summary, by keeping a pulse on the level of involvement and urgency of external stakeholders and metering you and your team’s involvement to a similar level you can maintain a sense of partnership with the rest of the organization.  In addition, if you are positioned with “pre-tasks” and an at-the-ready response to your management when the “ruggedization” effort goes from cool to cold to instantly hot again, you will respond to their desire to have confidence and trust that you are on top of the situation.  Maintaining both achieves the required sense of balance to maintain the appropriate level of involvement in the “ruggedization” effort along with not neglecting the new and emerging request for attention.

, , , , , , , , , , , , , , , , , , , ,

Related posts:

  1. The Dreaded Root Cause Meeting for the Manager, Part 5
  2. The Dreaded Root Cause Meeting for the Manager, Part 4
  3. The Dreaded Root Cause Meeting for the Manager, Part 3
  4. The Dreaded Root Cause Meeting for the Manager, Part 2
  5. The Dreaded Root Cause Meeting for the Manager, Part 1

For both IT managers and engineers alike, it is the least desired activity following a system failure of some kind, coming up with the root cause.  Business and/or product owners outside of IT are waiting, after the dust settles and the system is restored to working condition, to have primarily two questions answered:

  1. Why how did the system go down in the first place?
  2. What is IT going to do to make sure this doesn’t happen again?
Can you avoid the spinning wheel of blame?

Can you avoid the spinning wheel of blame?

The need for answers to these two seemingly straight forward questions generates an urgency that has all the IT stakeholders rallying together in camps.  This series of articles look at this challenging exercise from an engineering management perspective with the first article introducing the “80% accurate technique” and previous article focusing on communication strategies with your management when the priority of the organization is restore service at all costs, but don’t neglect data critical to finding the root cause.  This article considers an even more challenging element … avoiding the spinning wheel of blame.

  1. What is the priority of the organization when it comes to systems outages?
  2. Does someone/group need to be blamed for the outage?

Is the priority to restore services as fast a humanly possible, but with this ever present fear of the inevitable “spinning wheel of blame” along the way?  If so, then you have your work cut out for you.  Hopefully this article provides some helpful tips for this most unpleasant IT cultural scenario.

Those working or having worked in an IT culture that embraces what I call the spinning wheel of blame immediately know to what I am referring.  It is that sense that as the duration of the outage increases, proportionally, the need to cast blame on a particular entity for the cause of the outage also increases.  This proportional increase results in significant downward pressure on everyone involved not to be remotely close to the impending blame assignment.  In the opposite culture, though an organization does not enjoy a systems outage, they take a more healthy approach liken to a previous article: restore service quickly, but learn why the outage occurred in the first place so rational steps and associated investments can be made in order to reduce the likelihood of future outages. Again, in this counter case, the priority shifts to restoring service as quickly as possible but at the same time, building a case to point the finger of blame as far from one’s team and one’s department as possible.  This is where throwing the technology vendor under the proverbial bus comes in very handy.  Look for more on this challenging dynamic for the engineering team dependent on a vendor in a future article.

So, you have made it this far and you are either still groaning at the thought of your most recent experience avoiding the wheel of blame in your organization or curious how such an unhealthy culture can actually manifest itself in IT which is known for embracing constant change and the bumps and bruises along the way.  Below is a modification to the two pronged approach I mentioned previously:

Prong One – Keep Your Team Focused

Similar to the approach in this article, identify team member competencies on juggling the multiple priorities involved in restoring service and gathering data and manage accordingly.  But considering the wheel of blame element, coach the senior members to keep you abreast of the current buzz on who the wheel is pointing at before and after each major milestone in the restoration effort.  Instruct them to give you a heads up as soon as their confidence nears 80% your team’s service is likely the root cause candidate: preferably before external parties catch on to this likelihood.  For junior members without leadership provided by a senior member, though it may come across as a little bit of micro managing, step in frequently to get a pulse on their discoveries and remind them to inform you of incriminating facts prior to sharing with a larger audience.  You’ll need to absorb as much of the pulse and data of what is going on as to predict where the spinning wheel is pointing at any given moment and if it is potentially going to point at you as the next log entry is revealed.

Lastly, and very important, make sure you instill trust in your team that you have their back.  If they know the wheel exists and they get a sense you will throw them under the bus at any opportunity, they will quickly adopt non-supporting and counter productive behaviors making your job significantly harder.  Be prepared to go to bat for them when others might like to take the easy route and blame one of your team members in the blame assignment phase.

Prong Two – Keep Your Management Team Informed

While you cringe at the energy you have to expel to keep up with all the activities in flight plus the spinning wheel, your management is crossing their fingers that the wheel will land on someone else with equal fervor.  In addition to providing the information in the thematic format I proposed in this previous article, with each communication, consider including a “likelihood it is us at XX%” indicator, preferably at the top of each communication.  Strive to not have XX go from 5% to 95% in between a 5 minute communication string.  It is best to start with some assumed outage responsibility since your team is being called into the restoration and root cause effort for a reason.  If data even smells like you might have some culpability, start showing it in the XX% indicator right away.  Nothing will grab attention like an XX% going from 50% to 60% to 70% as this is a clear indication the wheel of blame is definitely spinning in your department’s direction.  This gives your management the opportunity to get involved if only to be prepared to erect the blame shields.  Another positive to having your management get involved as the percentage increase is that they can give direction if they see fit.  You have most probably been heads down, focusing on the tactical.  Your management has had the opportunity to be looking more broadly at the situation and can provide some valuable feedback from this more external perspective.

In extreme cases, not keeping your management informed opens the door for the wheel of blame to land on you directly from them.  If you haven’t brought your management in early, then when something goes wrong procedurally or otherwise, you are going to have to retroactively explain.  Unless you have a great story, and more than likely you don’t, you set yourself up for enabling your management to have little choice but to leave you out in the proverbial cold.  Where as, if you have been in regular communication with them and they are interacting in some manner, then they are implicitly part of the situation, not abstracted from it.  Taking a more harsh angle: you have removed their plausible deniability and significantly reduced the “surprise and confusion” opportunity as their out.

In the next article in this series … now that service is restored and a brief sense of calm has returned, how to approach to spirited post disaster “ruggedization” efforts.

Anyone have an example of a do or a don’t when it comes to how you handle these situations?  Anything you did that was helpful or hurtful during these events you can share?

, , , , , , , , , , , , , , , , , , , , ,

Related posts:

  1. The Dreaded Root Cause Meeting for the Manager, Part 4
  2. The Dreaded Root Cause Meeting for the Manager, Part 3
  3. The Dreaded Root Cause Meeting for the Manager, Part 2
  4. The Dreaded Root Cause Meeting for the Manager, Part 1
  5. The Dreaded Root Cause Meeting for the Engineer, Part 4

For both IT managers and engineers alike, it is the least desired activity following a system failure of some kind, coming up with the root cause.  Business and/or product owners outside of IT are waiting, after the dust settles and the system is restored to working condition, to have primarily two questions answered:

  1. Why how did the system go down in the first place?
  2. What is IT going to do to make sure this doesn’t happen again?
Fix ASAP, but don't miss the how and why?

Fix ASAP, but don't miss the how and why?

The need for answers to these two seemingly straight forward questions generates an urgency that has all the IT stakeholders rallying together in camps.  This series of articles look at this challenging exercise from an engineering management perspective with the first article introducing the “80% accurate technique” and previous article focusing on communication strategies with your management when the priority of the organization is restore service at all costs, then try to find the root cause.  This article considers a less “at all costs” culture.”

  1. What is the priority of the organization when it comes to systems outages?

Is the priority to restore services as fast a humanly possible, but with attention paid to what changes are being made, when and what are we learning about the system along the way?  If so, you need a two pronged approach:

Prong One – Keep Your Team Focused

Taking a quick assessment of your team members, you can probably quickly determine who can successfully balance competing priorities and who is overwhelmed when multiple goals are up in the air at the same time.  For those that have proven the ability to successfully balance these competing priorities, minor reminding to be cognizant of the need to balance the urgency to get things fixed against the need to capture the high level steps taken both towards ultimate success as well as towards knowledge that ultimately leads to success.  Overly reminding these individuals of these goals will be perceived as micro managing.  Thus, politely remind them, then proceed to monitor without nagging.  For those that have proven to struggle with the “troubleshooting 101” concepts of value derived from both fixing the problem quickly and gathering knowledge during the fixing process, you will need to get more involved.  One approach is to link these individuals to more senior team members that can take the lead and leverage these less skilled resources as a personal support arm to their resolution efforts.  If you are unable to link these individuals to ones that do possess these skills, you will need to provide further instruction.  Here, what skilled team members would view as micromanaging is what these team members would view as helpful, clear and focused direction.  Consider a quick electronic template for individuals to use with columns such as:

Date/Time, Activity Performed, Knowledge/Result of Activity, By Your or Other?

Example entries:

10/01/2009 8:00am, Joined troubleshooting conference call, nothing yet, <Team Member Name Here>

10/01/2009 8:10am, Restarted BLAH service, no change for the system still crashed under load, <Team Member Name Here>

10/01/2009 8:15am, Increased Available Threads in Thread Pool and Restarted BLAH service, no change for the system still crashed under load but the system supported 10k more users than before this change, <Team Member Name Here>

Also, consider pre-populating the template with other relevant data such as a drop down list of business units impacted or applications/services involved or other support groups involved.  Include as many pre-populate-able attributes as needed to assist you in strategizing your communications.

Prong Two – Keep Your Management Team Informed

If you feel some what tactically helpless, image your management’s level of helplessness in these situations.  If you are new to your management role, this is a great opportunity to demonstrate your leadership capabilities and build confidence and trust in you from your management.  Structuring a communication frequency that provides timely, but not thrashing, updates of major milestones in the troubleshooting and root cause effort will go along way to build that confidence and trust.  A theme sequence to consider is:

  • Reported problem, your team’s initial engagement (don’t forget to mention the urgency of your team’s involvement), other teams engaged, more details forthcoming
  • Quickly following, initial assessment of the systems and end users impacted, what they are experiencing, what the initial take is on what the culprit is, temperature check of the players involved, more details forthcoming
  • Major milestones of knowledge discovery or change in the reported problem from the last report, confidence assessment of next steps equaling resolution and root cause
    • Consider attaching your most recent template as an appendix/supporting material
    • Final resolution, root cause with confidence assessment, degree of involvement in the cause of the problem, next steps now that service is restored
      • Consider attaching your most recent template as an appendix/supporting material

With a two pronged approach of balancing the directional needs of your team to juggle competing priorities factoring in their individual skill sets plus an organized thematic approach to communicating to your management, you add considerable value in the root cause analysis process even though your hands are not directly solving the technical issues.

In the next article in this series … what if the priority to restore services is as fast as humanly possible but under the overwhelming fear that the spinning wheel of blame has to land on someone for this disastrous event?

Anyone have an example of a do or a don’t when it comes to how you handle these situations?  Anything you did that was helpful or hurtful during these events you can share?

, , , , , , , , , , , , , , , , , ,

Related posts:

  1. The Dreaded Root Cause Meeting for the Manager, Part 3
  2. The Dreaded Root Cause Meeting for the Manager, Part 2
  3. The Dreaded Root Cause Meeting for the Manager, Part 1
  4. The Dreaded Root Cause Meeting for the Engineer, Part 4
  5. The Dreaded Root Cause Meeting for the Engineer, Part 3