For both IT managers and engineers alike, it is the least desired activity following a system failure of some kind, coming up with the root cause.  Business and/or product owners outside of IT are waiting, after the dust settles and the system is restored to working condition, to have primarily two questions answered:

  1. Why how did the system go down in the first place?
  2. What is IT going to do to make sure this doesn’t happen again?
Get the system restored at all costs?

Get the system restored at all costs?

The need for answers to these two seemingly straight forward questions generates an urgency that has all the IT stakeholders rallying together in camps.  This series of articles look at this challenging exercise from an engineering management perspective with the first article introducing the “80% accurate technique” and the second focusing on interacting with your team.  In this article I’ll cover considerations on how to interact with your management during the outage and crucial fact gathering post outage activities.

Considering you have had a hands-on engineering role in the past but have now transitioned fully into management, you probably remember the first major systems outage you participated in as a manager.  Now if you were managing a system that you had very recent hands on experience working on, you probably felt more comfortable digging into error logs and debugging lines of code than communicating to outside stakeholders.  One the most important stakeholders is your management structure.  If you have drifted from being the hands on guy who knows all about the system to the manager, you probably have come to grips with not being able to immediately diagnose every problem and thus have to put trust in your team members (as mentioned in the previous article).  And most challenging, if you find yourself managing a service which you did not have a technical hand in designing and building, you are completely unable to rely solely on your brain power to dig into the problem and fix it without serious technical help.  Yet, in all the above situations, your role as manager requires having a solid understanding of what the problem is at any given moment, what impact the problem is causing and what steps are planned to make life grand again.

Keeping your management informed of what is going on in a manner which gives them the timely information they need to act at their level is curtail.  Keeping them in the dark about what you and your team are doing to work the problem and restore services by feverishly fixing things does not bode well for you being seen as a leader.  Also, you may need some assistance from your management chain when other groups are being impacted by your service outage and increasingly higher levels of their management start asking tough questions.  On the other side, sending your management details of new found cryptic error log data every two minutes is going to have a similar perception result … your distinct lack of leadership.

I wish I could produce a single check list of activities that would work for every organization, every culture and every one of your managers.  Rather, as I look back at past companies, managers and their associated styles and cultures, there is no one size fits all.  Thus, instead of a check list, I thought the best method would be to look at organizational attributes via a series of questions.

  1. What is the priority of the organization when it comes to systems outages?

Is the priority to restore services as fast a humanly possible, regardless of the steps taken?  If so, then the information flow up the chain would be catering towards creating confidence in your team’s focus on the urgency of getting things working.  At the same time, coach your team to look for every option to get it running and figure out the why later.

Experienced engineers know the best time to capture useful data is when the system is hemorrhaging error information during the failure.  In high volume systems, restarting processes or rebooting systems clears a good portion of this invaluable real-time problem data from the crash.  This begs the obvious contradiction: if you are rushing to get things running at all costs, isn’t one of those costs the loss of critical data that might point squarely at the root cause?  The answer is “Yes”.  Thus, in your communications upwards, strategically force into the communication stream the notion that as the team is rushing, the ability to slow down and interpret data for root cause is being sacrificed.   That way, once everyone temporarily relaxes when the system is restored, then switches to why did it crash in the first place, you have a proverbial leg to stand on when there is a lack of critical data to support a real root cause determination.  Sure, the “I told you so” conversation is never pleasant.  What is worse is the “why didn’t you tell me” conversation.  Choosing between the lesser of two evils, I would rather quietly and politely refer to mentioning the cost of rapid restore versus methodical data gathering first, and then restore, rather than “Oh, um, yah, I forgot to mention that when we rebooted the box, we lost all the error logs in memory thus we have no clue why the service was taking up all the CPU.”

In addition, don’t neglect keeping your upward communication stream of urgent service restore in sync with your download stream.  Know your team members’ approaches involved in the system restore and ultimately the root cause exercise.  You may need to help them refocus themselves on the priority of system restore at all costs.  Engineers tend to want to figure out the “why” which could eat up precious time against the goal of service restoration.  Plus, they know what follows the restore, thus they naturally want to continue to be viewed as a knowledge expert.  They want to get their hands on as much data to process as possible to maintain that image.

In the next article, I’ll build on this theme of organization and culturally aligned approaches to management communication.

Anyone have an example of a do or a don’t when it comes to how you handle these situations?  Anything you did that was helpful or hurtful during these events you can share?

, , , , , , , , , , , , , ,

For both IT managers and engineers alike, it is the least desired activity following a system failure of some kind, coming up with the root cause.  Business and/or product owners outside of IT are waiting, after the dust settles and the system is restored to working condition, to have primarily two questions answered:

  1. Why did the system go down in the first place?
  2. What is IT going to do to make sure this doesn’t happen again?

The need for answers to these two seemingly straight forward questions generates an urgency that has all the IT stakeholders rallying together in camps.  This series of articles look at this challenging exercise from an engineering management perspective with the first article introducing the “80% accurate technique”.  In this article I’ll cover how to interact with your staff during the crucial fact gather activities.

Always support your team

Always support your team

We all have had interactions with managers when a service or system we are responsible for in some capacity is not doing what it is supposed to be doing.  Rands has a recent post on his perspectives of a past manager “The Leaper” that abhorred excuses as an abdication of responsibility.  So what are the characteristics of managers that have effectively enabled staff to go through the system restore to service and root cause analysis effort successfully?  What are characteristics of managers that by their approach, style, involvement, or lack of involvement have actually impeded the process which, in theory, should enable all involved to learn from the events and better positioned for the future?

Do: Trust Your Staff

By and large, you have talented staff.  In general, they come in to work wanting to do a good job.  The ones that don’t or can’t do a good job you have either moved them into a position where they can do the least harm or moved them out all together.  Maybe it is that architect that doesn’t have his head in the IT clouds and digs into the technical details.  Maybe it is that engineer that just can’t stop at knowing how only his piece of the system works but has assembled an exhaustive knowledge of the entire system as a whole.  Whoever is it, trust them that, once pointed in the troubleshooting direction and reminded of the need to bring knowledge back to you and the team in order to strategize on how communicate and act on it, they are doing their job.  Resist the urge to ping them every 5 minutes with “did you fix it yet?” or “did you find way it crashed yet?”  Nothing is more annoying in this situation than a boss hovering over your shoulder while you are trying to work.

This isn’t to say you completely ignore them.  Rather, meter your check-ins for status and make sure you ask if they need anything.  If the team is huddled in a cube for five hours without a break, offer to run and get some beverages.

Do: Run Interference so Your Team can Work

One thing you can definitely do while your team is feverishly trying to restore an ailing system or troll through log data to see why it might have taken a turn for the worse is run interference for your team.  Offer to be the external communicator.  So while they are working and feeding you bits of data, you can mull it over and craft carefully constructed emails that give outside stakeholders the impression your team is on the job, giving this issue priority and has a handle on what went wrong, etc.  If there is a conference call where multiple parties are working together (or not depending on your corporate culture), volunteer to be the voice for your team.  When stakeholders on the call are demanding updates or answers, have a volley of responses that keep those stakeholders informed yet buffer your team from wasting precious debug and analysis time updating the “root cause coordinator” so he/she can send out some high level update on some arbitrary schedule.

Don’t: Let your Team Members get Burned Out

So you are trusting your team members involved while running interference for them yet make sure you keep a watchful eye out for the signs of burn out.  Are team members starting to verbally accost one another?  Are team members pounding desks, increasingly using profanity or just plain staring at a screen full of data glassy eyed and frozen for an extended period of time?  It is time to step in and try to break the tension.  Humor is a good technique to provide a few moments of distraction and levity to an otherwise stressful activity.  Forcing a break: “Hey guys, put the conference call on mute, I’ll let them know we need a bio break … let’s assemble outside the restroom and get something from the snack bar on me”  In extreme cases where this root causes exercise is extending for days, look to swap in/out different team resources.  If there is a test run of a possible break scenario that looks to be focusing on something less relevant to your team, find a junior team member to represent the team in the testing while you distract your senior resources with a break.

Do: Remind Your Team Members Their Efforts are Valued

While you are strategizing your next move and arguing with peer managers on who bares more blame for the outage, don’t forget to remind your team members involved that their efforts are valued.  Remember, as much as you hate the outage and post outage activities from a management perspective, engineers want to be engineering new stuff, not involved in educating others on why the old stuff they built broke.

Do: Support the Team’s Collective Decisions

When you meet with the team to review the data and collectively agree on a result to communicate externally, stick by the agreed upon result.  Once communicated, make sure you show support for your team.  Don’t suddenly suffer from an attack of surprise and confusion when peers challenge your position (exaggerated bad example):

Peer Manager: “That can’t be right!  There is no way my team making those system changes in module X would have caused the whole system to grind to a halt.  It had to be your team updating the settings in module Y!”

You: “My team made changes to module Y?  I’m surprised!  Obviously my team made these changes without involving me.  Of course, if I had been informed of the changes I would have made sure they were fully tested first. <insert additional back pedaling and side stepping accountability here>”

Rather:

You: “Yes, those changes were made to module Y as part of a formal change process that was approved by the change team because the appropriate testing steps were signed off by the QA team.  I think we may collectively have a weakness in the over all system testing.  Maybe we should invest some time in determining if the testing we’ve been doing for some time now truly accounts for all the system changes over the last N months.  <target a more holistic problem rather than getting into a blame battle or worse, throwing your team in front of the bus>”

In the next article, I’ll shift the focus off your team and on to techniques to interact with your management.

Anyone have an example of a do or a don’t when it comes to how you are supported in these situations?  Anything a manager did that was helpful or hurtful during these events you can share?

, , , , , , , , , , , , ,

For both IT managers and engineers alike, it is the least desired activity following a system failure of some kind, coming up with the root cause.  Business and/or product owners outside of IT are waiting, after the dust settles and the system is restored to working condition, to have primarily two questions answered:

  1. Why did the system go down in the first place?
  2. What is IT going to do to make sure this doesn’t happen again?
What if you approached all facts as only 80% accurate?

What if you approached all facts as only 80% accurate?

Since the business folks had their otherwise perfectly aligned piles of work to do that doesn’t involve IT completely interrupted by IT, they start the root cause analysis process by contacting that person in IT that represents their “relationship” with IT.  That IT person is usually high enough in the management hierarchy that the need for answers to these two seemingly straight forward questions generates an urgency that has all the IT stakeholders rallying together in camps.  As each camp is forming, one underlying theme prevails: no one wants to be the individual that broke the system and no manager wants to be responsible for that person and thus the breakage itself.  This series of articles look at this challenging exercise from an engineering management perspective.

Now, if you have been in an IT management position for some time, you have probably developed a system to have your staff gather for you the various facts and suppositions by the various players, build relationships where you can contact other managers “offline” and get their take on what is going on and have developed a system to inform your management of status as the events unfold.  Hopefully you haven’t proverbially been set on fire many times prior to having developed such a system of story crafting and information sharing.  I, myself, having moved from engineering into management without a mentor or coach … well … I thought about getting a pair of asbestos underwear many times.  The next few sections will offer different perspectives for developing such a system as well as interacting with peer managers that adhere to a particular style in outage situations.

Why do I need my staff to continually provide me with facts or data?  I have all the information I need to go off and establish my position!

My 80% Accurate Technique

I fell into the trap of taking seemingly factual information as absolute fact many times.  The initial information provided by a trusted engineer plus my own double checking and off I would zoom to defend my analysis.  I would clearly articulate my analysis with conviction at a root cause meeting only to find I was completely unaware of a series of parallel events that took place that completely leaves me sitting with only half the story and half my original credibility.  I found that a more credibility strengthening approach is to assume all of the information you have gathered is at most 80% accurate.  No matter how concrete the data is, such as a standard OS level error message indicating a volume is out of disk space; always assume that it is 80% accurate.  As an example:

“The temporary storage volume was out of disk space according to the OS error message thus that is why the service crashed.”

It sounds rather concrete that a service that requires a location to store temporary data would fail due to not being able to store data in that location.  But, note how the below comment from another in the meeting can seriously erode the credibility you had when you made that statement:

“But Infrastructure had that project to move all non-mission critical storage over to the UltraCheapO disk volumes.  The service’s configuration to use the UltraCheapO volumes was made last week where there is plenty of storage.”

At this point you have to back peddle quickly because you weren’t aware and seemingly whoever on your staff you pulled information from wasn’t aware of this project that magically changed the environment.  You maybe temped to fall back on being surprised and confused by this information, but as I’ll explain a bit later, this is an absolute last resort.

Now, using the OS level error message example and assuming the information is only 80% accurate, rephrasing the same message in the example below allows for a more graceful handling of unplanned follow-ups:

“My team’s understanding is the service uses the temporary storage volume to write data and if it can’t write, it crashes.  Does that OS error message indicate that the storage the service is using was full?”

“My team’s understanding” allows for some ambiguity in both the accuracy of the information as well as personal ambiguity connected to you, as a manager, in having a plausible reason not to be 100% in the know.  Thus, if you have to eat your words later, you have some face saving opportunities to shift blame to communication challenges between yourself and your team members in the proverbial rush to analyze the outage after quickly restoring service. “Does that OS error …” asked in the form of a question allows others that might be responsible for support components, such as temporary storage in this example, the ability to respond in a non-defensive, non-threatened manner.  By using a direct statement of the probable root cause, others are immediately put on the defensive.  Rarely, if ever, once put on the defensive, does someone raise their hand and admit “why yes, it was completely my team’s fault, we completely dropped the ball on this one.”  Where as, forming the question rather than the direct statement allows for the guilty party to respond in a less defensive manner.  Don’t be surprised if the response is along the lines of “surprised and confused” with a post meeting revelation that indeed that was the root cause communicated in an even more muted manner.

In summary, the 80% accurate technique allows for the very likely possibility that you don’t have 100% of the facts pertaining to the matter as well as allows peers to have an opportunity to save face in the very likely event they are indeed responsible for the outage.  By applying the 80% accurate technique as a mindset that permeates all of your fact gathering and meeting/peer interactions, you engage in more collaborative manner that allows both yourself and your peers to have ample opportunity to save face when new facts up turn the current flow of root cause analysis.

In the next article, looking at how to interact with your staff during the crucial fact gather activities.

, , , , , , , , , , ,

In the throws of performance reviews and feedback activities, I’ve started to form in my mind the concept of opinion boundaries as it pertains to work team interactions.  As you know from this blog, the significant bulk of my professional work experience has been in IT.  Thus, this concept of opinion boundary has been birthed from IT but very well could apply in other professional disciplines where disparate teams with specific service outputs come together to provide a combined product or service.  I would be curious to hear from folks outside of IT that experience this concept in their work places and under what circumstances does it manifest itself.

Keep your opinions within your realm

Keep your opinions within your realm

First, let me paint a picture of what is creating this opinion boundary concept in my mind:

Hypothetical meeting between a Software Development Team tasked with building a new application and a Database Team that is tasked with providing Database services to applications for the whole company.

Software Development Team: “Ok, we have been asked to build a new application called X that will take in manually entered information from paper request forms and allow each form to be assigned a unique identifier so the five steps to complete the request can be tracked by the new application.  We have a database design we can provide.  What else do we need to do in order to get a database available to us in order for us to start developing?”

Database Team A: “Hey, are you writing this application in Ruby on Signposts?  If you are, you should consider using at least version 2.5 because we’ve heard of performance problems in earlier versions.  Also, have you considered how your data layer should talk back to the UI?  We’ve heard of a group that had problems with the standard PC build, you should talk to the platform guys on that.  Hey, we know that in general, development shouldn’t start until all the requirements are finalized … do you know if the business signed off on the spec yet?  Oh, and we think …”

Compared to a response such as:

Database Team B: “Ok, we just need to know some initial pieces of information from you in order to ensure we can configure a database that will work for your needs and that we can support.  What development language are you using?  Do you have any disaster recovery requirements beyond the default ones for the company?  Do you have any specific needs beyond the core database services of X, Y, Z…”

Note: This is by no means a reflection of database teams in general.  The example attempts to portray one group that needs services from another group in order to deliver their work product.  I chose the example of a software application team needing a database from a database team as a representation of a common use of centralized IT services, in this case databases, being provided out to distributed groups that all need to leverage that common service in order to ultimately provide their service.

The different responses are striking in that the first exhibits seemingly no boundaries to the opinions this hypothetical Database Team A has outside of their core service offering compared to the second from Database Team B.  Team B recognizes the boundaries of their service compared to the requesting team’s needs and has derived an interaction strategy that focuses on determining the most logical touch points and associated technical considerations between their service and the requestors.  Team B does not cross that touch point boundary and encroach on the subject matter aligned with the software development team.

Over the years, I found the most successful cross team partnerships consistently involve a clear opinion boundary.  Opinion boundary being a respect for the needs of the other team and structuring a service interaction model that defines the service touch points in terms of what the requestor needs from the provider as well as clearly explained constraints the provider has to exhort on the requestor in order to ensure the provider’s service can meet their service expectations across all requestors.  And finally, refrain from offering any opinions that break the touch point barrier.

What I have struggled to assemble is what drives an individual or a team to have such a strong need to offer opinions on subject matter outside of their role or service within the organization to such an extreme?  In the utmost extreme case, the only logical conclusion I could draw was that the team was extremely weak skill set wise in what they were tasked to do and used such a tactic to misdirect requestors.  In the process of misdirecting, they seemed to be hoping one of two things occurred:

  1. The requestor would find another service provider and thus never return (seemingly odd, but in a very large IT organization it is likely to have multiple teams providing similar enough services that one can meet their business objectives via multiple options)
  2. The requestor was hoping the misdirection would require more people to get involved to ultimately map out exact what the provider needed to do at a detailed level that ultimately filled in for their skill set gap.  “Oh, you actually wanted us to provide that?  Oh, that … sure, we can do that.  It wasn’t clear to us initially that you really wanted that.”  Using such backtracking to agree to perform the granular tasks outlined to them from a more senior resource outside their team.

Another is shear ego whilst the provider wants to talk to hear themselves talk and feed their sense of importance for the duration of time the requestor needs to confirm the use of their services.  In other words, for the requestor, suffer through the chest beating and technical pontifications in order to ultimately get the service required in the end.

Anyone have a different definition on option boundary as I have described it or knows of alternative motivations for someone or for a team to interact in these manners?  Anyone find this applies outside of IT?

, , , ,

As a manager of a team of IT engineers, one of the toughest challenges is getting a handle on not only what everyone is working on, but what are all the seemingly random sources of work coming at your team.  Thus whether you find yourself managing a new team or have been managing a team for some time but you are constantly being surprised with new requests out of left field, you may want to consider constructing a logical approach similar to what is being outlined in this series of articles to stop the surprises.

In the previous article, we have identified the work request attributes of your team and built a list of sources of those requests.  In this next article, we will start to combine these two lists into a single model of how your team does work.

Step 3: Separate request attributes you can influence from those that you cannot

Just a note right off the bat … I chose the word “influence” rather than control or direct.  I rarely find one has complete and total control over every aspect of a particular work request attribute.  Thus, the work model needs to be flexible to handle changes even to work requests you yourself are sponsoring.

From our example engineering team in the previous article:

Influence:

  • Vendor product upgrade = If you are taking the proactive step of upgrading a vendor product, then you have an increased level of influence over the scheduling details.  The most notable exception is if you are opting to make a vendor product upgrade a dependency on an external project’s requirements as I’ve outlined in the following article on this technique.
  • Vendor product end-of-life = although you don’t have influence over when a vendor decides to establish an end-of-life date, you usually have plenty of advanced notice.  Knowing your product has a six, nine, twelve month life, you can proactively plan how to upgrade to the appropriate version.
  • Service Strategy = probably the request attribute you have the most influence over.  Also, when the work is piling up, this is probably the attribute that gets the least attention.

No Influence:

  • Production issue support = the classic “drop everything, all hands on deck” scenario that pops up when you least expect this type of work request.
  • Projects requesting engineering support = If you want to continue to enjoy your paycheck, when an external project requests resource assistance, you probably aren’t typically responding with “we are too busy, come back next year”.  Rather, you need to provide each and every project that is integrating with your team’s services with some form of assistance.  Since project scopes rare stay fixed, the likelihood of having little to no involvement in a particular project will turn 180 degrees and need your services puts this attribute in the no influence category.

Step 4: Identify sources of predictive data per request attribute

Time to start identifying some metrics

Time to start identifying some metrics

Now that you have categorized you request attributes into those that pop up at random compared to those that you have some influence over, it is time to identify sources of data to use as metrics to drive the model.  Why are metrics so important?  Because as an IT Manager you need a library of complex colored charts and graphs to swamp senior management with to justify your role in the organization; more seriously, in order to make credible claims.  Proceed to identify per request attribute sources of metric data:

Production issue support

Most if not all IT shops have some product trouble ticket tracking system of some kind.  Get yourself access to whatever reporting capabilities exist and try and get as much data extracted as your can.  The data should specifically note in some way your team members’ involvement.  Can you extract patterns?  Do you see a spike of involvement at the beginning or end of the month?  Is there a customer billing cycle that has your team helping the billing department five times every second week of the month?  Can you pull any start/end times your team is involved such as the average time spent per ticket is 1.5 hours?  This is the kind of trend data you are looking for to start turning the “we keep getting surprised with all these production tickets” into “we can expect to get hit with tickets here, here and here for approximately X hours per ticket”.

Projects requesting engineering support

Unlike production support tickets, project teams requesting assistance should be less frequent in nature.  The sources of new projects and the technical scope of new projects are a bit more challenging.  I’ve found you need a multi-pronged approach to collecting this data and it is a reoccurring activity rather than say a once a year one shot deal.  First, you need to do some digging into how your organization kicks off projects.  Is there a central PMO or Project Management Office?  Are there lines of business that have their own project management teams?  Are there product managers or customer service owners that have a project management arm that services their products and services?  Is it a combination of all the above?

For a new manager, top priority in my mind is to start building this project kickoff source of information.  It is time to break out your networking skills and track down sources of project tracking data and people who are the typical working sponsors in your organization.  Get plugged into the project leads in the PM office.  Identify product managers that typically sponsor projects that impact your team and take them out to lunch to have them spill their product roadmap or strategy to you.  Determine the budget cycle of each product your team’s technology interacts with and get talking to project leads that regularly manage product enhancement projects since they are usually tapped to offer estimates and expertise during the pre-budget cycle.

Clearly, it is possible to invest a significantly large portion of your time tracking down all these people in these various roles throughout the company … especially if your company is large.  Thus, I recommend throttling your time investment in this activity.  Step back and note if you are getting useful data or just polite conversations.  If the data isn’t flowing, it is time to change direction.  On the flip side, don’t ignore this continual time investment exercise even if investigating the feature set of the upcoming release of one of your vendor software products seems more appealing to your engineering mind.  Your mastery of knowing the technical feature set of your vendor’s products will be overshadowed by the perception you are struggling to manage the work demands on your team by your management.

The next article will dive into team capacity metrics.

, , , , , , , , , , , , ,

From both an IT engineering and management perspective, I find it very easy to fall victim to groupthink.  Groupthink is defined by Irving Janis as “A mode of thinking that people engage in when they are deeply involved in a cohesive in-group, when the members’ strivings for unanimity override their motivation to realistically appraise alternative courses of action.”1 To illustrate groupthink in the IT world does a team meeting scenario like below sound familiar?

Impromptu Team Meeting in the Conference Room

Is venting starting to turn everyone too negative?

Is venting starting to turn everyone too negative?

Bob the Engineer <very agitated>: “Geez, that project ‘critical path’ regroup meeting was a huge pain!”

Sally the Engineer <visibly frustrated>: “Yah, if only the project managers were listening to us all last month when we were telling them their schedule was ridiculous!  There was no way back then the work was going to get accomplished when their fancy project plans said it would and we told them!”

Joe the Engineer <rolling his eyes>: “How many times are we going to have to attend these silly ‘we knew were going off track a month ago but we did nothing so now we need to have a meeting to chat about why we are suddenly off course  …’”

Bob the Engineer <speaking over Jo>: “We are going to have these meetings as long as the PMs keep ignoring our work estimates.”

Sally the Engineer: “Yah, when will they ever learn?”

Groupthink is any easy trap to fall into.  Yet, in my opinion, fostering a team climate for which there is an opportunity for such venting provides benefit to the entire team.  With a team sense of shared success and shared pain, the team climate evolves to allow team members to be open with their teammates and their manager on problems, issues, challenges and successes with more candor.  The openness breeds more open communication and cooperation plus ultimately leads to high quality output in less time with problems surfacing and resolving earlier in the work processes before they become disasters.  Yet, left unchecked, this venting can take the team as a whole deeper into the short term comfort of the shared pain and away from the need to look at opportunities to avoid the situation that causes the pain in the first place.

It is time for someone to be a leader and jump in and pull the team back from the precipice of groupthink.

Boss/Engineer: “OK, OK, ok … we all know the PMs get themselves into this situation more often than seems warranted.  How about we brainstorm on some creative ways we can do things differently going forward so we don’t have to sit in these useless regroup meetings?”

As an Engineer, this is a great opportunity to demonstrate leadership skills in front of your boss with real credibility.  As a Manager, if no one on your team is stepping up and the level of negativity is rising, you may have to step in with similar comments to redirect your team to focus on positives before the negatives further spiral the team down a bad path.

Anyone have any perspectives to share on groupthink?  Anyone have a technique or example where groupthink was avoided or not avoided resulting in a bad situation getting worse?

  1. Janis, Irving L. Victims of Groupthink. Boston. Houghton Mifflin Company, 1972, page 9.
, , , ,

Anyone that has had to participate in a meeting to determine why some IT system went down is echoing a collective groan as they read this title.  For both IT managers and engineers alike, it is the least desired activity following a system failure of any kind.  Business and/or product owners outside of IT are waiting, after the dust settles and the system is restored to working condition, to have primarily two questions answered

  1. Why did the system go down in the first place?
  2. What is IT going to do to make sure this doesn’t happen again?

In the first article, I outlined the business context of the root cause analysis exercise in general and the complexities in clearly and logically arriving at a true root cause for a system outage due to the interconnected players involved.  In the previous article, I outline a particular IT engineering resource approach entitled “Openly Be the Hero” to participating in the root cause analysis process.  This article introduces “Play it Safe”.

IT Engineering Participatory Approach C = Play it Safe

Play it Safe!

Play it Safe!

Having seen the potential pros and cons of approaches A and B, I assume you are wondering is there any way to play the root cause situation safely?  There is, but you are going to have to put your engineering brain and ego on hold a bit.

Step 1 = Resist the urge to be either “surprised and confused” or the Hero.

At the onset, avoid meetings, emails, hallway conversations and basically any situation that might put you in a position to start down the road of approach A or B.  Reply with vague “I’m not sure.  I think we are still looking into that.  I am waiting on <whatever>, let me get back to you” type answers.

Step 2 = Get with your management ASAP and give them a full run down of what is going on, the situation and the players involved.

As succinctly as humanly possible, state the problem “we may very well be part of the root cause for this outage”.  That should get management’s attention very quickly.  Follow-up with “here is what I know, stop me if I am going to fast or you already know all this already” and then quickly and briefly step through the problem clearly indicating where you believe/feel/think each “fact” ranks in authority.  In other words, don’t claim something is a fact unless you hold a log file printout in your hand that date and time stamps what you are saying.  “The commonly held view is the temporary storage volume filled up before anyone could purge files as the system needs, etc., etc., etc.”  “I am 50% confident based on this log entry that disk space was an issue.”  Be prepared to be stopped and asked all kinds of questions pertaining to how you know this, from whom, who else knows, etc.  What is happening is management is starting to build the story of what is taking place factually, the black and white versus gray-ness of those facts and how all the players are positioned to take the blame.

Step 3 = If management doesn’t define your role and thematically what to say and not to say, suggest your role and seek confirmation

Equally as important as step 2, confirming how management wants you to proceed is critical.  If you complete step 2 but then go off and “Being the Hero”, you will be susceptible to all of the cons associated with being the hero. Rather, if you are going forward and executing your role under the clear direction of management, as long as you indeed execute and seek clarity when a unclear situation presents itself, it will be exceedingly difficult to fall victim to the cons associated with “Being the Hero”.

Step 4 = Execute your role and keep management informed of major milestones

Go forth and help the post outage root cause investigation effort always being mindful of your role as indicated by your management.  As you are made aware of “major” milestones, make sure you go back and update management as soon as possible.  The timely updates directly assist in reshaping the story and may be accompanied by some tweak in direction to your role.  “Major” represents any event or new information that changes the shape of events.  “Bob in infrastructure just shared that the daily disk utilization report was indeed showing a reduction in free space for the last two weeks” = share ASAP.  “Bob just shared he forgot his lunch at home” = ignore.  Yes, these are rather obvious examples of what to share and not to share, but the goal here is to develop your own system for listening strategically to all the information that is being shared in order to parse out the noise and direct significant facts back to management.

In summary, the approaches and recommendations here may seem a bit extreme to many.  If you are lucky enough to belong to an organization that is culturally rational and fact based, you may not be forced into these scenarios.  Yet, all it takes is one situation to get out of control and the scenarios above become reality.  The rational, fact based logical analysis of an outage is replaced by the panicking, irrationality and emotion of engineers and managers faced with the notion of job loss due to failure to prevent an outage disaster that had major reputational and/or financial impacts.

, , , , , , , , , , , , ,

Anyone that has had to participate in a meeting to determine why some IT system went down is echoing a collective groan as they read this title.  For both IT managers and engineers alike, it is the least desired activity following a system failure of any kind.  Business and/or product owners outside of IT are waiting, after the dust settles and the system is restored to working condition, to have primarily two questions answered:

  1. Why did the system go down in the first place?
  2. What is IT going to do to make sure this doesn’t happen again?

In the first article, I outlined the business context of the root cause analysis exercise in general and the complexities in clearly and logically arriving at a true root cause for a system outage due to the interconnected players involved.  In the previous article, I outline a particular IT engineering resource approach entitled “Surprised and Confused” to participating in the root cause analysis process.  This article introduces “Openly Be the Hero”:

IT Engineering Participatory Approach B = Openly Be the Hero

“I know what happened, the temporary storage volume …..”

I will save the day with facts no one can refute!

I will save the day with facts no one can refute!

This approach, which is diametrically opposed to the surprised and confused approach, comes with some different risks.  By standing up and sharing every technical fact you can get your hands on to point out what really is going on can back fire in exactly the opposite way as the surprised and confused option.  People will tend to latch on to the one spouting off all the undeniable facts and suddenly the masses will associate the one with all of the answers as the one being in a position to have avoided the problem all together.  As far as your management goes, if they aren’t on board, you’ve placed them in a difficult spot to be supportive if the tide turns towards the root cause being the hero’s perceived lack of involvement.  Your peers, fearing their job might be in some jeopardy, will most likely slink down in their chairs to remain quiet and allow you to stand tall to take the proverbial daggers of blame.

Now if you are one that has put in the extra energy to understand how the system or systems were constructed, the “why” behind the seemingly architecturally backwards ways certain business processes are completed you may struggle with avoiding the hero trap.  You may be thinking: “The facts that I possess clearly indicate without compromise that what I known to be the root cause is the root cause.  Why can’t everyone just go with the facts and be done with it?”  Not everyone is comfortable accepting the facts even if they are the facts.  What if the facts suggest a particular individual or group of individuals have been linked to the last five system outages?  Maybe these five outages are legit and the individual or group is trying desperately to improve their system management activities.  The last thing they need is another problem piled on top of their previous problems to further put pressure on management to take some action.  In an effort to save their jobs and buy more time to get out from underneath their pile of problems they can redirect the masses to focus on the hero’s involvement and thus take the heat off themselves.

“Let me understand, the Hero knew that this problem was going to happen but didn’t do anything to stop it?  Why is the Hero hiding knowledge that would help the company?  This is yet another example of the Hero not sharing and not partnering.  How can the Hero just sit idly by and allow this to happen.  Something needs to be done about the Hero …”

And this “something that needs to be done” … and get ready, this is going to make any logical thinking IT engineer’s head spinning … could be as severe as disciplinary action cast upon the Hero.  Why is such an illogical outcome such as the individual that amassed such valuable knowledge to be able to assemble together all the puzzle pieces of the problem become the victim of some disciplinary action?  The answer falls more on the organizational hierarchy than on conventional logic.  If the individual that is uttering those statements about the Hero is significantly high on the organizational chart, then the layers below, who have been focusing on all sorts of other fires, are caught without a good story as to why this situation occurred and why the Hero is not the root of all evil.  Not being armed with a story that shields the Hero, the management layers in between are somewhat constrained and thus the blame lands on the Hero.  For more of the management side of the Hero’s plight, see the articles that cover this in the management section.

Sure, your peers might find you after meetings and give you kudos for standing up for the facts, but is being technically “right” worth the cost of being put through this ancillary pain?

The next article introduces the hybrid approach which I’ve entitled “Play it Safe”

, , , , , , , , , , ,

Anyone that has had to participate in a meeting to determine why some IT system went down is echoing a collective groan as they read this title.  For both IT managers and engineers alike, it is the least desired activity following a system failure of any kind.  Business and/or product owners outside of IT are waiting, after the dust settles and the system is restored to working condition, to have primarily two questions answered:

  1. Why did the system go down in the first place?
  2. What is IT going to do to make sure this doesn’t happen again?

In the previous article, I outlined the business context of the root cause analysis exercise in general and the complexities in clearly and logically arriving at a true root cause for a system outage due to the interconnected players involved.  In this article, I outline a particular IT engineering resource approach to participating in the root cause analysis process.

IT Engineering Participatory Approach A = Surprised and confused

“There is a temporary storage volume?  Really?  What is it used for?  You mean it can fill up?  Wow … I’m surprised!  If someone had told me about the temporary storage volume I would have obviously blah, blah, blah”

What technology?  I'm supposed to know about this "technology"?

What technology? I'm supposed to know about this "technology"?

This approach is synonymous with playing dumb about the whole situation as well as the technology itself that was aligned with the outage.  If you haven’t experienced someone using this technique, you might be thinking to yourself: oh come on, can this really work, doesn’t this just make one come across as inept?  The answer is: absolutely.  So why does this approach based in ineptitude actually work.  I will put forth two arguments based on human thought processes and the pressures of time.

I’ve witnessed time and time again where the immediate reaction to the surprised and confused IT engineer is to get caught up in following that line of reasoning.  People do not by and large enjoy being surprised at work with questions about their performance or work quality or responsiveness, etc.  I am not a sociologist, but in my experience, when experiencing someone else suggesting they were surprised and caught off guard by this work event, they immediately identify with the notion of dreading the feelings associated with being surprised themselves and implicitly support this individual’s claim of surprise and confusion.  Rarely have I witnessed the alternative reaction supersede the previous with: “wait a minute … isn’t it your job to be responsible for health of this IT service and thus why are you surprised at how the service you are responsible for functions?”  If someone indeed starts to go down this logic road, the confused aspect starts to take precedence of the surprise with “well, I obviously would have known and fixed the temporary storage problem if someone had told me about it … but I’m confused, who should have notified me ….”  The perception of the problem again, has been shifted from the individual off to some more nebulous area that suggests there was nothing the individual could have done because of this nebulous third party’s role as a barrier for the individual to do their job.  The confusion can keep growing: “wasn’t there a project that was supposed to fix this alerting problem?  Wasn’t <insert random but related engineering group name here> working on this?  Isn’t there a ticket open with the vendor on this issue?  Wasn’t <insert random IT resource name here> working on a fix for this?”  The larger the organization, the more effort it will take to follow up on each accusation in order to find validity.  And that validity gets more and more difficult as the word spreads that the proverbial witch hunt has begun:

<insert random IT resource name here>: “Working on a fix for that?  Yah, but that got assigned to what’s-his-name in the quality team.  You might want to follow up with whose-his-face in testing services because I think they now are responsible for the quality team.”

So this option sure sounds great since it appears the perception is always some external force that can’t be controlled is restricting one from doing their job.  With the converse being if the restriction didn’t exist, I would have done my job and there wouldn’t have been a problem in the first place.

Finally, for this option to be successful, you have to stay completely away from the problem resolution itself as much as possible.  If you are visibly involved in finding and fixing the problem or if you are poking around systems related to the problem and thus are appearing in audit and history logs, you can lose plausible deniability.  Someone could surface at the least opportune time and reveal your involvement in the resolution process, hence reducing your legitimate claim to surprise and confusion.

My suggestion, don’t pursue this option unless explicitly directed by management with some explanation of what role you are playing in the greater root cause exercise.  Why since it seems such an easy out?  In one word: reputation.  You will quickly be branded inept by your peers and management will see you as weak in the sense you are junior, not worthy of being trusted with an important assignment and finally, not promotion material; the latter being the most difficult to rescind if you have your heart on a different job within the organization.  Ok, so you might be thinking, right now, just staying where I am is just fine.  That might be true right now … but what if some new project comes along or the company purchases some cool new technology that you might want to participate in the near future?  The likelihood that you will get such an opportunity given your propensity to be surprised and confused about your job assignments is exceedingly low.  I will finally venture to say that when the company is facing difficult financial stress and the option of work force reduction seems eminent, guess who falls into the X percent category of people the organization can survive without: those that are perceived as surprised and confused by work.

In the next article I’ll outline the pros and cons associated with “Openly Be the Hero”.

, , , , , , , , ,