Project sponsor turnover can be handled smoothly

Project sponsor turnover can be handled smoothly

Hallway conversations and whispers in meetings have the grapevine quickly communicating the departure of a highly visible person in the corporation. “Did you hear Bob gave his two week notice?” “Yah, any idea where he is going?” “No, I don’t think he shared that.” “Who is going to lead the big FlimFlam upgrade project now?” “Don’t know that either. It hasn’t been announced. Bob has been it for as long as anyone can remember.” “This could get very messy.”

I was reflecting on my participation in a large, multi-track, multi-phase, multi-year project some time ago. So, safe to say, this was a big project involving substantial change across a variety of technology groups, products and business units. About a third of the way through the project, the day to day business sponsor left the organization for an outside opportunity. Since the project was well under way, being a third completed, a new sponsor was needed to step in quickly to keep providing direction to all the concurrent work streams.

Executive Leadership Steps In

The executive sponsor immediately started attending the regular program level status meetings. This provided much needed leadership. Thus, two big thumbs up for her participation. Instead of everyone looking around the table at each other wondering who was in charge, there was continuity in project leadership.

New Sponsor Arrives

The executive sponsor didn’t waste much time sourcing a new business sponsor for the project. With only a few weeks of drift, a new day to day sponsor was at the table. The executive sponsor gave a brief introduction and the new sponsor took charge. Following the introduction, it was clear to everyone that the new sponsor wasted no time getting up to speed even though he had no prior knowledge of the project nor subject matter expertise in the goals and objects of the project itself. The new sponsor already had had meetings with key stakeholders individually.

New Sponsor Sets the Tone

The new sponsor also gave brief summary of the current state of the project, the major open issues and summarized the strategic next steps. In summarizing the next steps, the new sponsor established an immediate credibility as the prior sponsor seemed to be struggling a bit with how to prioritize the cross-functional team’s focus for the in-flight work streams. All in all, the new sponsor, in the first formal meeting, established a strong confidence that had everyone leaving that meeting with a positive sense of enthusiasm that we were all in good hands for the remaining work ahead. The new sponsor clearly set the tone for project success.

So what made this potentially negative situation result in a re-energizer to the project team?

  • Executive leadership presence immediately upon word the current sponsor was leaving the organization.
  • Executive leadership remaining visible and actively engaged through naming the new sponsor.
  • The new sponsor’s strong initial engagement and clear understanding of:
    • Project’s current state
    • Clarity surrounding open issues
    • Ability to articulate next steps.

Has anyone else experienced a positive project sponsor change? What contributed to the success of the leadership switch?

, , , , , ,

Related posts:

  1. Aha Moment: Technical People need Project Managers
  2. How to Survive Your Role on a Project as an Engineer, Part 4

Earlier this week I was given the opportunity to share my “aha moment” when it comes to project management.  Below is a brief quote from my article submission:

“Like it or not, technical people need project managers. Yet, ask the average software developer, systems integration engineer or infrastructure engineer about what their company needs more of role-wise and they will almost without a doubt say: “We need more X’s. I’m buried in work!” Feel free to replace “X” with whatever job title applies to whom you are questioning. Software developers indicate they need more developers because the product owners keep asking for more features faster than can be developed.”

Feel free to read the article in its entirety here.

, , ,

Related posts:

  1. Need for Gating People and Process
  2. How to Survive Your Role on a Project as an Engineer, Part 2
  3. How to Survive Your Role on a Project as an Engineer, Part 5
  4. How to Survive Your Role on a Project as an Engineer, Part 1
  5. How to Survive Your Role on a Project as an Engineer, Part 3

Can I get the impossible delivered next week?

Can I get the impossible delivered next week?

How many times have you participated in this typical IT conversation?

Business: “How long is it going to take to enable that new FlimFlam engage-the-client-better module with those extra customizations we talked about?”

IT: “Last time we looked at that we said it would take a little over three months including changing the custom code in order to be able to use that new module.”

Business: “Well, we need it turned on by the end of next month when the marketing campaign starts.”

IT: “Um, err, change all the custom code, install the new module and enable it with those additional features in two not three months?”

Business: “Yes.”

The theme of this classic IT delivery challenge is the same:

In order to meet a communicated business need within a business defined time-frame, a perceived number of technical tasks need to be accomplished that don’t initially appear feasible in that given time-frame.

The initial reaction by most technically minded people is this is completely impossible.  And yes, Agile or other project delivery methodologies have built in capabilities to handle fixed dates and variable scopes.  But if you are faced with this common theme of questioning, I am going out on a limb and guessing you are not working in a truly Agile shop or else this question isn’t likely to be asked in this manner in the first place.

So, you rush out and grab your FlimFlam experts to communicate the need and the desired end date.  You tell them this is the most important thing to work on and then you go back to your desk to get ready for the next crisis of systems delivery.

Nope.

Consider taking the time to put together a work delivery estimate as your first priority.  Why do this seemingly futile exercise when the business has already stated what they need and when they need it by?

You need to have some estimate data in hand to have a conversation with the business on what is truly feasible within a pre-communicated time-frame.

This conversation serves a few purposes that are critical to you and your team’s ability to deliver a quality solution:

  • Establishes what in-flight work will be put on hold/delayed until this higher priority request is completed.
  • Durations of time on sub-tasks enable the business to prioritize what features they truly need versus those that are just “nice to have”.
  • Establishes a baseline so when other high priority requests come in or new feature requests are added, you can dust off the previous estimate, revise with the new needs, and re-engage in the conversation.

I can’t count the number of times I have chatted with a business sponsor that swore up and down they just had to have every item in their request delivered by an irrational date.  Yet when presented with some work estimate that indicated everything wasn’t realistically possible within the time-frame [based on the cumulative hours/weeks associated with their request broken down into granular tasks], that same business sponsor started cutting “low priority” features left and right to meet their date.

Business: “You mean it is going to take a week just to get that data on the screen within the application?  We can just use that old report that shows the same data on paper.  Scratch that feature.”

Technical/Engineering Challenge

The typical technical/engineer mindset applied to the theme of delivering unfamiliar technology within an aggressive (arbitrary?) time frame is to think it impossible given the number of unknowns.  Get ready for panic and shortness of breath from your less seasoned technical staff.  Giving them coaching and a framework to break down the seemingly huge and complicated requests into a logical sequence of executable tasks is the other side of the estimation challenge.

To help enable the technical resources to break down “the work” into meaningful, estimate able and negotiable chunks, I’ve attached a simple work estimation spreadsheet:

Sample IT Work Estimation Template 10-05-10 [xls]

Sample IT Work Estimation Template 10-05-10 [xlsx]

Below is a brief explanation of how the template works:

There are three tabs:

1.      Template = where one fills in the data to create an IT work estimate.

2.      Calculations = where certain values are used in calculations on the Template sheet.  Changing numbers here causes the whole estimate to change.

3.      Assumptions = certain assumptions, copyright and reference back to this article for explaining how the template works.

Template

The top portion of the template (red arrow below) is designed to capture the basics about the estimate to tell it apart from other estimates: Name, project, dates, etc.  Think of all the fields that would help you and your organization know what you are estimating.

The main section of the template is where the work break down and granular task estimating occurs.  Gray fields throughout are auto-calculated but can be typed over if needed.  The first task heading “1. Architecture Tasks” shown below is a section to capture the non-negotiable tasks that need to be executed in order for any business functionality to be delivered.  This could be getting servers installed or setting up a new project in MS Visual Studio 2039 or creating a new code branch for the required new features which involves an act of Congress in your organization.

The “description” column is for the estimator to describe, in low tech language, what granular task is needed.  Since one may need to sit down and discuss this with a non-technical business person, some effort to use language that is more descriptive and less heads down techie would be beneficial.

The “Low” and “High” are for the estimator to enter the approximate minimum and maximum hours needed to complete the task.  The “Average” column in gray is calculated as simply the average to be used for the roll-up calculations at the end of the template.  If there are a number of unknowns or the tasks could be quick but depending on X, Y or Z, might run long, use a wide range for “low” and “high”.  This also presents the business conversational element of “Well, if this step goes well, it could be as short as 3 hours.  But, if turns out we do need to engage the storage team and get more storage allocated, then this could take as long as 20 hours.”  This is useful in setting the business’s expectations around variability in the estimate so they don’t get too fixed on a single, implicit number when things don’t follow the “happy path”.

The “Actual” column is useful for the engineer to record the actual time it took them to complete the tasks or the number of hours consumed up to the date the template is revised for some status reporting reason.  This also paves the way for estimators to get better at more evidence based estimating/scheduling.

The “Complete” column is for the task executor to record if they indeed completed the task or if it  still needs work to finish.  Entering a “Y” means the task is complete.  Blank or anything else means the task still needs work to complete.

The “Estimate Remaining” is gray and thus calculated as the remaining hours of the “Estimate Average” minus the “Actual” if the “Complete” column doesn’t have a “Y”.  If the “Actual” turned out to be more than the “Estimate Average” then the column is left blank.

The “Notes” column (not shown here) is for any task notes that help the estimator or anyone reading the estimate to know some additional details about task that is driving the estimated hours.

The next section called “2. Development of Functional Unit 1” represents a block of work that roles up to some business identifiable chunk of work.  Feel free to change the section name to reflect something all project stakeholders would understand.  These sections are designed to be the negotiable features that can serve as that business conversation to determine what are the exact features needed and the corresponding work durations.  Feel free to cut/paste as many new sections as is representative of the work break down requested.

In the above example from the template, Sample Task 2.1 represents a task that was estimated to be 7.5 hours but actually took 4.5 and is compete.

Sample Task 2.2 represents a task with 18.00 hours completed so far, thus calculated is the remaining 5 hours from the 23 hour estimate.

The “Sub Total” represents the min/max of the total work effort (142.25 and 181.25 in this example, which is a full ~40 hour delta) for an average of 161.75 hours of which 65 have been completed and 101 hours remain against the average.  Thus, the business expectations can be set relative to the ~40 hour swing to help the planning for best and worst case delivery.

Below the “Sub Total” section are tasks that are relative to the overall IT work:

In this section, any standard deliverables or work associated with the doing the actual development work can be captured.  In this example section, I added “unit testing” which I calculated as a percentage of the sub total hours of development work above.  The percentage is pulled from the “Calculations” tab.  In this case, I am calculating that unit testing takes 80% of the hours estimated towards development.  You can add/remove entries or adjust calculations on the “Calculations” tab to capture the hours needed to deliver a quality solution that so many developers forget to include in their hard core development work estimating.

The two documentation entries represent either a fixed amount of time, in this case 10 hours, or hours that are a percentage of the total development work, in this case, 20% of the total hours.  Feel free to add and subtract items that come up regularly to make the overall estimate more complete.  Need production turn over documentation?  Add an entry here.  Need some change control document to push a solution into the next environment?  Add an entry here.  Over time, this section will settle into capturing the work that is regularly needed in every project but is easy to forget to estimate for each time.

Lastly, the final section includes the total hours for all the work which is especially useful in answering that initial question: “How long is it going to take to enable that new FlimFlam engage-the-client-better module?”  One additional element that helps beyond the “how many hours” is the “Total Work Days” calculation which is based on a more realistic number of productive hours per day plus any reduction in time to cover other assignments that aren’t specific to project work such as researching a new technology or working on a special assignment of some kind.  The calculations are in the “Calculations” tab.  In this example, the productive work day is 6 hours (not 8 as some might consider) and 80% of those 6 hours should go to projects such as this one.  Hence the “Total Work Days” is greater than simply 348.50 divided by 40.  Again, feel free to adjust these calculations to aide in matching what your resources truly can dedicate to a particular project.  Want to show the “cost” of assigning two concurrent projects to a single resource?  Drop the hours to 3 and add another 20% to cover the cost of “context switching” and see how your estimates come to reflect reality a bit more as an example.

Additional Value

Once established as the “baseline” estimate, as new requests/changes come in, add them to the previous estimation sheet, change the date and quickly be able to predict when the overall solution will now be delivered.  Get ready for another “what is pushing out the date?” discussion armed with your estimating data.

Once established on multiple projects in the “portfolio”, now you can hold these up against competing resources to show “if project X goes first, when would I get project Y next?”  This is extremely handy if your resources effectively cost “zero dollars” in a non-charge back type corporate IT model.

Please give this template a try and let me know feedback on how effective it is with technical resources as well as a conversation tool with project sponsors.

For additional practical estimation articles, consider these I’ve written in the past:

Also, for an excellent article on using a similar technique on prioritizing multiple projects, consider this great post by Peter Kretzman, “The Practical CIO: Difficulties in project prioritization & selection, part 2“.

For additional caution in getting too carried away with the accuracy of your estimate, consider Todd Williams’s article “Good Estimates Only Have a 50% Chance of Being Made”.

, , , , , ,

Related posts:

  1. More Pitfalls of Work Estimation – Part 1
  2. The Art of IT Work Estimation
  3. Agile versus Classic IT Budgeting

For both IT managers and engineers alike, it is the least desired activity following a system failure of some kind, coming up with the root cause.  Business and/or product owners outside of IT are waiting, after the dust settles and the system is restored to working condition, to have primarily two questions answered:

  1. Why how did the system go down in the first place?
  2. What is IT going to do to make sure this doesn’t happen again?
The urgency can return without warning!

The urgency can return without warning!

The need for answers to these two seemingly straight forward questions generates an urgency that has all the IT stakeholders rallying together in camps.  This series of articles look at this challenging exercise from an engineering management perspective with the first article introducing the “80% accurate technique” and previous article focusing on avoiding the spinning wheel of blame.  This article considers how to approach the inevitable post outage “ruggedization” efforts.

So, you have survived with minor bumps and bruises from a service outage.  The service is now restored.  But does everyone just go about their regular work and forget this grueling event?  Nope … here come the “ruggedization” efforts.  I’ve covered one angle to the project involvement perspective in a previous post.  In summary from that post to set the tone for this extension: “ruggedization” projects tend to have strong support immediately following the outage but as time marches on and new problems and priorities pop up, the “ruggedization” effort loses momentum.  Strong resources move on to the problems of the moment and the challenges of the future leaving weaker resources behind to struggle to move forward on the “ruggedization” effort of the past.  What ultimately puts a nail in the coffin of the “ruggedization” effort is when real capital dollars need approval in order to buy new equipment and/or additional software licenses when many have forgotten the event ever occurred.

Thus you, as a manager, are faced with the strong potential for your team’s resources and your energy to get pulled into this likely to eventually stall out effort.  Walking away from these “ruggedization” efforts initially will brand you and your team as ones that don’t partner with the rest of the organization.  Assigning your top engineers and keeping tabs on all the throws of the ensuring project process could put you at even greater risk of not paying enough attention to new and in flight projects.  Thus you need a strategy to maintain a partnering perception while not losing your strategic focus.

Approach? In a word: balance

Balance in the sense that you need to balance you and your team’s involvement in the effort with the priority the rest of the organization is applying to the effort.  In the beginning, everyone will be running around with a sense of urgency about the effort and you need to be applying an equal amount of urgency.  Everyone external needs to have a similar sense that the urgency they feel is matched by the urgency you and your team export.  But as you get a sense that the organization is beginning to lose the momentum and participants are dropping off to focus on more urgent matters, begin to echo that same level of decreasing involvement.  Depending on your risk tolerance level, you can immediately start pulling back at the sign the others are doing the same.  I prefer a slightly more risk adverse approach.  I have found you get an even more concrete sense of the dissipating urgency as you directly interact with people that were demanding licensing costs, hardware estimates and testing schedules yesterday, but when you follow up with them with more questions today, they noticeably baulk at returning your calls, emails and IMs.  Plus, with this approach, if something goes wrong and everyone is brought back up to the level of urgency (example: system shows signs of potential doom and gloom again), you have a steady flow of “pre-tasks” (introduced in this article) you can reference that has “you” waiting on “them”.  These “pre-tasks” help both with your interactions with external parties and your management.  As they rush to get back into the hyper –urgent state and begin to thrash you and your team with requests, you have immediate responses that redirect them back to their world allowing you to more calming ramp backup.  The same applies to your management as they hear things are picking back up, they want/need a sense that you and your team are on top of the situation.  Nothing conveys that message as when, as you are fighting the current fires and this old fire flares back up, you have this at the ready:

“The customer service quality team is asking where the “ruggedization” project is at?  Well, we are waiting on a quote for the two different server config options the platform team recommended to add capacity from IT procurement.  We have questions out to the enterprise support team for them to confirm what data they are looking to pull from the system logs that they said don’t have the info they need.  And finally, we are waiting on Testing Services to provide a performance testing window to test if the vendor recommended performance tuning settings will have any effect.  So, we are ready to re-engage, but right now, we are waiting on these items in order to proceed.”

Want to be even more proactive?  Then email each of the contacts in the above example and check in on how they are progressing with your request as soon as you get wind the fire has re-ignited.  Then you can add the following to you response to your management:

“In addition, I’ve ping-ed each of those groups to see if they need anything from us at this point.”

This further solidifies you are on top of the situation when you can respond with this vote of confidence.

Thus, in summary, by keeping a pulse on the level of involvement and urgency of external stakeholders and metering you and your team’s involvement to a similar level you can maintain a sense of partnership with the rest of the organization.  In addition, if you are positioned with “pre-tasks” and an at-the-ready response to your management when the “ruggedization” effort goes from cool to cold to instantly hot again, you will respond to their desire to have confidence and trust that you are on top of the situation.  Maintaining both achieves the required sense of balance to maintain the appropriate level of involvement in the “ruggedization” effort along with not neglecting the new and emerging request for attention.

, , , , , , , , , , , , , , , , , , , ,

Related posts:

  1. The Dreaded Root Cause Meeting for the Manager, Part 5
  2. The Dreaded Root Cause Meeting for the Manager, Part 4
  3. The Dreaded Root Cause Meeting for the Manager, Part 3
  4. The Dreaded Root Cause Meeting for the Manager, Part 2
  5. The Dreaded Root Cause Meeting for the Manager, Part 1

For both IT managers and engineers alike, it is the least desired activity following a system failure of some kind, coming up with the root cause.  Business and/or product owners outside of IT are waiting, after the dust settles and the system is restored to working condition, to have primarily two questions answered:

  1. Why how did the system go down in the first place?
  2. What is IT going to do to make sure this doesn’t happen again?
Can you avoid the spinning wheel of blame?

Can you avoid the spinning wheel of blame?

The need for answers to these two seemingly straight forward questions generates an urgency that has all the IT stakeholders rallying together in camps.  This series of articles look at this challenging exercise from an engineering management perspective with the first article introducing the “80% accurate technique” and previous article focusing on communication strategies with your management when the priority of the organization is restore service at all costs, but don’t neglect data critical to finding the root cause.  This article considers an even more challenging element … avoiding the spinning wheel of blame.

  1. What is the priority of the organization when it comes to systems outages?
  2. Does someone/group need to be blamed for the outage?

Is the priority to restore services as fast a humanly possible, but with this ever present fear of the inevitable “spinning wheel of blame” along the way?  If so, then you have your work cut out for you.  Hopefully this article provides some helpful tips for this most unpleasant IT cultural scenario.

Those working or having worked in an IT culture that embraces what I call the spinning wheel of blame immediately know to what I am referring.  It is that sense that as the duration of the outage increases, proportionally, the need to cast blame on a particular entity for the cause of the outage also increases.  This proportional increase results in significant downward pressure on everyone involved not to be remotely close to the impending blame assignment.  In the opposite culture, though an organization does not enjoy a systems outage, they take a more healthy approach liken to a previous article: restore service quickly, but learn why the outage occurred in the first place so rational steps and associated investments can be made in order to reduce the likelihood of future outages. Again, in this counter case, the priority shifts to restoring service as quickly as possible but at the same time, building a case to point the finger of blame as far from one’s team and one’s department as possible.  This is where throwing the technology vendor under the proverbial bus comes in very handy.  Look for more on this challenging dynamic for the engineering team dependent on a vendor in a future article.

So, you have made it this far and you are either still groaning at the thought of your most recent experience avoiding the wheel of blame in your organization or curious how such an unhealthy culture can actually manifest itself in IT which is known for embracing constant change and the bumps and bruises along the way.  Below is a modification to the two pronged approach I mentioned previously:

Prong One – Keep Your Team Focused

Similar to the approach in this article, identify team member competencies on juggling the multiple priorities involved in restoring service and gathering data and manage accordingly.  But considering the wheel of blame element, coach the senior members to keep you abreast of the current buzz on who the wheel is pointing at before and after each major milestone in the restoration effort.  Instruct them to give you a heads up as soon as their confidence nears 80% your team’s service is likely the root cause candidate: preferably before external parties catch on to this likelihood.  For junior members without leadership provided by a senior member, though it may come across as a little bit of micro managing, step in frequently to get a pulse on their discoveries and remind them to inform you of incriminating facts prior to sharing with a larger audience.  You’ll need to absorb as much of the pulse and data of what is going on as to predict where the spinning wheel is pointing at any given moment and if it is potentially going to point at you as the next log entry is revealed.

Lastly, and very important, make sure you instill trust in your team that you have their back.  If they know the wheel exists and they get a sense you will throw them under the bus at any opportunity, they will quickly adopt non-supporting and counter productive behaviors making your job significantly harder.  Be prepared to go to bat for them when others might like to take the easy route and blame one of your team members in the blame assignment phase.

Prong Two – Keep Your Management Team Informed

While you cringe at the energy you have to expel to keep up with all the activities in flight plus the spinning wheel, your management is crossing their fingers that the wheel will land on someone else with equal fervor.  In addition to providing the information in the thematic format I proposed in this previous article, with each communication, consider including a “likelihood it is us at XX%” indicator, preferably at the top of each communication.  Strive to not have XX go from 5% to 95% in between a 5 minute communication string.  It is best to start with some assumed outage responsibility since your team is being called into the restoration and root cause effort for a reason.  If data even smells like you might have some culpability, start showing it in the XX% indicator right away.  Nothing will grab attention like an XX% going from 50% to 60% to 70% as this is a clear indication the wheel of blame is definitely spinning in your department’s direction.  This gives your management the opportunity to get involved if only to be prepared to erect the blame shields.  Another positive to having your management get involved as the percentage increase is that they can give direction if they see fit.  You have most probably been heads down, focusing on the tactical.  Your management has had the opportunity to be looking more broadly at the situation and can provide some valuable feedback from this more external perspective.

In extreme cases, not keeping your management informed opens the door for the wheel of blame to land on you directly from them.  If you haven’t brought your management in early, then when something goes wrong procedurally or otherwise, you are going to have to retroactively explain.  Unless you have a great story, and more than likely you don’t, you set yourself up for enabling your management to have little choice but to leave you out in the proverbial cold.  Where as, if you have been in regular communication with them and they are interacting in some manner, then they are implicitly part of the situation, not abstracted from it.  Taking a more harsh angle: you have removed their plausible deniability and significantly reduced the “surprise and confusion” opportunity as their out.

In the next article in this series … now that service is restored and a brief sense of calm has returned, how to approach to spirited post disaster “ruggedization” efforts.

Anyone have an example of a do or a don’t when it comes to how you handle these situations?  Anything you did that was helpful or hurtful during these events you can share?

, , , , , , , , , , , , , , , , , , , , ,

Related posts:

  1. The Dreaded Root Cause Meeting for the Manager, Part 4
  2. The Dreaded Root Cause Meeting for the Manager, Part 3
  3. The Dreaded Root Cause Meeting for the Manager, Part 2
  4. The Dreaded Root Cause Meeting for the Manager, Part 1
  5. The Dreaded Root Cause Meeting for the Engineer, Part 4

For both IT managers and engineers alike, it is the least desired activity following a system failure of some kind, coming up with the root cause.  Business and/or product owners outside of IT are waiting, after the dust settles and the system is restored to working condition, to have primarily two questions answered:

  1. Why how did the system go down in the first place?
  2. What is IT going to do to make sure this doesn’t happen again?
Fix ASAP, but don't miss the how and why?

Fix ASAP, but don't miss the how and why?

The need for answers to these two seemingly straight forward questions generates an urgency that has all the IT stakeholders rallying together in camps.  This series of articles look at this challenging exercise from an engineering management perspective with the first article introducing the “80% accurate technique” and previous article focusing on communication strategies with your management when the priority of the organization is restore service at all costs, then try to find the root cause.  This article considers a less “at all costs” culture.”

  1. What is the priority of the organization when it comes to systems outages?

Is the priority to restore services as fast a humanly possible, but with attention paid to what changes are being made, when and what are we learning about the system along the way?  If so, you need a two pronged approach:

Prong One – Keep Your Team Focused

Taking a quick assessment of your team members, you can probably quickly determine who can successfully balance competing priorities and who is overwhelmed when multiple goals are up in the air at the same time.  For those that have proven the ability to successfully balance these competing priorities, minor reminding to be cognizant of the need to balance the urgency to get things fixed against the need to capture the high level steps taken both towards ultimate success as well as towards knowledge that ultimately leads to success.  Overly reminding these individuals of these goals will be perceived as micro managing.  Thus, politely remind them, then proceed to monitor without nagging.  For those that have proven to struggle with the “troubleshooting 101” concepts of value derived from both fixing the problem quickly and gathering knowledge during the fixing process, you will need to get more involved.  One approach is to link these individuals to more senior team members that can take the lead and leverage these less skilled resources as a personal support arm to their resolution efforts.  If you are unable to link these individuals to ones that do possess these skills, you will need to provide further instruction.  Here, what skilled team members would view as micromanaging is what these team members would view as helpful, clear and focused direction.  Consider a quick electronic template for individuals to use with columns such as:

Date/Time, Activity Performed, Knowledge/Result of Activity, By Your or Other?

Example entries:

10/01/2009 8:00am, Joined troubleshooting conference call, nothing yet, <Team Member Name Here>

10/01/2009 8:10am, Restarted BLAH service, no change for the system still crashed under load, <Team Member Name Here>

10/01/2009 8:15am, Increased Available Threads in Thread Pool and Restarted BLAH service, no change for the system still crashed under load but the system supported 10k more users than before this change, <Team Member Name Here>

Also, consider pre-populating the template with other relevant data such as a drop down list of business units impacted or applications/services involved or other support groups involved.  Include as many pre-populate-able attributes as needed to assist you in strategizing your communications.

Prong Two – Keep Your Management Team Informed

If you feel some what tactically helpless, image your management’s level of helplessness in these situations.  If you are new to your management role, this is a great opportunity to demonstrate your leadership capabilities and build confidence and trust in you from your management.  Structuring a communication frequency that provides timely, but not thrashing, updates of major milestones in the troubleshooting and root cause effort will go along way to build that confidence and trust.  A theme sequence to consider is:

  • Reported problem, your team’s initial engagement (don’t forget to mention the urgency of your team’s involvement), other teams engaged, more details forthcoming
  • Quickly following, initial assessment of the systems and end users impacted, what they are experiencing, what the initial take is on what the culprit is, temperature check of the players involved, more details forthcoming
  • Major milestones of knowledge discovery or change in the reported problem from the last report, confidence assessment of next steps equaling resolution and root cause
    • Consider attaching your most recent template as an appendix/supporting material
    • Final resolution, root cause with confidence assessment, degree of involvement in the cause of the problem, next steps now that service is restored
      • Consider attaching your most recent template as an appendix/supporting material

With a two pronged approach of balancing the directional needs of your team to juggle competing priorities factoring in their individual skill sets plus an organized thematic approach to communicating to your management, you add considerable value in the root cause analysis process even though your hands are not directly solving the technical issues.

In the next article in this series … what if the priority to restore services is as fast as humanly possible but under the overwhelming fear that the spinning wheel of blame has to land on someone for this disastrous event?

Anyone have an example of a do or a don’t when it comes to how you handle these situations?  Anything you did that was helpful or hurtful during these events you can share?

, , , , , , , , , , , , , , , , , ,

Related posts:

  1. The Dreaded Root Cause Meeting for the Manager, Part 3
  2. The Dreaded Root Cause Meeting for the Manager, Part 2
  3. The Dreaded Root Cause Meeting for the Manager, Part 1
  4. The Dreaded Root Cause Meeting for the Engineer, Part 4
  5. The Dreaded Root Cause Meeting for the Engineer, Part 3

For both IT managers and engineers alike, it is the least desired activity following a system failure of some kind, coming up with the root cause.  Business and/or product owners outside of IT are waiting, after the dust settles and the system is restored to working condition, to have primarily two questions answered:

  1. Why how did the system go down in the first place?
  2. What is IT going to do to make sure this doesn’t happen again?
Get the system restored at all costs?

Get the system restored at all costs?

The need for answers to these two seemingly straight forward questions generates an urgency that has all the IT stakeholders rallying together in camps.  This series of articles look at this challenging exercise from an engineering management perspective with the first article introducing the “80% accurate technique” and the second focusing on interacting with your team.  In this article I’ll cover considerations on how to interact with your management during the outage and crucial fact gathering post outage activities.

Considering you have had a hands-on engineering role in the past but have now transitioned fully into management, you probably remember the first major systems outage you participated in as a manager.  Now if you were managing a system that you had very recent hands on experience working on, you probably felt more comfortable digging into error logs and debugging lines of code than communicating to outside stakeholders.  One the most important stakeholders is your management structure.  If you have drifted from being the hands on guy who knows all about the system to the manager, you probably have come to grips with not being able to immediately diagnose every problem and thus have to put trust in your team members (as mentioned in the previous article).  And most challenging, if you find yourself managing a service which you did not have a technical hand in designing and building, you are completely unable to rely solely on your brain power to dig into the problem and fix it without serious technical help.  Yet, in all the above situations, your role as manager requires having a solid understanding of what the problem is at any given moment, what impact the problem is causing and what steps are planned to make life grand again.

Keeping your management informed of what is going on in a manner which gives them the timely information they need to act at their level is curtail.  Keeping them in the dark about what you and your team are doing to work the problem and restore services by feverishly fixing things does not bode well for you being seen as a leader.  Also, you may need some assistance from your management chain when other groups are being impacted by your service outage and increasingly higher levels of their management start asking tough questions.  On the other side, sending your management details of new found cryptic error log data every two minutes is going to have a similar perception result … your distinct lack of leadership.

I wish I could produce a single check list of activities that would work for every organization, every culture and every one of your managers.  Rather, as I look back at past companies, managers and their associated styles and cultures, there is no one size fits all.  Thus, instead of a check list, I thought the best method would be to look at organizational attributes via a series of questions.

  1. What is the priority of the organization when it comes to systems outages?

Is the priority to restore services as fast a humanly possible, regardless of the steps taken?  If so, then the information flow up the chain would be catering towards creating confidence in your team’s focus on the urgency of getting things working.  At the same time, coach your team to look for every option to get it running and figure out the why later.

Experienced engineers know the best time to capture useful data is when the system is hemorrhaging error information during the failure.  In high volume systems, restarting processes or rebooting systems clears a good portion of this invaluable real-time problem data from the crash.  This begs the obvious contradiction: if you are rushing to get things running at all costs, isn’t one of those costs the loss of critical data that might point squarely at the root cause?  The answer is “Yes”.  Thus, in your communications upwards, strategically force into the communication stream the notion that as the team is rushing, the ability to slow down and interpret data for root cause is being sacrificed.   That way, once everyone temporarily relaxes when the system is restored, then switches to why did it crash in the first place, you have a proverbial leg to stand on when there is a lack of critical data to support a real root cause determination.  Sure, the “I told you so” conversation is never pleasant.  What is worse is the “why didn’t you tell me” conversation.  Choosing between the lesser of two evils, I would rather quietly and politely refer to mentioning the cost of rapid restore versus methodical data gathering first, and then restore, rather than “Oh, um, yah, I forgot to mention that when we rebooted the box, we lost all the error logs in memory thus we have no clue why the service was taking up all the CPU.”

In addition, don’t neglect keeping your upward communication stream of urgent service restore in sync with your download stream.  Know your team members’ approaches involved in the system restore and ultimately the root cause exercise.  You may need to help them refocus themselves on the priority of system restore at all costs.  Engineers tend to want to figure out the “why” which could eat up precious time against the goal of service restoration.  Plus, they know what follows the restore, thus they naturally want to continue to be viewed as a knowledge expert.  They want to get their hands on as much data to process as possible to maintain that image.

In the next article, I’ll build on this theme of organization and culturally aligned approaches to management communication.

Anyone have an example of a do or a don’t when it comes to how you handle these situations?  Anything you did that was helpful or hurtful during these events you can share?

, , , , , , , , , , , , , ,

Related posts:

  1. The Dreaded Root Cause Meeting for the Manager, Part 2
  2. The Dreaded Root Cause Meeting for the Manager, Part 1
  3. The Dreaded Root Cause Meeting for the Engineer, Part 4
  4. The Dreaded Root Cause Meeting for the Engineer, Part 3
  5. The Dreaded Root Cause Meeting for the Engineer, Part 2

For both IT managers and engineers alike, it is the least desired activity following a system failure of some kind, coming up with the root cause.  Business and/or product owners outside of IT are waiting, after the dust settles and the system is restored to working condition, to have primarily two questions answered:

  1. Why did the system go down in the first place?
  2. What is IT going to do to make sure this doesn’t happen again?

The need for answers to these two seemingly straight forward questions generates an urgency that has all the IT stakeholders rallying together in camps.  This series of articles look at this challenging exercise from an engineering management perspective with the first article introducing the “80% accurate technique”.  In this article I’ll cover how to interact with your staff during the crucial fact gather activities.

Always support your team

Always support your team

We all have had interactions with managers when a service or system we are responsible for in some capacity is not doing what it is supposed to be doing.  Rands has a recent post on his perspectives of a past manager “The Leaper” that abhorred excuses as an abdication of responsibility.  So what are the characteristics of managers that have effectively enabled staff to go through the system restore to service and root cause analysis effort successfully?  What are characteristics of managers that by their approach, style, involvement, or lack of involvement have actually impeded the process which, in theory, should enable all involved to learn from the events and better positioned for the future?

Do: Trust Your Staff

By and large, you have talented staff.  In general, they come in to work wanting to do a good job.  The ones that don’t or can’t do a good job you have either moved them into a position where they can do the least harm or moved them out all together.  Maybe it is that architect that doesn’t have his head in the IT clouds and digs into the technical details.  Maybe it is that engineer that just can’t stop at knowing how only his piece of the system works but has assembled an exhaustive knowledge of the entire system as a whole.  Whoever is it, trust them that, once pointed in the troubleshooting direction and reminded of the need to bring knowledge back to you and the team in order to strategize on how communicate and act on it, they are doing their job.  Resist the urge to ping them every 5 minutes with “did you fix it yet?” or “did you find way it crashed yet?”  Nothing is more annoying in this situation than a boss hovering over your shoulder while you are trying to work.

This isn’t to say you completely ignore them.  Rather, meter your check-ins for status and make sure you ask if they need anything.  If the team is huddled in a cube for five hours without a break, offer to run and get some beverages.

Do: Run Interference so Your Team can Work

One thing you can definitely do while your team is feverishly trying to restore an ailing system or troll through log data to see why it might have taken a turn for the worse is run interference for your team.  Offer to be the external communicator.  So while they are working and feeding you bits of data, you can mull it over and craft carefully constructed emails that give outside stakeholders the impression your team is on the job, giving this issue priority and has a handle on what went wrong, etc.  If there is a conference call where multiple parties are working together (or not depending on your corporate culture), volunteer to be the voice for your team.  When stakeholders on the call are demanding updates or answers, have a volley of responses that keep those stakeholders informed yet buffer your team from wasting precious debug and analysis time updating the “root cause coordinator” so he/she can send out some high level update on some arbitrary schedule.

Don’t: Let your Team Members get Burned Out

So you are trusting your team members involved while running interference for them yet make sure you keep a watchful eye out for the signs of burn out.  Are team members starting to verbally accost one another?  Are team members pounding desks, increasingly using profanity or just plain staring at a screen full of data glassy eyed and frozen for an extended period of time?  It is time to step in and try to break the tension.  Humor is a good technique to provide a few moments of distraction and levity to an otherwise stressful activity.  Forcing a break: “Hey guys, put the conference call on mute, I’ll let them know we need a bio break … let’s assemble outside the restroom and get something from the snack bar on me”  In extreme cases where this root causes exercise is extending for days, look to swap in/out different team resources.  If there is a test run of a possible break scenario that looks to be focusing on something less relevant to your team, find a junior team member to represent the team in the testing while you distract your senior resources with a break.

Do: Remind Your Team Members Their Efforts are Valued

While you are strategizing your next move and arguing with peer managers on who bares more blame for the outage, don’t forget to remind your team members involved that their efforts are valued.  Remember, as much as you hate the outage and post outage activities from a management perspective, engineers want to be engineering new stuff, not involved in educating others on why the old stuff they built broke.

Do: Support the Team’s Collective Decisions

When you meet with the team to review the data and collectively agree on a result to communicate externally, stick by the agreed upon result.  Once communicated, make sure you show support for your team.  Don’t suddenly suffer from an attack of surprise and confusion when peers challenge your position (exaggerated bad example):

Peer Manager: “That can’t be right!  There is no way my team making those system changes in module X would have caused the whole system to grind to a halt.  It had to be your team updating the settings in module Y!”

You: “My team made changes to module Y?  I’m surprised!  Obviously my team made these changes without involving me.  Of course, if I had been informed of the changes I would have made sure they were fully tested first. <insert additional back pedaling and side stepping accountability here>”

Rather:

You: “Yes, those changes were made to module Y as part of a formal change process that was approved by the change team because the appropriate testing steps were signed off by the QA team.  I think we may collectively have a weakness in the over all system testing.  Maybe we should invest some time in determining if the testing we’ve been doing for some time now truly accounts for all the system changes over the last N months.  <target a more holistic problem rather than getting into a blame battle or worse, throwing your team in front of the bus>”

In the next article, I’ll shift the focus off your team and on to techniques to interact with your management.

Anyone have an example of a do or a don’t when it comes to how you are supported in these situations?  Anything a manager did that was helpful or hurtful during these events you can share?

, , , , , , , , , , , , ,

Related posts:

  1. The Dreaded Root Cause Meeting for the Manager, Part 1
  2. The Dreaded Root Cause Meeting for the Engineer, Part 4
  3. The Dreaded Root Cause Meeting for the Engineer, Part 3
  4. The Dreaded Root Cause Meeting for the Engineer, Part 2
  5. The Dreaded Root Cause Meeting for the Engineer, Part 1

For both IT managers and engineers alike, it is the least desired activity following a system failure of some kind, coming up with the root cause.  Business and/or product owners outside of IT are waiting, after the dust settles and the system is restored to working condition, to have primarily two questions answered:

  1. Why did the system go down in the first place?
  2. What is IT going to do to make sure this doesn’t happen again?
What if you approached all facts as only 80% accurate?

What if you approached all facts as only 80% accurate?

Since the business folks had their otherwise perfectly aligned piles of work to do that doesn’t involve IT completely interrupted by IT, they start the root cause analysis process by contacting that person in IT that represents their “relationship” with IT.  That IT person is usually high enough in the management hierarchy that the need for answers to these two seemingly straight forward questions generates an urgency that has all the IT stakeholders rallying together in camps.  As each camp is forming, one underlying theme prevails: no one wants to be the individual that broke the system and no manager wants to be responsible for that person and thus the breakage itself.  This series of articles look at this challenging exercise from an engineering management perspective.

Now, if you have been in an IT management position for some time, you have probably developed a system to have your staff gather for you the various facts and suppositions by the various players, build relationships where you can contact other managers “offline” and get their take on what is going on and have developed a system to inform your management of status as the events unfold.  Hopefully you haven’t proverbially been set on fire many times prior to having developed such a system of story crafting and information sharing.  I, myself, having moved from engineering into management without a mentor or coach … well … I thought about getting a pair of asbestos underwear many times.  The next few sections will offer different perspectives for developing such a system as well as interacting with peer managers that adhere to a particular style in outage situations.

Why do I need my staff to continually provide me with facts or data?  I have all the information I need to go off and establish my position!

My 80% Accurate Technique

I fell into the trap of taking seemingly factual information as absolute fact many times.  The initial information provided by a trusted engineer plus my own double checking and off I would zoom to defend my analysis.  I would clearly articulate my analysis with conviction at a root cause meeting only to find I was completely unaware of a series of parallel events that took place that completely leaves me sitting with only half the story and half my original credibility.  I found that a more credibility strengthening approach is to assume all of the information you have gathered is at most 80% accurate.  No matter how concrete the data is, such as a standard OS level error message indicating a volume is out of disk space; always assume that it is 80% accurate.  As an example:

“The temporary storage volume was out of disk space according to the OS error message thus that is why the service crashed.”

It sounds rather concrete that a service that requires a location to store temporary data would fail due to not being able to store data in that location.  But, note how the below comment from another in the meeting can seriously erode the credibility you had when you made that statement:

“But Infrastructure had that project to move all non-mission critical storage over to the UltraCheapO disk volumes.  The service’s configuration to use the UltraCheapO volumes was made last week where there is plenty of storage.”

At this point you have to back peddle quickly because you weren’t aware and seemingly whoever on your staff you pulled information from wasn’t aware of this project that magically changed the environment.  You maybe temped to fall back on being surprised and confused by this information, but as I’ll explain a bit later, this is an absolute last resort.

Now, using the OS level error message example and assuming the information is only 80% accurate, rephrasing the same message in the example below allows for a more graceful handling of unplanned follow-ups:

“My team’s understanding is the service uses the temporary storage volume to write data and if it can’t write, it crashes.  Does that OS error message indicate that the storage the service is using was full?”

“My team’s understanding” allows for some ambiguity in both the accuracy of the information as well as personal ambiguity connected to you, as a manager, in having a plausible reason not to be 100% in the know.  Thus, if you have to eat your words later, you have some face saving opportunities to shift blame to communication challenges between yourself and your team members in the proverbial rush to analyze the outage after quickly restoring service. “Does that OS error …” asked in the form of a question allows others that might be responsible for support components, such as temporary storage in this example, the ability to respond in a non-defensive, non-threatened manner.  By using a direct statement of the probable root cause, others are immediately put on the defensive.  Rarely, if ever, once put on the defensive, does someone raise their hand and admit “why yes, it was completely my team’s fault, we completely dropped the ball on this one.”  Where as, forming the question rather than the direct statement allows for the guilty party to respond in a less defensive manner.  Don’t be surprised if the response is along the lines of “surprised and confused” with a post meeting revelation that indeed that was the root cause communicated in an even more muted manner.

In summary, the 80% accurate technique allows for the very likely possibility that you don’t have 100% of the facts pertaining to the matter as well as allows peers to have an opportunity to save face in the very likely event they are indeed responsible for the outage.  By applying the 80% accurate technique as a mindset that permeates all of your fact gathering and meeting/peer interactions, you engage in more collaborative manner that allows both yourself and your peers to have ample opportunity to save face when new facts up turn the current flow of root cause analysis.

In the next article, looking at how to interact with your staff during the crucial fact gather activities.

, , , , , , , , , , ,

Related posts:

  1. The Dreaded Root Cause Meeting for the Engineer, Part 4
  2. The Dreaded Root Cause Meeting for the Engineer, Part 3
  3. The Dreaded Root Cause Meeting for the Engineer, Part 2
  4. The Dreaded Root Cause Meeting for the Engineer, Part 1
  5. How to Survive Your Role on a Project as a Manager, Part 3

Anyone that has had to participate in a meeting to determine why some IT system went down is echoing a collective groan as they read this title.  For both IT managers and engineers alike, it is the least desired activity following a system failure of any kind.  Business and/or product owners outside of IT are waiting, after the dust settles and the system is restored to working condition, to have primarily two questions answered

  1. Why did the system go down in the first place?
  2. What is IT going to do to make sure this doesn’t happen again?

In the first article, I outlined the business context of the root cause analysis exercise in general and the complexities in clearly and logically arriving at a true root cause for a system outage due to the interconnected players involved.  In the previous article, I outline a particular IT engineering resource approach entitled “Openly Be the Hero” to participating in the root cause analysis process.  This article introduces “Play it Safe”.

IT Engineering Participatory Approach C = Play it Safe

Play it Safe!

Play it Safe!

Having seen the potential pros and cons of approaches A and B, I assume you are wondering is there any way to play the root cause situation safely?  There is, but you are going to have to put your engineering brain and ego on hold a bit.

Step 1 = Resist the urge to be either “surprised and confused” or the Hero.

At the onset, avoid meetings, emails, hallway conversations and basically any situation that might put you in a position to start down the road of approach A or B.  Reply with vague “I’m not sure.  I think we are still looking into that.  I am waiting on <whatever>, let me get back to you” type answers.

Step 2 = Get with your management ASAP and give them a full run down of what is going on, the situation and the players involved.

As succinctly as humanly possible, state the problem “we may very well be part of the root cause for this outage”.  That should get management’s attention very quickly.  Follow-up with “here is what I know, stop me if I am going to fast or you already know all this already” and then quickly and briefly step through the problem clearly indicating where you believe/feel/think each “fact” ranks in authority.  In other words, don’t claim something is a fact unless you hold a log file printout in your hand that date and time stamps what you are saying.  “The commonly held view is the temporary storage volume filled up before anyone could purge files as the system needs, etc., etc., etc.”  “I am 50% confident based on this log entry that disk space was an issue.”  Be prepared to be stopped and asked all kinds of questions pertaining to how you know this, from whom, who else knows, etc.  What is happening is management is starting to build the story of what is taking place factually, the black and white versus gray-ness of those facts and how all the players are positioned to take the blame.

Step 3 = If management doesn’t define your role and thematically what to say and not to say, suggest your role and seek confirmation

Equally as important as step 2, confirming how management wants you to proceed is critical.  If you complete step 2 but then go off and “Being the Hero”, you will be susceptible to all of the cons associated with being the hero. Rather, if you are going forward and executing your role under the clear direction of management, as long as you indeed execute and seek clarity when a unclear situation presents itself, it will be exceedingly difficult to fall victim to the cons associated with “Being the Hero”.

Step 4 = Execute your role and keep management informed of major milestones

Go forth and help the post outage root cause investigation effort always being mindful of your role as indicated by your management.  As you are made aware of “major” milestones, make sure you go back and update management as soon as possible.  The timely updates directly assist in reshaping the story and may be accompanied by some tweak in direction to your role.  “Major” represents any event or new information that changes the shape of events.  “Bob in infrastructure just shared that the daily disk utilization report was indeed showing a reduction in free space for the last two weeks” = share ASAP.  “Bob just shared he forgot his lunch at home” = ignore.  Yes, these are rather obvious examples of what to share and not to share, but the goal here is to develop your own system for listening strategically to all the information that is being shared in order to parse out the noise and direct significant facts back to management.

In summary, the approaches and recommendations here may seem a bit extreme to many.  If you are lucky enough to belong to an organization that is culturally rational and fact based, you may not be forced into these scenarios.  Yet, all it takes is one situation to get out of control and the scenarios above become reality.  The rational, fact based logical analysis of an outage is replaced by the panicking, irrationality and emotion of engineers and managers faced with the notion of job loss due to failure to prevent an outage disaster that had major reputational and/or financial impacts.

, , , , , , , , , , , , ,

Related posts:

  1. The Dreaded Root Cause Meeting for the Engineer, Part 3
  2. The Dreaded Root Cause Meeting for the Engineer, Part 2
  3. The Dreaded Root Cause Meeting for the Engineer, Part 1
  4. How to Survive Your Role on a Project as an Engineer, Part 1
  5. How to Survive Your Role on a Project as an Engineer, Part 3