As a manager of a team of IT engineers, one of the toughest challenges is getting a handle on not only what everyone is working on, but what are all the seemingly random sources of work coming at your team. Thus whether you find yourself managing a new team or have been managing a team for some time but you are constantly being surprised with new requests out of left field, you may want to consider constructing a logical approach similar to what is being outlined in this series of articles to stop the surprises.

You need to out in front of the Business and IT

You need to be out in front of the Business and IT

In the first article, we identified the work request attributes of your team and built a list of sources of those requests. In the previous article, we used a great post by Peter Kretzman to build on a more complicated example surround predicting how much project work your team can reasonably accomplish in the future.

In this article, I’ll apply the same approach but to the “Projects requesting engineering support” request attribute which represents work scenarios where you and your team don’t represent the entire technical solution to the business need. Rather, you and your team contribute a portion of the technical solution to another group that owns the final IT business solution. An example would be along the lines of developing a software module that would be used within the confines of a large software application you and your team don’t own. Another example would be you and your team represent a shared IT infrastructure asset/service or application asset/service that involves integrating your asset/service with a larger more holistic IT solution.

Getting your team engaged when a business problem has been identified and another IT group has been requested to map out a solution that they may decide involves your team’s asset/service is challenging from a predictability perspective to say the least. In these situations, rarely does the IT “solutions” group pro-actively engage you and your team when they get the slightest inclination that a particular business problem they are aware of might need integration services from your team. More than likely, since they are under pressure to provide initial feedback and high level estimates, the required due diligence is cut short. The solutions team falls back on past work your team completed for a somewhat similar but never turns out to be that similar guesstimate on your team’s involvement in their half baked solution.

If you are reading this far and identifying with being more aligned with the solutions group than one providing integration services, then you probably want to focus more on the previous article on capacity planning. If you are identifying with being an integration provider, then keep reading.

If you rely solely on the solutions group as your complete end customer, you will always be playing catch up. Arguments such as “If you had engaged my team earlier in the project life cycle/design phase, we could have partnered to map out a more viable solution” fail to work for some of the following reasons:

  • Solutions teams will prefer to by pass the “lack of engagement” discussion rather than participate knowing they indeed didn’t engage your team. What will they do? They will start designing solutions that don’t involve your team’s service. Since they control the requirements, the knowledge and the engagement of peer groups, they can orchestrate a solution that doesn’t involve your team’s service. If you start losing “customers”, at what point will the organization no longer see value in your services and your team in general?

  • If a solutions team does participate in a “lack of engagement” discussion and promises to engage you and your team earlier in the process, what is the likelihood that that same team or more importantly, the leadership of that team will be in tact come the next project? You have achieved some level of “I told you so” agreement, but if the players change on the next project, what does that do for you?

  • From a management perspective, how many times can you claim you had the “I told you so” speech with the solutions group to no avail to your management before your management will begin to wonder why you are stuck being reactive and haven’t put forth a plan to be more proactive?

So, your options with the solutions group seem limited. You need to provide them with engagement criteria, process and procedural info geared to get them to engage your team ASAP for new solutions. You need to have simple contact lists so they know who to call. Also, you need to participate in some level of “lack of engagement” discussion but with a partnering focus, not an I told you so focus as you run into lack of engagement scenarios. You need to provide all this so you can feel confident you are doing everything you can to empower the solutions group to successfully engage your team. In addition, since the solutions group is your choke point for information and knowledge, you need a strategy to get out in front of the solutions group to level the knowledge playing field.

Get out in front, but doesn’t the solutions group have the organizational charge to successfully interface between the business and IT? Don’t they have the project management heavy resource pool and skill set? Aren’t they the first people the business contacts when they think they need an IT solution? Yes, but as mentioned above, if you don’t make in roads into the business camp, you will forever be playing catch up.

How can you get out in front? Make a strong effort to introduce yourself to all the business folks that commonly engage the IT solutions groups. Bring a very simple and brief one page list of services your team provides, in business language not IT, to leave with them. Get them to spill their role in the organization (product manager, application owner, etc.) to you. Try and get them to share their product or service road map. Ask them about their strategy for upgrades, enhancements, changes in the industry that they are reacting too, etc. Your goal is not to replace the solutions group, rather, ask the questions you wish the solutions groups would ask for you but they never seem to ask. Make sure you stress you are not here to say the solutions group is failing, but rather, you are trying to help the organization, as a whole, better align IT services with business needs. I think you will find that setting the correct tone will open a flood gate of information to help you better understand how your services will most likely be needed in the future.

Now that you have some techniques to get out in front of the requests coming to your team, time to put together a way to capture that data in a meaningful way that will better predict when you will be engaged in work. The next article will show a modified spreadsheet for collecting this information in a metrics and data focused format for more effective resource planning.

, , , , , , , , , , , , , ,

Project Portfolio Management Lite

Project Portfolio Management Lite

As a manager of a team of IT engineers, one of the toughest challenges is getting a handle on not only what everyone is working on, but what are all the seemingly random sources of work coming at your team.  Thus whether you find yourself managing a new team or have been managing a team for some time but you are constantly being surprised with new requests out of left field, you may want to consider constructing a logical approach similar to what is being outlined in this series of articles to stop the surprises.

In the first article, we identified the work request attributes of your team and built a list of sources of those requests.  In the previous article, we covered an example of using the work capacity metrics to support a frequent management question around the lack of project work progress for a team with additional production issue support responsibilities.

So, we have seen how the use of work capacity metrics data can be used to capture the impact of difficult to predict production issue support competing priorities.  Let’s build on that with a more complicated example surround predicting how much project work your team can reasonably accomplish in the future.

A common question to an engineering manager that will be repeated over and over is: “We need to accomplish X and Y and Z … oh, and we need to complete the first Q milestones of project L … can we do all this by the end of the week/quarter/year?”

A completely valid question, but one that is incredibly challenging to answer:  Clearly, the requestor is asking for some degree of confidence that all of this work can be accomplish or if it can’t, what can be reasonably expected and to what level of completion.  Based on the answer, the requestor may consider re-prioritizing the work, or reducing “nice to have” features/capabilities or determining a completely different, potentially non-IT solution to a given problem.  Given an answer of “I don’t know” or “I can’t tell you” or worse “yah, sure we can” without any shred of data to support that claim, the organization as a whole is already heading for a disaster of some magnitude.

Now, any IT project manager reading this is going: “hey, I do this for a living, let me take over …” I agree.  What we are embarking on here is a slimmed down form of project portfolio management.  From the engineering manager’s point of view, you have a fixed set of resources.  You have work requests you have little to no influence over that you have to support.  You also have work for which you have more influence but you probably have more work than you can reasonable staff with a simple: “Hey, Bob, go dedicate 100% of your time to Project X and come back when Project X is done for your next assignment”.

Feel free to type in “project portfolio management” into Google and be prepared for an onslaught of material to sift through.  Try the same in Amazon or even the shelves of your local bookstore and you will find mountains of material on theories and approaches to this topic.  But of all the material available, for someone that is looking for a quick and simple way to put together a mechanism to use some loose metric data to predict the future, I highly recommend Peter Kretzman’s post on his blog “The Practical CIO: Difficulties in project prioritization & selection, part 2” [].

In essence, you need a tool to help you put some metric data you have together with some work estimation data you extract via the requestor to produce some map of the likelihood, given your fixed resources and non-influence-able work requests that you can complete some portion of the work requests by some arbitrary date in the future.  Or, stated another way, what you are trying to predict with data you can explain, the intersection between the number of resources you have, minus the percentage of time they spend on non-project work against the number of projects and associate work duration estimates.  If you take on two many projects at once, your resources will be spread so thin they will predicatively miss deadlines on all of the projects.  If you take on only a few projects but the work is lengthy and sequential in nature (waterfall application development comes to mind) for individual resources, then you will have idle resources but hours and hours of work ahead of the few engaged resources.  As I mentioned above, Peter has put together a straight forward explanation with a simple MS Excel template you can start using immediately which I highly recommend you explore.

After exploring, you should be in better shape to handle the negotiations that ensure when project work requests exceed your team’s delivery capacity.

In the next article, I’ll apply the same approach to the “Projects requesting engineering support” request attribute.

, , , , , , , , , , , ,

As a manager of a team of IT engineers, one of the toughest challenges is getting a handle on not only what everyone is working on, but what are all the seemingly random sources of work coming at your team.  Thus whether you find yourself managing a new team or have been managing a team for some time but you are constantly being surprised with new requests out of left field, you may want to consider constructing a logical approach similar to what is being outlined in this series of articles to stop the surprises.

Do you want to go in front of your boss with just your gut?

Do you want to go in front of your boss with just your gut?

In the first article, we identified the work request attributes of your team and built a list of sources of those requests.  In the previous article, we dove into team capacity metrics to put some data to the hunches around who is working on what for how long.

So, you have your work request attributes and you have the beginnings of a single model of how your team does work with some numbers to show your team’s real work capacity.  Now let us tackle some way to use those numbers to put some data against those attributes.

Remember from our first article the sample attributes we compiled and determined the level of control or influence you have over the request flow?

Influence:

  • Vendor product end-of-life = vendor products that you have purchased and currently use that will ultimately reach a vendor dictated end of life date by which you can no longer get support from the vendor thus you should/have to upgrade
  • Vendor product upgrades = instead of the vendor forcing an upgrade due to a product’s end of life, you may need to upgrade a product prior in order to take advantage of new features and functions, etc. you or someone desires
  • Service Strategy = the IT service that you provide to the organization needs to keep up with new demands on the features you provide externally as well as the time/energy spent keeping the service functioning internally.  This is greater than just “upgrade to the latest version” but could entail workflow changes, process or procedural changes, etc. that mean some level of work impact to your team.

No Influence:

  • Production issue support = ad hoc requests from outside your team for subject matter expertise to assist with resolving system production issues or ad hoc “consulting” such as assisting peers with solving a problem based on your team’s expertise
  • Projects requesting engineering support = business sponsored projects that require some resource assistance from your team

Now is where your capacity metrics come in handy to try and be more predictive around the work you have no real influence over.

Production issue support

Looking back at the numbers you have collected about your production issue support, what can you say?  Well, for a given time period, say an average 5 day work week, you should be able to make a data based claim as to the percentage of time your team as a whole spends on this activity.  Is it 5% of your team’s total capacity or more like 30%?  Taking the time captured per the previous article’s approach, you should be able to determine a percentage of total time claim that is supported by data and not just your gut.  The data based rather than gut based doesn’t detract from the value you place on your gut instincts, but rather, should be re-enforcing your notions concerning the impact of this service that your team has to provide.  Consider this management exchange:

Gut Only:

Big Boss: “Why can’t we make consistent traction on the FlimFlam Upgrade project?  Each monthly status meeting seems to show some progress, but never the progress predicted the previous week?”

You: “Well, we can make some traction but our resources get distracted with production support issues that take precedence.”

Gut Reinforced by Capacity Data:

Big Boss: <same question>

You: “Well, we can make some traction but our resources get distracted with production support issues that take precedence.  In fact, over the last N months, in tracking the hours the team devotes to production support issues, the data says the team, on average, is spending 30% of their time focused on production support issues.  Thus, those hours are not available to the FlimFlam Upgrade project.”

I think everyone would agree that the second response has more credibility because it backs up the “gut” with data that puts real context around the impact of the higher priority work requests.  Further, it suggests that only 70% of the work projected for the FlimFlam Upgrade project can actually be accomplished thus closes the gap in the Big Boss’s question around work not getting done.  Said another way, a conservative maximum of 70% of the next month’s predicted work should be expected to be accomplished.  So, now you can add a proactive element to your response to the big boss’s question.  Consider this addition to the above response:

“I would like to propose, unless you disagree, that I contact the project manager on the FlimFlam Upgrade project and ask him/her to forecast the month forward work projection and then a second forecast showing only 70% of that projected work getting accomplished.  That way, we all collectively will have the best projection of how much work will most likely get accomplished given the competing priorities of production support work.”

The data backed gut with proactive forecasting response presents a much more comprehensive answer to the Big Boss’s question than just gut alone.  Plus, going forward, at the FlimFlam Upgrade project meetings, the progress (either greater than or less than 70% in this example) can be tied to the impact of higher priority work requests rather than some nebulous unknown reason that could be left to the interpretation of stakeholders.  More work than the 70% got accomplished?  It must be less production support work.  Less work than 70% got accomplished?  It is more likely to be a spike in production support and not ineffective resources or your weak management of those resources.

In the next article, I’ll apply the same approach to the “Projects requesting engineering support” request attribute.

, , , , , , , , , , , , , ,

As a manager of a team of IT engineers, one of the toughest challenges is getting a handle on not only what everyone is working on, but what are all the seemingly random sources of work coming at your team.  Thus whether you find yourself managing a new team or have been managing a team for some time but you are constantly being surprised with new requests out of left field, you may want to consider constructing a logical approach similar to what is being outlined in this series of articles to stop the surprises.

Real work capacity in a work day = 5 or 6 hours?  Really?

Real work capacity in a work day = 5 or 6 hours? Really?

In the first article, we identified the work request attributes of your team and built a list of sources of those requests.  In the previous article, we combined these two lists into a single model of how your team does work.  This article we will dive into team capacity metrics.

So, you have your work request attributes, you have the beginnings of a single model of how your team does work, now let us start to put some numbers together to show your team’s real work capacity.

Easy, each engineering resource works 40 hours a week; done.

Or … how about getting a bit more sophisticated:

40 hours – 3 weeks of vacation – 8 holidays = 36.5 hours per week or 7.3 hours per day.

Great … we are done … this was easy!

Well … there are whole companies focused on delivering this number to an organization based on their proprietary way they take into account inefficient processes, status reporting, change traffic and group/team/feedback meetings.  This article contends that 6.5 hours a day is what resource capacity planning tools generally use as a guess for effective work hours in a day.  In the very same article they further challenge 6.5 as being too optimistic and suggest 4 to 4.5 hours a day is more probable.  Something tells me if you were to approach your management with a claim that your team is really only able to do 4 to 4.5 hours of work in a given business day you would be on a fast road to picking up your last paycheck.

But at the same time, to say your team is productive on engineering activities 8 hours a day or even 7.3 hours a day would be equally unrealistic.  So how does one arrive at a plausible number?  I will venture to say it involves one part science to one part “guesstimation”.

Science:

Do you have any sort of time reporting/tracking system?  If you do, you are in great shape as far as useful data is concerned.  Before you go digging for metric gold, you should consider pinging each team member to see how they actually approach time entry.  Do they just put eight hours every day without any delineation for different projects or meetings or even lunch?  Make sure you assemble how your team is entering time before making any assumptions on the data you are mining.  You may find it helpful, after chatting with each team member, to create a time entry guide so you can help get everyone pointed in the similar direction.  After handing out the guide, start re-checking the time entry data to see if everyone is consistent or if you need to revise the time entry guide to course correct.

Guesstimation:

So you aren’t lucky enough to have a time tracking system of some kind already in place.  You could try and whip up something in say MS Excel and collect files from everyone each week.  Or, you could go find an open source product such as WR Time Tracker* or My Time Card*.  Or, instead of having to figure out a way to convince your team you really are trying to do a more scientific capacity planning exercise with real data versus being the “The Man” and wanting to track their every trip to the restroom and the coffee station, you could make an educated guess.  Try this:

In given month there are 20 working days on average thus at 8 hours a day, someone could be working 160 hours in a given month.

Catalog every team member’s allotted vacation time, holidays and sick days or whatever allows an employee not to be at work.  Average across your team members (since we are guessing) and compute to total hours.  Subtract that number from 160.

Do you have a team meeting every month?  Subtract one hour per month.

Do you have a department or large group meeting monthly or bi-monthly?  Subtract that as well.

Your probably have other meetings for human resources functions or life safety or company training or design reviews or architecture reviews or committee meetings: determine those monthly and subtract as well.

Don’t forget any external training, conferences, seminars, etc., average and subtract those as well.

You may be surprised to find when you sit down and figure out all of the “corporate distractions” from doing actually hands on engineering work, you end up with somewhere between 5 and 6 hours of actual concentrated engineering work time.

Now, are you wondering why your project estimates always seem to fall short?  Have you sat down and gone through this exercise?  Even if you haven’t gone through this “guesstimation” exercise, pick a recent project that ran longer than estimated and try using a 5 or 6 hour work day instead of 7 or 8+ and see if you come closer to the actual target.

Anyone else have any tips in this area or a handy MS Excel template they can share that makes this easier?

The next article will dive into a very simple project/request capacity exercise.

*I have no direct experience with these products.  Review their terms and conditions and product feature set prior to installing and/or purchasing.

, , , , , , , , , , , , ,

For both IT managers and engineers alike, it is the least desired activity following a system failure of some kind, coming up with the root cause.  Business and/or product owners outside of IT are waiting, after the dust settles and the system is restored to working condition, to have primarily two questions answered:

  1. Why how did the system go down in the first place?
  2. What is IT going to do to make sure this doesn’t happen again?
The urgency can return without warning!

The urgency can return without warning!

The need for answers to these two seemingly straight forward questions generates an urgency that has all the IT stakeholders rallying together in camps.  This series of articles look at this challenging exercise from an engineering management perspective with the first article introducing the “80% accurate technique” and previous article focusing on avoiding the spinning wheel of blame.  This article considers how to approach the inevitable post outage “ruggedization” efforts.

So, you have survived with minor bumps and bruises from a service outage.  The service is now restored.  But does everyone just go about their regular work and forget this grueling event?  Nope … here come the “ruggedization” efforts.  I’ve covered one angle to the project involvement perspective in a previous post.  In summary from that post to set the tone for this extension: “ruggedization” projects tend to have strong support immediately following the outage but as time marches on and new problems and priorities pop up, the “ruggedization” effort loses momentum.  Strong resources move on to the problems of the moment and the challenges of the future leaving weaker resources behind to struggle to move forward on the “ruggedization” effort of the past.  What ultimately puts a nail in the coffin of the “ruggedization” effort is when real capital dollars need approval in order to buy new equipment and/or additional software licenses when many have forgotten the event ever occurred.

Thus you, as a manager, are faced with the strong potential for your team’s resources and your energy to get pulled into this likely to eventually stall out effort.  Walking away from these “ruggedization” efforts initially will brand you and your team as ones that don’t partner with the rest of the organization.  Assigning your top engineers and keeping tabs on all the throws of the ensuring project process could put you at even greater risk of not paying enough attention to new and in flight projects.  Thus you need a strategy to maintain a partnering perception while not losing your strategic focus.

Approach? In a word: balance

Balance in the sense that you need to balance you and your team’s involvement in the effort with the priority the rest of the organization is applying to the effort.  In the beginning, everyone will be running around with a sense of urgency about the effort and you need to be applying an equal amount of urgency.  Everyone external needs to have a similar sense that the urgency they feel is matched by the urgency you and your team export.  But as you get a sense that the organization is beginning to lose the momentum and participants are dropping off to focus on more urgent matters, begin to echo that same level of decreasing involvement.  Depending on your risk tolerance level, you can immediately start pulling back at the sign the others are doing the same.  I prefer a slightly more risk adverse approach.  I have found you get an even more concrete sense of the dissipating urgency as you directly interact with people that were demanding licensing costs, hardware estimates and testing schedules yesterday, but when you follow up with them with more questions today, they noticeably baulk at returning your calls, emails and IMs.  Plus, with this approach, if something goes wrong and everyone is brought back up to the level of urgency (example: system shows signs of potential doom and gloom again), you have a steady flow of “pre-tasks” (introduced in this article) you can reference that has “you” waiting on “them”.  These “pre-tasks” help both with your interactions with external parties and your management.  As they rush to get back into the hyper –urgent state and begin to thrash you and your team with requests, you have immediate responses that redirect them back to their world allowing you to more calming ramp backup.  The same applies to your management as they hear things are picking back up, they want/need a sense that you and your team are on top of the situation.  Nothing conveys that message as when, as you are fighting the current fires and this old fire flares back up, you have this at the ready:

“The customer service quality team is asking where the “ruggedization” project is at?  Well, we are waiting on a quote for the two different server config options the platform team recommended to add capacity from IT procurement.  We have questions out to the enterprise support team for them to confirm what data they are looking to pull from the system logs that they said don’t have the info they need.  And finally, we are waiting on Testing Services to provide a performance testing window to test if the vendor recommended performance tuning settings will have any effect.  So, we are ready to re-engage, but right now, we are waiting on these items in order to proceed.”

Want to be even more proactive?  Then email each of the contacts in the above example and check in on how they are progressing with your request as soon as you get wind the fire has re-ignited.  Then you can add the following to you response to your management:

“In addition, I’ve ping-ed each of those groups to see if they need anything from us at this point.”

This further solidifies you are on top of the situation when you can respond with this vote of confidence.

Thus, in summary, by keeping a pulse on the level of involvement and urgency of external stakeholders and metering you and your team’s involvement to a similar level you can maintain a sense of partnership with the rest of the organization.  In addition, if you are positioned with “pre-tasks” and an at-the-ready response to your management when the “ruggedization” effort goes from cool to cold to instantly hot again, you will respond to their desire to have confidence and trust that you are on top of the situation.  Maintaining both achieves the required sense of balance to maintain the appropriate level of involvement in the “ruggedization” effort along with not neglecting the new and emerging request for attention.

, , , , , , , , , , , , , , , , , , , ,

For both IT managers and engineers alike, it is the least desired activity following a system failure of some kind, coming up with the root cause.  Business and/or product owners outside of IT are waiting, after the dust settles and the system is restored to working condition, to have primarily two questions answered:

  1. Why how did the system go down in the first place?
  2. What is IT going to do to make sure this doesn’t happen again?
Can you avoid the spinning wheel of blame?

Can you avoid the spinning wheel of blame?

The need for answers to these two seemingly straight forward questions generates an urgency that has all the IT stakeholders rallying together in camps.  This series of articles look at this challenging exercise from an engineering management perspective with the first article introducing the “80% accurate technique” and previous article focusing on communication strategies with your management when the priority of the organization is restore service at all costs, but don’t neglect data critical to finding the root cause.  This article considers an even more challenging element … avoiding the spinning wheel of blame.

  1. What is the priority of the organization when it comes to systems outages?
  2. Does someone/group need to be blamed for the outage?

Is the priority to restore services as fast a humanly possible, but with this ever present fear of the inevitable “spinning wheel of blame” along the way?  If so, then you have your work cut out for you.  Hopefully this article provides some helpful tips for this most unpleasant IT cultural scenario.

Those working or having worked in an IT culture that embraces what I call the spinning wheel of blame immediately know to what I am referring.  It is that sense that as the duration of the outage increases, proportionally, the need to cast blame on a particular entity for the cause of the outage also increases.  This proportional increase results in significant downward pressure on everyone involved not to be remotely close to the impending blame assignment.  In the opposite culture, though an organization does not enjoy a systems outage, they take a more healthy approach liken to a previous article: restore service quickly, but learn why the outage occurred in the first place so rational steps and associated investments can be made in order to reduce the likelihood of future outages. Again, in this counter case, the priority shifts to restoring service as quickly as possible but at the same time, building a case to point the finger of blame as far from one’s team and one’s department as possible.  This is where throwing the technology vendor under the proverbial bus comes in very handy.  Look for more on this challenging dynamic for the engineering team dependent on a vendor in a future article.

So, you have made it this far and you are either still groaning at the thought of your most recent experience avoiding the wheel of blame in your organization or curious how such an unhealthy culture can actually manifest itself in IT which is known for embracing constant change and the bumps and bruises along the way.  Below is a modification to the two pronged approach I mentioned previously:

Prong One – Keep Your Team Focused

Similar to the approach in this article, identify team member competencies on juggling the multiple priorities involved in restoring service and gathering data and manage accordingly.  But considering the wheel of blame element, coach the senior members to keep you abreast of the current buzz on who the wheel is pointing at before and after each major milestone in the restoration effort.  Instruct them to give you a heads up as soon as their confidence nears 80% your team’s service is likely the root cause candidate: preferably before external parties catch on to this likelihood.  For junior members without leadership provided by a senior member, though it may come across as a little bit of micro managing, step in frequently to get a pulse on their discoveries and remind them to inform you of incriminating facts prior to sharing with a larger audience.  You’ll need to absorb as much of the pulse and data of what is going on as to predict where the spinning wheel is pointing at any given moment and if it is potentially going to point at you as the next log entry is revealed.

Lastly, and very important, make sure you instill trust in your team that you have their back.  If they know the wheel exists and they get a sense you will throw them under the bus at any opportunity, they will quickly adopt non-supporting and counter productive behaviors making your job significantly harder.  Be prepared to go to bat for them when others might like to take the easy route and blame one of your team members in the blame assignment phase.

Prong Two – Keep Your Management Team Informed

While you cringe at the energy you have to expel to keep up with all the activities in flight plus the spinning wheel, your management is crossing their fingers that the wheel will land on someone else with equal fervor.  In addition to providing the information in the thematic format I proposed in this previous article, with each communication, consider including a “likelihood it is us at XX%” indicator, preferably at the top of each communication.  Strive to not have XX go from 5% to 95% in between a 5 minute communication string.  It is best to start with some assumed outage responsibility since your team is being called into the restoration and root cause effort for a reason.  If data even smells like you might have some culpability, start showing it in the XX% indicator right away.  Nothing will grab attention like an XX% going from 50% to 60% to 70% as this is a clear indication the wheel of blame is definitely spinning in your department’s direction.  This gives your management the opportunity to get involved if only to be prepared to erect the blame shields.  Another positive to having your management get involved as the percentage increase is that they can give direction if they see fit.  You have most probably been heads down, focusing on the tactical.  Your management has had the opportunity to be looking more broadly at the situation and can provide some valuable feedback from this more external perspective.

In extreme cases, not keeping your management informed opens the door for the wheel of blame to land on you directly from them.  If you haven’t brought your management in early, then when something goes wrong procedurally or otherwise, you are going to have to retroactively explain.  Unless you have a great story, and more than likely you don’t, you set yourself up for enabling your management to have little choice but to leave you out in the proverbial cold.  Where as, if you have been in regular communication with them and they are interacting in some manner, then they are implicitly part of the situation, not abstracted from it.  Taking a more harsh angle: you have removed their plausible deniability and significantly reduced the “surprise and confusion” opportunity as their out.

In the next article in this series … now that service is restored and a brief sense of calm has returned, how to approach to spirited post disaster “ruggedization” efforts.

Anyone have an example of a do or a don’t when it comes to how you handle these situations?  Anything you did that was helpful or hurtful during these events you can share?

, , , , , , , , , , , , , , , , , , , , ,

For both IT managers and engineers alike, it is the least desired activity following a system failure of some kind, coming up with the root cause.  Business and/or product owners outside of IT are waiting, after the dust settles and the system is restored to working condition, to have primarily two questions answered:

  1. Why how did the system go down in the first place?
  2. What is IT going to do to make sure this doesn’t happen again?
Fix ASAP, but don't miss the how and why?

Fix ASAP, but don't miss the how and why?

The need for answers to these two seemingly straight forward questions generates an urgency that has all the IT stakeholders rallying together in camps.  This series of articles look at this challenging exercise from an engineering management perspective with the first article introducing the “80% accurate technique” and previous article focusing on communication strategies with your management when the priority of the organization is restore service at all costs, then try to find the root cause.  This article considers a less “at all costs” culture.”

  1. What is the priority of the organization when it comes to systems outages?

Is the priority to restore services as fast a humanly possible, but with attention paid to what changes are being made, when and what are we learning about the system along the way?  If so, you need a two pronged approach:

Prong One – Keep Your Team Focused

Taking a quick assessment of your team members, you can probably quickly determine who can successfully balance competing priorities and who is overwhelmed when multiple goals are up in the air at the same time.  For those that have proven the ability to successfully balance these competing priorities, minor reminding to be cognizant of the need to balance the urgency to get things fixed against the need to capture the high level steps taken both towards ultimate success as well as towards knowledge that ultimately leads to success.  Overly reminding these individuals of these goals will be perceived as micro managing.  Thus, politely remind them, then proceed to monitor without nagging.  For those that have proven to struggle with the “troubleshooting 101” concepts of value derived from both fixing the problem quickly and gathering knowledge during the fixing process, you will need to get more involved.  One approach is to link these individuals to more senior team members that can take the lead and leverage these less skilled resources as a personal support arm to their resolution efforts.  If you are unable to link these individuals to ones that do possess these skills, you will need to provide further instruction.  Here, what skilled team members would view as micromanaging is what these team members would view as helpful, clear and focused direction.  Consider a quick electronic template for individuals to use with columns such as:

Date/Time, Activity Performed, Knowledge/Result of Activity, By Your or Other?

Example entries:

10/01/2009 8:00am, Joined troubleshooting conference call, nothing yet, <Team Member Name Here>

10/01/2009 8:10am, Restarted BLAH service, no change for the system still crashed under load, <Team Member Name Here>

10/01/2009 8:15am, Increased Available Threads in Thread Pool and Restarted BLAH service, no change for the system still crashed under load but the system supported 10k more users than before this change, <Team Member Name Here>

Also, consider pre-populating the template with other relevant data such as a drop down list of business units impacted or applications/services involved or other support groups involved.  Include as many pre-populate-able attributes as needed to assist you in strategizing your communications.

Prong Two – Keep Your Management Team Informed

If you feel some what tactically helpless, image your management’s level of helplessness in these situations.  If you are new to your management role, this is a great opportunity to demonstrate your leadership capabilities and build confidence and trust in you from your management.  Structuring a communication frequency that provides timely, but not thrashing, updates of major milestones in the troubleshooting and root cause effort will go along way to build that confidence and trust.  A theme sequence to consider is:

  • Reported problem, your team’s initial engagement (don’t forget to mention the urgency of your team’s involvement), other teams engaged, more details forthcoming
  • Quickly following, initial assessment of the systems and end users impacted, what they are experiencing, what the initial take is on what the culprit is, temperature check of the players involved, more details forthcoming
  • Major milestones of knowledge discovery or change in the reported problem from the last report, confidence assessment of next steps equaling resolution and root cause
    • Consider attaching your most recent template as an appendix/supporting material
    • Final resolution, root cause with confidence assessment, degree of involvement in the cause of the problem, next steps now that service is restored
      • Consider attaching your most recent template as an appendix/supporting material

With a two pronged approach of balancing the directional needs of your team to juggle competing priorities factoring in their individual skill sets plus an organized thematic approach to communicating to your management, you add considerable value in the root cause analysis process even though your hands are not directly solving the technical issues.

In the next article in this series … what if the priority to restore services is as fast as humanly possible but under the overwhelming fear that the spinning wheel of blame has to land on someone for this disastrous event?

Anyone have an example of a do or a don’t when it comes to how you handle these situations?  Anything you did that was helpful or hurtful during these events you can share?

, , , , , , , , , , , , , , , , , ,