For both IT managers and engineers alike, it is the least desired activity following a system failure of some kind, coming up with the root cause. Business and/or product owners outside of IT are waiting, after the dust settles and the system is restored to working condition, to have primarily two questions answered:
- Why how did the system go down in the first place?
- What is IT going to do to make sure this doesn’t happen again?

The urgency can return without warning!
The need for answers to these two seemingly straight forward questions generates an urgency that has all the IT stakeholders rallying together in camps. This series of articles look at this challenging exercise from an engineering management perspective with the first article introducing the “80% accurate technique” and previous article focusing on avoiding the spinning wheel of blame. This article considers how to approach the inevitable post outage “ruggedization” efforts.
So, you have survived with minor bumps and bruises from a service outage. The service is now restored. But does everyone just go about their regular work and forget this grueling event? Nope … here come the “ruggedization” efforts. I’ve covered one angle to the project involvement perspective in a previous post. In summary from that post to set the tone for this extension: “ruggedization” projects tend to have strong support immediately following the outage but as time marches on and new problems and priorities pop up, the “ruggedization” effort loses momentum. Strong resources move on to the problems of the moment and the challenges of the future leaving weaker resources behind to struggle to move forward on the “ruggedization” effort of the past. What ultimately puts a nail in the coffin of the “ruggedization” effort is when real capital dollars need approval in order to buy new equipment and/or additional software licenses when many have forgotten the event ever occurred.
Thus you, as a manager, are faced with the strong potential for your team’s resources and your energy to get pulled into this likely to eventually stall out effort. Walking away from these “ruggedization” efforts initially will brand you and your team as ones that don’t partner with the rest of the organization. Assigning your top engineers and keeping tabs on all the throws of the ensuring project process could put you at even greater risk of not paying enough attention to new and in flight projects. Thus you need a strategy to maintain a partnering perception while not losing your strategic focus.
Approach? In a word: balance
Balance in the sense that you need to balance you and your team’s involvement in the effort with the priority the rest of the organization is applying to the effort. In the beginning, everyone will be running around with a sense of urgency about the effort and you need to be applying an equal amount of urgency. Everyone external needs to have a similar sense that the urgency they feel is matched by the urgency you and your team export. But as you get a sense that the organization is beginning to lose the momentum and participants are dropping off to focus on more urgent matters, begin to echo that same level of decreasing involvement. Depending on your risk tolerance level, you can immediately start pulling back at the sign the others are doing the same. I prefer a slightly more risk adverse approach. I have found you get an even more concrete sense of the dissipating urgency as you directly interact with people that were demanding licensing costs, hardware estimates and testing schedules yesterday, but when you follow up with them with more questions today, they noticeably baulk at returning your calls, emails and IMs. Plus, with this approach, if something goes wrong and everyone is brought back up to the level of urgency (example: system shows signs of potential doom and gloom again), you have a steady flow of “pre-tasks” (introduced in this article) you can reference that has “you” waiting on “them”. These “pre-tasks” help both with your interactions with external parties and your management. As they rush to get back into the hyper –urgent state and begin to thrash you and your team with requests, you have immediate responses that redirect them back to their world allowing you to more calming ramp backup. The same applies to your management as they hear things are picking back up, they want/need a sense that you and your team are on top of the situation. Nothing conveys that message as when, as you are fighting the current fires and this old fire flares back up, you have this at the ready:
“The customer service quality team is asking where the “ruggedization” project is at? Well, we are waiting on a quote for the two different server config options the platform team recommended to add capacity from IT procurement. We have questions out to the enterprise support team for them to confirm what data they are looking to pull from the system logs that they said don’t have the info they need. And finally, we are waiting on Testing Services to provide a performance testing window to test if the vendor recommended performance tuning settings will have any effect. So, we are ready to re-engage, but right now, we are waiting on these items in order to proceed.”
Want to be even more proactive? Then email each of the contacts in the above example and check in on how they are progressing with your request as soon as you get wind the fire has re-ignited. Then you can add the following to you response to your management:
“In addition, I’ve ping-ed each of those groups to see if they need anything from us at this point.”
This further solidifies you are on top of the situation when you can respond with this vote of confidence.
Thus, in summary, by keeping a pulse on the level of involvement and urgency of external stakeholders and metering you and your team’s involvement to a similar level you can maintain a sense of partnership with the rest of the organization. In addition, if you are positioned with “pre-tasks” and an at-the-ready response to your management when the “ruggedization” effort goes from cool to cold to instantly hot again, you will respond to their desire to have confidence and trust that you are on top of the situation. Maintaining both achieves the required sense of balance to maintain the appropriate level of involvement in the “ruggedization” effort along with not neglecting the new and emerging request for attention.




