Anyone that works in the Information Technology field knows that production technology systems, from time to time, will have problems. From a functional defect that has everyone scratching their heads as to how it wasn’t discovered by seemingly endless rounds of QA to full blown hardware failures that take down entire suites of applications, no matter how much is invested in “highly available” and “redundant” technologies, failures are bound to occur. For IT Managers and IT Engineers, how one handles these failures from inception through service restoration and finally root cause analysis is critical. Sure, the priority is to restore full service availability as soon as possible. But, if you neglect some key technical support quality attributes in the process, which I’ll highlight in this series of articles, you may find you both succeeded and failed in restoring service at the same time. Succeeded and failed at the same time you wonder? Please read on and I will attempt to shed some light on this success with failure construct and considerations on how to avoid the failure “pitfalls”.
Pitfall = Challenges in an Extended Outage
So, you’ve bought into the need to be response based on a previous article touting the benefits to you (being viewed as a leader and raise and bonus positives) and your organization (calmly restore production IT services to normal working order). You’ve communicated in a personal style with incremental positive facts and indicated at what timing points you will be updating the stakeholders on your progress as indicated in the previous article. If the problem can be easily identified and corrected quickly including a rather direct way to explain why it happened, pat yourself on the back for a job well done. Now get ready for the after math of re-explaining what happened a hand full of times over and possibly participate in some post issue shoring up of the technology (see root cause analysis considerations posted here previously). But what happens when the status reporting is going on longer and longer and you can tell that the natives are getting restless as they are starting to grow concerned at the length of the outage and at the lack of a clear “it will be fixed in 5 minutes” status report? When an outage becomes an extended outage, time to ratchet up the communication plan and bring in some help.
Problem isn’t Obviously Fixable in Short Order? Get Help
Most likely, as time is going by, more people are aware of the outage and thus the list of stakeholders is growing larger. Also, the likelihood those stakeholders are senior technical people offering to give you a hand is slim and none … and slim left town as the saying goes. I would venture to say that the stakeholders are a growing list of non-technical people that are impacted in some way by the production situation continuing to be a problem. More and more managers on the operations and product side of the service are getting engaged as possible customer complaints are mounting or call center call volumes are reaching levels of concern. There maybe more people engaged to discuss what to do if the outage continues and an alternative, possibly more manual means is needed to meet customer SLAs. By the way, manual usually means more work done by people, hence more people getting engaged to see if they have to bring in even more people to ensure the alternative service delivery option has the right, skilled and trained staff. Company marketing resources could be engaged to offer advice on how best to let customers know the service is having a greater than normal duration outage and what the company plans to do to service their needs. I am not trying to paint a picture of doom and gloom for the primarily technical audience for this article. I know the technical mind wants to have all the people just stop talking so the real work of fixing the technology can take place. But on the business side of the technology in trouble, there are company stakeholders and customers of some form or another that are materially impacted in some way by having the usually highly reliable technology fail to function correctly.
Thus, as time goes by, your incrementally positive but not “it’s fixed” communications aren’t enough to appease the masses. You are either going to have to spend more and more time explaining to new people joining the situation what happened when, what has been ruled out, what is next to investigate, etc. or risk becoming non-communicative in order get some focused time to fix things, thus putting all your hard work at risk as outlined in this previous article. It is time to ask for some help.
Hopefully you have already engaged your management to keep them apprised of the situation as suggested in this previous series of articles. Thus, you may already be getting asked if you need help because you have informed your management and thus they are starting to ask the “hey, you are doing a good job, but can we help?” type of questions.
Ask for and accept help
I can’t stress it enough: avoid the notion that the fix is “just around the corner and if I only spend 10 more minutes researching …”. Ask for and accept help. To start, get someone engaged to be the status communicator so you have less distractions and more time to dig into the problem. The status communicator needs to have level of competence in the following skill areas:
Enough of a technical background to take technical status bits from you and quickly understand what you are saying without a 5 hour white-board deep dive session.
Ability to communicate in “business speak” not “techno-speak”.
Enough understanding of the players involved organizational chart-wise to know how and when to communicate with stakeholders and when to recognize the VP of Product is looking for status and it is time to get your VP peer manager involved.
Your manager is in the best position to act in this capacity if they aren’t already doing so. As managers, you stand to lose huge management credibility and leadership points of you just sit on the sidelines and hope the problem goes away or you are somehow hoping for plausible deny-ability to relieve you of your responsibility in this situation. Roll up your sleeves and get engaged. Start sharing what is going on in a polite but authoritative tone to build confidence and most importantly, buy more time for your engineers to dig in and figure out what is going wrong and fix it. This previous series of articles offers additional tips.
In summary, as the outage is dragging on, be mindful that not everyone involved has the priority of discovering the coveted technical root cause. For engineers, as an extended outage is building, don’t keep trying to take on the rolls of technical investigator and communications expert. Get help. Managers, get involved and start shielding your engineers from the constant barrage of status requests and allow them more focused attention on digging in and finding out what is really going on and get it fixed.
We’ve extended the need for responsiveness to reports of production support problems to include an initial take on the art of creating an effective status communication approach as well as when to admit your need help and get your manager and/or team lead involved directly. Look for additional articles to identify more technical support pitfalls and steps to take to avoid them.