
Respond and forget, right?
Anyone that works in the Information Technology field knows that production technology systems, from time to time, will have problems. From a functional defect that has everyone scratching their heads as to how it wasn’t discovered by seemingly endless rounds of QA to full blown hardware failures that take down entire suites of applications, no matter how much is invested in “highly available” and “redundant” technologies, failures are bound to occur. For IT Managers and IT Engineers, how one handles these failures from inception through service restoration and finally root cause analysis is critical. Sure, the priority is to restore full service availability as soon as possible. But, if you neglect some key technical support quality attributes in the process, which I’ll highlight in this series of articles, you may find you both succeeded and failed in restoring service at the same time. Succeeded and failed at the same time you wonder? Please read on and I will attempt to shed some light on this success with failure construct and considerations on how to avoid the failure “pitfalls”.
Pitfall = Providing Status
So, you’ve bought into the need to be responsive based on the previous article touting the benefits to you (being viewed as a leader and raise and bonus positives which are always good) and to your organization (calmly restoring production IT services to normal working order). So, all you have to do is “respond” by sending an email right away, jumping on a conference line quickly or changing a status in a production trouble ticket tracking system promptly and you are done, right? You can now disappear into the depths of your logs files and your performance counters and your packet traces only to resurface when you have found the real cause of the problem, right? Never under estimate the extent to which people, lacking timely information people, will panic.
To help illustrate, we can extend the example from the responsiveness article of needing a plumber to call you back quickly to address the hot water heater that is pouring water all over your basement floor and not delivering any hot water to any faucet in your house. Consider that a plumber does call you back promptly to indicate they are able to start looking into your leaking hot water tank right away. But after that responsive call back, time keeps ticking by without any indication if your tank can be fixed or needs to be replaced or is about to explode and flood your basement in the process.
Note: Yes, you can walk down into the basement to physically see the plumber’s progress or lack there of, but pretend you can’t easily do that to allow this extended plumbing example to help frame the context for this article. Let’s say you left your home for work right after you confirmed the plumber was engaged to fix your problem.
So, without any further status from the plumber besides his or her initial: “Yes, I look into your hot water tank problem right away”, how do you know what is going on? The plumber could be minutes away from turning off the water main to stop the river forming in your basement followed quickly by unloading a delivery truck approaching your house with a brand new hot water heater or sitting down on the couch to catch a baseball game on TV completely ignoring your water dilemma. Thus, how do you know what is going on? You don’t, unless you are physically watching the plumber’s every move or the plumber is providing frequent status as to what is going on with your hot water crisis.
Frequently provide status
So, how does one keep the panic to a minimum once initially responding to the production issue? Reduce panic by frequently communicating status of what is going on in the troubleshooting process. This sounds simple enough, just keep everyone informed:
- “I just VPNed into the network”
- “I am pulling up a terminal session with the server now.”
- “I am typing my user name.”
- “I am typing my password.”
- “Ooops, wrong password, trying again.”
- “I am now at a command shell …”
Obviously, that is going too far into the over communication side of the status equation. What you are trying to find is the artful balance in the level of detail and frequency to share status. As in all things technological, there is no silver bullet, no industry established check list and no “do this and it will work for every situation written on a stone tablet somewhere to implement with guaranteed success. One has to put some energy into looking for clues as to what is going to work best in the given situation and then constantly monitor the results of the your communication approach to tweak as necessary.
But this sure seems like a lot of work that doesn’t get directly at fixing the true technical problem? Correct. As I mentioned previously, you can dedicate all your efforts to fixing the problem as quickly as possible, but be prepared for the consequences of various negative backlashes surrounding non-technical and peer management’s frustration of being left in the dark for who knows how long starting from problem occurrence and ending at problem resolution. Plus, you can safely anticipate the root cause analysis aftermath being painful and extended due to this lack of communication frustration you have helped create. Thus, I am arguing the time invested up front in an effective communication approach will pay large dividends in avoiding post service restoration negativity and an elongated investment in root cause analysis malaise.
Art of an Effective Status Communication Approach
So how does one determine a successful status communication approach? First, suspend your technical or engineering brain that puts speedy problem resolution as the highest priority in any production outage situation. Recall that once you put aside the technology, people are involved in the production outage. Harkin back to the plumbing crisis example above, if you are at work wondering how much your water bill is going to be as your basement floods, what would be your reaction to getting call or a note from your plumbing saying:
“Hey, this is Bob the plumber, just wanted to let you know I stopped the geyser erupting in your basement. A replacement water tank is on a delivery truck and should be arriving at your house within the hour. I’ll let you know when it gets here and what the next steps are in about an hour or so.”
Imagine the feeling of relief at getting such an update at work. Now, carry those feelings of relief over to the other people involved in the production outage situation. They are fretting over lost revenues or having to explain to their management what happened, why and what is going to be put in place so it never happens again with absolutely no clue at this moment on answers to any of those questions at the moment.
Can you make everyone relax and go about their day with a smile with a few simple sentences on what is going on? Not a chance, but you can help keep the people involved more calm and less likely to break out in irrationality by providing indications of where you are in the troubleshooting process.
Consider this revision to the step by step over communication example from above:
“Everyone, this is Bob from systems support. I was able to get online and successfully access the production server that is hosting the application that is involved in the production outage. This is a good sign in that we able to start debugging immediately without any infrastructure barriers at this point. I will now start investigating the error logs that should give some further technical direction on what is going on. I will let everyone know what I discover in 15 minutes from now.”
Similar to the status update from your plumber, there are key elements in this status message that address the human side of the outage:
- Saying your name
Saying your name seems over simplistic, but giving your name instead of hiding behind the anonymity of an artificial company group such as “systems support” makes a small but important personal connection to all of the people involved that possess likelihood to panic at a moments notice. This is similar logic as to why people prefer talking to a human rather than interacting with an automated “push or say 1 and then entering your 45 digit account number” system when calling to resolve an incorrect cell phone, gas or electric bill.
- Providing legitimate positive news, even if it is somewhat insignificant to correcting the real problem
Again, seems simplistic, but by indicating you were able to get online and get into some level of technology to begin troubleshooting, it helps to give additional confidence to the non-technical individuals participating in the outage that some potential barriers to real problem resolution have been crossed. Look for opportunities to share facts that narrow the problem down, even if they only narrow the problem down ever so slightly. The increased feeling of progress that the elements of narrowing down the problem create help to continue to enforce feelings of increasing control over a seemingly out of control situation to the non-technical people involved. Again, you are looking for balance. “I successfully typed my password” does no invoke that much confidence. Thus look for real progress facts that can be shared that focus on narrowing the problem scope rather than just facts for the sake of facts. Lastly, I chose the word “facts” specifically. Make sure you communicate facts and not speculation at this early problem engagement level. I’ll cover some suggestions on how to share speculation in another article.
- Indicate when the next status communication will occur
Giving people an indication of when they can anticipate an update on what is going on or what you are doing provides two significant benefits. The first is it allows everyone participating in the outage who is not directly involved in restoring service the ability to relax just a bit and prepare for the when they need to be engaged next. They know there is nothing tactically they can really do to solve the immediate problem. They know they are effectively 100% dependant on technical resources to do the real work of finding the problem and fixing it. They desperately want to hear: “the problem is X and I’ve fixed it.” But since you nor anyone else is at that point in the troubleshooting process, a time in the not too distant future where such a phrase might be uttered is the next best thing.
The second is it gives you much needed breathing room. Instead of hearing “Is it fixed yet? How about now? Now? Maybe now?” every couple of minutes, you’ve clearly set the expectation that you need some uninterrupted time to do some digging in order to provide anything valuable as far as investigative analysis. Thus, you now have some time to completely disengage from the noise associated with the problem and roll up your sleeves and immerse yourself in performance and log data to try and figure out what is going wrong with the technology.
Communicating Status – Approach in Summary
- Use your name and thus communicate in a more personal tone to increase confidence in non-technical participants … avoiding the opposite completely impersonal tone of “tech resource number 12”
- Provide positive news to further increase confidence and reduce the panic building in others with facts (not opinions), even if those facts are small troubleshooting milestones and not grandiose “ah ha!” findings. Make sure to balance the too small “I pressed enter and …” type facts.
- Indicate you need time to dig deeper and set the timing expectations of when others can await the next element of status from you to buy uninterrupted investigation time and allow others to put off panicking for a period of time.
We’ve extended the need for responsiveness to reports of production support problems to include an initial take on the art of creating an effective status communication approach. Look for additional articles to identify more technical support pitfalls and steps to take to avoid them.
pitfalls, production issue, production support, QA, responsive, responsiveness, SLA, status