Know when to call in some help during an outage

Know when to call in some help during an outage

Anyone that works in the Information Technology field knows that production technology systems, from time to time, will have problems. From a functional defect that has everyone scratching their heads as to how it wasn’t discovered by seemingly endless rounds of QA to full blown hardware failures that take down entire suites of applications, no matter how much is invested in “highly available” and “redundant” technologies, failures are bound to occur. For IT Managers and IT Engineers, how one handles these failures from inception through service restoration and finally root cause analysis is critical. Sure, the priority is to restore full service availability as soon as possible. But, if you neglect some key technical support quality attributes in the process, which I’ll highlight in this series of articles, you may find you both succeeded and failed in restoring service at the same time. Succeeded and failed at the same time you wonder? Please read on and I will attempt to shed some light on this success with failure construct and considerations on how to avoid the failure “pitfalls”.

Pitfall = Challenges in an Extended Outage

So, you’ve bought into the need to be response based on a previous article touting the benefits to you (being viewed as a leader and raise and bonus positives) and your organization (calmly restore production IT services to normal working order). You’ve communicated in a personal style with incremental positive facts and indicated at what timing points you will be updating the stakeholders on your progress as indicated in the previous article. If the problem can be easily identified and corrected quickly including a rather direct way to explain why it happened, pat yourself on the back for a job well done. Now get ready for the after math of re-explaining what happened a hand full of times over and possibly participate in some post issue shoring up of the technology (see root cause analysis considerations posted here previously). But what happens when the status reporting is going on longer and longer and you can tell that the natives are getting restless as they are starting to grow concerned at the length of the outage and at the lack of a clear “it will be fixed in 5 minutes” status report? When an outage becomes an extended outage, time to ratchet up the communication plan and bring in some help.

Problem isn’t Obviously Fixable in Short Order? Get Help

Most likely, as time is going by, more people are aware of the outage and thus the list of stakeholders is growing larger. Also, the likelihood those stakeholders are senior technical people offering to give you a hand is slim and none … and slim left town as the saying goes. I would venture to say that the stakeholders are a growing list of non-technical people that are impacted in some way by the production situation continuing to be a problem. More and more managers on the operations and product side of the service are getting engaged as possible customer complaints are mounting or call center call volumes are reaching levels of concern. There maybe more people engaged to discuss what to do if the outage continues and an alternative, possibly more manual means is needed to meet customer SLAs. By the way, manual usually means more work done by people, hence more people getting engaged to see if they have to bring in even more people to ensure the alternative service delivery option has the right, skilled and trained staff. Company marketing resources could be engaged to offer advice on how best to let customers know the service is having a greater than normal duration outage and what the company plans to do to service their needs. I am not trying to paint a picture of doom and gloom for the primarily technical audience for this article. I know the technical mind wants to have all the people just stop talking so the real work of fixing the technology can take place. But on the business side of the technology in trouble, there are company stakeholders and customers of some form or another that are materially impacted in some way by having the usually highly reliable technology fail to function correctly.

Thus, as time goes by, your incrementally positive but not “it’s fixed” communications aren’t enough to appease the masses. You are either going to have to spend more and more time explaining to new people joining the situation what happened when, what has been ruled out, what is next to investigate, etc. or risk becoming non-communicative in order get some focused time to fix things, thus putting all your hard work at risk as outlined in this previous article. It is time to ask for some help.

Hopefully you have already engaged your management to keep them apprised of the situation as suggested in this previous series of articles. Thus, you may already be getting asked if you need help because you have informed your management and thus they are starting to ask the “hey, you are doing a good job, but can we help?” type of questions.

Ask for and accept help

I can’t stress it enough: avoid the notion that the fix is “just around the corner and if I only spend 10 more minutes researching …”. Ask for and accept help. To start, get someone engaged to be the status communicator so you have less distractions and more time to dig into the problem. The status communicator needs to have level of competence in the following skill areas:

  1. Enough of a technical background to take technical status bits from you and quickly understand what you are saying without a 5 hour white-board deep dive session.

  2. Ability to communicate in “business speak” not “techno-speak”.

  3. Enough understanding of the players involved organizational chart-wise to know how and when to communicate with stakeholders and when to recognize the VP of Product is looking for status and it is time to get your VP peer manager involved.

Your manager is in the best position to act in this capacity if they aren’t already doing so. As managers, you stand to lose huge management credibility and leadership points of you just sit on the sidelines and hope the problem goes away or you are somehow hoping for plausible deny-ability to relieve you of your responsibility in this situation. Roll up your sleeves and get engaged. Start sharing what is going on in a polite but authoritative tone to build confidence and most importantly, buy more time for your engineers to dig in and figure out what is going wrong and fix it.  This previous series of articles offers additional tips.

In summary, as the outage is dragging on, be mindful that not everyone involved has the priority of discovering the coveted technical root cause. For engineers, as an extended outage is building, don’t keep trying to take on the rolls of technical investigator and communications expert. Get help. Managers, get involved and start shielding your engineers from the constant barrage of status requests and allow them more focused attention on digging in and finding out what is really going on and get it fixed.

We’ve extended the need for responsiveness to reports of production support problems to include an initial take on the art of creating an effective status communication approach as well as when to admit your need help and get your manager and/or team lead involved directly. Look for additional articles to identify more technical support pitfalls and steps to take to avoid them.

, , , , , , , ,

Related posts:

  1. Pitfalls of IT Technical Support and How to Avoid Them – Providing Status
  2. Pitfalls of IT Technical Support and How to Avoid Them – Responsiveness

Respond and forget, right?

Respond and forget, right?

Anyone that works in the Information Technology field knows that production technology systems, from time to time, will have problems.  From a functional defect that has everyone scratching their heads as to how it wasn’t discovered by seemingly endless rounds of QA to full blown hardware failures that take down entire suites of applications, no matter how much is invested in “highly available” and “redundant” technologies, failures are bound to occur.  For IT Managers and IT Engineers, how one handles these failures from inception through service restoration and finally root cause analysis is critical.  Sure, the priority is to restore full service availability as soon as possible.  But, if you neglect some key technical support quality attributes in the process, which I’ll highlight in this series of articles, you may find you both succeeded and failed in restoring service at the same time.  Succeeded and failed at the same time you wonder?  Please read on and I will attempt to shed some light on this success with failure construct and considerations on how to avoid the failure “pitfalls”.

Pitfall = Providing Status

So, you’ve bought into the need to be responsive based on the previous article touting the benefits to you (being viewed as a leader and raise and bonus positives which are always good) and to your organization (calmly restoring production IT services to normal working order).  So, all you have to do is “respond” by sending an email right away, jumping on a conference line quickly or changing a status in a production trouble ticket tracking system promptly and you are done, right?  You can now disappear into the depths of your logs files and your performance counters and your packet traces only to resurface when you have found the real cause of the problem, right?  Never under estimate the extent to which people, lacking timely information people, will panic.

To help illustrate, we can extend the example from the responsiveness article of needing a plumber to call you back quickly to address the hot water heater that is pouring water all over your basement floor and not delivering any hot water to any faucet in your house.  Consider that a plumber does call you back promptly to indicate they are able to start looking into your leaking hot water tank right away.  But after that responsive call back, time keeps ticking by without any indication if your tank can be fixed or needs to be replaced or is about to explode and flood your basement in the process.

Note: Yes, you can walk down into the basement to physically see the plumber’s progress or lack there of, but pretend you can’t easily do that to allow this extended plumbing example to help frame the context for this article.  Let’s say you left your home for work right after you confirmed the plumber was engaged to fix your problem.

So, without any further status from the plumber besides his or her initial: “Yes, I look into your hot water tank problem right away”, how do you know what is going on?  The plumber could be minutes away from turning off the water main to stop the river forming in your basement followed quickly by unloading a delivery truck approaching your house with a brand new hot water heater or sitting down on the couch to catch a baseball game on TV completely ignoring your water dilemma.  Thus, how do you know what is going on?  You don’t, unless you are physically watching the plumber’s every move or the plumber is providing frequent status as to what is going on with your hot water crisis.

Frequently provide status

So, how does one keep the panic to a minimum once initially responding to the production issue?  Reduce panic by frequently communicating status of what is going on in the troubleshooting process.  This sounds simple enough, just keep everyone informed:

  • “I just VPNed into the network”
  • “I am pulling up a terminal session with the server now.”
  • “I am typing my user name.”
  • “I am typing my password.”
  • “Ooops, wrong password, trying again.”
  • “I am now at a command shell …”

Obviously, that is going too far into the over communication side of the status equation.  What you are trying to find is the artful balance in the level of detail and frequency to share status.  As in all things technological, there is no silver bullet, no industry established check list and no “do this and it will work for every situation written on a stone tablet somewhere to implement with guaranteed success.  One has to put some energy into looking for clues as to what is going to work best in the given situation and then constantly monitor the results of the your communication approach to tweak as necessary.

But this sure seems like a lot of work that doesn’t get directly at fixing the true technical problem?  Correct.  As I mentioned previously, you can dedicate all your efforts to fixing the problem as quickly as possible, but be prepared for the consequences of various negative backlashes surrounding non-technical and peer management’s frustration of being left in the dark for who knows how long starting from problem occurrence and ending at problem resolution.  Plus, you can safely anticipate the root cause analysis aftermath being painful and extended due to this lack of communication frustration you have helped create.  Thus, I am arguing the time invested up front in an effective communication approach will pay large dividends in avoiding post service restoration negativity and an elongated investment in root cause analysis malaise.

Art of an Effective Status Communication Approach

So how does one determine a successful status communication approach?  First, suspend your technical or engineering brain that puts speedy problem resolution as the highest priority in any production outage situation.  Recall that once you put aside the technology, people are involved in the production outage.  Harkin back to the plumbing crisis example above, if you are at work wondering how much your water bill is going to be as your basement floods, what would be your reaction to getting call or a note from your plumbing saying:

“Hey, this is Bob the plumber, just wanted to let you know I stopped the geyser erupting in your basement.  A replacement water tank is on a delivery truck and should be arriving at your house within the hour.  I’ll let you know when it gets here and what the next steps are in about an hour or so.”

Imagine the feeling of relief at getting such an update at work.  Now, carry those feelings of relief over to the other people involved in the production outage situation.  They are fretting over lost revenues or having to explain to their management what happened, why and what is going to be put in place so it never happens again with absolutely no clue at this moment on answers to any of those questions at the moment.

Can you make everyone relax and go about their day with a smile with a few simple sentences on what is going on?  Not a chance, but you can help keep the people involved more calm and less likely to break out in irrationality by providing indications of where you are in the troubleshooting process.

Consider this revision to the step by step over communication example from above:

“Everyone, this is Bob from systems support.  I was able to get online and successfully access the production server that is hosting the application that is involved in the production outage.  This is a good sign in that we able to start debugging immediately without any infrastructure barriers at this point.  I will now start investigating the error logs that should give some further technical direction on what is going on.  I will let everyone know what I discover in 15 minutes from now.”

Similar to the status update from your plumber, there are key elements in this status message that address the human side of the outage:

  1. Saying your name

Saying your name seems over simplistic, but giving your name instead of hiding behind the anonymity of an artificial company group such as “systems support” makes a small but important personal connection to all of the people involved that possess likelihood to panic at a moments notice.  This is similar logic as to why people prefer talking to a human rather than interacting with an automated “push or say 1 and then entering your 45 digit account number” system when calling to resolve an incorrect cell phone, gas or electric bill.

  1. Providing legitimate positive news, even if it is somewhat insignificant to correcting the real problem

Again, seems simplistic, but by indicating you were able to get online and get into some level of technology to begin troubleshooting, it helps to give additional confidence to the non-technical individuals participating in the outage that some potential barriers to real problem resolution have been crossed.  Look for opportunities to share facts that narrow the problem down, even if they only narrow the problem down ever so slightly.  The increased feeling of progress that the elements of narrowing down the problem create help to continue to enforce feelings of increasing control over a seemingly out of control situation to the non-technical people involved.  Again, you are looking for balance.  “I successfully typed my password” does no invoke that much confidence.  Thus look for real progress facts that can be shared that focus on narrowing the problem scope rather than just facts for the sake of facts.  Lastly, I chose the word “facts” specifically.  Make sure you communicate facts and not speculation at this early problem engagement level.  I’ll cover some suggestions on how to share speculation in another article.

  1. Indicate when the next status communication will occur

Giving people an indication of when they can anticipate an update on what is going on or what you are doing provides two significant benefits.  The first is it allows everyone participating in the outage who is not directly involved in restoring service the ability to relax just a bit and prepare for the when they need to be engaged next.  They know there is nothing tactically they can really do to solve the immediate problem.  They know they are effectively 100% dependant on technical resources to do the real work of finding the problem and fixing it.  They desperately want to hear: “the problem is X and I’ve fixed it.”  But since you nor anyone else is at that point in the troubleshooting process, a time in the not too distant future where such a phrase might be uttered is the next best thing.

The second is it gives you much needed breathing room.  Instead of hearing “Is it fixed yet? How about now?  Now?  Maybe now?” every couple of minutes, you’ve clearly set the expectation that you need some uninterrupted time to do some digging in order to provide anything valuable as far as investigative analysis.  Thus, you now have some time to completely disengage from the noise associated with the problem and roll up your sleeves and immerse yourself in performance and log data to try and figure out what is going wrong with the technology.

Communicating Status – Approach in Summary

  1. Use your name and thus communicate in a more personal tone to increase confidence in non-technical participants … avoiding the opposite completely impersonal tone of “tech resource number 12”
  2. Provide positive news to further increase confidence and reduce the panic building in others with facts (not opinions), even if those facts are small troubleshooting milestones and not grandiose “ah ha!” findings.  Make sure to balance the too small “I pressed enter and …” type facts.
  3. Indicate you need time to dig deeper and set the timing expectations of when others can await the next element of status from you to buy uninterrupted investigation time and allow others to put off panicking for a period of time.

We’ve extended the need for responsiveness to reports of production support problems to include an initial take on the art of creating an effective status communication approach.  Look for additional articles to identify more technical support pitfalls and steps to take to avoid them.

, , , , , , ,

Related posts:

  1. Pitfalls of IT Technical Support and How to Avoid Them – Responsiveness
  2. Vendor Management – Part 14 – Tech Support – Part 2 of 2