Know when to call in some help during an outage

Know when to call in some help during an outage

Anyone that works in the Information Technology field knows that production technology systems, from time to time, will have problems. From a functional defect that has everyone scratching their heads as to how it wasn’t discovered by seemingly endless rounds of QA to full blown hardware failures that take down entire suites of applications, no matter how much is invested in “highly available” and “redundant” technologies, failures are bound to occur. For IT Managers and IT Engineers, how one handles these failures from inception through service restoration and finally root cause analysis is critical. Sure, the priority is to restore full service availability as soon as possible. But, if you neglect some key technical support quality attributes in the process, which I’ll highlight in this series of articles, you may find you both succeeded and failed in restoring service at the same time. Succeeded and failed at the same time you wonder? Please read on and I will attempt to shed some light on this success with failure construct and considerations on how to avoid the failure “pitfalls”.

Pitfall = Challenges in an Extended Outage

So, you’ve bought into the need to be response based on a previous article touting the benefits to you (being viewed as a leader and raise and bonus positives) and your organization (calmly restore production IT services to normal working order). You’ve communicated in a personal style with incremental positive facts and indicated at what timing points you will be updating the stakeholders on your progress as indicated in the previous article. If the problem can be easily identified and corrected quickly including a rather direct way to explain why it happened, pat yourself on the back for a job well done. Now get ready for the after math of re-explaining what happened a hand full of times over and possibly participate in some post issue shoring up of the technology (see root cause analysis considerations posted here previously). But what happens when the status reporting is going on longer and longer and you can tell that the natives are getting restless as they are starting to grow concerned at the length of the outage and at the lack of a clear “it will be fixed in 5 minutes” status report? When an outage becomes an extended outage, time to ratchet up the communication plan and bring in some help.

Problem isn’t Obviously Fixable in Short Order? Get Help

Most likely, as time is going by, more people are aware of the outage and thus the list of stakeholders is growing larger. Also, the likelihood those stakeholders are senior technical people offering to give you a hand is slim and none … and slim left town as the saying goes. I would venture to say that the stakeholders are a growing list of non-technical people that are impacted in some way by the production situation continuing to be a problem. More and more managers on the operations and product side of the service are getting engaged as possible customer complaints are mounting or call center call volumes are reaching levels of concern. There maybe more people engaged to discuss what to do if the outage continues and an alternative, possibly more manual means is needed to meet customer SLAs. By the way, manual usually means more work done by people, hence more people getting engaged to see if they have to bring in even more people to ensure the alternative service delivery option has the right, skilled and trained staff. Company marketing resources could be engaged to offer advice on how best to let customers know the service is having a greater than normal duration outage and what the company plans to do to service their needs. I am not trying to paint a picture of doom and gloom for the primarily technical audience for this article. I know the technical mind wants to have all the people just stop talking so the real work of fixing the technology can take place. But on the business side of the technology in trouble, there are company stakeholders and customers of some form or another that are materially impacted in some way by having the usually highly reliable technology fail to function correctly.

Thus, as time goes by, your incrementally positive but not “it’s fixed” communications aren’t enough to appease the masses. You are either going to have to spend more and more time explaining to new people joining the situation what happened when, what has been ruled out, what is next to investigate, etc. or risk becoming non-communicative in order get some focused time to fix things, thus putting all your hard work at risk as outlined in this previous article. It is time to ask for some help.

Hopefully you have already engaged your management to keep them apprised of the situation as suggested in this previous series of articles. Thus, you may already be getting asked if you need help because you have informed your management and thus they are starting to ask the “hey, you are doing a good job, but can we help?” type of questions.

Ask for and accept help

I can’t stress it enough: avoid the notion that the fix is “just around the corner and if I only spend 10 more minutes researching …”. Ask for and accept help. To start, get someone engaged to be the status communicator so you have less distractions and more time to dig into the problem. The status communicator needs to have level of competence in the following skill areas:

  1. Enough of a technical background to take technical status bits from you and quickly understand what you are saying without a 5 hour white-board deep dive session.

  2. Ability to communicate in “business speak” not “techno-speak”.

  3. Enough understanding of the players involved organizational chart-wise to know how and when to communicate with stakeholders and when to recognize the VP of Product is looking for status and it is time to get your VP peer manager involved.

Your manager is in the best position to act in this capacity if they aren’t already doing so. As managers, you stand to lose huge management credibility and leadership points of you just sit on the sidelines and hope the problem goes away or you are somehow hoping for plausible deny-ability to relieve you of your responsibility in this situation. Roll up your sleeves and get engaged. Start sharing what is going on in a polite but authoritative tone to build confidence and most importantly, buy more time for your engineers to dig in and figure out what is going wrong and fix it.  This previous series of articles offers additional tips.

In summary, as the outage is dragging on, be mindful that not everyone involved has the priority of discovering the coveted technical root cause. For engineers, as an extended outage is building, don’t keep trying to take on the rolls of technical investigator and communications expert. Get help. Managers, get involved and start shielding your engineers from the constant barrage of status requests and allow them more focused attention on digging in and finding out what is really going on and get it fixed.

We’ve extended the need for responsiveness to reports of production support problems to include an initial take on the art of creating an effective status communication approach as well as when to admit your need help and get your manager and/or team lead involved directly. Look for additional articles to identify more technical support pitfalls and steps to take to avoid them.

, , , , , , , ,

Related posts:

  1. Pitfalls of IT Technical Support and How to Avoid Them – Providing Status
  2. Pitfalls of IT Technical Support and How to Avoid Them – Responsiveness

Respond and forget, right?

Respond and forget, right?

Anyone that works in the Information Technology field knows that production technology systems, from time to time, will have problems.  From a functional defect that has everyone scratching their heads as to how it wasn’t discovered by seemingly endless rounds of QA to full blown hardware failures that take down entire suites of applications, no matter how much is invested in “highly available” and “redundant” technologies, failures are bound to occur.  For IT Managers and IT Engineers, how one handles these failures from inception through service restoration and finally root cause analysis is critical.  Sure, the priority is to restore full service availability as soon as possible.  But, if you neglect some key technical support quality attributes in the process, which I’ll highlight in this series of articles, you may find you both succeeded and failed in restoring service at the same time.  Succeeded and failed at the same time you wonder?  Please read on and I will attempt to shed some light on this success with failure construct and considerations on how to avoid the failure “pitfalls”.

Pitfall = Providing Status

So, you’ve bought into the need to be responsive based on the previous article touting the benefits to you (being viewed as a leader and raise and bonus positives which are always good) and to your organization (calmly restoring production IT services to normal working order).  So, all you have to do is “respond” by sending an email right away, jumping on a conference line quickly or changing a status in a production trouble ticket tracking system promptly and you are done, right?  You can now disappear into the depths of your logs files and your performance counters and your packet traces only to resurface when you have found the real cause of the problem, right?  Never under estimate the extent to which people, lacking timely information people, will panic.

To help illustrate, we can extend the example from the responsiveness article of needing a plumber to call you back quickly to address the hot water heater that is pouring water all over your basement floor and not delivering any hot water to any faucet in your house.  Consider that a plumber does call you back promptly to indicate they are able to start looking into your leaking hot water tank right away.  But after that responsive call back, time keeps ticking by without any indication if your tank can be fixed or needs to be replaced or is about to explode and flood your basement in the process.

Note: Yes, you can walk down into the basement to physically see the plumber’s progress or lack there of, but pretend you can’t easily do that to allow this extended plumbing example to help frame the context for this article.  Let’s say you left your home for work right after you confirmed the plumber was engaged to fix your problem.

So, without any further status from the plumber besides his or her initial: “Yes, I look into your hot water tank problem right away”, how do you know what is going on?  The plumber could be minutes away from turning off the water main to stop the river forming in your basement followed quickly by unloading a delivery truck approaching your house with a brand new hot water heater or sitting down on the couch to catch a baseball game on TV completely ignoring your water dilemma.  Thus, how do you know what is going on?  You don’t, unless you are physically watching the plumber’s every move or the plumber is providing frequent status as to what is going on with your hot water crisis.

Frequently provide status

So, how does one keep the panic to a minimum once initially responding to the production issue?  Reduce panic by frequently communicating status of what is going on in the troubleshooting process.  This sounds simple enough, just keep everyone informed:

  • “I just VPNed into the network”
  • “I am pulling up a terminal session with the server now.”
  • “I am typing my user name.”
  • “I am typing my password.”
  • “Ooops, wrong password, trying again.”
  • “I am now at a command shell …”

Obviously, that is going too far into the over communication side of the status equation.  What you are trying to find is the artful balance in the level of detail and frequency to share status.  As in all things technological, there is no silver bullet, no industry established check list and no “do this and it will work for every situation written on a stone tablet somewhere to implement with guaranteed success.  One has to put some energy into looking for clues as to what is going to work best in the given situation and then constantly monitor the results of the your communication approach to tweak as necessary.

But this sure seems like a lot of work that doesn’t get directly at fixing the true technical problem?  Correct.  As I mentioned previously, you can dedicate all your efforts to fixing the problem as quickly as possible, but be prepared for the consequences of various negative backlashes surrounding non-technical and peer management’s frustration of being left in the dark for who knows how long starting from problem occurrence and ending at problem resolution.  Plus, you can safely anticipate the root cause analysis aftermath being painful and extended due to this lack of communication frustration you have helped create.  Thus, I am arguing the time invested up front in an effective communication approach will pay large dividends in avoiding post service restoration negativity and an elongated investment in root cause analysis malaise.

Art of an Effective Status Communication Approach

So how does one determine a successful status communication approach?  First, suspend your technical or engineering brain that puts speedy problem resolution as the highest priority in any production outage situation.  Recall that once you put aside the technology, people are involved in the production outage.  Harkin back to the plumbing crisis example above, if you are at work wondering how much your water bill is going to be as your basement floods, what would be your reaction to getting call or a note from your plumbing saying:

“Hey, this is Bob the plumber, just wanted to let you know I stopped the geyser erupting in your basement.  A replacement water tank is on a delivery truck and should be arriving at your house within the hour.  I’ll let you know when it gets here and what the next steps are in about an hour or so.”

Imagine the feeling of relief at getting such an update at work.  Now, carry those feelings of relief over to the other people involved in the production outage situation.  They are fretting over lost revenues or having to explain to their management what happened, why and what is going to be put in place so it never happens again with absolutely no clue at this moment on answers to any of those questions at the moment.

Can you make everyone relax and go about their day with a smile with a few simple sentences on what is going on?  Not a chance, but you can help keep the people involved more calm and less likely to break out in irrationality by providing indications of where you are in the troubleshooting process.

Consider this revision to the step by step over communication example from above:

“Everyone, this is Bob from systems support.  I was able to get online and successfully access the production server that is hosting the application that is involved in the production outage.  This is a good sign in that we able to start debugging immediately without any infrastructure barriers at this point.  I will now start investigating the error logs that should give some further technical direction on what is going on.  I will let everyone know what I discover in 15 minutes from now.”

Similar to the status update from your plumber, there are key elements in this status message that address the human side of the outage:

  1. Saying your name

Saying your name seems over simplistic, but giving your name instead of hiding behind the anonymity of an artificial company group such as “systems support” makes a small but important personal connection to all of the people involved that possess likelihood to panic at a moments notice.  This is similar logic as to why people prefer talking to a human rather than interacting with an automated “push or say 1 and then entering your 45 digit account number” system when calling to resolve an incorrect cell phone, gas or electric bill.

  1. Providing legitimate positive news, even if it is somewhat insignificant to correcting the real problem

Again, seems simplistic, but by indicating you were able to get online and get into some level of technology to begin troubleshooting, it helps to give additional confidence to the non-technical individuals participating in the outage that some potential barriers to real problem resolution have been crossed.  Look for opportunities to share facts that narrow the problem down, even if they only narrow the problem down ever so slightly.  The increased feeling of progress that the elements of narrowing down the problem create help to continue to enforce feelings of increasing control over a seemingly out of control situation to the non-technical people involved.  Again, you are looking for balance.  “I successfully typed my password” does no invoke that much confidence.  Thus look for real progress facts that can be shared that focus on narrowing the problem scope rather than just facts for the sake of facts.  Lastly, I chose the word “facts” specifically.  Make sure you communicate facts and not speculation at this early problem engagement level.  I’ll cover some suggestions on how to share speculation in another article.

  1. Indicate when the next status communication will occur

Giving people an indication of when they can anticipate an update on what is going on or what you are doing provides two significant benefits.  The first is it allows everyone participating in the outage who is not directly involved in restoring service the ability to relax just a bit and prepare for the when they need to be engaged next.  They know there is nothing tactically they can really do to solve the immediate problem.  They know they are effectively 100% dependant on technical resources to do the real work of finding the problem and fixing it.  They desperately want to hear: “the problem is X and I’ve fixed it.”  But since you nor anyone else is at that point in the troubleshooting process, a time in the not too distant future where such a phrase might be uttered is the next best thing.

The second is it gives you much needed breathing room.  Instead of hearing “Is it fixed yet? How about now?  Now?  Maybe now?” every couple of minutes, you’ve clearly set the expectation that you need some uninterrupted time to do some digging in order to provide anything valuable as far as investigative analysis.  Thus, you now have some time to completely disengage from the noise associated with the problem and roll up your sleeves and immerse yourself in performance and log data to try and figure out what is going wrong with the technology.

Communicating Status – Approach in Summary

  1. Use your name and thus communicate in a more personal tone to increase confidence in non-technical participants … avoiding the opposite completely impersonal tone of “tech resource number 12”
  2. Provide positive news to further increase confidence and reduce the panic building in others with facts (not opinions), even if those facts are small troubleshooting milestones and not grandiose “ah ha!” findings.  Make sure to balance the too small “I pressed enter and …” type facts.
  3. Indicate you need time to dig deeper and set the timing expectations of when others can await the next element of status from you to buy uninterrupted investigation time and allow others to put off panicking for a period of time.

We’ve extended the need for responsiveness to reports of production support problems to include an initial take on the art of creating an effective status communication approach.  Look for additional articles to identify more technical support pitfalls and steps to take to avoid them.

, , , , , , ,

Related posts:

  1. Pitfalls of IT Technical Support and How to Avoid Them – Responsiveness
  2. Vendor Management – Part 14 – Tech Support – Part 2 of 2

Respond ASAP!

Respond ASAP!

Anyone that works in the Information Technology field knows that production technology systems, from time to time, will have problems.  From a functional defect that has everyone scratching their heads as to how it wasn’t discovered by seemingly endless rounds of QA* to full blown hardware failures that take down entire suites of applications, no matter how much is invested in “highly available” and “redundant” technologies, failures are bound to occur.  For IT Managers and IT Engineers, how one handles these failures from inception through service restoration and finally root cause analysis is critical.  Sure, the priority is to restore full service availability as soon as possible.  But, if you neglect some key technical support quality attributes in the process, which I’ll highlight in this series of articles, you may find you both succeeded and failed in restoring service at the same time.  Succeeded and failed at the same time you wonder?  Please read on and I will attempt to shed some light on this success with failure construct and considerations on how to avoid the failure “pitfalls”.

The First Pitfall = Responsiveness

The very first pitfall in providing any level of technical support to a production system is to fail to be responsive.  Imagine for a minute that you are a home owner and you don’t know the first thing about plumbing (and maybe you don’t know much about plumbing).  You turn on the facet expecting to feel some soothing hot water spray over your hands and yet, the water remains freezing cold.  You wait the typical number of seconds by when you would expect to at least sense the water turning from freezing cold to mildly tolerable; yet it isn’t happening.  You turn off the facet in disgust and trudge down to the basement to confront the source of pleasant, warm water: the hot water heater.  To your surprise, the hot water heater is where you expected it to be physically, but what you didn’t expect to discover is a steady stream of water pouring from underneath the unit and running a short length to your basement floor drain.  Since you don’t know the first thing about plumbing, your immediate and only thought is to call a plumber to come over as soon as possible.

Assume you have contact information for three plumbers in the area that you either have had satisfactory work completed prior or had reliable information from friends and neighbors that indicated they were prompt and reliable.  You hastily dial each plumber only to be greeted by pleasant yet unhelpful answering machines:

“… your call is very important to us.  Please leave a message and someone will be with you shortly <beep>”

You leave a hasty yet detailed message about the unnatural spring that has sprung forth from your hot water heater in the basement.  You provide a home phone, cell phone and your spouse/roommate/significant other’s contact information to ensure there are multiple ways your urgently needed plumber can contact you.

Minutes go by.

Hours go by.

Hours turn into days …

Okay, you get the picture.  What you and your lack of hot water disaster need is some initial response from a plumber.  Without any response of any kind, one can only plunge deeper and deeper into panic that you will forever by taking cold showers and personally draining the local fresh water lake via your leaking hot water heater.

I believe this analogy translates well to the notion of being responsive to production support issues.  Replace the panicked home owner with the leaking hot water heater and no response from any plumbers with an IT manager that is responsible for a technology service that is aware the technology is broke, but has no clue if/when it will be attended too by an engineer and you can image the panic the IT manager is feeling.

Thus, to not respond to a production support call or page promptly when the technology you are responsible for could be broken is liken to a previously reliable plumber never even returning your call about your broken hot water heater.

To Avoid the First Pitfall = Responsiveness = Respond in a Timely Fashion

Yes, this may seem ever so simple, but by responding to a call or page related to a production support issue in the expected manner will go an exceedingly long way to avoid the first pitfall and put the IT manager’s mind at ease.  Respond in the “expected manner”?  Yes, if the expectation is that you verbally answer a call or dial into a conference line or bridge to announce your availability or log into an automated support management system and perform some simple acknowledgement that you are aware your services are needed, and then do just that.  Nothing will be gained by sending an email to your manager when the clear expectation is you join a conference bridge line to be informed of the situation.  It may seem painful, irritating, not worth your energy, etc.  Allowing your immediate distaste for the production support situation that you are about to be drug into block you from “doing the right thing” well create the perception that you are not reliable and thus not a leader.  You don’t want those negative perceptions to be linked to your professional image, especially around raise and bonus time.  Plus, once you have been linked to those perceptions, it is going to take above and beyond effort for some time to reverse them.

Know the response expectations

As mentioned above, make sure you know what the response expectations are.  Make sure you have a clear understanding what the SLAs** are for the services for which you will be contacted.  If you have 30 minutes to respond, then make sure you make every effort to respond within that 30 minutes; sooner the better.  Are you supposed to respond via phone, email or join a conference line?  Make sure you are clear on what you are supposed to do.  Are you supposed to login to a problem management system and update the status on a support ticket?  Make sure you have confirmed you can do this remotely as well as in the office (assuming you are providing off hours support).  Know the customer SLAs for the service you are supporting.  If the service is available to customers 24/7 but the real customer service agreement is from 7am to 7pm, know that so if a call comes in before 7pm, be of the mindset that the system needs urgent attention until someone of authority indicates the problem can be handled the next day, not the other way around.  Are there priority levels assigned to the problems that get communicated out?  If so, make sure you know so that you are confident you can ignore a priority twelve problem till the next day, and so on.

Additionally, even though there are SLAs with different response times by priority, make every effort to understand what really constitutes a priority for the service rather than just arbitrary numbers.  Is there a particular “high value” customer or customer group that requires high touch service?  Is there a particular business function that is mission critical or if not completed successfully in a timely fashion, will create a rippling effect of additional problems within the support organization?  Develop a firm grasp of these unique support situations.  Even through they technically might not match the “priority 1 – entire system is down” criteria, they still are viewed by senior stakeholders as important to the business.  Hence, treating them as such will go along way to create the perception you care about your role, the company and have leadership potential.  Alternatively, think positive perception for raise and bonus time can’t be a bad thing.

Lastly, consider that response is just that: response.  Compare these two examples of responsiveness to the same problem:

Example A

Bob on his cell phone calling into the production situation conference line: “Hey, this Bob from FlimFlam support, I just got a text that there is a problem and to join this line.  How can I help?”

Voice on conference line: <Briefly explains that the production system is throwing error codes left and right and the system is essentially unusable>

Bob: “Hmm, I don’t know what could be causing that situation off the top of my head.  I am in the car and about 30 minutes away from being online to begin troubleshooting.  Is that problem?”

Example B

Bob, checking his cell phone for a text message to join a production situation conference line, thinks to himself: “I bet FlimFlam is throwing errors again.  I’ll get home, get online, see what might be going on and then join the conference line.”

In both examples, the time to resolve the production situation is probably the same.  I would actually argue that the time to restore service is probably quicker in option B than in A due to less communication and interaction time compared to hands on technical troubleshooting time.  But if it takes 60 minutes to restore service in example A compared to only 45 minutes in example B, the perception of the quality of technical support provided in example A is much higher than example B due to the higher level of communication and responsiveness involved in example A.  Back up to the leaking hot water heater example from the beginning of this article, that 30+ minute driving commute from receiving the text message to join the call to get engaged is similar to leaving a message for a plumber and not hearing back.  The perceived lack of responsiveness will work against any heroic technical feats of system restoration because those that don’t fully appreciate that you pulled off a systems miracle behind the scenes are only aware of the stress you caused them by not responding promptly to their communication needs first, technical second.

Look for additional articles to identify more technical support pitfalls and steps to take to avoid them.

* QA, Quality Assurance, the process or set of processes used to measure and assure the quality of a product.

*SLA, Service Level Agreement is a part of a service contract where the level of service is formally defined. In practice, the term SLA is sometimes used to refer to the contracted delivery time (of the service) or performance.

, , , , , ,

No related posts.