Anyone that works in the Information Technology field knows that production technology systems, from time to time, will have problems. From a functional defect that has everyone scratching their heads as to how it wasn’t discovered by seemingly endless rounds of QA* to full blown hardware failures that take down entire suites of applications, no matter how much is invested in “highly available” and “redundant” technologies, failures are bound to occur. For IT Managers and IT Engineers, how one handles these failures from inception through service restoration and finally root cause analysis is critical. Sure, the priority is to restore full service availability as soon as possible. But, if you neglect some key technical support quality attributes in the process, which I’ll highlight in this series of articles, you may find you both succeeded and failed in restoring service at the same time. Succeeded and failed at the same time you wonder? Please read on and I will attempt to shed some light on this success with failure construct and considerations on how to avoid the failure “pitfalls”.
The First Pitfall = Responsiveness
The very first pitfall in providing any level of technical support to a production system is to fail to be responsive. Imagine for a minute that you are a home owner and you don’t know the first thing about plumbing (and maybe you don’t know much about plumbing). You turn on the facet expecting to feel some soothing hot water spray over your hands and yet, the water remains freezing cold. You wait the typical number of seconds by when you would expect to at least sense the water turning from freezing cold to mildly tolerable; yet it isn’t happening. You turn off the facet in disgust and trudge down to the basement to confront the source of pleasant, warm water: the hot water heater. To your surprise, the hot water heater is where you expected it to be physically, but what you didn’t expect to discover is a steady stream of water pouring from underneath the unit and running a short length to your basement floor drain. Since you don’t know the first thing about plumbing, your immediate and only thought is to call a plumber to come over as soon as possible.
Assume you have contact information for three plumbers in the area that you either have had satisfactory work completed prior or had reliable information from friends and neighbors that indicated they were prompt and reliable. You hastily dial each plumber only to be greeted by pleasant yet unhelpful answering machines:
“… your call is very important to us. Please leave a message and someone will be with you shortly <beep>”
You leave a hasty yet detailed message about the unnatural spring that has sprung forth from your hot water heater in the basement. You provide a home phone, cell phone and your spouse/roommate/significant other’s contact information to ensure there are multiple ways your urgently needed plumber can contact you.
Minutes go by.
Hours go by.
Hours turn into days …
Okay, you get the picture. What you and your lack of hot water disaster need is some initial response from a plumber. Without any response of any kind, one can only plunge deeper and deeper into panic that you will forever by taking cold showers and personally draining the local fresh water lake via your leaking hot water heater.
I believe this analogy translates well to the notion of being responsive to production support issues. Replace the panicked home owner with the leaking hot water heater and no response from any plumbers with an IT manager that is responsible for a technology service that is aware the technology is broke, but has no clue if/when it will be attended too by an engineer and you can image the panic the IT manager is feeling.
Thus, to not respond to a production support call or page promptly when the technology you are responsible for could be broken is liken to a previously reliable plumber never even returning your call about your broken hot water heater.
To Avoid the First Pitfall = Responsiveness = Respond in a Timely Fashion
Yes, this may seem ever so simple, but by responding to a call or page related to a production support issue in the expected manner will go an exceedingly long way to avoid the first pitfall and put the IT manager’s mind at ease. Respond in the “expected manner”? Yes, if the expectation is that you verbally answer a call or dial into a conference line or bridge to announce your availability or log into an automated support management system and perform some simple acknowledgement that you are aware your services are needed, and then do just that. Nothing will be gained by sending an email to your manager when the clear expectation is you join a conference bridge line to be informed of the situation. It may seem painful, irritating, not worth your energy, etc. Allowing your immediate distaste for the production support situation that you are about to be drug into block you from “doing the right thing” well create the perception that you are not reliable and thus not a leader. You don’t want those negative perceptions to be linked to your professional image, especially around raise and bonus time. Plus, once you have been linked to those perceptions, it is going to take above and beyond effort for some time to reverse them.
Know the response expectations
As mentioned above, make sure you know what the response expectations are. Make sure you have a clear understanding what the SLAs** are for the services for which you will be contacted. If you have 30 minutes to respond, then make sure you make every effort to respond within that 30 minutes; sooner the better. Are you supposed to respond via phone, email or join a conference line? Make sure you are clear on what you are supposed to do. Are you supposed to login to a problem management system and update the status on a support ticket? Make sure you have confirmed you can do this remotely as well as in the office (assuming you are providing off hours support). Know the customer SLAs for the service you are supporting. If the service is available to customers 24/7 but the real customer service agreement is from 7am to 7pm, know that so if a call comes in before 7pm, be of the mindset that the system needs urgent attention until someone of authority indicates the problem can be handled the next day, not the other way around. Are there priority levels assigned to the problems that get communicated out? If so, make sure you know so that you are confident you can ignore a priority twelve problem till the next day, and so on.
Additionally, even though there are SLAs with different response times by priority, make every effort to understand what really constitutes a priority for the service rather than just arbitrary numbers. Is there a particular “high value” customer or customer group that requires high touch service? Is there a particular business function that is mission critical or if not completed successfully in a timely fashion, will create a rippling effect of additional problems within the support organization? Develop a firm grasp of these unique support situations. Even through they technically might not match the “priority 1 – entire system is down” criteria, they still are viewed by senior stakeholders as important to the business. Hence, treating them as such will go along way to create the perception you care about your role, the company and have leadership potential. Alternatively, think positive perception for raise and bonus time can’t be a bad thing.
Lastly, consider that response is just that: response. Compare these two examples of responsiveness to the same problem:
Bob on his cell phone calling into the production situation conference line: “Hey, this Bob from FlimFlam support, I just got a text that there is a problem and to join this line. How can I help?”
Voice on conference line: <Briefly explains that the production system is throwing error codes left and right and the system is essentially unusable>
Bob: “Hmm, I don’t know what could be causing that situation off the top of my head. I am in the car and about 30 minutes away from being online to begin troubleshooting. Is that problem?”
Bob, checking his cell phone for a text message to join a production situation conference line, thinks to himself: “I bet FlimFlam is throwing errors again. I’ll get home, get online, see what might be going on and then join the conference line.”
In both examples, the time to resolve the production situation is probably the same. I would actually argue that the time to restore service is probably quicker in option B than in A due to less communication and interaction time compared to hands on technical troubleshooting time. But if it takes 60 minutes to restore service in example A compared to only 45 minutes in example B, the perception of the quality of technical support provided in example A is much higher than example B due to the higher level of communication and responsiveness involved in example A. Back up to the leaking hot water heater example from the beginning of this article, that 30+ minute driving commute from receiving the text message to join the call to get engaged is similar to leaving a message for a plumber and not hearing back. The perceived lack of responsiveness will work against any heroic technical feats of system restoration because those that don’t fully appreciate that you pulled off a systems miracle behind the scenes are only aware of the stress you caused them by not responding promptly to their communication needs first, technical second.
Look for additional articles to identify more technical support pitfalls and steps to take to avoid them.
* QA, Quality Assurance, the process or set of processes used to measure and assure the quality of a product.
*SLA, Service Level Agreement is a part of a service contract where the level of service is formally defined. In practice, the term SLA is sometimes used to refer to the contracted delivery time (of the service) or performance.