I do a lot of troubleshooting of problems in corporate infrastructure. By that, I mean investigating difficult technical problems in a structured way to find either a solution or a workaround. This post is a few thoughts on the process.
Here is a caricature of the process that often happens. It is imaginary, but I think most people will easily identify real cases that are similar.
- A user (or a group of users, for example at one site or using one application) experiences a problem. The help desk tries a few things, not very systematically. Sometimes different people from the help desk try the same thing again. Various updates and changes are made to see if they fix the problem. Sometimes the problem is resolved. Sometimes it goes away. And sometimes it is ignored as "just the way it is until we upgrade xyz".
- A user experiences a problem. The user is senior, or the problem affects something that senior people care about. A problem manager takes over the case. The problem manager is someone who co-ordinates actions but does not, themselves, possess the technical skills to resolve the problem. The problem is assigned to whichever team seems most likely e.g. networks, server, storage, application. The team works hard to demonstrate that the fault cannot be with them. The problem continues, with increasing levels of polite acrimony. Eventually a significant change is made which may, or may not, be the cause e.g. a platform component is changed. The problem goes away, but no-one knows what actually caused it.
What is the perfect way? A user experiences a problem. A technical wizard understands every aspect of the problem without needing to be told anything about it, or to do any investigation. The technical wizard knows the solution, whether it is in the network, or server, or storage, or application layer. After some discussion, the fix is implemented and the problem is solved.
This perfect way is absurd, obviously. No-one can know what the problem is until they have seen it, thought about it, asked a few questions, gathered some data. No-one can be expert in enough technologies to know exactly what the fix is, without running some tests, trying a few things.
So we need a process that is not like 1) and 2), but also does not assume there is a perfect way.
First, some context. We already have an awareness of what is normal behaviour and what is not. If a user says that logon is slow, we assess it against a typical logon time. If the behaviour is not normal, then we can assume there is a fault. For the most part (not always!) vendor products do not show obvious faults in isolation. So, if the vendors do not know about this fault, then there must be a fault in our configuration, or an unknown fault in the product. There must be something we can change so that the problem does not occur. Our job is to find what to change.
The way I go about it is divided into five parts. These are not all sequential, but they involve different types of activity:
- Incident Handling
- Problem Definition
I am not going to describe each part in detail: just give an idea.
Being practical, I am going to assume that the great majority of day-to-day help desk calls do not require much rigour. The most important thing is to avoid doing things twice. As soon as we (on the help desk) realise that a problem is not simple, we need to follow a systematic process to gather the incident data in a way that avoids having to do it again.
As soon as we realise that a fault may be more difficult to resolve than by providing a simple change, we need to prepare a Problem Definition.
Very often I find that, when starting out, it is more difficult to write down an accurate statement of the problem than you might expect. The report might be: "The application crashes". But is that a Not Responding window or an application error? If there is an error, what is the exact message, and what happens when you click on OK? Is there an error shown in the Application Event Log? Does the application log show an error? How often does this occur? How many people are affected? What is the user doing in the moments leading up to the problem? Writing down what is and is not known goes a long way to defining what problem we are trying to solve.
It is not always easy. I have worked on a fault causing Windows to crash with a bug check, where it was difficult to know even whether there was a single problem, or more than one.
This is not a distinct step. But I identify it separately because we attempt to gather data systematically to understand the problem better. The more we understand, the more specific the tools we might use to investigate further.
Initially we might configure application debug level logging. Or we might run a network capture. As we narrow it down we might get very specific.
For example, I recently had a problem where Outlook seemed to hang for a few seconds when typing. I first ran a network capture of all traffic to see if, perhaps, a peak was causing a delay in response. There was no significant level of traffic. There was no significant latency. No retransmission. No problems with the network connection in general.
Then I ran a capture only between the Outlook client and the Exchange server (using the relevant ports). I noticed a spike in traffic every 15 minutes, coinciding with the problem. But what was the spike? Was it normal, or a fault?
So then, on the server, I configured Remote Operation (ROP) logging to capture the exact MAPI operations being performed by the client. The problem was caused by a MAPI action to re-read the mail database and refresh the view in Outlook. This occurred exactly when the user experienced the problem.
I will mention just two aspects of the analysis.
One is that we must have a mental model of the part of the system we are examining, and all its components. We then need to rule in or out the different components we think could be involved. Often I am told the problem is in Citrix. When I investigate, it is not in Citrix at all. It is in, for example, the configuration of the application on Citrix. Or, in one very difficult case, it was port exhaustion in an unpatched Windows server. Or it might be a configuration error in Citrix after all.
The second is conjecture. I use the term to mean that we need to develop an idea of a single cause of all the different symptoms we have found in the investigation. I don’t mean a guess. For example, sometimes people will say "I think it must be the anti-virus". This is a guess. But a conjecture would be: "A windows driver fault is occurring in the I/O stack. It is a type of fault that does not appear in testing using Driver Verifier". This is exactly what the problem was, but it was extremely difficult to find.
An interesting aspect of troubleshooting is that the final part is often really easy. Often, I don’t need to be involved any further, once the exact cause is identified.
You might think that, after investigation and analysis, we may have a good idea but we cannot be certain. Of course, nothing is ever entirely certain. But going back to what I said about the context, the problem must be caused by a specific fault; otherwise the behaviour would be normal. When you find the cause, you just know this is it.