Diagnostics Anonymous
When working with large systems, especially with software, inevitably something will go wrong and you’ll be stuck trying to diagnose it. My years of developing software have given me a lot of practice diagnosing errors or unintended behaviour in software systems (usually in my own code), and over time I’ve built up a systematic process that I run through to speed up the process. In writing this post, I came to realise that my process actually has quite a lot in common with 12-step programs, so I’ve tweaked the format a little to make the content a little more interesting.
1. We admitted that there was a problem—that the system
had become unmanageable
The first step is to admit that you have a problem. Sometimes you are working on something and struggling to get it to behave how you want, and you are just reaching the point where you begin to question whether you are taking the right approach. Or perhaps you saw what you thought was an isolated incident that you corrected, but you’re starting to suspect there might be more to it. It’s important to stop and admit that there is a problem if you want to have any hope of understanding it.
Hopefully when you started doing whatever went wrong, you had an idea in your head of what you were hoping to achieve. If you didn’t, hey, maybe you don’t need to solve the problem after all. Just leave everything broken and move on. More likely, you were trying to solve a particular problem or achieve a particular outcome. Keep this in the back of your mind because it’s important that you remind yourself of it every so often to make sure you’re on track to actually solving it, that you haven’t gone so far down a rabbit hole that you forget what you were trying to do in the first place.
Quite often, when attempting to solve one problem you stumble upon a
second problem, and the best solution is to not solve the second
problem at all, but to find a better solution to the first problem.
This scenario is known as an XY problem—
2. Came to believe there must be a way to resolve it
Now that you have admitted that you have a problem, the next step is to understand how things could be better. Ask yourself the following questions. “What do I think should happen? What are the outputs I’m hoping to see?” How will you know when you’ve solved the problem? If you had a magic lamp, what wish would you ask of the genie? Genies can be really pedantic so you need to be as precise and unambiguous as possible.
3. Made a decision to turn our will over to resolving it
Unfortunately you probably don’t have a magic lamp, so you’re going to have to resolve the problem yourself. It’s important to commit yourself to solving the problem, adjust your mindset, and make sure you have the best environment for tackling it. Try to clear any clutter off your desk and your screen, pull up any documents that might be useful, and get a pen and paper ready.
4. Made a searching and fearless inventory of the system and environment
Gather up any relevant information you can find. Current log files, old log files, error messages, network traces, known issue lists, documentation, requirements, test cases, configuration, patch information, everything. Information is key here and you want as much of it as you can. If you can find historical data, get it because it can help you track changes over time.
5. Stated to ourselves the exact nature of the problem
It’s important to take note of what behaviour you’re actually seeing and how to compare it with what you want to happen. As you start trying to solve the problem, you need to know that you’re affecting the system, and if possible, whether you’re making things better or worse. Ask yourself the following questions.
What is actually happening? How is this different to what I expected? When does this behaviour occur? Has it always happened, or did it break? If it broke, did anything change around the time that it broke? Did I change something right before it broke? Often the cause of a problem is the last thing you did before you noticed it. Is there a pattern to when it works and when it doesn’t, like a certain number of repetitions, only during certain times of the day, only after some other event, only in certain environments? If it’s environmental, what’s different between the environments? These questions help us to consider external factors that might be causing the problem. It’s good to get these out of the way first, because there’s no point in wasting your time on trial-and-error if the cause ends up being unrelated to the changes you try making.
6. Were entirely ready to resolve the problem
At this point you should have all the information available to you
that might be useful resolving the problem, you have a clear
understanding of the precise nature of the problem and what it means
for the problem to be resolved, and all that is left is to get into
it. You may have found that in running through all the previous
steps, you realised what the problem was and you have already figured
out how to solve it. Actually this happens so frequently, people gave
a name to this phenomenon—
7. Humbly remove shortcomings
Start by putting out any current fires—
8. Made a list of all components that are affected, and how we can amend them
Now it’s time to start narrowing down the search space. What
components in the system or features in the software are you using?
Think about all the behind-the-scenes processes that might also be
running in the background, and requests that might be coming in from
other systems. Which of them can you reconfigure, to actually make
changes to the system? Which of them can you disable to block the
number of interactions? Which of them can you step through slowly,
verifying after each step that things are looking good? Which of them
can you orchestrate manually? What can you reset between each attempt
to eliminate the possibility of interaction between attempts? Make an
educated guess as to what will happen for each change you could
make—
9. Made direct amends to such components wherever possible
Based on your educated guesses, start making some changes. Where
possible, try to make only a single change at a time, and monitor to
see what effect the change has. If you don’t notice any change, or if
the change doesn’t appear to resolve the issue, revert the change and
try to get as close to the original state as possible. You want to
make sure that when you finally do resolve the issue, you actually
know which changes were required. Compare new behaviour, new results
and new log entries with those collected during step 4. Compare
observed behaviour with what you guessed would happen. Try to be as
thorough and systematic as possible, recording your actions and the
results. There isn’t really a great deal more general advice I can
give on the actions required to resolve arbitrary issues—
10. Continued to take inventory, and when things were off, promptly fixed them
During and after resolving the problem, it’s important to keep gathering as much information as you can to continue monitoring the system. As I said before, often the cause of a problem is the last action you took. You’re now in the action-taking phase, and you want to make sure that if you cause any problems, you see them as quickly as possible. You also want to make sure that you don’t accidentally revert the changes that you made to resolve the problem, causing it to occur all over again. Not to mention that the changes that you made to resolve your problem might interact badly with other parts of the system, causing more problems. Keep an eye on the system until you’re satisfied that everything is stable, and run through any tests or validation steps required to convince yourself that the system is now working correctly.
11. Sought to improve our contact with the system, knowledge and the ability to act on it
Now that you’ve resolved the problem and started to collect information to monitor the system and ensure there are no regressions, it’s time to understand what really went wrong. Sure, there were visible symptoms that you identified in the earlier steps and those have already been resolved, but beneath the surface there’s always a deeper problem and it might still be there, waiting to surface as a new set of symptoms. This could be a lack of understanding of how the components in the system work or interact with each other, a breakdown in a human process that led to the issue, or a gap in monitoring that meant nobody was alerted to the issue early enough. Have a think about other improvements that can be made to prevent this and similar problems from happening again.
12. Having an awakening, we tried to document this knowledge, and practice it in the future
The last step is to document everything you’ve learnt so far. The goal should be to ensure that this problem never happens again, for this environment, for you, and for everybody else who has access to your documentation. If you found any more general improvements to overarching processes or methodologies that could help prevent similar problems too, make sure relevant documentation is updated with the improvements and key learnings. Try to spread the knowledge to others in your organisation or even the wider community (as I’m trying to do now by writing this post). With a small up-front investment of your time in spreading the knowledge, bit by bit we can all work towards eliminating hours of wasted effort diagnosing and resolving known problems with known solutions.
Now that you have enjoyed (hopefully) reading my 12 steps, what are your strategies for resolving issues you encounter?