'So, what is it you exactly do?' - Part five, troubleshooting

In the last article of this sysadmin series, I talked about the importance of monitoring as an insight into infrastructure and application behaviour - something that is hard to overstate. But what good is monitoring if you don't understand what it's telling you? That's where troubleshooting comes in.

I had a hard time trying to work out how to write this piece, precisely because troubleshooting is only part science, and part intuition. The quality of troubleshooting depends on how your brain works. I (of course..) pride myself on my troubleshooting abilities, mostly because I think I'm good at pattern recognition and have a good understanding of how different things behave in certain conditions. This goes beyond, say, counting the number of times an error has appeared in a log, but recognising 'meta' about those entries: what time they appeared in the logs, what else was happening at the time. Whether it coincides with wider problems (e.g upstream issues). What happened after the error and whether that thing was a side-effect of the problem or unrelated. Vice versa for what came before the error and if it caused it.

Ghosts in the machine

Just today I was troubleshooting a Drupal problem which was reported to me as a file permissions issue, but initially behaved like a Javascript issue. Based on the nature of the error message it became pretty quickly clear to me that

  • a) it was a server-side error being thrown, but
  • b) it was not permissions, and
  • c) it was most likely a change in the hosting provider's mod_security rules (false positive on CKeditor actions when publishing a node).

I didn't even know the provider employed mod_security, but it was the only thing that I could think of which would explain a 403 on that sort of POST request under those certain conditions. This was all before I had even had the chance to SSH into the server to see the logs that proved it (my client joked that it took a couple hours to get the hosting provider to reset an expired password for me so I could login, but only about 5 minutes for me to determine the cause of the problem).

Some of this is evidence based. Some of it is 'trial and error' or 'process of elimination' technique. But some of it is a vague 'feeling' that I can't quite describe, but that I think simply comes down to a combination of experience and 'thinking right' about an issue.

Another case from last week: a server was reported by Nagios as flapping all over the place - lots of packet loss every 10 minutes or so. I was baffled by the Munin monitoring which seemed to show wild spikes in CPU and RAM for the machine. Suddenly my ghost whispers on a 'hunch' that I am not seeing spikes, but in fact the data of a different server with the same hostname or IP, but a different specification, responding to the Munin request and therefore supplementing its own data onto the same graph. With some ARPing and tcpdump action proving a case of 'martian packets', indeed discovered I had a case of 2 servers on the same network with the same IP assignment (but of course different MAC addresses).

What not to do

Over the years I've seen sysadmins simply not think properly when faced with a problem. Common mistakes end up being:

  • "Maybe it's" - endless hyperbole about theories without actually looking at what the error says
  • Changing lots of things in an effort to solve the problem - often without checking after making each change, or altering the landscape away from something that can reproduce the problem
  • Misreading a problem as a different problem because a similar thing has happened before (e.g a network timeout causing backups to fail, so assuming that a corruption occurred & forcing a full backup because that's why the last backup failed)
  • Solving the problem but misunderstanding how or why - an unusual case that involves not understanding the solution let alone the problem (i.e copy-paste from StackExchange and then going to lunch)
  • Simply not testing their own theory as to the problem - e.g believing a firewall is blocking a request, but not testing the request in another way to prove it (e.g no tcpdumping on either end to see if packets arrive/ACKs sent back, etc), just giving up instead and escalating the issue
  • Not keeping or updating a 'Trouble Log' - a searchable diary of problems that occurred and how they were fixed (in case they happen again, as well as to reinforce one's own memory and understanding of the issue by writing it down)

Sometimes it's more esoteric and probably down to experience - a junior sysadmin might attempt to run a command and give up waiting if it takes too long, reporting it as a symptom of high load (naturally without backing this up with evidence) or a firewall issue, whereas I can recognise the certain amount of delay characteristic of a DNS issue. It is really hard to teach this sort of thing, and that troubles me because I am trying to do so all the time with junior sysadmins at the agencies I consult to as the 'adhoc senior'!

I have a strong memory of being panicked by a massive network issue at an organisation I worked at when I was a junior. I was all over the place with wild theories as to what was going on, misinterpreting traceroutes, and making dangerous changes to firewalls that only led to other auxiliary problems. The head of the sysadmin team sat with me and forced me to 'think of myself sitting on the packet' (or something along those lines). By trying to imagine moving alongside the packet on its journey to its destination, he was able to get me to find precisely which tier in the stack of routers that was not forwarding the packet as it should have been. This experience was highly influential on my overall approach to troubleshooting, even outside of complex networking (routing and BGP etc), which probably remains my weakest link.

Summing up

I don't have much more to say on this area of sysadmin, but the point of the series is to explain what makes up my day. Effective troubleshooting is an enormous part, and I feel that it, along with communication (which I'll talk about in the concluding piece), are my strongest tools in my sysadmin arsenal, despite neither of those tools being inherently technical. I appreciate the irony of this in what is a very technical profession.

My advice to other sysadmins when it comes to troubleshooting is:

  • Collect evidence: think carefully about what the error is telling you. Don't waste time adding your own theories. This might also avoid the risk of trying solutions found by Googling that are irrelevant and therefore dangerous in their own right
  • Make small changes in efforts to resolve the issue. If they don't fix the issue, undo your change so as not to introduce new issues
  • Learn to Google effectively. When I started sysadmin, I didn't have a senior mentor at my dayjob - he hired me and then went on holiday for a couple months. Google was literally all I had
  • Record everything in a 'Trouble Log' to make next time easier. This can also double as an incident report or addition to your risk register if you have compliance matters to deal with (e.g ISO27001)

Could your Drupal agency do with a consultant sysadmin's expert troubleshooting? I work on an ongoing retainer basis, in different capacities (4 to 40 hours a month depending on organisation size), for Drupal agencies all round the world who can't justify a full-time sysadmin. Get in touch.

Coming up

We are nearly done. Two more areas I'd like to talk about:

Part Six: high availability and disasters (sometimes the same thing :) ),
Part Seven: communication.