Limit your guessing

I want to share a humbling experience I’ve recently had at work. And I think you might find it interesting.

This is a repost from my old blog that is now lost to time and carelessness. I found some of my drafts and reedited a handful that might be interesting to read.

So you might have read it before but it's not like I have a massive following so you probably haven't ;)

So for about a year I’ve been a go-to guy for Oracle DB performance issues. It’s not that I’m particularly knowledgeable and apt at it, it just so happened that I’ve dealt with a couple of issues. Us being quite a small company I ended up in the role of the emergency DBA, so to speak. You know how it goes.

Having dealt with around ten-to-twenty slightly different things I got into a sort of a workflow:

  1. trace,
  2. tkprof,
  3. check the query,
  4. check the stats,
  5. fix an access predicate,
  6. propose adding an index if needed.

That was also my plan when I got a slack message calling me to take a look at a yet another database instance having performance issues.

What I’ve done

Now to give you some context this was one of our new clients, and a very special at that. They had the biggest number of customers we’ve seen so far and thus the biggest DB of all, so that was already a challenge. They’ve also cheaped out on hardware. That made our support guys’ jobs a bit, well, challenging.

Jumping in my expert “shoes” I’ve asked what exactly was slowing down, got an answer, made a guess and went off to take a trace. There was something funny about that trace. Everything was just baseline: not great, not terrible. I found a couple of minor issues but it’s not like they were causing any real damage.

Still I made a note of that, created a couple of JIRA tickets and a hotfix which we promptly run. The sun has set long ago, the workday was over and I went home.

Next day our support lead told me that nothing has really changed after my fixes so we had to go back to the drawing board. That was disappointing but not entirely surprising.

What I should’ve done

I spent some more time on this, gave up and asked a colleague for a second opinion.

We were done in fifteen minutes. As you might have guessed by now, this issue had nothing to do with SQL queries being slow. Yes, this client was special and their database was bigger. So I expected some queries which work fine everywhere else to be slow here on these massive amounts of data. What I didn’t expect is to see a completely different kind of a problem.

It was also extremely simple. During one of the scheduled jobs the instance ran out of filesystem descriptors. And I could find that out just by carefully looking at V$SESSION_WAIT. Everything else was waiting for a free descriptor.

Of course the moral of this story isn’t “Don’t forget to take a look at wait events”. Not at all! My problem was the tunnel vision. I kept applying this one thing that I knew that always worked and I got stuck. As soon as I engaged a second person I was forced to come up with a hypothesis, to explain what I want to look at and what I would expect.

The oldest trick in the book, the rubber duck debugging, would be just as helpful, but my co-worker brought some of his expertise as well, so I can’t take away his credit =)

Guessing is good

Now I’d never tell you to disregard your guesses and your intuition. You’re probably an experienced engineer, you’ve been doing it for a while. Your intuition is valuable. Your brain recognizes patterns and comes up with solutions.

And it’s magical when the solution works. You’ve dealt with a difficult problem extremely quick. That’s what you’re paid for. But our brains are not that well equipped for engineering tasks, so when your first guess (okay, first two guesses) doesn’t work, go back to engineering. Formulate a hypothesis, validate it. Formulate another.

That’s how we get closer.