[WIP] Read The Logs

Sunday, June 1, 2025 • 3 min read

When you work on a ~~micro~~ reasonable-sized service, you inevitably talk to neighboring teams and your service talks to other serives. Of course, every single developer is a bearer of unique knowledge about the system and interactions between the systems. One can't possibly know all the details and nuances of their own service, not even mentioning the whole service fleet.

Imagine yourself a scenario: you're on-call over a weekend, it's some time close to midnight, and, of course, you get paged. The situation is dire, it's not a monitoring alert. It's a customer-reported issue that's been open for weeks now, and the customer wants it fixed! The issue's been forwarded between teams without significant progress and no one understands what exactly is wrong. You've got the incident only because they think that it might be slightly related to your area of expertise. Just a bit. A teeny-tiny bit. At least, they know that the affected functionality sends requests to or receives responses from your service. Since the incident hasn't had an input from your team, they send it to you so that they could present some traction to the angry customer. You, on the other hand, don't believe that the problem originates from your service. So, it's not your problem, right?

FALSE!

It is your problem now. #more What if I told you that you can be the engineer that solves this puzzle and finds the root cause? Get recognition by the managers, the senior leadership, the customers and finally get that long ago deserved promotion. Sounds cool, huh? But you have to have a very particular set of skills.

Very Particular Set of Skills

The set is small: {read code, read logs}. But each one of those goes deep down the rabbit hole. It's relatively easy to read logs & code of your own service - you work on it every day. But there's a lot to learn about other services. Why even bother with learning it? Easy: to get shit done. I'm not a big fan of working on Saturday nights but on-call is on-call. Can you redirect this incident to some other team? Most probably. Do you want to do it? Hell yeah! But will you? Not really. What's gonna happen if you send it away immediately? It'll wonder around for a bit and will return again (true story). Let's look into the issue, maybe it'll go away.

With lots of swearing you dive deep into the incident comments to find the description and some repro data. You search your logs for the given timestamp and user ids. All looks fine. You process the request, you do some magic, you send the request downstream, you return a response to the user.

record scratch sound

Wait, what? You sent it downstream? Okay, you found that there's something weird happening in the downstream service. What are we going to do? Redirect the incident to this downstream service? Sounds good, it's a problem on their end, nothing to do with us. But can we do something better than just get rid of it? Of course!

Let's open their logs! We know the timestamp, we know the request id, we know basically everything to identify the lines that we need! Let's get some shit done. We search for the call, we try to find anything strange - it's not that hard, the logs are written in plain English after all.

GOTCHA!

They receive some weird crap from the storage that breaks the user flow. Here we don't need to go any deeper (but we can, of course). We've just made sure that the issue isn't on our side. We made an extra step or walked an extra mile, whatever. We've got the shit done. Let's write the report, attach everything we have and finally go to sleep.