✏️

Minimal Viable Logging

This guide assumes that you have already dealt with log levels and have experience building and debugging complex systems on a decent scale.

My name is Anton Kulikalov, I'm a professional Cloud Architect. In this article I'll share my minimalistic approach to production-ready logging that allows me

  • to manage multiple projects alone,
  • diagnose bugs within minutes
  • and sleep tight, knowing that I'll be automatically notified only in case of emergency.

Production logs must answer two questions:

  1. Is everything okay?
  2. What went wrong?

The rest is very situational, usually unnecessary and could even be wasteful on a larger scale.

Is everything okay?

Everything is an "I/O device", essentially. Everything has an Input and Output, right?. What would be the surest universally applicable indication that an I/O device is healthy? It's the input and the output. Logging the crucial parts of the input and the output is usually enough to tell if there is a problem.

products/GET(search_phrase=coffee)
βœ… products/GET(count=10,products=[<product_id>,...])

products/GET(search_phrase=coffee,page=2)
βœ… products/GET(count=10,products=[<product_id>,...])

products/GET(search_phrase=coffee,page=3)
βœ… products/GET(count=10,products=[<product_id>,...])

What went wrong?

What would give you a definitive picture of what went wrong with an I/O device? The input. Sometimes the context matters too. If that's the case, then use a circular buffer to store the 10 most recent debug logs and add them to your error logs when an error actually occurs. Maybe even put this buffer into a separate process, so it won't crash along with your app.

products/GET(search_phrase=coffee)
πŸ€– validating input...
πŸ€– querying db...
...
❌ products/GET(code=500,message=db disconnected)

Critical logs level is unreliable.

image

You can't rely on a system to always be able to report on itself. However perfect your error handling is, the first hard drive crash or blackout will make your service go silent without any notice. Yes, these events are rare, so most likely, you'll simply screw up somewhere and cause your service to go silent. It's okay - we are all humans - we make mistakes. Instead, use external health checks. Whether your service is a web server or a data transformation pipeline, you can expose an API endpoint that makes your service run a quick internal diagnostics and report back. This is a reliable way of making sure your service is truly healthy.

Group your logs

[<request_id>, <user_id>]products/GET(search_phrase=coffee)
[<request_id>, <user_id>]βœ… products/GET(count=10,products=[<product_id>,...])

Can you easily trace a specific user behavior using just logs? Can you easily trace a particular function or method healthiness across all users and services?

Use logs grouping or branching to make it possible.

Log the input as soon as the service gets it and mark it with a unique identifier. Then log the output or the error message with the same identifier, so you can filter out all irrelevant logs when debugging your system. There is usually no need to use hashing algorithms or generate globally unique identifiers for that purpose. It's enough if your unique identifier is not occurring more than once a day. The advanced way of using this technique would be to branch your logs, so you can nest groups inside other groups.

Use Cloud Logging

There is a notion of keeping logs stored as files next to the running service. This is the default behavior for most OSs, so it's a natural thing to do. But it doesn't have to be the primary storage. It's like keeping all your money in a box under your bed instead of a bank account. This approach is inconvenient and fragile. Instead, use centralized logging services, like Google Cloud Logging or similar. Google Cloud Logging comes with handy data storage, retention policies, all kinds of filtering, querying and parsing capabilities. It can also visualize the data and email you whenever something important is happening. Or when something that supposes to happen is not happening - you name it. You can even stream your client-side logs to build yourself a cozy place to observe your entire project's heartbeat.

Thanks!

Thank you for reading this to the end! If you have any comments - pls feel free to shoot me an email.