How Avenue Makes Effective Logging Effortless with OpenTelemetry
Any engineer that has had to debug an issue in production will know that logs are a crucial component of the dev-ops toolkit. They’re almost certainly the first place you look when you’re notified that something in your app is broken, so the quality of your logs is highly correlated to the time it takes to diagnose, and ultimately fix, any issue that you run into. The only thing that can make a 2 AM on-call alert worse is the subsequent realization that you’re not going to be able to root cause the problem because it was not adequately logged. In an effort to help you avoid that situation, let’s go through the core traits of a successful logging solution below.
What makes a good log?
A good log has four key characteristics:
In most cases, a log should be as close as possible to the source of the error. This allows you to highlight the problem with the highest level of fidelity. As you bubble the error up through your call stack, you become more likely to conflate multiple sources of error into one generic log message. Ultimately it’s the difference between knowing that your API failed, and knowing the specific external dependency that caused said failure.
A log message should, at the very least, always contain the contents of the error that it is logging. In the example below, the log message on the bottom is exponentially more useful than the log message on top because it will provide the reader with substantially more information about the root cause of the failure:
Was it because the request was rate limited?
Was the request body malformed somehow?
Or was it something else?
Knowing the answer to these types of questions is fundamental to knowing how to fix the issue.
Today your logs are most likely aggregated into some persistent storage like DataDog, Splunk, or AWS CloudWatch. These log stores are fantastic tools for combing through the mountains of logs that your servers emit, but they’re far more useful if your logs are well-structured. In production you should emit your logs in a format like JSON so that your aggregator can properly index your logs. Likewise, rather than emitting a log as a monolithic string, emit it as an object with a key-value pair for each relevant piece of information.
This will allow you to explicitly search for each field, instead of always doing a fuzzy text-search on the content.
Lastly, a log message should be contextually rich. The definition of context will depend on your application, but it should ultimately serve to semantically identify a given log message from other log messages that may have been emitted at the same time or place in your app. In a nightmare scenario where you’re suddenly flooded with a series of errors, the first thing you’ll see is a chronologically-ordered glut of interleaved logs coming from a number of different places in your app.
It's incredibly important to have the capability to filter down these logs to a single trace of a given request, workflow, or other process in your app so that you can make sense of the noise.
Making It All Easy With OpenTelemetry
Ultimately, identifying and agreeing on the aspects of a good log message is only useful if those qualities are easy for the engineers to implement. Location, content, and structure can mostly be enforced through documented best practices and code reviews. However, making sure that each log message has adequate context is a non-trivial problem. Let’s use Avenue’s distributed workflow engine as an example.
Avenue’s workflow engine is a multi-tenant application that runs thousands of cron jobs every hour to evaluate the conditions of various Avenue monitors, trigger escalations, and more. For a given job, let’s define our base context as a jobName, a monitorId and an organizationId. It’s impractical and tedious to independently resolve these values every time we want to emit a log message. A natural alternative is to pass around a context object with these values throughout our entire call stack. This is the approach advised by the Go team and adopted by Zap, Uber’s popular Go logger.
The results are traceable and consistent logs, but the implementation still feels a little clumsy because you are required to pass a context object throughout your entire call stack. Luckily OpenTelemetry provides us with a way to achieve the same end product without any boilerplate!
How OpenTelemetry Helps
OpenTelemetry is a "collection of tools, APIs, and SDKs [used to] instrument, generate, collect, and export telemetry data (metrics, logs, and traces) to help you analyze your software’s performance and behavior.” It’s a Cloud Native Foundation project, available in many languages, and an altogether fantastic collection of software. We can use OpenTelemetry’s NodeJS Context Manager to register and propagate our context object down our call-stack without explicitly passing the object to every function! The magic behind this module (@opentelemetry/context-async-hooks) is Node’s concept of Asynchronous Local Storage, which allows asynchronous resources (such as Promises and functions with callbacks) to inherit contextual information from their “parent” resource without explicitly passing around a map to store this metadata. If you’ve used Thread Local Storage in other programming languages, this concept should feel familiar!
In practice, the developer experience is as follows:
First, we instantiate a context object at the top of our workflow engine’s call stack for a given job like so:
Next, we use OpenTelemetry to propagate the context throughout our call stack like so:
Finally, wherever we’d like to include this context in a log message, we simply need to fetch the context and instantiate a child logger with that context included, like so:
Thus we can instantiate context once at the top of our call stack, provide it to the OpenTelemetry context manager, and leverage it whenever we need to emit a contextually rich log message. Adding context to upstream functions is seamless, and similarly, updating existing log lines to leverage this context took minimal refactoring. If you’ve made it this far, hopefully this pattern can help you bring a bit more structure and consistency to your logs! If you’re interested in learning more about the various applications of async local storage, async hooks, and persistent execution context, this 2020 talk by Shanee Radzewsky is an awesome resource.
Likewise if you’re passionate about observability, building resilient software, and are looking for something new - Avenue is hiring across engineering, design, growth, and more! Check out our open positions or feel free to shoot me an email!
Happy logging! 🪵