Tracing, a critical component, tracks requests through complex systems. This visibility reveals bottlenecks and errors, enabling faster resolutions. In a previous post of our Go web services series, we explored observability's significance. Today, we focus on tracing. Jaeger collects, stores, and visualizes traces from distributed systems. It provides crucial insights into request flows across services. By integrating Jaeger with OpenTelemetry, developers can unify their tracing approach, ensuring consistent and comprehensive visibility. This integration simplifies diagnosing performance issues and enhances system reliability. In this post, we'll set up Jaeger, integrate it with OpenTelemetry in our application, and explore visualizing traces for deeper insights.
What we are working towards is a Jaeger dashboard that looks like this:
As we go to various parts of the app (on the Onehub frontend), the various requests' traces are collected (from the point of hitting the grpc-gateway) with a summary of each of them. We can even drill down into one of the traces for a finer detailed view. Look at the first request (for creating/sending a message in a topic):
Here we see all the components the request touches along with their entry/exit times and time taken in and out of the methods. Very powerful indeed.
TL;DR: To see this in action and validate the rest of the blog:
With instrumentation with OpenTelemetry, our system will evolve to:
As we noted earlier, it is quite onerous for each service to use separate clients to send to specific vendors. Instead with an OTel collector running separately, we can ensure that all (interested) services can simply send metrics/logs/traces to this collector, which can then export to them various backends as required -- in this case, Jaeger for traces.
Let us get started.
The first step is to add our OTel collector running in the Docker environment along with Jaeger so they are accessible.
Note: We have split our original all-encompassing config into two parts:
The two docker-compose environments are connected by a shared network () through which services in these environments can communicate with each other. With this separation, we only need to restart a subset of services upon changes speeding our development.
Simple enough, this sets up two services in our Docker environment:
A few things to note:
OTel also needs to be configured with specific receivers, processors, and exporters. We will do that in configs/otel-collector.yaml.
We need to tell the OTel collector which receivers are to be activated. This is specified in the section:
This activates an OTLP receiver on ports 4317 and 4318 (, respectively). There are many kinds of receivers that can be started. As an example, we have also added a "" receiver that will actively scrape Postgres for metrics (though that is not relevant for this post). Receivers can also be pull or push-based. Pull-based receivers periodically scrape specific targets (e.g., ), whereas push-based receivers listen to and "receive" send metrics/logs/traces from applications using the OTel client SDK.
That's it. Now our collector is ready to receive (or scrape) the appropriate metrics.
Processors in OTel are a way to transform, map, batch, filter, and/or enrich received signals before exporting them. For example, processors can sample metrics, filter logs, or even batch signals for efficiency. By default, no processors are added (making the collector a pass-through). We will ignore this for now.
Now it is time to identify where we want the signals to be exported to: backends that are best suited for respective signals. Just like receivers, exporters can also be pull- or push-based. Push-based exporters are used to emit signals to another receiver that acts in push mode. These are outbound. Pull-based exporters expose endpoints that can be scraped by other pull-based receivers (e.g., ). We will add an exporter of each kind: one for tracing and one for Prometheus to scrape from (though Prometheus is not the topic of this post):
Here we have an exporter to Jaeger running the OTLP collector, as indicated by . This exporter will push traces regularly to Jaeger. We are also adding a "scraper" endpoint on port 9090 which Prometheus will scrape regularly from.
The "debug" exporter simply is used for dumping signals to standard output/error streams.
The receiver, processor, and exporter sections simply define the modules that will be enabled by the collector. They are still not invoked. To actually invoke/activate them, they must be referred to as "pipelines". Pipelines define how signals flow through and are processed by the collector. Our pipeline definitions (in the section) will clarify this:
Here we are defining two pipelines. Note how similar the pipelines are but allow two different exporting modes (Jaeger and Prometheus). Now we are seeing the power of OTel and in creating pipelines within it.
Jaeger provides a dashboard for visualizing trace data about all our requests. This can be visualized on a browser by enabling the following in our Nginx config. Again though not the topic of this post - we are also exposing the Prometheus UI via nginx at the HTTP path prefix .
Jaeger UI is quite comprehensive and has several features that you can explore. Navigate to the Jaeger UI in the browser. You will see a comprehensive interface for searching and analyzing traces. Go ahead and familiarize yourself with the main sections, including the search bar and the trace list. You can search for various traces by search criteria and filter by service, time durations, components, etc.
Analyze the trace timelines in the different requests to understand the sequence of operations. Each span represents a unit of work, showing start and end times, duration, and related metadata. This detailed view is very helpful in identifying performance bottlenecks and errors within the trace.
So far we have set up our systems to visualize, consume signals, etc. However, our services are still not updated to emit the signals to OTel. Here we will integrate with the (Golang) client SDK in various parts of our code. The SDK documentation is a fantastic place to first familiarize yourself with some of the concepts.
The key concepts we will deal with are described below.
Resources are the entity that produces the signal. In our case, the scope of the resource is the binary hosting the services. Currently, we have a single resource for the entirety of the Onehub service, but this could be split up later on.
This is defined in cmd/backend/obs.go. Note that the client SDK did not need us to go into the details of the resource definition explicitly. The standard helper () lets us create a resource definition by inferring the most useful parts (like process name, pod name, etc.) at runtime.
We only had to override one thing: the environment variable for the service in docker-compose.yml.
Context propagation is a very important topic in observability. The various pillars are exponentially powerful when we can correlate signals from each of the pillars in identifying issues with our system. Think of contexts as extra bits of data that can be tied to the signals: i.e., can be "joined" in some unique way to relate the various signals to a particular group (say a request).
For each of the signals, OTel provides a Provider interface (e.g., for exporting spans/traces, for exporting metrics, for exporting logs, and so on). For each of these interfaces, there can be several implementations, e.g., a Debug provider for sending to stdout/err streams, an OTel provider for exporting to another OTel endpoint (in a chain), or even directly via a variety of exporters. However, in our case, we want to defer the choice of any vendors out of our services and instead send all signals to the OTel collector running in our environment.
This is a simple wrapper keeping track of common aspects needed by the OTel SDK. Here we have providers (Logger, Tracer, and Metric) as well as ways to provide context (for tracing). The over-arching Resource used by all providers is also specified here. Shutdown functions are interesting. They are functions called by the providers when the underlying exporter has terminated (gracefully or due to an exit). The wrapper itself takes a generic so specific instantiators of this Setup can use their own custom data.
The repo contains two implementations of this:
We will instantiate the second one in our app. We will not go into details of the specific implementations as they have been taken from the examples in the SDK with minor fixes and refactoring. Specifically, take a look at the otel-collector example for inspiration.
The essence of enabling the collector in our services is that some kind of OTel-related "context" is started at all the "entry" points. If this context is created at the start then it will be sent to all targets called here, which is then propagated subsequently (as long as we do the right thing).
In our case, the entry points here are at the start when the gRPC Gateway receives an API request from Nginx (we could start tracing them from the point the HTTP request hits Nginx to even highlight latencies at Nginx, but we will postpone for just a bit).
As an example, let us see how our gateway service leverages this channel.
Instead of starting the HTTP server (for the grpc-gateway) as:
Pay attention to lines 9-14 where the server shutdown is watched for in a separate goroutine and line 15 where if there was an error when the server exited, it is sent back via the "notification" channel that was passed as an argument to this method.
Now the various parts of our services now have access to an "active" OTLP connection to use whenever signals are to be sent.
Above, the instance used to start the gRPC Gateway is using a custom handler: the in the OTel HTTP package. This handler takes an existing instance, decorates it with the OTel context, and ensures its propagation to any other downstream that is called.
By default, the gRPC Gateway library creates a "plain" context when creating/managing connections to the underlying GRPC services. After all, the gateway does not know anything about OTel. In this mode, a connection (from the user/browser) to the gRPC Gateway and the connection from the gateway to the gRPC service will be treated as two different traces.
So it is important to remove the responsibility of creating gRPC connections away from the Gateway and instead provide a connection that is already OTel aware. We will do that now.
Prior to OTel integration, we were registering a Gateway handler for our gRPCs with:
What we have done is first create a (Line 6) that acts as a connection factory for our gRPC server. The client is pretty simple. Only the gRPC () is being used to create the connection. This ensures that the context in the current trace that began on a new HTTP request is now propagated to the gRPC server via this handler.
That's it. Now we should start seeing the new request to the Gateway and the Gateway->gRPC request in a single consolidated trace instead of as two different traces.
We are almost there. So far:
Now we need to emit spans for all "interesting" places in our code. Take the method for example (in services/topics.go):
We call the database to fetch the topics and return them. Similar to the database access method (in datastore/topicds.go):
Here we would be mainly interested in how much time is spent in each of these methods. We simply create spans in each of these and we are done. Our additions to the service and datastore methods (respectively) are:
Here, the given context () is "wrapped" and a new context is returned. We can (and should) pass this new wrapped context to further methods. We do just that when we call the datastore method.
Ending a span (whenever the method returns) ensures that the right finish times/codes, etc. are recorded. We can also do other things like add tags and statuses to this if necessary to carry more information to aid with debugging.
That's it. You can see your beautiful traces in Jaeger and get more and more insights into the performance of your requests, end to end!
We covered a lot in this post and still barely scratched all the details behind OTel and tracing. Instead of overloading this (already loaded) post, we will introduce newer concepts and intricate details in future posts. For now, give this a go in your own services and try playing with other exporters and receivers in the otel-contrib repo.