Skip to main content
QuantLab Logo
Glossary · Infrastructure

What is Distributed Tracing?

Distributed tracing is the technique of following a single request as it travels through every service, queue, and database that handles it, recording how long each hop took and whether it succeeded — turning "the app feels slow" into a precise picture of exactly which step, in which service, is eating the time.

The problem it solves

In a single monolithic application, a slow request is relatively easy to diagnose — attach a profiler, read the logs, find the slow function. In a distributed system, one user action might fan out across a dozen microservices, several databases, a message queue, and a few third-party APIs. When that action is slow, logs from each service tell you what happened locally but not how the pieces connect. Distributed tracing exists to reconstruct that connection: it is the part of observability that answers "where did the time actually go?"

Traces and spans

A trace represents the full journey of one request. It is made of spans — each span is a single unit of work, like an HTTP call to one service, a database query, or a meaningful function. Every span records a start time, a duration, a status (success or error), and attributes such as the endpoint or the SQL statement. Spans nest: the API gateway's span is the parent, the service it calls is a child, the database query inside that service is a grandchild. Lay the spans out on a timeline and you get the familiar waterfall view where a long bar instantly reveals the bottleneck.

Trace context propagation

The magic that ties spans across services together is context propagation. When the first service receives a request, it generates a unique trace ID. As it calls downstream services, it passes that ID along — typically in HTTP headers following the W3C Trace Context standard, which has largely replaced earlier vendor-specific formats. Each service attaches its spans to the same trace ID. Without propagation you would get a pile of disconnected single-service traces; with it you get one coherent end-to-end story, even across language and team boundaries.

Sampling — you cannot trace everything

Recording a full trace for every request in a high-traffic system would generate an unaffordable volume of data, so tracing systems sample. Head-based sampling decides at the start of a request whether to record it — simple, but it might discard the one slow request you needed. Tail-based sampling buffers spans until the trace finishes, then keeps it only if it is interesting — slow, errored, or otherwise notable — and drops the routine majority. Tail-based sampling costs more to run but is far better at retaining exactly the traces an engineer will want during an incident.

The tooling landscape

The lineage runs from Google's internal Dapper system through open-source projects like Zipkin and Jaeger. Today most teams instrument with OpenTelemetry, the vendor-neutral standard, and send traces to a backend of their choice — Jaeger or Grafana Tempo on the open-source side, or commercial platforms like Honeycomb, Datadog, and New Relic. Because OpenTelemetry decouples instrumentation from the backend, you can change where traces are stored and visualized without touching the code that produces them.

At QUANT LAB

When we build multi-service systems under SaaS platform development or operate them under DevOps engineering, we wire in distributed tracing with OpenTelemetry so the team can follow a real user request end to end. It changes incident response from arguing about which service is at fault to opening the trace and reading the answer. Tracing also exposes the hidden cost of chatty service-to-service calls — the kind of pattern that a thoughtful caching layer or an API redesign can eliminate once you can finally see it.

Hunting latency across services?

We instrument distributed systems with end-to-end tracing so slow requests stop being a mystery. Book a 30-minute call.

DevOps engineering