What is distributed tracing in one sentence?

Distributed tracing follows a single request as it moves through every service, queue, and database in a system, recording how long each step took and whether it succeeded, so you can see exactly where time or errors are being spent.

A span is a single unit of work within a trace — one service call, one database query, one function. Each span records a start time, duration, status, and attributes. Spans nest into a tree that forms the complete trace of a request.

What is trace context propagation?

It is how a trace ID is passed from one service to the next, usually via HTTP headers following the W3C Trace Context standard. Without propagation, each service would create disconnected traces instead of one end-to-end picture.

What is the difference between tracing and logging?

Logs are independent records of events within a single service. Traces stitch events across services into the story of one request. A log tells you what happened in one place; a trace tells you the whole journey and where the time went.

What is tail-based sampling?

Tracing every request is expensive, so systems sample. Tail-based sampling waits until a trace is complete, then keeps it if it is interesting — slow or errored — and discards routine ones. This keeps the valuable traces without storing everything.

Glossary · Infrastructure

What is Distributed Tracing?

Distributed tracing is the technique of following a single request as it travels through every service, queue, and database that handles it, recording how long each hop took and whether it succeeded — turning "the app feels slow" into a precise picture of exactly which step, in which service, is eating the time.

The problem it solves

In a single monolithic application, a slow request is relatively easy to diagnose — attach a profiler, read the logs, find the slow function. In a distributed system, one user action might fan out across a dozen microservices, several databases, a message queue, and a few third-party APIs. When that action is slow, logs from each service tell you what happened locally but not how the pieces connect. Distributed tracing exists to reconstruct that connection: it is the part of observability that answers "where did the time actually go?"

Traces and spans

A trace represents the full journey of one request. It is made of spans — each span is a single unit of work, like an HTTP call to one service, a database query, or a meaningful function. Every span records a start time, a duration, a status (success or error), and attributes such as the endpoint or the SQL statement. Spans nest: the API gateway's span is the parent, the service it calls is a child, the database query inside that service is a grandchild. Lay the spans out on a timeline and you get the familiar waterfall view where a long bar instantly reveals the bottleneck.

Trace context propagation

The magic that ties spans across services together is context propagation. When the first service receives a request, it generates a unique trace ID. As it calls downstream services, it passes that ID along — typically in HTTP headers following the W3C Trace Context standard, which has largely replaced earlier vendor-specific formats. Each service attaches its spans to the same trace ID. Without propagation you would get a pile of disconnected single-service traces; with it you get one coherent end-to-end story, even across language and team boundaries.

Sampling — you cannot trace everything

Recording a full trace for every request in a high-traffic system would generate an unaffordable volume of data, so tracing systems sample. Head-based sampling decides at the start of a request whether to record it — simple, but it might discard the one slow request you needed. Tail-based sampling buffers spans until the trace finishes, then keeps it only if it is interesting — slow, errored, or otherwise notable — and drops the routine majority. Tail-based sampling costs more to run but is far better at retaining exactly the traces an engineer will want during an incident.

The tooling landscape

The lineage runs from Google's internal Dapper system through open-source projects like Zipkin and Jaeger. Today most teams instrument with OpenTelemetry, the vendor-neutral standard, and send traces to a backend of their choice — Jaeger or Grafana Tempo on the open-source side, or commercial platforms like Honeycomb, Datadog, and New Relic. Because OpenTelemetry decouples instrumentation from the backend, you can change where traces are stored and visualized without touching the code that produces them.

At QUANT LAB

When we build multi-service systems under SaaS platform development or operate them under DevOps engineering, we wire in distributed tracing with OpenTelemetry so the team can follow a real user request end to end. It changes incident response from arguing about which service is at fault to opening the trace and reading the answer. Tracing also exposes the hidden cost of chatty service-to-service calls — the kind of pattern that a thoughtful caching layer or an API redesign can eliminate once you can finally see it.

Long-form deep-dives that use this term

All posts

Related terms

Hunting latency across services?

We instrument distributed systems with end-to-end tracing so slow requests stop being a mystery. Book a 30-minute call.

DevOps engineering