What is Distributed Tracing?
Distributed tracing is the technique of following a single request as it travels through every service, queue, and database that handles it, recording how long each hop took and whether it succeeded — turning "the app feels slow" into a precise picture of exactly which step, in which service, is eating the time.
The problem it solves
In a single monolithic application, a slow request is relatively easy to diagnose — attach a profiler, read the logs, find the slow function. In a distributed system, one user action might fan out across a dozen microservices, several databases, a message queue, and a few third-party APIs. When that action is slow, logs from each service tell you what happened locally but not how the pieces connect. Distributed tracing exists to reconstruct that connection: it is the part of observability that answers "where did the time actually go?"
Traces and spans
A trace represents the full journey of one request. It is made of spans — each span is a single unit of work, like an HTTP call to one service, a database query, or a meaningful function. Every span records a start time, a duration, a status (success or error), and attributes such as the endpoint or the SQL statement. Spans nest: the API gateway's span is the parent, the service it calls is a child, the database query inside that service is a grandchild. Lay the spans out on a timeline and you get the familiar waterfall view where a long bar instantly reveals the bottleneck.
Trace context propagation
The magic that ties spans across services together is context propagation. When the first service receives a request, it generates a unique trace ID. As it calls downstream services, it passes that ID along — typically in HTTP headers following the W3C Trace Context standard, which has largely replaced earlier vendor-specific formats. Each service attaches its spans to the same trace ID. Without propagation you would get a pile of disconnected single-service traces; with it you get one coherent end-to-end story, even across language and team boundaries.
Sampling — you cannot trace everything
Recording a full trace for every request in a high-traffic system would generate an unaffordable volume of data, so tracing systems sample. Head-based sampling decides at the start of a request whether to record it — simple, but it might discard the one slow request you needed. Tail-based sampling buffers spans until the trace finishes, then keeps it only if it is interesting — slow, errored, or otherwise notable — and drops the routine majority. Tail-based sampling costs more to run but is far better at retaining exactly the traces an engineer will want during an incident.
The tooling landscape
The lineage runs from Google's internal Dapper system through open-source projects like Zipkin and Jaeger. Today most teams instrument with OpenTelemetry, the vendor-neutral standard, and send traces to a backend of their choice — Jaeger or Grafana Tempo on the open-source side, or commercial platforms like Honeycomb, Datadog, and New Relic. Because OpenTelemetry decouples instrumentation from the backend, you can change where traces are stored and visualized without touching the code that produces them.
At QUANT LAB
When we build multi-service systems under SaaS platform development or operate them under DevOps engineering, we wire in distributed tracing with OpenTelemetry so the team can follow a real user request end to end. It changes incident response from arguing about which service is at fault to opening the trace and reading the answer. Tracing also exposes the hidden cost of chatty service-to-service calls — the kind of pattern that a thoughtful caching layer or an API redesign can eliminate once you can finally see it.
Long-form deep-dives that use this term
All postsAdding AI Features to Your SaaS (2026)
Where AI helps, build-vs-API trade-offs, evals, guardrails, and shipping without torching margins.
Read postBuilding Multi-Tenant SaaS on Postgres RLS
Row-level security patterns for isolating tenant data without separate databases.
Read postCaching Strategies for SaaS (2026)
Cache layers from CDN to Redis, invalidation that works, stampede protection, and what never to cache.
Read post
Related terms
Hunting latency across services?
We instrument distributed systems with end-to-end tracing so slow requests stop being a mystery. Book a 30-minute call.