Skip to main content
QuantLab Logo

Backend Engineering · 2026

Background Jobs & Queues in Production: A 2026 Guide

Almost every real app needs to do work outside the request — send mail, process files, call third-party APIs. Doing it reliably is harder than it looks. This guide covers idempotency, retries, dead-letter queues, concurrency, and graceful shutdown for production workers.

Bill Beltz, Founder & Principal Engineer
By , Founder & Principal EngineerPublished 12 min read

Quick answer

Move slow, external, or flaky work out of the request and into background jobs. Make every job idempotent because the queue will retry it, use exponential backoff with jitter on transient failures, and send exhausted jobs to a dead-letter queue you monitor. Bound concurrency so workers do not overwhelm downstreams, handle termination signals with graceful shutdown, and start with a database-backed queue before reaching for a dedicated broker you do not yet need.

A background job system looks trivial in a demo and reveals its complexity the first time a worker crashes mid-job in production. The patterns that matter are about failure, not the happy path. We build and operate these systems for a living, and our custom software practice treats reliability as the default. If your jobs react to domain events rather than direct enqueues, pair this with event-driven architecture for SaaS.

1. What belongs in a job

The request path should do the minimum to be correct and return fast. Everything slow or unreliable moves to a worker.

  • Sending email, SMS, and push notifications — never block a response on an email provider's latency.
  • Calling third-party APIs you do not control, where a timeout would otherwise become your timeout.
  • Report generation, PDF rendering, image and video processing, and bulk imports or exports.
  • Fan-out work: one user action that triggers many downstream updates.

The pattern in the request is always the same: validate, persist, enqueue, return. The job does the heavy lifting after the user already has their answer.

2. Idempotency: the non-negotiable

Every reliable queue retries, and a worker can crash after completing its side effect but before acknowledging the message. So every job will occasionally run twice. Idempotency makes that harmless.

// Idempotency key guards the side effect, not just the job
async function sendInvoiceEmail(job) {
  const key = `invoice-email:${job.invoiceId}`;
  const claimed = await db.idempotency.tryInsert(key); // unique row
  if (!claimed) return; // already sent on a prior attempt

  await email.send(renderInvoice(job.invoiceId));
  // If we crash here, the key exists, so a retry safely no-ops.
}
  • Derive a stable idempotency key from the job's inputs, not a random value generated at run time.
  • Prefer naturally idempotent operations — an upsert, a set-status-to-X — over check-then-act where you can.
  • Pass idempotency keys through to third-party APIs that support them (Stripe, for example) so the provider dedupes too.

3. Retries, backoff, and dead-letter queues

Transient failures are normal — a momentary timeout, a rate-limited dependency, a brief network blip. Retry them, but intelligently.

// Exponential backoff with full jitter, capped attempts
function nextDelayMs(attempt) {
  const base = Math.min(30_000, 1000 * 2 ** attempt); // cap at 30s
  return Math.floor(Math.random() * base);            // full jitter
}

// On failure: if attempt < MAX and error is retryable, requeue
// with nextDelayMs(attempt); otherwise route to the dead-letter queue.
  • Use exponential backoff with jitter so retries from many workers do not synchronize into a thundering herd on the dependency.
  • Separate retryable errors (timeout, 503) from permanent ones (400, validation) — never retry a request that cannot succeed.
  • Cap attempts and route exhausted jobs to a dead-letter queue so one poison message cannot block the stream or loop forever.
  • Alert on DLQ depth. A dead-letter queue nobody watches is just a silent failure with extra steps.

4. Concurrency, scheduling, and shutdown

Once jobs run reliably, the operational concerns are how many run at once, when recurring ones fire, and what happens on deploy.

  • Bound concurrency. Limit workers per queue and per downstream so a backlog burst does not overwhelm your database or a third-party API. Pair this with the same thinking as API rate limiting.
  • Scheduling. For recurring work, ensure only one instance fires each tick — a distributed lock or a single scheduler — so three app servers do not run the nightly job three times.
  • Graceful shutdown. On a termination signal, stop pulling new jobs, finish or checkpoint the one in flight, ack or release it, then exit — within a timeout under the platform's kill deadline.
  • Visibility timeout. Set it longer than your slowest job so the queue does not redeliver work that is still running.

Mid-post: start with the queue you already operate

A Postgres-backed queue with SKIP LOCKED handles more load than most teams expect — and avoids new infrastructure you would have to run. Want help right-sizing your job system? Book a free scoping call.

Failure modes and their defenses

Failure modeDefense
Duplicate runIdempotency key on the side effect
Transient errorRetry with exponential backoff + jitter
Poison messageCapped attempts → dead-letter queue + alert
Overloaded downstreamBounded concurrency and per-dependency limits
Deploy mid-jobGraceful shutdown + adequate visibility timeout
Duplicate scheduleDistributed lock or single scheduler per tick

Operational practices that hold over time

A queue is a system you operate, not a fire-and-forget library:

  • Watch queue depth and age. A growing backlog or rising oldest-message age is your earliest signal that capacity is short — see observability for startups.
  • Make jobs replayable. When you fix a bug, you want to drain the DLQ back through the corrected worker, not lose the work.
  • Keep payloads small. Enqueue an ID and re-fetch inside the job; a fat payload goes stale and bloats the queue.

If a job runs long-lived database changes, the batching and idempotency rules from our zero-downtime migrations guide apply directly.

Frequently asked questions

What work belongs in a background job versus the request?

Move anything slow, external, or failure-prone out of the request path: sending email, generating reports or PDFs, calling third-party APIs, image and video processing, and bulk database operations. The request should do the minimum to be correct — validate, persist, enqueue — and return quickly. Keeping a slow or flaky operation inline ties your response time and reliability to a dependency you do not control, and blocks a web worker that could be serving other users.

Why must background jobs be idempotent?

Because every reliable queue retries, and retries mean a job can run more than once. A worker can crash after doing its work but before acknowledging the message, so the queue redelivers it. An idempotent job produces the same result whether it runs once or three times — by checking whether the work is already done, using a unique key on the side effect, or making the operation a safe upsert. Without idempotency, retries send duplicate emails and double-charge customers.

How should job retries and backoff work?

Retry transient failures automatically with exponential backoff and jitter so a flaky dependency is not hammered by synchronized retries. Cap the number of attempts, then move the job to a dead-letter queue rather than retrying forever. Distinguish retryable errors (a timeout, a 503) from permanent ones (a validation failure, a 400) — retrying a permanent error wastes capacity and delays the inevitable. Always pair retries with idempotency, or you multiply side effects.

What is a dead-letter queue?

A dead-letter queue (DLQ) is where messages go after they exhaust their retry budget or are rejected as un-processable. It prevents a single poison message — one that fails every time — from blocking the queue or looping forever. The DLQ turns a silent, repeating failure into a visible backlog you can inspect, fix, and replay. A DLQ with no alerting is just a place jobs go to die quietly, so always monitor its depth.

How do you handle graceful shutdown of workers?

When a worker receives a termination signal during a deploy or scale-down, it should stop accepting new jobs, finish or safely checkpoint the job in flight, acknowledge or release it, and only then exit. Without graceful shutdown, an in-progress job is killed mid-execution, which either loses work or — if the queue redelivers — relies entirely on idempotency to avoid corruption. Set a shutdown timeout slightly under your platform's kill deadline so cleanup actually completes.

Should you use a database-backed queue or a dedicated broker?

For small to mid-sized SaaS, a database-backed queue (using SELECT ... FOR UPDATE SKIP LOCKED in Postgres) is often the right call — it reuses infrastructure you already operate, gives you transactional enqueue, and is simple to reason about. Move to a dedicated broker like Redis-backed queues, SQS, or Kafka when throughput, fan-out, or routing needs outgrow what a database table handles comfortably. Do not adopt heavy messaging infrastructure before the workload justifies the operational cost.

Reliable work, even when things fail.

We build background job systems with idempotency, retries, and dead-letter handling designed in from the start. Book a free scoping call to talk through your workload.

All blog postsUpdated June 3, 2026