Skip to main content
QuantLab Logo

Database Engineering · 2026

Database Migrations Without Downtime: A 2026 Guide

A schema change should never require a maintenance window. This is the practitioner's guide to shipping migrations against a live production database — expand-and-contract, online DDL, batched backfills, dual writes, and rollbacks that don't lose data.

Bill Beltz, Founder & Principal Engineer
By , Founder & Principal EngineerPublished 12 min read

Quick answer

Ship zero-downtime migrations with the expand-and-contract pattern: add new schema in a backward-compatible way, deploy code that writes to both shapes, backfill historical rows in small batches, switch reads to the new shape, and only then drop the old one. Keep every step independently deployable and reversible, build indexes with CREATE INDEX CONCURRENTLY, and never couple a breaking schema change to the deploy that consumes it.

Most outages attributed to "the database" are not capacity problems — they are a single migration that took a table-level lock during peak traffic. The good news is that nearly every schema change can be expressed as a sequence of small, safe, backward-compatible steps. We build and operate production systems for a living, and our custom software practice treats "no maintenance window" as a baseline requirement, not a luxury. If you are still deciding on an engine, our Postgres vs MySQL comparison covers the operational tradeoffs that make online DDL easier or harder.

1. The expand-and-contract pattern

A breaking change — renaming a column, splitting a table, tightening a constraint — becomes safe when you decompose it into an expand phase, a transition phase, and a contract phase. The rule that makes it work: at every deploy boundary, both the previous and the next version of your application must run correctly against the schema that exists at that moment.

  • Expand. Add the new column, table, or index. This is purely additive and breaks nothing.
  • Transition. Deploy code that writes to both old and new, backfill historical data, then move reads to the new shape.
  • Contract. Once nothing reads or writes the old structure, drop it in a final migration.

A "rename" that would normally lock a table and break in-flight requests turns into four boring, reversible deploys. That is the whole trick.

2. Online DDL: the statements that lock

The danger is not the migration concept — it is specific DDL that takes an ACCESS EXCLUSIVE lock or rewrites the table. Adding an index without CONCURRENTLY blocks writes for the entire build. Knowing which statements are safe is most of the battle.

-- DANGEROUS — locks the table while the index builds
CREATE INDEX idx_orders_user ON orders (user_id);

-- SAFE — builds the index without blocking writes
CREATE INDEX CONCURRENTLY idx_orders_user ON orders (user_id);

-- DANGEROUS on older Postgres — rewrites every row
ALTER TABLE orders ADD COLUMN status text NOT NULL DEFAULT 'open';

-- SAFE — split it: add nullable, set default, backfill, then enforce
ALTER TABLE orders ADD COLUMN status text;          -- instant
ALTER TABLE orders ALTER COLUMN status SET DEFAULT 'open';  -- instant
-- backfill in batches (see below), then:
ALTER TABLE orders ALTER COLUMN status SET NOT NULL;
  • Always build indexes with CONCURRENTLY on populated tables; it is slower but never blocks writes.
  • Add columns as nullable first. Setting a default is metadata-only on modern Postgres, but a backfill of existing rows is not.
  • Adding a foreign key or check constraint NOT VALID first, then VALIDATE CONSTRAINT, avoids a full-table lock.
  • Set a short lock_timeout on migration sessions so a blocked DDL fails fast instead of queueing behind production traffic.

3. Backfilling without bloat

A single UPDATE across millions of rows is the second most common self-inflicted outage. It holds row locks, balloons the write-ahead log, and can starve vacuum. Backfill in bounded, resumable batches instead.

// Batched, resumable backfill — a few thousand rows at a time
let lastId = 0;
const BATCH = 2000;
for (;;) {
  const rows = await db.query(
    `UPDATE orders SET status = 'open'
       WHERE id > $1 AND status IS NULL
       ORDER BY id
       LIMIT $2
     RETURNING id`,
    [lastId, BATCH],
  );
  if (rows.length === 0) break;
  lastId = rows[rows.length - 1].id;
  await sleep(50); // leave headroom for live traffic
}
  • Key batches on the primary key and commit between them — short transactions release locks and let vacuum keep up.
  • Make the backfill idempotent (the status IS NULL guard) so it can crash and resume without double-applying.
  • Throttle. A backfill that finishes in twenty minutes with zero user impact beats one that finishes in two and spikes latency.

4. Dual writes during the transition

While historical rows backfill, new rows keep arriving. The transition phase keeps both shapes correct by writing to each and reading from the old one until the new column is fully populated and verified.

  • Deploy code that writes the new column on every insert/update while still reading the old one. Now no new row is ever missing data.
  • Run the backfill to fill the gap left by old rows. When it completes, every row has the new value.
  • Flip reads to the new column behind a flag, watch your metrics, then stop writing the old one.
  • Prefer a database trigger or application-layer dual write over a long-lived ETL job — fewer moving parts to fail mid-migration.

This is the same discipline that keeps a sharded or partitioned system consistent. For the broader scaling picture, see our guide to scaling a SaaS database.

Mid-post: decouple the migration from the deploy

The single highest-leverage habit is running the schema change as its own step that lands before the code depending on it. Need a second set of eyes on a risky migration plan? Book a free scoping call.

Safe vs dangerous operations at a glance

OperationSafe approach
Add indexCREATE INDEX CONCURRENTLY
Add columnNullable first; set default + backfill separately
Add constraintAdd NOT VALID, then VALIDATE CONSTRAINT
Rename columnNew column + dual write + backfill + drop old
Change typeNew column of new type, migrate, swap, drop old
Drop columnStop using it in code first; drop in a later deploy

Operational practices that hold over time

Tooling discipline keeps migrations boring as a team grows:

  • Lint migrations in CI. A guard that rejects a non-concurrent index or a blocking type change stops the outage before review.
  • Test on production-shaped data. A migration that is instant on an empty staging table can lock for minutes on a hundred-million-row production table.
  • Observe the rollout. Watch lock waits, replication lag, and error rates during and after each step — covered in our observability for startups guide.

If your migrations run from a background worker or a one-off job runner, the same batching and idempotency rules apply — see background jobs and queues in production.

Frequently asked questions

How do you run a database migration without downtime?

Use the expand-and-contract pattern. First expand the schema in a fully backward-compatible way — add the new column or table while leaving the old one in place. Deploy code that writes to both and reads from the old shape. Backfill historical rows in small batches. Switch reads to the new shape. Only after the old shape is provably unused do you contract by dropping it. Every step is independently deployable and reversible, so no single deploy requires a maintenance window.

What is the expand-and-contract migration pattern?

Also called parallel-change, it splits a breaking schema change into a sequence of non-breaking ones. The expand phase adds new structures alongside the old. A transition phase runs code compatible with both shapes while data is backfilled and reads move over. The contract phase removes the old structures once nothing depends on them. Because old and new code coexist at every deploy boundary, you can roll a release backward or forward without corrupting data.

Why are some Postgres migrations dangerous on large tables?

Certain DDL statements take an ACCESS EXCLUSIVE lock or rewrite the whole table, which blocks reads and writes for the duration. Adding a NOT NULL column with a non-constant default on older Postgres, changing a column type, or creating an index without CONCURRENTLY can lock a large table for minutes. The fix is to add nullable columns, set defaults separately, backfill in batches, and build indexes with CREATE INDEX CONCURRENTLY so the table stays writable.

How do you backfill a column on a huge table safely?

Never run a single UPDATE across millions of rows — it holds locks and bloats the transaction log. Instead, backfill in bounded batches (a few thousand rows each) keyed on the primary key, commit between batches, and sleep briefly to leave headroom for production traffic. Make the operation idempotent and resumable so a crash mid-backfill can restart from the last processed id without double-applying.

Should migrations and application deploys be coupled?

Decouple them. Run the schema migration as its own step that lands before the code that depends on it, and design each migration so the currently running application version keeps working against the new schema. Coupling a breaking schema change to the same deploy that consumes it creates a window where old pods see a schema they cannot handle. The expand-and-contract sequence exists precisely to keep migration and deploy independent.

How do you roll back a database migration?

By making forward steps reversible rather than relying on a down-migration that drops data. Additive expand steps are trivially reversible — you simply stop using the new column. Destructive contract steps should run only once the new path has been verified in production for long enough to be confident. For data changes, prefer a feature flag that flips reads back to the old shape over a schema rollback that loses writes made in the interim.

Ship the schema change. Skip the maintenance window.

We design and run zero-downtime migrations on live production databases. Book a free scoping call and we'll map your change to a safe, reversible sequence.

All blog postsUpdated June 3, 2026