Skip to main content
QuantLab Logo
Glossary · Software

What is a Data Warehouse?

A data warehouse is a database engineered for analytical workloads — long, wide queries across years of history — so the business can ask hard questions of its data without bringing the production application to its knees.

Where the idea came from

Bill Inmon coined "data warehouse" in the 1990s and defined the basic shape: a subject-oriented, integrated, time-variant, and non-volatile collection of data. Translated out of textbook English — a single place to put everything your business has ever recorded, organized by what the business cares about (customers, orders, sessions) rather than by which application emitted the data. Ralph Kimball added the dimensional modeling approach that made warehouses queryable by non-engineers. The cloud era — Redshift in 2012, BigQuery in 2010, Snowflake in 2014 — made warehouses cheap enough that a Series A company could run one.

Why a separate database

The application database — the Postgres or MySQL instance running your product — is optimized for the kind of work an app does: write a row when a user signs up, read a row when they log in, update a balance when they pay. That workload is small, fast, and constant. The analytical workload is the opposite: read forty million rows, group by month and country, join four tables. Mixing the two on the same instance is how product engineers learn the phrase "the database is on fire" — a finance analyst running a quarterly report can lock tables, blow the page cache, and slow every customer login.

The warehouse is a separate copy of the data, refreshed on a schedule (often hourly or nightly), stored in columnar format so that analytical scans are 10 to 100 times faster, and sized for the analytical workload independently. The price is some staleness — the warehouse is rarely up-to-the-second — and a pipeline to keep it fresh.

The modern data stack around it

A warehouse never lives alone. A modern data stack typically has: extraction tools (Fivetran, Airbyte, custom scripts) that pull data out of Postgres, Stripe, HubSpot, Google Ads, and dozens of other sources; the warehouse itself (Snowflake, BigQuery, Databricks, or ClickHouse); a transformation layer (dbt is the standard) that turns raw landed tables into clean, modeled tables; and a visualization tool on top (Metabase, Looker, Hex, Tableau) where business users actually answer questions. The warehouse is the middle of that stack, not the whole of it.

When you actually need one

Three pretty reliable triggers. First, an analyst query takes more than a minute on the production database, or has caused a customer-facing slowdown. Second, the team starts asking questions that require joining the app, Stripe, the marketing platform, and the support tool — none of which share a database. Third, a board deck or investor update needs a report you cannot produce in twenty minutes because the data lives in seven places. Any one of those is enough; two of them is the sign you have already waited too long.

At QUANT LAB

Our cloud infrastructure practice builds warehouses for clients whose product database is no longer enough — usually BigQuery or Snowflake, with Fivetran or custom Python jobs for extraction and dbt for transformation. We also help trading and quant clients build research warehouses that store tick-level history and make backtests reproducible.

For early-stage SaaS companies we usually recommend starting with a single dbt project pointed at a managed warehouse rather than over-engineering a Kafka-based pipeline. Read our piece on multi-tenant SaaS with Postgres for how the application database is structured before the warehouse stage, and book a call if you want a one-hour review of your current analytics setup.

Outgrowing the app database?

We design analytics warehouses sized to your data and your team — not the over-engineered Kafka monster the consultant pitched.

Cloud infrastructure