Skip to main content
QuantLab Logo

AI Engineering · 2026

Building a RAG Pipeline: A 2026 Engineering Guide

Retrieval-augmented generation is how you make a language model answer from your data instead of guessing. This is the practitioner's guide to the pipeline that matters: chunking, embeddings, vector search, reranking, prompt assembly, and evaluation — with code and the failure modes that bite in production.

Bill Beltz, Founder & Principal Engineer
By , Founder & Principal EngineerPublished 14 min read

Quick answer

Build a RAG pipeline as two paths. Offline: chunk documents on semantic boundaries, attach metadata, embed each chunk, and store the vectors. Online: embed the query, retrieve the top candidates by vector similarity, rerank them with a cross-encoder, assemble a grounded prompt with citations, and generate the answer. The highest-leverage moves are a two-stage retrieve-then-rerank design and a real evaluation set — without evaluation you are tuning blind.

A language model only knows what it saw in training. RAG closes that gap by retrieving relevant passages from your own corpus at query time and putting them in the prompt, so the model answers from evidence you control and can cite. We build these systems for a living — our AI integration practice ships RAG over real client data, and our data engineering practice builds the ingestion plumbing behind it. The sections below follow the order you actually build in, not the order tutorials present.

1. Ingestion: chunk, then enrich

Ingestion is offline and unglamorous, and it decides your ceiling. Parse each source into clean text, split it into chunks on semantic boundaries, and attach metadata. The mistake here is fixed-size character splitting that cuts sentences in half and strips the context a chunk needs to be useful.

  • Chunk on headings and paragraphs; a 300–800 token target with ~10–15% overlap is a sane starting point.
  • Attach metadata to every chunk — source, section title, URL, timestamp — so you can filter retrieval and cite the origin.
  • Strip boilerplate (nav, footers, repeated headers) before embedding; it pollutes similarity scores.
  • Make ingestion idempotent and re-runnable so you can re-chunk the whole corpus when you change strategy.

2. Embeddings and the vector store

An embedding model maps each chunk to a vector so that similar meaning lands nearby in space. Pick one embedding model and use it for both ingestion and queries — mixing models silently destroys retrieval. Store the vectors with their metadata in a vector database or a Postgres extension.

// Ingest: embed each chunk once, store vector + metadata
for (const chunk of chunks) {
  const embedding = await embed(chunk.text);   // SAME model as query time
  await index.upsert({
    id: chunk.id,
    values: embedding,
    metadata: {
      source: chunk.source,
      title: chunk.title,
      url: chunk.url,
      tenantId: chunk.tenantId,                 // for per-tenant isolation
    },
  });
}

Choosing the store is its own decision — managed vector DBs, self-hosted engines, and Postgres + pgvector all trade off differently. Our vector database comparison walks the options for a production RAG workload.

3. Retrieval and reranking: the two-stage pattern

At query time, embed the question and pull the top 20–50 candidates by vector similarity. That stage optimizes recall — it casts a wide net. Then rerank: a cross-encoder scores each candidate against the query directly and surfaces the few that truly answer it. This two-stage design is the single biggest quality lever in most RAG systems.

// Stage 1: cheap, recall-oriented vector retrieval
const queryVec = await embed(question);
const candidates = await index.query({
  vector: queryVec,
  topK: 40,
  filter: { tenantId },          // never retrieve across tenants
});

// Stage 2: precise cross-encoder rerank, keep the best few
const reranked = await rerank(question, candidates.map((c) => c.text));
const context = reranked.slice(0, 6);   // top passages feed the prompt
  • Consider hybrid retrieval — combine dense vectors with keyword (BM25) search to catch exact terms, IDs, and rare names.
  • Always apply a metadata filter for the requesting user/tenantbefore ranking, not after.
  • Cap the number of passages you pass downstream; more context is not always better and costs tokens.

4. Prompt assembly and generation

Assemble a prompt that puts the retrieved passages in front of the model with clear instructions: answer only from the provided context, cite the source for each claim, and say "I don't know" when the context does not contain the answer. Label each passage so the model can cite it.

  • Instruct the model to ground its answer in the context and to refuse politely when evidence is missing.
  • Number or tag passages so citations map back to real sources.
  • Treat retrieved text as untrusted — a document can contain instructions aimed at hijacking the model. See our prompt-injection prevention guide.
  • Long context windows are not free; trimming and reranking keep cost down — see LLM cost optimization.

5. Evaluation: stop tuning blind

Evaluation is what separates a demo from a product. Measure retrieval and generation separately so you know which half to fix.

  • Retrieval: a labeled set of questions with known-good chunks; track recall and precision at k.
  • Faithfulness: does the answer stay grounded in retrieved context, or invent facts?
  • Answer relevance & citations: does it address the question and cite real sources?
  • Run an automated LLM-as-judge pass on every change, backed by a small human-reviewed golden set to catch judge error.

Mid-post: ship RAG that holds up in production

A working demo is a weekend. A RAG system that is accurate, secure, and cost-controlled at scale is engineering. Book a free scoping call and we'll map the right architecture for your data.

The RAG pipeline at a glance

StageWhat it does
ChunkSplit sources on semantic boundaries with overlap + metadata
EmbedMap chunks to vectors with one consistent model
RetrievePull top-k candidates by similarity, filtered by tenant
RerankCross-encoder rescores candidates for precision
GenerateGrounded prompt with citations; refuse when no evidence
EvaluateScore retrieval + faithfulness against a golden set

For where RAG fits among other ways to add intelligence to a product, see adding AI features to your SaaS.

Operational practices that hold over time

RAG quality decays as your corpus grows and drifts. Three habits keep it honest past launch:

  • Freshness. Re-ingest on a schedule and on document change; stale chunks produce confidently wrong answers.
  • Observability. Log the retrieved passages for every answer so you can debug a bad response instead of guessing.
  • Regression testing. Run the evaluation set in CI so a prompt or chunk-size change cannot silently degrade quality.

When the corpus is large and changes constantly, the storage layer matters — our data engineering practice builds the pipelines that keep an index fresh, and the choice of warehouse vs lake shapes where your source documents live.

Frequently asked questions

What is a RAG pipeline?

A retrieval-augmented generation (RAG) pipeline grounds a large language model in your own data instead of relying solely on what the model memorized during training. At query time it retrieves the most relevant passages from a knowledge base — usually via vector similarity search — and injects them into the prompt as context. The model then answers from that retrieved evidence, which reduces hallucination and lets you cite sources. A RAG pipeline has two halves: an offline ingestion path that chunks and embeds documents, and an online query path that retrieves, reranks, assembles a prompt, and generates an answer.

How should I chunk documents for RAG?

Chunk on semantic boundaries — headings, paragraphs, or logical sections — rather than a fixed character count that splits sentences mid-thought. A common starting point is 300 to 800 tokens per chunk with a small overlap of 10 to 15 percent so context is not lost at boundaries. Always attach metadata to each chunk (source document, section title, URL, timestamp) so you can filter retrieval and cite the origin. Tune chunk size against your evaluation set: too large and retrieval gets noisy, too small and chunks lose the context needed to be useful.

Do I need a reranker in my RAG pipeline?

Often yes. Vector search is fast but approximate — it retrieves passages that are semantically near the query, not necessarily the ones that best answer it. A cross-encoder reranker takes the top 20 to 50 candidates from vector search and rescores each one against the query directly, surfacing the genuinely relevant passages to the top. The two-stage pattern — cheap recall-oriented vector retrieval followed by precise reranking — is the single highest-leverage quality improvement in most production RAG systems.

How do I evaluate a RAG pipeline?

Evaluate retrieval and generation separately. For retrieval, build a labeled set of questions with known relevant chunks and measure recall and precision at k. For generation, measure faithfulness (does the answer stay grounded in the retrieved context?), answer relevance, and citation accuracy. Automated LLM-as-judge scoring plus a small human-reviewed golden set catches regressions before they ship. Without evaluation you are tuning chunk size and prompts blind, and quality silently drifts as your corpus grows.

RAG vs fine-tuning — which should I use?

They solve different problems. RAG injects knowledge that changes frequently or is too large to memorize — documentation, support tickets, product catalogs — and lets you update answers by updating data, with citations. Fine-tuning changes behavior, format, or tone, and teaches the model patterns rather than facts. Most production systems start with RAG because it is cheaper to iterate and easier to audit, then add light fine-tuning only when they need a consistent output style the prompt cannot reliably enforce.

How do I keep a RAG pipeline secure?

Treat retrieved content as untrusted input and apply per-user authorization to the index so a query only retrieves documents that user is allowed to see — RAG over a shared index is a classic data-leak path. Guard against prompt injection hidden inside retrieved documents, never let retrieved text silently override system instructions, and log what was retrieved for each answer so you can audit a bad response. We cover the injection threat in our prompt-injection guide.

Ground your model in your own data.

From ingestion to evaluation, we build RAG systems that are accurate, secure, and cost-controlled. Book a free scoping call and we'll cover the right architecture for your corpus.

Or email Bill at beltz@quantlabusa.dev
All blog postsUpdated June 3, 2026