Building Production-Ready RAG Applications

Retrieval-Augmented Generation (RAG) has become the default approach for building LLM applications that need to work with private or domain-specific data. The basic idea is straightforward: retrieve relevant documents, stuff them into the prompt, let the model generate an answer.

The gap between a RAG demo and a production RAG system is significant. Here's what I've learned building these systems.

The naive approach and why it breaks

The tutorial version of RAG looks like this:

Split documents into chunks
Embed chunks into a vector database
On query, embed the question and find similar chunks
Pass the chunks to an LLM as context

This works surprisingly well for demos. It breaks in production for predictable reasons:

Chunking destroys context. A paragraph about "Q4 revenue" means nothing without knowing which company and which year.
Semantic similarity isn't relevance. "How do I reset my password?" and "Password reset failed" are semantically similar but serve completely different intents.
Top-K retrieval misses important information. The answer might require pieces from 3 different documents, but your top-5 chunks are all from the same document.

Chunking strategies that work

The single biggest improvement you can make to a RAG system is better chunking.

Hierarchical chunking

Instead of fixed-size chunks, preserve the document's structure:

python

def hierarchical_chunk(document):
    sections = split_by_headers(document)
    chunks = []
 
    for section in sections:
        # Keep section context with each chunk
        for paragraph in section.paragraphs:
            chunks.append({
                "content": paragraph.text,
                "section_title": section.title,
                "document_title": document.title,
                "metadata": {
                    "source": document.source,
                    "date": document.date,
                    "section_path": section.full_path,
                },
            })
 
    return chunks

The key insight: every chunk should be understandable in isolation. If a chunk references "the above method" or "as mentioned earlier," it's a bad chunk.

Parent-child retrieval

Store chunks at multiple granularities. Retrieve at the paragraph level (more precise), but pass the parent section to the LLM (more context):

python

# Embed small chunks for precise retrieval
small_chunks = split_into_paragraphs(document)
embed_and_store(small_chunks)
 
# On retrieval, expand to parent sections
results = vector_search(query, top_k=5)
expanded = [get_parent_section(chunk) for chunk in results]
deduplicated = deduplicate(expanded)

This gives you the precision of small chunks with the context of large ones.

Retrieval improvements

Hybrid search

Pure vector search misses exact keyword matches. Pure keyword search misses semantic relationships. Use both:

python

def hybrid_search(query, top_k=10):
    vector_results = vector_store.search(
        embed(query), top_k=top_k
    )
    keyword_results = bm25_index.search(
        query, top_k=top_k
    )
 
    # Reciprocal Rank Fusion
    combined = reciprocal_rank_fusion(
        vector_results, keyword_results
    )
    return combined[:top_k]

Reciprocal Rank Fusion (RRF) is simple and works well. It doesn't require tuning weights between the two result sets.

Query transformation

Users write terrible queries. Before hitting the retrieval layer, transform the query:

Query expansion: Use an LLM to generate multiple phrasings of the same question, then retrieve for each and merge results.
Query decomposition: Break complex questions into sub-questions. "Compare our Q3 and Q4 performance" becomes two retrieval queries.
Hypothetical Document Embedding (HyDE): Ask the LLM to write a hypothetical answer, embed that instead of the question. This often retrieves more relevant documents because the embedding space is closer to the stored documents.

Reranking

Retrieval gets you candidates. Reranking orders them by actual relevance:

python

candidates = hybrid_search(query, top_k=20)
reranked = cross_encoder.rerank(query, candidates, top_k=5)

Cross-encoder rerankers are significantly more accurate than embedding similarity alone. They're slower, which is why you use them on a small candidate set after initial retrieval.

The generation layer

Context formatting matters

How you present retrieved documents to the LLM affects answer quality:

text

Based on the following sources, answer the user's question.

Source 1 (Company Annual Report 2025, Section: Financial Overview):
[content]

Source 2 (Board Meeting Minutes, Date: 2025-11-15):
[content]

Question: What was the company's revenue growth in 2025?

Include source metadata. The model uses it for disambiguation and can cite sources in its response.

Handling "I don't know"

Production RAG systems must handle questions that can't be answered from the available context. The simplest approach:

text

If the provided sources don't contain enough information
to answer the question, say so explicitly. Do not make up
information that isn't supported by the sources.

This works better than elaborate prompt engineering. The model is already good at knowing when it lacks information — you just need to give it permission to say so.

Evaluation

RAG evaluation requires measuring both retrieval and generation quality.

Retrieval metrics

Recall@K: Of all relevant documents, how many appear in the top K results?
Mean Reciprocal Rank (MRR): How high does the first relevant result appear?
Context precision: What percentage of retrieved chunks are actually relevant?

Generation metrics

Faithfulness: Does the answer only use information from the retrieved context? (No hallucination)
Answer relevance: Does the answer address the question?
Completeness: Does the answer cover all aspects of the question that are answerable from the context?

Tools like Ragas and DeepEval can automate these evaluations. The key is building a golden dataset of question-answer-context triples that represents your actual use cases.

Architecture for production

A production RAG pipeline:

Ingestion pipeline: Document processing, chunking, embedding, indexing. Runs on document updates, not on queries.
Query pipeline: Query transformation, retrieval, reranking, generation. This is the hot path — latency matters.
Feedback loop: Track which answers users accept/reject, which sources they click. Use this to improve retrieval and chunking over time.
Guardrails: Input validation (reject prompt injection attempts), output validation (check for hallucination, PII leakage).

Latency budget

A typical breakdown for a sub-2-second response:

Step	Budget
Query transformation	200ms
Retrieval	100ms
Reranking	300ms
Generation (streaming first token)	400ms
Overhead	100ms

Stream the response. Users don't mind waiting 2 seconds for a complete answer, but they mind staring at a blank screen.

What I'd build differently next time

Invest in chunking first. It has the highest ROI of any improvement.
Build evaluation before building features. Without evals, you're guessing whether changes help.
Start with hybrid search. Pure vector search will eventually fail on keyword-heavy queries.
Cache aggressively. Embedding and reranking are expensive. Cache at every layer.
Log everything. When a user reports a bad answer, you need the full pipeline trace to debug it.

RAG isn't glamorous infrastructure work. But it's the foundation of most useful LLM applications, and getting it right is the difference between a demo that impresses and a product that delivers.

Building Production-Ready RAG Applications

Related posts

The Grief of Getting What You Wanted

5 AI Antipatterns Your Team Is Probably Committing Right Now

There Is No Best AI Model in 2026

Related posts

The Grief of Getting What You Wanted

5 AI Antipatterns Your Team Is Probably Committing Right Now

There Is No Best AI Model in 2026