

“Set your heart upon your work, but never on its reward.”
Bhagavad Gita

“Set your heart upon your work, but never on its reward.”
Bhagavad Gita
Retrieval-Augmented Generation (RAG) has become the default approach for building LLM applications that need to work with private or domain-specific data. The basic idea is straightforward: retrieve relevant documents, stuff them into the prompt, let the model generate an answer.
The gap between a RAG demo and a production RAG system is significant. Here's what I've learned building these systems.
The tutorial version of RAG looks like this:
This works surprisingly well for demos. It breaks in production for predictable reasons:
The single biggest improvement you can make to a RAG system is better chunking.
Instead of fixed-size chunks, preserve the document's structure:
def hierarchical_chunk(document):
sections = split_by_headers(document)
chunks = []
for section in sections:
# Keep section context with each chunk
for paragraph in section.paragraphs:
chunks.append({
"content": paragraph.text,
"section_title": section.title,
"document_title": document.title,
"metadata": {
"source": document.source,
"date": document.date,
"section_path": section.full_path,
},
})
return chunksThe key insight: every chunk should be understandable in isolation. If a chunk references "the above method" or "as mentioned earlier," it's a bad chunk.
Store chunks at multiple granularities. Retrieve at the paragraph level (more precise), but pass the parent section to the LLM (more context):
# Embed small chunks for precise retrieval
small_chunks = split_into_paragraphs(document)
embed_and_store(small_chunks)
# On retrieval, expand to parent sections
results = vector_search(query, top_k=5)
expanded = [get_parent_section(chunk) for chunk in results]
deduplicated = deduplicate(expanded)This gives you the precision of small chunks with the context of large ones.
Pure vector search misses exact keyword matches. Pure keyword search misses semantic relationships. Use both:
def hybrid_search(query, top_k=10):
vector_results = vector_store.search(
embed(query), top_k=top_k
)
keyword_results = bm25_index.search(
query, top_k=top_k
)
# Reciprocal Rank Fusion
combined = reciprocal_rank_fusion(
vector_results, keyword_results
)
return combined[:top_k]Reciprocal Rank Fusion (RRF) is simple and works well. It doesn't require tuning weights between the two result sets.
Users write terrible queries. Before hitting the retrieval layer, transform the query:
Retrieval gets you candidates. Reranking orders them by actual relevance:
candidates = hybrid_search(query, top_k=20)
reranked = cross_encoder.rerank(query, candidates, top_k=5)Cross-encoder rerankers are significantly more accurate than embedding similarity alone. They're slower, which is why you use them on a small candidate set after initial retrieval.
How you present retrieved documents to the LLM affects answer quality:
Based on the following sources, answer the user's question.
Source 1 (Company Annual Report 2025, Section: Financial Overview):
[content]
Source 2 (Board Meeting Minutes, Date: 2025-11-15):
[content]
Question: What was the company's revenue growth in 2025?
Include source metadata. The model uses it for disambiguation and can cite sources in its response.
Production RAG systems must handle questions that can't be answered from the available context. The simplest approach:
If the provided sources don't contain enough information
to answer the question, say so explicitly. Do not make up
information that isn't supported by the sources.
This works better than elaborate prompt engineering. The model is already good at knowing when it lacks information — you just need to give it permission to say so.
RAG evaluation requires measuring both retrieval and generation quality.
Tools like Ragas and DeepEval can automate these evaluations. The key is building a golden dataset of question-answer-context triples that represents your actual use cases.
A production RAG pipeline:
A typical breakdown for a sub-2-second response:
| Step | Budget |
|---|---|
| Query transformation | 200ms |
| Retrieval | 100ms |
| Reranking | 300ms |
| Generation (streaming first token) | 400ms |
| Overhead | 100ms |
Stream the response. Users don't mind waiting 2 seconds for a complete answer, but they mind staring at a blank screen.
RAG isn't glamorous infrastructure work. But it's the foundation of most useful LLM applications, and getting it right is the difference between a demo that impresses and a product that delivers.

Developers spent decades wishing for tools that write code. Now they have them. Why does freedom feel like loss?

Shadow IT on steroids, MCP tools nobody asked for, LLMs playing architect, vibe-coded open source, and text-to-SQL fantasies. The antipatterns everyone's falling into — and how to stop.

Google leads in math and science. OpenAI leads in agentic coding. Anthropic leads in economically valuable work. A comprehensive breakdown of every flagship AI model with actual numbers.