A deep-dive into RAG system design for architects and tech leads - from naive pipelines to advanced retrieval patterns and agentic RAG, with concrete benchmarks and production trade-offs.

Every team building with LLMs hits the same wall. The model sounds confident, but its answers are stale, generic, or flat-out wrong. It can't access your internal docs, your product catalog, or last week's incident reports. Fine-tuning helps with tone and format, but it doesn't solve the knowledge problem - the model still doesn't know your data.

Retrieval-Augmented Generation (RAG) is the architecture pattern that solves this. RAG is a system design in which an LLM's generation step is preceded by a retrieval step that fetches relevant documents from an external knowledge base, injects them into the prompt as context, and grounds the model's output in actual data. It was introduced by Meta AI researchers in 2020 and has since become the default architecture for knowledge-grounded AI applications. This post walks through how RAG systems are actually built - from the naive pipeline that most teams start with, through advanced retrieval patterns, to agentic architectures where retrieval becomes just one tool in an autonomous loop.

The Naive RAG Pipeline - and Where It Breaks

Most RAG implementations follow a two-phase architecture: an offline ingestion pipeline and a real-time query pipeline.

Ingestion works like this:

  1. Chunk source documents into segments (typically 256-1024 tokens)
  2. Embed each chunk into a dense vector using an embedding model (OpenAI text-embedding-3-large, Cohere embed-v4, etc.)
  3. Store vectors in a vector database (Pinecone, Weaviate, Qdrant, OpenSearch, Elasticsearch)

Query time mirrors this flow:

  1. Embed the user's query using the same embedding model
  2. Retrieve the top-k most similar chunks via approximate nearest neighbor (ANN) search
  3. Inject the retrieved chunks into the LLM prompt as context
  4. Generate a response grounded in that context

This works surprisingly well for simple Q&A over a small, clean corpus. But it breaks down fast in production.

The failure modes are predictable. Fixed-size chunking splits paragraphs mid-thought, losing context. Embedding similarity alone retrieves chunks that are semantically adjacent but not actually relevant to the question. Single-shot retrieval gives you one chance to get the right documents - if the query is ambiguous or poorly phrased, you get garbage context. And there's no feedback loop: the system has no way to know whether what it retrieved was useful before it generates an answer.

Most teams get stuck at this stage. They tune chunk sizes, swap embedding models, fiddle with top-k - all of which helps marginally but doesn't fix the structural problems.

Advanced RAG Architecture Patterns

The gap between a naive RAG prototype and a production-grade system is filled by a set of well-studied retrieval techniques. These aren't optional enhancements - they're what separate systems that occasionally work from systems that reliably work.

Query Transformation

The user's raw query is often a poor retrieval key. It might be too short, too vague, or phrased differently from how the answer appears in your documents. Query transformation techniques rewrite the query before retrieval to close this gap.

HyDE (Hypothetical Document Embeddings) is one of the most effective approaches. Instead of embedding the query directly, you first ask the LLM to generate a hypothetical answer - a "fake" document that would contain the relevant information. You then embed that document and use it as the retrieval vector. This works because the hypothetical document is closer in embedding space to the actual answer than the short query would be. The original HyDE paper by Gao et al. showed consistent improvements across web search, QA, and fact verification tasks.

Query decomposition breaks complex multi-hop questions into sub-questions. "How does our pricing compare to competitors for enterprise customers?" becomes three separate retrieval queries: one for your pricing, one for competitor pricing, one for enterprise tier definitions. Each sub-query retrieves its own context, and the LLM synthesizes across all of them.

Hybrid Retrieval and Re-ranking

Dense vector search captures semantic similarity but misses exact keyword matches. BM25 (the algorithm behind traditional search engines) catches exact terms but misses semantic equivalents. Hybrid retrieval combines both, and the benchmarks are clear: hybrid BM25 + dense vector search improves recall from ~0.72 to ~0.91 compared to either method alone. This is a 15-30% recall improvement with minimal added complexity.

The combination typically uses Reciprocal Rank Fusion (RRF) to merge the two ranked lists without requiring score normalization. Both Elasticsearch and OpenSearch support hybrid search with RRF natively.

After initial retrieval, a cross-encoder re-ranker scores each retrieved chunk against the original query with full attention (not just embedding similarity). This is computationally heavier than bi-encoder retrieval, so you run it on a small candidate set - retrieve 50 chunks with hybrid search, re-rank to the top 5. Cross-encoder re-ranking consistently adds 5-15% accuracy on top of hybrid retrieval, making it the default second stage in production RAG pipelines.

Contextual Chunking

Fixed-size chunking is the single biggest source of retrieval errors in naive RAG. Better strategies exist:

  • Parent-child chunking: Store small chunks for retrieval precision, but when a chunk is retrieved, expand the context window to include the parent section. This gives the LLM enough surrounding context to generate a coherent answer.
  • Semantic chunking: Split documents at natural boundaries - paragraph breaks, topic shifts, section headers - rather than at fixed token counts.
  • Sliding window with overlap: Chunks overlap by 10-20%, so information at chunk boundaries isn't lost.

The right strategy depends on your corpus. Structured documentation benefits from semantic chunking. Dense technical manuals need parent-child. Conversational content like support tickets works well with sliding windows.

Technique Naive RAG Advanced RAG
Query handling Direct embedding of raw query HyDE, decomposition, step-back prompting
Retrieval Dense vector only (single-shot) Hybrid (BM25 + dense), multi-stage
Ranking Cosine similarity Cross-encoder re-ranking
Chunking Fixed-size Semantic, parent-child, sliding window
Feedback loop None Re-ranking scores, optional self-evaluation

Agentic RAG: When Retrieval Becomes a Tool

Standard RAG - even advanced RAG - follows a fixed pipeline: transform query, retrieve, re-rank, generate. Agentic RAG breaks this linearity. In an agentic RAG system, an LLM-based agent treats retrieval as one tool among many and autonomously decides when, what, and how to retrieve based on the evolving state of its reasoning. The agent operates in a loop, not a pipeline.

The survey on Agentic RAG by Singh et al. (2025) identifies three defining properties:

  1. Autonomous strategy - the agent dynamically selects retrieval approaches without being locked into a single predefined workflow
  2. Iterative execution - the agent runs multiple retrieval rounds, adapting based on intermediate results
  3. Interleaved tool use - retrieval, computation, API calls, and reasoning are interleaved in a ReAct-style thought-action-observation loop

In practice, this means the agent might: receive a user question, attempt a retrieval, judge the results insufficient, reformulate the query, retrieve again from a different source, then synthesize. Or it might decompose a complex question into sub-tasks, route each to a specialized knowledge base (product docs vs. support tickets vs. API reference), and merge the results.

Multi-source routing is where agentic RAG shines for organizations with heterogeneous knowledge bases. Rather than dumping everything into a single vector index, you maintain separate indexes with different retrieval strategies, and let the agent decide which to query. A question about API rate limits goes to the API docs index. A question about a customer's deployment goes to the CRM knowledge base. The routing decision itself becomes part of the agent's reasoning chain.

Self-correcting retrieval closes the feedback gap that naive RAG leaves open. The agent evaluates whether retrieved documents actually answer the question - using an LLM-as-judge step or a lightweight classifier - and retries with different parameters if they don't. This is the pattern behind Self-RAG (Asai et al., 2023) and Corrective RAG (CRAG), and it meaningfully reduces hallucination rates.

Dimension Naive RAG Advanced RAG Agentic RAG
Flow control Fixed pipeline Fixed pipeline with better components Dynamic agent loop
Retrieval Single-shot Multi-stage with re-ranking Multi-step, multi-source, self-correcting
Query strategy Direct embedding Transformation (HyDE, decomposition) Agent-selected per step
Failure handling None Re-ranking filters weak results Agent detects and retries
Complexity Low Medium High
Latency Lowest Moderate Highest (multiple LLM calls)

RAG in Production: Evaluation, Observability, and Knowing When Not to Use RAG

Building a RAG system that works in a demo is straightforward. Keeping it working in production - across changing data, diverse queries, and cost constraints - is a different problem entirely.

Evaluation: Two Separate Problems

RAG evaluation splits into retrieval quality and generation quality. Conflating them is a common mistake.

Retrieval metrics measure whether you got the right documents:

  • Recall@k - what fraction of relevant documents appear in the top-k results
  • MRR (Mean Reciprocal Rank) - how high the first relevant document ranks
  • nDCG - a graded measure of ranking quality

Generation metrics measure whether the LLM used those documents well:

  • Faithfulness - is every claim in the output grounded in the retrieved context?
  • Answer relevance - does the output actually address the user's question?
  • Hallucination rate - percentage of generated statements not supported by retrieved context

Tools like RAGAS and DeepEval automate these measurements. The practical advice: run retrieval evals and generation evals separately. A low-faithfulness score with high-recall retrieval means your prompting is the problem, not your retrieval. Low recall with high faithfulness means your retrieval needs work but your generation is solid.

When NOT to Use RAG

RAG is not always the right answer. Three alternatives cover most of the cases where teams reach for RAG but shouldn't:

  • Fine-tuning works better when the problem is behavioral - wrong output format, inconsistent tone, poor classification accuracy. Fine-tune for style, policy, and decision behavior. Don't fine-tune for constantly changing facts.
  • Long-context prompting beats RAG when your total knowledge base fits in the context window. For corpora under roughly 200K tokens, stuffing everything into the prompt (with prompt caching) is often simpler, cheaper, and more accurate than building retrieval infrastructure. But beware: accuracy drops 10-20 percentage points when relevant information sits in the middle of long contexts rather than at the start or end.
  • Knowledge graphs handle relational queries ("which customers use product X in region Y?") that vector similarity search fundamentally cannot. GraphRAG - open-sourced by Microsoft - combines knowledge graph traversal with RAG retrieval and has shown 4-10% F1 improvements over vector-only approaches on multi-hop reasoning benchmarks.
Approach Best for Limitations
RAG Grounding LLMs in large, changing knowledge bases Retrieval errors propagate to generation; latency overhead
Fine-tuning Behavioral consistency (format, tone, policy) Expensive to update; doesn't add new knowledge dynamically
Long-context Small, stable corpora (<200K tokens) Lost-in-the-middle effect; cost scales linearly with context size
Knowledge graphs Relational and multi-hop reasoning Requires structured data; high upfront modeling cost

What's Coming

The multimodal RAG tooling market is projected to hit $4.18 billion in 2026, up from $3.32 billion in 2025. The direction is clear: RAG pipelines are expanding beyond text to images, tables, audio, and video. GraphRAG is becoming standard for enterprise use cases that need relational reasoning. And the line between "RAG system" and "AI agent" continues to blur - modern architectures treat RAG as one capability inside a broader agent framework, not as a standalone pipeline.

Key Takeaways

  • RAG grounds LLM outputs in real data by retrieving relevant documents before generation - solving hallucination, staleness, and the private data problem.
  • Naive RAG (embed-retrieve-generate) works for prototypes but breaks in production due to poor chunking, single-shot retrieval, and no relevance feedback.
  • Hybrid retrieval (BM25 + dense vectors) with cross-encoder re-ranking is the proven default for production systems, delivering 15-30% recall improvement over single-method search.
  • Query transformation techniques like HyDE and decomposition close the gap between user queries and document representations.
  • Agentic RAG gives the LLM control over the retrieval process itself, enabling multi-step, multi-source, self-correcting retrieval - at the cost of higher latency and complexity.
  • Evaluate retrieval and generation separately. Different failure modes require different fixes.
  • RAG isn't always the answer. Fine-tuning handles behavioral issues. Long-context prompting works for small corpora. Knowledge graphs handle relational queries.