What is RAG (Retrieval-Augmented Generation)?

Large language models hallucinate. Their training data has a cutoff date. They know nothing about your internal documents. Retrieval-Augmented Generation — RAG — tackles all three problems at once.

The idea is straightforward: before the LLM generates a response, retrieve relevant information from external sources and include it in the prompt. The model's answer is then grounded in actual data — not just what it memorized during training. The result is more accurate, more current, and traceable to specific sources. That's why RAG has become the default architecture for connecting LLMs to organizational knowledge.

How RAG Works

A RAG pipeline has two phases.

Indexing Phase

Documents get split into chunks, converted into vector embeddings using an embedding model, and stored in a vector database or search engine like Elasticsearch or OpenSearch. Metadata — source, date, category — is preserved alongside the vectors for filtering.

Query Phase

The user's question is embedded using the same model.
The system searches the vector store for the most semantically similar chunks.
Retrieved chunks are assembled into a prompt alongside the original question.
The LLM generates a response grounded in that context.

Every step matters. Chunking strategy, embedding model choice, retrieval method, prompt construction — each one directly impacts answer quality. Getting retrieval right is often harder than getting generation right.

Why RAG Works

Reduced Hallucination: Grounding responses in retrieved evidence cuts fabrication dramatically. Not to zero — but enough to make the difference between useful and unreliable.
Access to Current and Proprietary Data: Connect LLMs to live knowledge bases, internal documents, and databases that weren't in the training set. No fine-tuning required.
Transparency: Retrieved sources can be cited. Users verify claims and trace answers back to specific documents. That traceability matters for trust.
Cheap Knowledge Updates: Index new documents and they're immediately available. Far cheaper and faster than retraining a model.
Domain Specialization: A general-purpose LLM gives expert-level answers when connected to a specialized knowledge base. Medical literature, legal case law, internal engineering docs — the model doesn't need to have been trained on them.

RAG Patterns

Not all RAG is created equal.

Naive RAG: Retrieve chunks, stuff them in a prompt, generate. Simple. Works for straightforward Q&A.
Advanced RAG: Add query rewriting, re-ranking, hybrid search (vector + keyword), and iterative retrieval. This is where most production systems land.
Modular RAG: Each component — retrieval, augmentation, generation — is swappable. Experiment with individual stages without rebuilding the pipeline. Shraga is a good example of this philosophy.
Agentic RAG: An AI agent decides when and how to retrieve, which sources to query, and whether the information is sufficient. Reasoning meets retrieval.
GraphRAG: Build a knowledge graph over your data and traverse relationships during retrieval. Handles multi-hop questions that standard chunk-based RAG can't.

Use Cases

Enterprise Knowledge Assistants: Internal chatbots answering employee questions from company wikis, policy documents, and knowledge bases.
Customer Support: AI assistants pulling answers from product docs, FAQs, and support ticket history.
Legal and Compliance: Retrieve relevant regulations, case law, or policy documents to support legal research.
Healthcare: Ground clinical decision support in medical literature, drug databases, and clinical guidelines.
Technical Documentation: Answer questions about APIs, codebases, and infrastructure by retrieving from docs and code. Developers are already using this daily.

RAG in the Broader Stack

RAG sits at the intersection of search and generative AI. It depends on solid retrieval infrastructure — vector databases, search engines like Elasticsearch and OpenSearch, embedding models — and benefits from advances on both the search and LLM sides. Frameworks like LangChain and LangGraph provide the orchestration layer, while tools like Langfuse add the observability needed to keep RAG systems reliable in production.