Building Agentic RAG with LangGraph and OpenSearch

A practical langgraph tutorial on building self-correcting RAG with langgraph agents. Learn how to use langgraph for document grading, query rewriting loops, and hybrid search in production retrieval systems.

There is no shortage of tutorials explaining what Retrieval Augmented Generation is. There is a shortage of content showing what it actually looks like to build one - the data wrangling, the mapping errors, the design decisions, the dead ends. That gap is what our video series "Building Agentic RAG with LangGraph and OpenSearch" is trying to close: a live-coded, unscripted build of a production-grade RAG system over ClickHouse, OpenSearch, and Elasticsearch documentation, where things fail, get debugged, and get fixed on camera.

We created a video tutorial series - which is still on-going - that guides through building an Agentic RAG application that could withstand production. This post walks through the architecture and engineering decisions behind the series - what we're building, why we chose LangGraph, how the system evolves from a naive pipeline to a self-correcting agentic graph, and what challenges lie ahead.

From Simple RAG to Agentic RAG

Standard RAG is a pipeline: take a user question, embed it, search a vector store, stuff the top-k results into a prompt, call an LLM. It's a straight line. For well-behaved questions against a clean, single-source corpus, it works. But as the series demonstrates early on, this falls apart fast when your queries are ambiguous, your corpus spans multiple documentation sources with different conventions, or the search engine returns tangentially related content. The LLM doesn't know the context is bad - it just generates a confident, plausible-sounding answer that might be wrong.

Agentic RAG is what happens when you add decision-making to that pipeline. Instead of blindly passing retrieved documents to the LLM, the system reasons about them. Are these documents actually relevant to the question? Should the query be rewritten and retrieval attempted again? Is the question even within scope? These are the kinds of judgment calls that turn a static pipeline into an adaptive one - a LangGraph RAG system where the graph can loop, branch, and self-correct based on intermediate results.

Why LangGraph

Developers often confuse LangGraph and LangChain, so the LangGraph and LangChain difference is worth being precise about. LangChain is a library of components: LLM wrappers, document loaders, text splitters, retrievers, embedding models. The building blocks. LangGraph is the orchestration layer that coordinates those blocks into stateful workflows with branching and looping. They're complementary - a LangGraph retrieval agent's nodes typically invoke LangChain components internally.

The distinction that matters: LangChain chains execute as directed acyclic graphs where data flows forward. LangGraph supports cycles. A node can route execution back to an earlier node, which is exactly what you need when retrieved documents score poorly and you want to rewrite the query and try again. That loop is impossible in a chain. In a graph, it's a single conditional edge.

LangGraph's core primitives are straightforward. A StateGraph holds a typed state object that every node reads and writes. Nodes are plain Python functions. Edges connect them - either unconditionally or via routing functions that inspect the state and decide where to go next. After defining the graph, .compile() produces an executable that supports .invoke(), .stream(), and async equivalents.

The Self-Correcting RAG Architecture

The graph being built in the series follows the adaptive RAG pattern - sometimes called self-correcting RAG. Here is how the nodes connect:

START -> Validator -> Keyword Extraction -> Retrieve -> Grade Documents
                                                              |
                                                ┌─────────────┼────────────┐
                                                v                          v
                                            Generate              Rewrite Query -> Retrieve (loop)
                                                |
                                               END

The validator runs first. A cheap, fast model (Amazon Nova Pro or Claude Haiku via Amazon Bedrock) checks whether the question is about ClickHouse, OpenSearch, or Elasticsearch. Off-topic or nonsense inputs get declined immediately, burning zero retrieval or generation tokens. Keyword extraction then generates 1-5 structured search term pairs from the question rather than passing the raw query to the search engine. Multiple keyword pairs enable parallel queries that expand recall before grading filters results down.

Retrieval executes searches against OpenSearch (which replaced ChromaDB early in the series - the rationale being that OpenSearch is well-understood, easy to spin up locally via Docker, and has native support for hybrid search). Grading evaluates each retrieved document for relevance using structured LLM output - a binary score. If documents pass, the graph proceeds to generation, which is explicitly instructed to answer using only the provided context with source citations. If documents fail grading, a conditional edge routes to the query rewrite node, which reformulates the question and loops back to retrieval.

In code, the graph assembly looks like this:

from langgraph.graph import StateGraph, START, END
  
  workflow = StateGraph(RAGState)
  
  workflow.add_node("validate", validate_scope)
  workflow.add_node("extract_keywords", extract_keywords)
  workflow.add_node("retrieve", retrieve_documents)
  workflow.add_node("grade", grade_documents)
  workflow.add_node("generate", generate_response)
  workflow.add_node("rewrite", rewrite_query)
  
  workflow.add_edge(START, "validate")
  workflow.add_conditional_edges("validate", route_valid, {
      "in_scope": "extract_keywords",
      "out_of_scope": END,
  })
  workflow.add_edge("extract_keywords", "retrieve")
  workflow.add_edge("retrieve", "grade")
  workflow.add_conditional_edges("grade", decide_to_generate, {
      "generate": "generate",
      "rewrite": "rewrite",
  })
  workflow.add_edge("rewrite", "retrieve")  # the self-correcting loop
  workflow.add_edge("generate", END)
  
  graph = workflow.compile()

The design is deliberately modular. Cheap models handle validation and keyword extraction; expensive models (Claude Sonnet or Opus) handle generation. Context-only generation is the core hallucination guard - the model cannot synthesize knowledge it wasn't given in the retrieved documents.

Starting Simple and Failing Forward

The series begins with scaffolding via Claude Code, which generated the boilerplate FastAPI application (with /health and /query endpoints), the LangGraph graph structure with stub nodes, a Poetry project file with all dependencies, and an ingestion script stub. This saved the first hours of setup so the series could focus on the real problems.

Episode 3 tackles data preparation - the unglamorous work that precedes any real RAG development. A data prep pipeline loads documentation from five sources (ClickHouse docs, OpenSearch docs, OpenSearch website and blog content, Elasticsearch docs split across two repos, and Big Data Boutique's internal knowledge base), each with different frontmatter conventions and folder structures. Per-repository cleaners strip JSX imports from ClickHouse docs, remove Docusaurus-specific syntax from OpenSearch website content, and handle other source-specific noise.

When the ingest script runs, approximately 5,000 documents load into OpenSearch - and problems surface immediately. OpenSearch's dynamic mapping infers a schema from the data, causing mapper_parsing_exception errors when fields like applies_to appear as an object in some documents and a string in others. The auto-generated mapping also picks up dozens of irrelevant fields inherited from various frontmatter conventions across the repos. The fix: ingest a sample, inspect the mapping, curate it down to useful fields, and re-create the index with dynamic: false - telling OpenSearch to silently ignore undeclared fields rather than rejecting documents outright.

The current system has basic BM25 retrieval, no vector search, no proper grading, raw user queries going straight to the index - and is not going to produce good results. And that's the point. The series will demonstrate why it fails before showing how to fix it, every step of the way. Which is the only good way to learn.

What's Coming Next

The series has a clear roadmap ahead, and each step brings its own category of expected failures.

Embeddings and hybrid search are first. The plan is to add dense vector embeddings via Cohere Embed and combine them with BM25 keyword search using Reciprocal Rank Fusion (RRF) in OpenSearch. Pure BM25 misses semantically similar content; pure vector search surfaces loosely related noise. Hybrid search should improve recall, but tuning the balance between keyword and vector scoring will be a challenge - and getting chunking right per documentation source (long enough for semantic meaning, short enough for embedding precision) is a problem the series has flagged as unsolved.

Evaluation is planned as a major focus. Without a way to measure retrieval quality (precision, recall) and answer correctness, every change is a shot in the dark. The series will build feedback loops to understand whether retrieved documents actually answer the question, and iterate based on real measurements rather than vibes.

Agentic extensions are the big milestone. Once the core RAG produces reliable answers, the system will connect to Pulse to enable actions on live clusters - not just "here's what you should do about your shard allocation" but actually executing the fix. This is where the self-correcting RAG graph becomes a genuine agent: reasoning over both documentation and live cluster state, deciding what needs to happen, and doing it. Expect new failure modes around tool invocation, safety guardrails for destructive actions, and the challenge of grounding agent decisions in real-time metrics alongside static documentation.

Production hardening rounds it out - rate limiting, input validation, guardrails, and everything else that separates a demo from a system you'd trust with cluster operations.

The series is published on the BigData Boutique YouTube channel and built with LangGraph, FastAPI, OpenSearch, and Amazon Bedrock. The OpenSearch VS Code extension used throughout for cluster inspection and query testing is available on the VS Code Marketplace.

Need expert guidance building production-grade RAG applications with OpenSearch? Our OpenSearch consulting team has hands-on experience with LangGraph, vector search, and agentic architectures at scale.