A bolt-on playbook for adding a knowledge graph to an existing vector RAG stack: extraction pipelines, storage choices, hybrid retrievers, evaluation, and day-2 ops with concrete LangChain and LlamaIndex code.
Most "Graph RAG" content stops at the architecture diagram. This is the bolt-on playbook: how to add a knowledge graph alongside an existing vector retriever without rewriting the pipeline you already shipped. If you are still weighing whether the pattern fits, start with our RAG architecture explainer. This post assumes the decision is made and you need the implementation playbook.
The thesis: a vector store handles broad semantic recall, a graph store handles precise relational queries and multi-hop traversal. Neither replaces the other. The wins come from running them together, joined on entity IDs.
Why Bolt a KG onto an Existing Vector RAG Stack
Pure vector retrieval fails on a recurring set of queries. Multi-hop questions like "Who manages the team that owns the service that caused last week's outage?" require structured joins that cosine similarity cannot produce. Aggregation queries across entities collapse into noise. Entity disambiguation breaks when two products, two people, or two contracts share embedding space. Temporal questions get answered against stale snapshots.
A KG buys four things: structured joins and traversal paths; provenance where every answer traces through explicit edges back to source chunks; type-safe filters at retrieval time (only return Person nodes with role = "engineer"); and reduced hallucination because entities are referenced through canonical IDs rather than resampled from text.
Three cases where you should not add a KG: short-document QA where a single chunk answers the question; single-entity lookup with no relational context; unstable schema where the cost of re-extraction outpaces the lift. If the team cannot sustain day-2 graph maintenance, skip it - a stale graph is worse than no graph.
Reference Architecture: The Dual-Index Pattern
The pattern that holds up in production is dual-index. The vector store is the recall engine: top-k chunks by semantic similarity. The graph is the precision and traversal engine: entity neighborhoods, multi-hop paths, type-filtered subgraphs. A router or fusion layer sits in front of both.
Three placements are useful. As a pre-retrieval filter, the graph performs entity linking and narrows vector search scope. As a parallel retriever, both stores run concurrently and candidate sets are fused. As a post-retrieval re-ranker, graph edges validate or boost vector hits with a structural connection to the query entities. Most teams settle on a mix: parallel retrieval as the default, entity-linking pre-filter for queries the router classifies as relational.
The minimal contract is the join key. Every text chunk stores the entity IDs it mentions; every graph node stores the source chunk IDs it was extracted from. Without this bidirectional linkage, you cannot trace from a graph traversal back to a quotable chunk - which means you cannot cite, which means the whole exercise loses its grounding advantage.
# Vector chunk metadata
{
"chunk_id": "doc_42#chunk_7",
"text": "...",
"entities": ["entity:person:itamar_synhershko", "entity:company:bigdataboutique"]
}
# Graph node properties (Cypher fragment)
MERGE (p:Person {id: "entity:person:itamar_synhershko"})
SET p.source_chunks = ["doc_42#chunk_7", "doc_91#chunk_2"],
p.canonical_name = "Itamar Syn-Hershko"
Extracting a KG from Unstructured Text
The first decision is schema-guided versus open extraction. Open extraction produces noisy, inconsistent triples that explode the entity space. Schema-guided constrains output to known types, which is what you want for retrieval. Start with five to fifteen entity types and ten to twenty relation types. If you do not yet know the domain ontology, run open extraction on a sample, cluster the output, then lock the schema.
LangChain's LLMGraphTransformer is the most direct path for property graphs. Constrain it with allowed_nodes and allowed_relationships:
from langchain_experimental.graph_transformers import LLMGraphTransformer
from langchain_openai import ChatOpenAI
llm = ChatOpenAI(model="gpt-4o", temperature=0)
transformer = LLMGraphTransformer(
llm=llm,
allowed_nodes=["Person", "Team", "Service", "Incident"],
allowed_relationships=["MANAGES", "OWNS", "CAUSED_BY", "MEMBER_OF"],
node_properties=["role", "severity"],
)
graph_documents = transformer.convert_to_graph_documents(chunks)
LlamaIndex takes a similar shape with SchemaLLMPathExtractor, where you pass enums for entity and relation types and it produces typed triplets. Both libraries integrate with Neo4jPropertyGraphStore for persistence.
A note on LangChain: GraphCypherQAChain was the textbook way to do graph QA in early releases. The current direction is to wrap the same logic in a LangGraph workflow with explicit nodes for routing, Cypher generation, execution, and answer synthesis, with retry on Cypher errors. The replacement is more code but gives you validation and retry surfaces the pure chain lacks. Treat the LangGraph rewrite as scheduled work, not urgent migration.
Every extracted edge must carry source_chunk_id and ideally character span offsets - the citation contract. Without it, generated answers cite a graph path with no ground truth in any chunk. Pipeline shape: chunker - extractor - deduplicator - graph writer with batch upsert.
Entity resolution is the work most teams underestimate. Three layers compose well: blocking on entity type plus fuzzy name match; embedding similarity over node names with a merge threshold; LLM adjudication for ambiguous pairs. Open-source dedupe.io covers the classical blocking and active-learning pipeline; the Zilliz write-up on entity resolution with LLMs describes the embedding-plus-LLM variant. Skip canonicalization and the graph degenerates into thousands of near-duplicate nodes within weeks.
LLM extraction is expensive. Batch chunks per call, distill a smaller model from GPT-4-class outputs once the schema is stable, and gate re-extraction on content hashes so you only pay for changed documents.
Choosing Graph Storage
Five options cover most teams. The choice depends on operational preference, AWS lock-in, scale, and whether you want vectors and graph in one store.
| Store | Query language | Vector index | Best for | Watch out for |
|---|---|---|---|---|
| Neo4j (Aura or self-host) | Cypher | Native | Default; broad LangChain/LlamaIndex support | License terms for self-hosted Enterprise |
| Amazon Neptune Analytics | openCypher / Gremlin | Native | AWS-native deployments, Bedrock integration | Less ecosystem tooling outside AWS |
| Memgraph | Cypher | Native | In-memory speed, agentic toolkits | Memory sizing; not for cold archives |
| Kuzu | Cypher | Native | Embedded, single-process, prototyping | Single-writer model |
| FalkorDB | Cypher | Native | Low-latency caching, sparse-matrix engine | Smaller community than Neo4j |
Neo4j is the default because the ecosystem is broadest. Cypher is the de facto property-graph query language, the Graph Data Science library covers the algorithms you need (PageRank, community detection, shortest paths), and both LangChain and LlamaIndex have first-class integrations. Native vector indexes mean you can colocate node embeddings with the graph.
Amazon Neptune Analytics is the AWS-native option. Neptune's own docs do not lead with "GraphRAG"; the canonical references are the AWS blog Build GraphRAG applications using Amazon Bedrock Knowledge Bases and the Bedrock Knowledge Bases GraphRAG GA announcement, plus the Neptune + LlamaIndex integration. If retrieval already runs in Bedrock Knowledge Bases, the graph layer is a configuration toggle.
Memgraph, Kuzu, and FalkorDB cover the lightweight end. Memgraph is in-memory and Cypher-compatible; the langchain-memgraph package and the Memgraph AI Toolkit ship LangChain and MCP tooling. Kuzu is embedded - one process, no infrastructure, ideal for prototyping - and LlamaIndex ships KuzuPropertyGraphStore. FalkorDB is the successor to RedisGraph (EOL early 2025); the migration guide covers legacy code paths. It uses sparse-matrix algebra and targets low-latency subgraph caching.
For RAG specifically, property graphs beat RDF/SPARQL on developer ergonomics and tooling. Choose RDF only if you have existing RDF infrastructure, regulatory requirements, or genuine cross-organization federated-query needs.
On embedding placement: keep entity-level embeddings inside the graph store (single-query entity linking) and chunk-level embeddings in the vector store you already run.
Retrieval Patterns That Actually Work
Entity linking at query time is the first hop. NER over the user query produces candidate spans; dense retrieval over node aliases resolves them to graph IDs; an LLM fallback handles NER misses. Cache hot entities in Redis to keep linking under 20ms.
Two paths exist for graph queries: templated and text-to-Cypher. Templated queries cover ~80% of patterns - parameterized Cypher with entity IDs slotted in. They are predictable, fast, and hard to exploit. Text-to-Cypher is flexible but introduces three problems at once: hallucinated Cypher that fails to parse, expensive queries that scan the whole graph, and graph query injection. The Neo4j text2cypher dataset on HuggingFace is the reference benchmark for accuracy expectations. If you ship text-to-Cypher, run it in a read-only transaction, set query timeouts, and validate the parsed AST against an allow-list of node and relationship types before execution.
Multi-hop traversal needs hard limits. Two to three hops is usually right; beyond that, latency and noise both spike. Cap fan-out per node, prune low-relevance edges with type filters or learned weights, and return a subgraph - not the full traversal path - to the prompt. 200ms is a reasonable per-query budget.
Subgraph serialization into the prompt has three options: triple list (token-efficient, harder for the LLM to reason over), natural-language narrative (easier reasoning, doubles token count), and structured JSON (middle ground). Cap subgraph context at one to two thousand tokens; summarize larger ones with a smaller LLM call.
Hybrid retrieval is where the lift compounds. Run vector and graph retrievers in parallel, fuse with Reciprocal Rank Fusion or a weighted linear combination, then rerank the fused candidates. Measure the lift on your own eval set - hybrid does not always win, and knowing where it does is what justifies the investment.
Implementation Walkthrough
The LangChain side, in current shape, looks like this:
from langchain_neo4j import Neo4jGraph, Neo4jVector
from langchain_openai import ChatOpenAI, OpenAIEmbeddings
from langgraph.graph import StateGraph, END
graph = Neo4jGraph(url=NEO4J_URI, username=USER, password=PWD)
vector = Neo4jVector.from_existing_index(
embedding=OpenAIEmbeddings(),
url=NEO4J_URI, username=USER, password=PWD,
index_name="chunk_embeddings",
)
llm = ChatOpenAI(model="gpt-4o", temperature=0)
def link_entities(state):
state["entities"] = ner_and_resolve(state["question"], graph)
return state
def graph_retrieve(state):
cypher = render_template(state["intent"], state["entities"])
state["graph_ctx"] = graph.query(cypher, timeout=2)
return state
def vector_retrieve(state):
state["vec_ctx"] = vector.similarity_search(state["question"], k=8)
return state
def synthesize(state):
state["answer"] = llm.invoke(prompt(state)).content
return state
workflow = StateGraph(dict)
workflow.add_node("link", link_entities)
workflow.add_node("graph", graph_retrieve)
workflow.add_node("vector", vector_retrieve)
workflow.add_node("answer", synthesize)
workflow.set_entry_point("link")
workflow.add_edge("link", "graph")
workflow.add_edge("link", "vector")
workflow.add_edge("graph", "answer")
workflow.add_edge("vector", "answer")
workflow.add_edge("answer", END)
app = workflow.compile()
This replaces the legacy GraphCypherQAChain pattern with explicit retry, validation, and parallel execution. The two retrieve nodes run concurrently because they share an entry point.
The LlamaIndex side leans on PropertyGraphIndex:
from llama_index.core import PropertyGraphIndex, Document
from llama_index.core.indices.property_graph import (
SchemaLLMPathExtractor,
VectorContextRetriever,
LLMSynonymRetriever,
)
from llama_index.graph_stores.neo4j import Neo4jPropertyGraphStore
pg_store = Neo4jPropertyGraphStore(username=USER, password=PWD, url=NEO4J_URI)
extractor = SchemaLLMPathExtractor(
llm=llm,
possible_entities=["PERSON", "TEAM", "SERVICE", "INCIDENT"],
possible_relations=["MANAGES", "OWNS", "CAUSED_BY"],
)
index = PropertyGraphIndex.from_documents(
docs,
property_graph_store=pg_store,
kg_extractors=[extractor],
)
retriever = index.as_retriever(
sub_retrievers=[
VectorContextRetriever(index.property_graph_store, embed_model=embed),
LLMSynonymRetriever(index.property_graph_store, llm=llm),
]
)
For the embedded prototyping path, swap Neo4jPropertyGraphStore for KuzuPropertyGraphStore from llama-index-graph-stores-kuzu. The retriever interface is identical, so the dev-to-prod path is painless.
Microsoft GraphRAG is the drop-in alternative for large corpora with thematic questions ("What are the recurring failure patterns across all post-mortems last year?"). It does community detection over the extracted graph, pre-summarizes communities, and routes queries to local search (specific entities) or global search (themes). Opinionated - you trade fine-grained retrieval control for less code. The Azure-Samples graphrag-accelerator and LazyGraphRAG cover the managed-on-Azure path.
Evaluation: Use Your Graph to Test Itself
Generic RAG eval sets miss the point of GraphRAG. The differentiated story is generating multi-hop questions from your own knowledge graph. Ragas TestsetGenerator does exactly this: it builds an enriched knowledge graph from your documents, traverses it with scenario synthesizers, and produces questions whose reference answers require following two or three edges. Ten document nodes typically expand into ~50 nodes and several hundred relationships after default transformations.
Aim for 200 to 500 eval questions covering the entity types and relation patterns in your domain. Go beyond answer accuracy:
- Retrieval hit@k on entities: did the graph retriever surface the right nodes?
- Path faithfulness: does the retrieved subgraph contain the reasoning path for the gold answer?
- Citation precision: do cited chunks actually support the generated claim?
- Latency p50 and p95 for graph queries, broken out from total RAG latency.
Langfuse does not have a graph-specific primitive - instrument Cypher and Gremlin calls as custom spans, attach the rendered query and result count as attributes, and graph latency becomes queryable in your existing trace dashboard. Pair with Ragas for offline metrics and TruLens for groundedness feedback.
The A/B that earns the KG its keep: run the same eval set through hybrid and vector-only pipelines, compare answer accuracy, retrieval recall, latency, and cost per query. At least 200 questions for statistical significance. Less than 5% lift after 60 days of tuning means the graph is not paying for itself.
Day-2 Operations
Incremental ingestion is the operational backbone. CDC from source systems triggers re-extraction; hash-based change detection skips unchanged chunks. Upsert by entity ID, tombstone removed entities, version the schema on every node.
Graph drift kills retrieval quality faster than expected. Schedule weekly jobs that find orphans (zero inbound edges), dead nodes (not referenced by any chunk), and duplicate clusters (embedding similarity above threshold across same-type nodes). Alert on rate of change rather than absolute counts - an orphan-ratio spike usually means an extractor regression.
Schema evolution splits into additive (new entity or relation types - just update extractor config) and breaking (rename, merge types - run a migration script). Tag every node with the schema version that produced it. Full-corpus reprocessing is only for fundamental ontology restructuring.
Access control matters more in graphs than flat vector stores because edges leak information. A user who can see node A and node B but not the relationship between them still learns from a graph that exposes the edge. Apply RBAC at property level where possible (Neo4j Enterprise supports this), tag nodes with access groups, filter at query time. PII gets a dedicated subgraph with stricter access; redact or tokenize PII properties on extraction.
Default budgets we use: entity linking under 20ms, graph traversal under 200ms, total hybrid retrieval under 500ms. Cost model: graph DB hosting + extraction LLM calls (one-time plus incremental) + ongoing maintenance compute. Distillation drops extraction costs by an order of magnitude once the schema is stable.
Rollout Playbook and Common Pitfalls
A 30-60-90 plan keeps the project honest:
- Days 1-30: Pick a pilot domain - a single document collection, 1K to 10K docs. Extract the KG, stand up the graph store, build a basic entity-linking retriever. Run shadow mode: log graph results alongside vector-only production answers, no user-facing change.
- Days 31-60: Build the eval set (200+ multi-hop questions). Run the A/B comparison. Tune extraction schema, entity resolution thresholds, traversal depth. Wire the hybrid retriever behind a feature flag.
- Days 61-90: Production cutover for the pilot domain. CDC pipeline, monitoring dashboards, alerting on drift metrics. Document runbooks. Plan expansion to the next domain.
The eight gotchas that show up in every engagement:
- Over-extraction. Too many entity types creates a noisy graph and slow queries. Start with five to ten entity types; expand only when retrieval misses prove the need.
- LLM-hallucinated edges. The extractor invents relationships not in the source text. Validate every edge against its source span; reject edges whose evidence does not contain both endpoints.
- Entity explosion. Skipping canonicalization produces thousands of near-duplicates. Invest in entity resolution from week one, not week twelve.
- Cypher timeouts. Unbounded traversals on dense subgraphs hang the whole pipeline. Always set depth limits and per-query timeouts; treat timeouts as alerts, not warnings.
- Stale embeddings. Node embeddings drift as the graph evolves. Schedule periodic re-embedding for any node whose properties or neighborhood changes materially.
- Prompt-context bloat. Serialized subgraphs blow the token budget. Cap, summarize, prefer triples over narrative when token-tight.
- Evaluation drift. The eval set goes stale as the corpus grows. Regenerate quarterly with the latest graph snapshot.
- Graph query injection. Text-to-Cypher is an attack surface. A user query like
... ; MATCH (n) DETACH DELETE nshould never reach the database. Run all generated Cypher in a read-only transaction, validate against an AST allow-list, and rate-limit per user.
The honest exit criterion: if the eval shows under 5% accuracy lift after 60 days of tuning, the domain has no meaningful relational structure, or the team cannot sustain the day-2 ops, retire the KG and go back to vector-only. A well-tuned vector retriever beats a half-maintained graph every time.
Key Takeaways
- The dual-index pattern (vector for recall, graph for precision and traversal) is what holds up in production. Join the two on entity IDs.
- Schema-guided extraction with five to fifteen entity types and ten to twenty relation types beats open extraction on every retrieval metric that matters.
- LangChain's
GraphCypherQAChainis the legacy path; the LangGraph workflow with explicit nodes for entity linking, Cypher generation, validation, and synthesis is the current shape. - Neo4j is the safe default; Kuzu via
KuzuPropertyGraphStoreis the right embedded path for prototyping; FalkorDB is the RedisGraph successor; Neptune Analytics plus Bedrock Knowledge Bases is the AWS-native managed route. - Templated Cypher covers most queries safely. Text-to-Cypher needs read-only transactions, AST validation, and timeout guards; treat it as an attack surface.
- Generate your eval set from your own graph using Ragas
TestsetGeneratorwith KG-based generation. 200+ multi-hop questions are the floor for statistical significance. - Graph query injection is the gotcha that rarely gets covered. Sanitize, sandbox, and run read-only.
- If hybrid does not show 5%+ lift after 60 days, the graph is not paying for itself. Roll back rather than carry the ops cost.