Semantic chunking splits text on embedding-derived meaning boundaries instead of fixed character counts. It often improves retrieval on narrative and heterogeneous corpora, but it is no free win - here is when it helps, when it is a wash, and how to implement it.

Semantic Chunking: The Key to Better RAG Results

Most RAG quality complaints get blamed on the model. The answers are vague, they miss the point, they hallucinate facts that were never in the source. Teams reach for a bigger model or a longer context window. Often the real problem sits one stage earlier, in the chunker.

A retriever never sees your documents. It sees chunks: the fixed slices you cut a document into before embedding. If those slices fragment a single idea across two chunks, or staple two unrelated ideas into one, retrieval quality is capped before the LLM ever runs. Semantic chunking splits text on embedding-derived meaning boundaries rather than character or token counts, producing variable-length chunks that line up with topic shifts. It often improves retrieval on narrative and heterogeneous corpora. On short, well-structured documents it is frequently a wash, and sometimes the extra compute is not worth it. This post covers why chunking is the bottleneck, how the main strategies compare, and how to ship semantic chunking without the common foot-guns.

Why Chunking Caps Retrieval Quality

A chunk is the atomic unit of context an LLM receives in a RAG pipeline. The embedding model turns each chunk into a single vector, and that vector has to summarize everything in the chunk. When a chunk holds one coherent idea, its embedding is a clean signal. When it spans two ideas, the embedding becomes a blurred average that matches neither query well.

Bad chunk boundaries produce three recurring failure modes:

  • Topic drift. A chunk spans two unrelated subjects, so its embedding sits in the semantic gap between them and ranks poorly for both. The relevant passage is in your index, but the retriever cannot surface it.
  • Context starvation. A chunk gets cut mid-thought. The retriever finds it, but the LLM lacks the surrounding sentences needed to answer faithfully, so it guesses.
  • Redundant retrieval. Heavy overlap means several near-identical chunks claim multiple top-k slots, crowding out passages that carry the rest of the answer.

The classic symptom: a user asks something specific, the retriever returns five chunks that each contain part of the answer, and none contains the whole thing. A larger model helps synthesize fragments, but it cannot recover information that fragmentation pushed out of the top-k entirely. Faithfulness and context-recall scores degrade with poor boundaries regardless of how capable the generator is. The retrieval stage is the ceiling, which is why it pays to get chunking right before tuning anything downstream in the RAG pipeline.

From Fixed-Size to Semantic Splitting

Fixed-size splitting cuts every N tokens with a fixed overlap. It is fast and trivial, and it ignores sentences, paragraphs, and topic shifts entirely. The defaults people copy around - 512 tokens, 100-character overlap - get applied without evaluation and cut straight through the middle of arguments.

Recursive character splitting is the common upgrade. LangChain's RecursiveCharacterTextSplitter walks a hierarchy of separators (\n\n, then \n, then . , then ) and falls back to raw character count only when a segment is still too long. It respects paragraph and sentence boundaries when the formatting cooperates. What it cannot do is notice a topic shift inside a long, well-punctuated paragraph. The split is structural, not semantic.

Semantic chunking attacks exactly that gap.

Semantic chunking is a text-splitting strategy that embeds each sentence or small group of sentences, measures the cosine distance between adjacent embeddings, and inserts a chunk boundary wherever that distance spikes. A large jump signals a topic shift, so the resulting chunks vary in length and align with coherent ideas rather than fixed character budgets.

One point of terminology that trips people up: semantic chunking is not semantic search. Semantic chunking is an ingest-time splitting decision. Semantic search is query-time retrieval over dense vectors. They both lean on embeddings, but they operate at different stages and solve different problems. (For the retrieval side, see how dense and sparse vectors actually differ.)

The mental model worth borrowing here is Greg Kamradt's 5 Levels of Text Splitting: character count, recursive, document-specific, semantic (embedding-based), and agentic (LLM-driven). Each level trades compute for semantic fidelity. You climb it until the gains stop paying for the cost on your corpus, not until you reach the top.

RAG Chunking Strategies Compared at a Glance

No single strategy dominates. Pick based on document shape and your ingest budget, not on what is fashionable.

Strategy Recall Precision Ingest cost Effort Best fit
Fixed-size Low-medium Low Negligible Trivial Short, uniform docs; throwaway prototypes
Recursive character Medium Medium Negligible Low Markdown/HTML with clear structure; default starting point
Semantic (embedding breakpoints) Medium-high Medium-high O(n) embeddings Medium Long narrative docs, transcripts, mixed-format corpora
LLM-based / agentic High High Very high (LLM per segment) High High-value, low-volume: legal, medical, compliance
Late chunking Medium-high High One long-context embed pass Medium Docs that fit a long-context model (around 8K tokens)
Hierarchical (small-to-big) High High Adds a parent index Medium When small chunks retrieve well but answers need wider context

A rough decision tree:

  • Short structured docs (under ~500 tokens, clear headers) - recursive or structure-aware splitting. Semantic chunking buys little.
  • Long narrative docs with no headers - semantic chunking.
  • Mixed-format docs (tables, prose, code) - structure-aware pre-split, then semantic refinement on the prose.
  • Real-time ingest under tight latency SLAs - recursive, since it makes no embedding calls.

That last point matters more than people expect. Semantic chunking costs one embedding call per sentence or sentence group, which is O(n) where fixed-size is O(1). In practice it runs several times slower than recursive splitting at ingest. The cost is paid once and is invisible at query time, but it is real, and on very large corpora with tight ingest windows it can tip the decision back to recursive.

How Embedding Breakpoints Work in Practice

The two dominant libraries expose the same core idea with slightly different knobs.

LangChain's SemanticChunker (in langchain_experimental.text_splitter) supports four breakpoint_threshold_type values: percentile (the default - break where the distance exceeds the Nth percentile), standard_deviation (break beyond mean plus k sigma), interquartile (a more outlier-robust variant), and gradient (break where the rate of change in distance spikes, useful for dense domain text where absolute distances stay compressed). Percentile is the safest place to start.

from langchain_experimental.text_splitter import SemanticChunker
  from langchain_openai import OpenAIEmbeddings
  
  chunker = SemanticChunker(
      OpenAIEmbeddings(),
      breakpoint_threshold_type="percentile",  # or standard_deviation, interquartile, gradient
      breakpoint_threshold_amount=95,
  )
  
  docs = chunker.create_documents([long_text])
  print(len(docs), sum(len(d.page_content) for d in docs) / len(docs))
  

LlamaIndex's SemanticSplitterNodeParser exposes buffer_size (how many sentences on each side are pooled into one embedding before comparison - larger smooths noise) and breakpoint_percentile_threshold (the percentile cutoff; 95 means only the top 5% of distances trigger a split). Those two parameters are your main lever on granularity.

Two things are worth knowing before you commit. First, the embedding model used for chunking does not have to match the one used for retrieval. A lightweight model like all-MiniLM-L6-v2 is plenty for boundary detection, and heavier models add ingest latency for marginal gains. What matters is consistency: use the same chunker model across re-indexes, because changing it shifts every boundary. Second, always set a hard max-chunk-length ceiling. A long stretch of uniform text can produce one runaway chunk that blows past your embedding model's token limit. Cap it.

When Semantic Chunking Is Worth It (and When It Is Not)

Here is the part the hype usually skips. Chroma's chunking evaluation study found that semantic chunking does not reliably beat simpler methods. A plain RecursiveCharacterTextSplitter at 200 tokens with zero overlap was competitive across their tests, the default semantic chunker underperformed until it was tuned, and the choice of embedding model moved results as much as the choice of chunking strategy - up to a 9% swing in recall between strategies. The paper Is Semantic Chunking Worth the Computational Cost? reaches a similar verdict: across document retrieval, evidence retrieval, and answer generation, the extra compute is not justified by consistent gains.

So the honest framing is this. Semantic chunking often helps on long narrative documents, transcripts, and heterogeneous corpora where structural cues are unreliable. It is frequently a wash on short, well-structured content, and on a cost-adjusted basis it can lose on huge corpora with tight ingest budgets. Treat it as a tool to reach for when boundaries genuinely matter, not as a default.

Two adjacent techniques attack the same fragmentation problem from different angles:

  • Anthropic Contextual Retrieval prepends an LLM-generated, document-aware summary to each chunk before indexing it. It is enrichment, not splitting, and it directly fights context starvation. Anthropic reports failed retrievals dropping 35% with contextual embeddings, 49% when combined with BM25, and 67% with reranking added on top. The cost is one LLM call per chunk at ingest.
  • Late chunking (Jina AI, arXiv 2409.04701) inverts the order: embed the whole document first with a long-context model, then segment the token-level embeddings into chunks. Each chunk vector keeps awareness of the full document. It fits documents that sit inside the model's window (jina-embeddings-v2 handles roughly 8,192 tokens).

The thread connecting all three is that the failure they target - chunks that lose their surrounding meaning - is real and measurable, even though no single fix wins everywhere.

Key Takeaways

  • The retriever operates on chunks, not documents. Chunk quality sets an upper bound on retrieval quality that no downstream model can lift.
  • Recursive character splitting is a strong, cheap default and the right baseline to beat. Do not skip past it.
  • Semantic chunking often helps on long, narrative, or mixed-format corpora and is frequently a wash on short, structured docs. Benchmark it on your own data before committing.
  • Embedding model choice can move retrieval metrics as much as chunking strategy does. Hold it constant when you compare strategies.
  • Always cap max chunk length, version your chunking config (model, threshold, buffer) alongside your index, and re-evaluate against a golden Q/A set as the corpus changes.
  • For context starvation specifically, Contextual Retrieval and late chunking are worth evaluating beyond plain embedding breakpoints.