Hybrid Search Explained: Combining Vector and Keyword Retrieval

Hybrid search runs lexical (BM25) and dense vector retrieval side by side and fuses the two ranked lists into one. This guide covers the architecture, OpenSearch and Elasticsearch implementations, fusion choices, weight tuning, and how to measure the uplift.

Pure vector search has a blind spot. Ask a dense retriever for ERR_CONN_RST or a product code like SKU-44819, and it will happily return semantically adjacent documents that never contain the exact token you typed. BM25 nails those queries and fails the opposite ones: paraphrase, synonymy, and long natural-language questions where the words a user picks rarely match the words in the corpus.

Hybrid search is the answer that has become the production default across OpenSearch, Elasticsearch, Weaviate, Vespa, and Qdrant. It runs a lexical retriever and a dense retriever in parallel, then fuses their two ranked lists into a single ordering. This post is about the engineering: how the two retrievers combine, how to wire it up in OpenSearch and Elasticsearch, how to pick a fusion method, and how to prove the result beats either signal on its own.

Why neither keyword nor vector search is enough

The case for hybrid starts with where each retriever breaks.

Hybrid search is a retrieval architecture that runs a lexical retriever (BM25 or a learned sparse model) and a dense vector retriever against the same corpus, then merges their ranked results into one list. The goal is to recover the exact-match precision of keyword search and the semantic recall of embeddings in a single query.

BM25 still wins on a specific class of queries: exact identifiers, SKUs, error codes, acronyms, and rare long-tail terms that an embedding model never saw in training. Term frequency and inverse document frequency give a strong, well-understood signal on short precise queries, and BM25 needs no labeled data, so it works on a cold-start corpus from day one. For a deeper treatment of how the two index types differ, see our breakdown of sparse vs dense vectors.

Dense vectors win where lexical overlap disappears. "Cheap flights" and "budget airfare" share no tokens but the same intent. Cross-lingual retrieval works without per-language synonym lists. Long, conversational questions, the kind RAG systems generate, dilute BM25 across many low-signal terms while a bi-encoder captures the overall meaning.

The failure mode that pushed teams to hybrid is out-of-domain embeddings. The BEIR benchmark (Thakur et al., NeurIPS 2021) showed that dense retrievers trained on one domain frequently underperform BM25 when evaluated zero-shot on another. That is the lexical gap: an embedding model confident about similarity in its training distribution gets unreliable the moment your corpus drifts. Hybrid hedges against that by keeping a lexical signal that does not depend on any model's training data.

What hybrid search actually means

Strip away the vendor terminology and hybrid is two parts: parallel retrieval, then fusion.

Both retrievers run against the corpus and each returns its own candidate set with its own scoring function. BM25 produces scores in an unbounded range driven by term statistics. A dense retriever produces cosine similarities or dot products in a completely different range. The two scores are not comparable, which is exactly why the second part matters.

Fusion is the step where hybrid happens. You take two ranked lists and merge them into one. There are two families. Score-based fusion normalizes each retriever's scores onto a common scale, then combines them, usually as a weighted sum. Rank-based fusion ignores score magnitude entirely and combines documents by their position in each list. Reciprocal Rank Fusion (RRF) is the dominant rank-based method.

A few terms get conflated here, so it is worth being precise:

Sparse-dense: a single index holding both a sparse representation (BM25 or SPLADE) and a dense vector per document. The hybrid happens at query time.
Multi-vector: ColBERT-style late interaction, where a query matches against many token-level vectors. This is a different retrieval model, not fusion.
Multi-stage: retrieve broadly with hybrid, then rerank the top candidates with a cross-encoder. This is the dominant production pattern, and the reranker is where most of the precision gain comes from. We cover that stage in RAG reranking with cross-encoders.

Choosing a fusion method

This is the decision that most affects relevance, and it comes down to whether your retrievers produce comparable, well-calibrated scores.

Weighted sum (alpha-blend) is the intuitive option: final = α * norm(lexical) + (1 - α) * norm(dense). It only works after normalization, because raw BM25 and cosine scores live on different scales. Common choices are min-max (sensitive to outliers, since one runaway score compresses everything else) and L2. When you can trust your scores, weighted sum is the most expressive method because it lets you upweight the retriever that performs better on your data.

Reciprocal Rank Fusion (RRF) sums 1 / (k + rank) for each document across retrievers, with k a smoothing constant (60 is a near-universal default from the original paper). Because it uses rank position rather than score, it needs no normalization and is robust to wildly different score distributions. That same property is its limit: it throws away score magnitude, so it cannot exploit a well-calibrated retriever that knows document A is far better than document B. RRF comes from Cormack, Clarke, and Büttcher (SIGIR 2009), and we have a dedicated explainer on how RRF works and when to use it if you want the math and tuning detail.

Cross-encoder reranking is not really fusion; it is a third stage. Retrieve a broad candidate set with hybrid, then re-score the top 50-200 with a cross-encoder that reads query and document together. This is the precision layer for high-stakes RAG, e-commerce, and legal or medical search.

Method	Needs normalization	Exploits score magnitude	Best when
Weighted sum	Yes	Yes	Scores are calibrated; one retriever dominates a query class
RRF	No	No	Retrievers produce incomparable scores; you want a zero-tuning baseline
Cross-encoder rerank	N/A (re-scores)	Yes (learned)	Top-of-funnel precision matters more than recall or latency

Start with RRF as a baseline, confirm hybrid beats both single signals, then move to weighted sum only if you have an evaluation set that shows it helps.

Hybrid search in OpenSearch

OpenSearch implements hybrid through a hybrid query type plus a search pipeline that fuses the sub-query results. Each sub-query (a match for BM25, a knn for vectors) is scored on its own path, and the pipeline combines them.

You first define a search pipeline with a normalization-processor. It supports min_max and l2 normalization, and combination via arithmetic_mean, geometric_mean, or harmonic_mean, with a configurable weights array. Then you reference that pipeline on the search request:

PUT /_search/pipeline/hybrid-pipeline
  {
    "phase_results_processors": [
      {
        "normalization-processor": {
          "normalization": { "technique": "min_max" },
          "combination": {
            "technique": "arithmetic_mean",
            "parameters": { "weights": [0.3, 0.7] }
          }
        }
      }
    ]
  }
  
  GET /products/_search?search_pipeline=hybrid-pipeline
  {
    "query": {
      "hybrid": {
        "queries": [
          { "match": { "description": "noise cancelling headphones" } },
          {
            "knn": {
              "description_embedding": {
                "vector": [0.42, -0.13, 0.88, "..."],
                "k": 50
              }
            }
          }
        ]
      }
    }
  }

The weights array maps positionally to the sub-queries, so [0.3, 0.7] biases toward the dense signal. To generate description_embedding at index time, attach an ingest pipeline that calls a model deployed on ML nodes, so embeddings are written alongside the text without a round-trip to your application.

OpenSearch 2.19 added rank-based fusion through the score-ranker-processor, which applies RRF instead of score normalization. By default each sub-query gets an equal weight of 1, and you can override per-query weights. Reach for it when your sub-query score distributions are too different for normalization to behave, or when you want a baseline with nothing to tune. For background on the vector side of this setup, see our introduction to vector search in OpenSearch and Elasticsearch.

Hybrid search in Elasticsearch

Elasticsearch took a different route with the retriever API. A retriever is a node in a tree that describes how top documents are produced, and retrievers nest, so you compose a hybrid query rather than hand-merging results in the client.

Retrievers and the rrf retriever were introduced in 8.14 and reached general availability in 8.16. The rrf retriever wraps child retrievers and fuses them server-side using RRF, with rank_constant (default 60) controlling the smoothing and rank_window_size setting how many documents each child contributes before fusion:

GET products/_search
  {
    "retriever": {
      "rrf": {
        "retrievers": [
          {
            "standard": {
              "query": { "match": { "description": "noise cancelling headphones" } }
            }
          },
          {
            "knn": {
              "field": "description_embedding",
              "query_vector": [0.42, -0.13, 0.88],
              "k": 50,
              "num_candidates": 100
            }
          }
        ],
        "rank_constant": 60,
        "rank_window_size": 50
      }
    }
  }

The standard and knn retrievers run in parallel and Elasticsearch fuses them; no client-side merging. Filters go inside each child retriever or at the top level.

A third retriever is where Elasticsearch's stack gets interesting. ELSER, Elastic's learned sparse encoder, produces sparse token expansions that catch paraphrase without the latency of a dense model. Adding ELSER as a third child under the rrf retriever (BM25 + ELSER + dense kNN) is Elastic's recommended high-recall configuration. Note the licensing: retrievers are available in all tiers, but ELSER needs a Platinum or Enterprise license, or Elastic Cloud. We compared ELSER against external embeddings if you are weighing that trade-off.

Other engines follow the same two patterns. Weaviate exposes an alpha parameter (0 is pure keyword, 1 is pure vector) and two fusion methods, rankedFusion (RRF-style) and relativeScoreFusion, the latter being the default since v1.24 because it preserves score magnitude. Qdrant's Query API uses prefetch to run sparse and dense sub-queries, then fuses with rrf or dbsf (Distribution-Based Score Fusion), which normalizes each retriever's scores using its mean and standard deviation before combining.

Tuning weights and measuring the uplift

Two failure modes dominate hybrid rollouts: shipping it without measuring, and tuning weights against intuition instead of data.

Start with RRF at k=60 as a zero-tuning baseline and benchmark it against BM25-only and dense-only. If hybrid does not beat both, stop and find out why before adding more machinery. Once hybrid wins, weighted sum is worth trying when one retriever consistently dominates a query class. Short navigational queries (one or two tokens) usually favor BM25; long natural-language questions favor dense. Some teams classify queries at runtime and route SKU lookups to a BM25-heavy blend and descriptive queries to a vector-heavy one.

Watch for score drift. When you re-embed with a new model, the dense score distribution shifts and any normalization parameters you tuned go stale. Version your embeddings in separate fields per model, run shadow scoring during migration, and alert on relevance regressions rather than discovering them in production.

To measure, you need a judged set: 50-100 queries with their top ~20 documents graded on a 0-3 scale. LLM-as-judge gets you there fast; validate a sample with human raters before trusting it. The metric depends on the architecture. Use nDCG@10 for end-user ranking quality. Use Recall@100 when hybrid is the first stage before a cross-encoder, since recall at the cutoff is what bounds the reranker's ceiling. Use MRR for navigational, single-answer queries. Track these per query class, not just as a corpus-wide average that hides where hybrid is hurting.

Budget for the real cost. With parallel execution, hybrid latency is roughly max(BM25, ANN) + fusion, but memory is additive: you carry both an inverted index and an HNSW graph. Quantized vectors (int8 or binary) and pruned graphs claw some of that back.

Key takeaways

Hybrid search runs a lexical and a dense retriever in parallel and fuses their ranked lists. It recovers BM25's exact-match precision and dense retrieval's semantic recall in one query.
BM25 wins on identifiers, acronyms, rare terms, and cold-start corpora; dense wins on paraphrase, synonymy, and cross-lingual queries. BEIR showed dense retrievers underperform BM25 out of domain, which is the core argument for keeping a lexical signal.
Use RRF (rank-based, no normalization) as a baseline. Switch to weighted sum only when scores are calibrated and an evaluation set proves it helps. Add a cross-encoder reranker when top-of-funnel precision justifies the latency.
OpenSearch uses a hybrid query plus a search pipeline (normalization-processor for score fusion, score-ranker-processor for RRF since 2.19). Elasticsearch uses the retriever API with an rrf retriever (introduced in 8.14, GA in 8.16).
Measure before shipping. Build a 50-100 query judged set, track nDCG@10 or Recall@100 per query class, and budget for the additive memory of running two indices.

If you are weighing where hybrid retrieval fits in a larger search modernization effort, our guide to moving beyond keyword search with OpenSearch puts it in context. And when you are ready to ship hybrid retrieval that demonstrably beats either signal alone, our search team does exactly this work.