A practical guide to building AI-powered search with Elasticsearch - inference endpoints, semantic_text, hybrid retrieval with RRF, embedding strategy choices, and production considerations.

Keyword search works until it doesn't. A user types "lightweight container orchestration" and your BM25 index returns nothing because your docs say "minimal Docker management." The words don't match, but the intent is identical. This gap between what users mean and what keyword search finds is the core problem that AI-powered search solves.

Elasticsearch has shipped a stack of AI capabilities over the past two years - inference endpoints, vector search, ELSER, retrievers, semantic_text - and they work. But the real challenge isn't whether these features exist. It's knowing which pieces to combine, in what order, and what the production trade-offs look like. This post walks through the practical architecture of AI-powered search in Elasticsearch, from building blocks to production realities.

The Building Blocks

Three components form the foundation of AI search in Elasticsearch today.

Inference endpoints provide a unified API for connecting ML models, regardless of where they run. You can point an inference endpoint at ELSER (Elastic's native sparse model), an external provider like OpenAI or Cohere, or a model you've uploaded via Eland. The task types cover sparse_embedding, text_embedding, rerank, and completion. Elasticsearch ships with preconfigured endpoints (.elser-2-elasticsearch, .multilingual-e5-small-elasticsearch, .rerank-v1-elasticsearch) that work out of the box on ML nodes. We previously wrote about ELSER vs. external vector embeddings - the core trade-offs still hold.

The semantic_text field type, GA since Elasticsearch 8.18, eliminates the boilerplate of manual vector search setup. Without it, you need to create an ingest pipeline with an inference processor, manually configure dense_vector or sparse_vector mappings with dimensions and similarity functions, and handle text chunking yourself. With semantic_text, a two-line mapping does all of that:

{
    "mappings": {
      "properties": {
        "content": {
          "type": "semantic_text",
          "inference_id": "my-elser-endpoint"
        }
      }
    }
  }
  

It auto-detects whether to use sparse or dense vectors based on the inference endpoint, configures dimensions and similarity, and handles chunking. Drop to manual dense_vector fields only when you need fine-grained control over HNSW parameters, quantization settings, or custom similarity functions.

The retrievers framework (GA since 8.16) makes search pipelines composable. Instead of bolting together queries with scripts and rescoring, retrievers let you stack retrieval stages declaratively: a standard retriever wraps any existing Query DSL query, knn handles vector search, rrf fuses multiple ranked lists, and text_similarity_reranker applies semantic reranking as a final pass. This composability is what makes hybrid search practical. For an introduction to how vector search works under the hood, see our earlier primer.

Hybrid Search: Where the Real Gains Are

Neither pure BM25 nor pure vector search wins on its own. BM25 excels at exact matches - product SKUs, error codes, proper nouns, technical terms. Vector search excels at intent - finding documents about "container orchestration" when the user searches for "Docker management." Hybrid search combines both and consistently outperforms either alone.

Reciprocal Rank Fusion (RRF) is the practical way to merge these. It operates on rank positions rather than raw scores, which sidesteps the problem of normalizing BM25 scores (unbounded) against cosine similarity scores (0 to 1). Each document's final score is the sum of 1 / (rank_constant + rank) across all retrievers.

Here's a concrete hybrid retrieval pipeline combining BM25, kNN vector search, and a reranking stage:

{
    "retriever": {
      "text_similarity_reranker": {
        "retriever": {
          "rrf": {
            "retrievers": [
              {
                "standard": {
                  "query": {
                    "match": { "content": "container orchestration" }
                  }
                }
              },
              {
                "knn": {
                  "field": "content_embedding",
                  "query_vector_builder": {
                    "text_embedding": {
                      "model_id": "my-embedding-model",
                      "model_text": "container orchestration"
                    }
                  },
                  "k": 10,
                  "num_candidates": 100
                }
              }
            ],
            "rank_window_size": 100,
            "rank_constant": 60
          }
        },
        "field": "content",
        "inference_id": "my-rerank-model",
        "inference_text": "container orchestration",
        "rank_window_size": 50
      }
    }
  }
  

The inner rrf retriever fuses BM25 and kNN results. The outer text_similarity_reranker takes the top 50 fused results and re-scores them with a cross-encoder model. This two-stage approach keeps the expensive reranking step focused on a small candidate set.

Note the reranker adds latency - typically 50-200ms depending on model and rank_window_size. Start without reranking. Add it only after measuring whether the relevance gains justify the latency cost.

Choosing Your Embedding Strategy

The embedding model decision is architectural, not just a feature toggle. Three paths, each with distinct operational implications.

ELSER (sparse embeddings) is the simplest on-ramp. It runs natively on Elasticsearch ML nodes, requires no external dependencies, handles chunking automatically, and produces sparse vectors that are memory-efficient. The trade-offs: English only, limited to 512 tokens per field, and throughput peaks around 26 docs/sec per allocation. For English-language use cases where you want zero model management overhead, ELSER is the right starting point.

Third-party APIs (OpenAI, Cohere, Google, etc.) make sense when you need multilingual support, already have a vendor relationship, or want access to the latest foundation models without managing infrastructure. The cost model shifts from ML node compute to per-token API pricing, which can be cheaper at low volumes but expensive at scale. You're also adding an external dependency to your indexing and search paths - if the API is down, your search pipeline stalls.

Self-hosted dense models (via Eland or custom deployments) give you full control: data stays on your infrastructure, you can fine-tune on domain-specific corpora, and per-query costs drop at scale. The trade-off is operational complexity - you own GPU provisioning, model versioning, and inference scaling. This path makes sense for organizations with data sovereignty requirements or domain-specific vocabularies where general-purpose models underperform.

A pragmatic approach: start with ELSER or semantic_text with the default endpoint. Measure relevance with a judgment list. If results fall short on non-English content or domain-specific queries, switch the inference endpoint to a dense model - semantic_text makes this a mapping change, not a rewrite.

Going Deeper: Production Realities

Indexing throughput. Inference at ingest time is the bottleneck. ELSER v2 processes roughly 26 documents per second per allocation. For bulk migrations, scale num_allocations on your ML nodes (linear gains up to about 8 allocations), use the Bulk API with appropriately sized batches, and consider the Elastic Inference Service for GPU-accelerated throughput.

Query latency budget. BM25 queries typically return in single-digit milliseconds. Adding kNN vector search pushes latency to 10-50ms depending on index size and HNSW parameters. Adding a reranker adds another 50-200ms. Profile each stage independently and set a latency ceiling that your application can tolerate. Not every query needs reranking - consider applying it selectively to ambiguous queries.

Model versioning. When you change embedding models, old vectors and new vectors live in incompatible spaces. There is no gradual migration - you need to reindex. Plan for this from day one. Your embedding strategy will evolve, and each evolution requires a full reindex. Keeping source text alongside vectors in _source makes reindexing straightforward.

Cost at scale. ML nodes for ELSER start at 4 GB RAM minimum. Dense vector storage at float32 with 1024 dimensions costs roughly 4 KB per document; int8 quantization cuts that to 1 KB; BBQ (Better Binary Quantization, GA in Elasticsearch 9.0) reduces it further to roughly 95% savings. For large-scale vector search considerations, see our guide on scaling vector search from millions to billions.

Relevance tuning. AI-powered search is not "set and forget." You still need judgment lists and offline evaluation using metrics like nDCG@10 and MRR. Establish a baseline with your current keyword search, measure after adding semantic retrieval, and iterate. Without quantitative evaluation, you're guessing.

Query intent understanding. For advanced use cases, an LLM can decompose natural language queries into structured filters and semantic components before they hit Elasticsearch. A search for "red dress under $50" becomes a color filter, a category match, and a price range - dramatically improving precision without changing the index.

Key Takeaways

  • Start with hybrid search (BM25 + semantic) fused with RRF. It outperforms either approach alone with minimal tuning.
  • semantic_text + inference endpoints is the fastest path to production. Drop to manual vector field configuration only when you need control over HNSW parameters or quantization.
  • Budget for reindexing. Your embedding strategy will change as models improve and requirements evolve.
  • Measure relevance quantitatively before and after every change. "It feels smarter" is not an engineering metric.

Need help building AI-powered search? The experts at BigData Boutique are always ready for a new challenge. Get in touch.