Amazon OpenSearch Service can back vector indexes with S3 Vectors for cheap hybrid search at scale. Setup, query examples, limits, cost-latency tradeoff.

S3 Vectors with OpenSearch: Cost-Efficient Hybrid Vector Search

RAG and agentic systems need hybrid search, not just vector lookups. Pure k-NN gives you semantic recall but no usable ranking signal when the corpus is large or noisy. Lexical scoring, filters, and aggregations are still doing most of the work that makes retrieval actually useful, and they need to live next to the vectors. The hard part is keeping that whole stack affordable when the corpus reaches hundreds of millions of embeddings.

Amazon OpenSearch Service now lets you back a knn_vector field directly with Amazon S3 Vectors, keeping document fields and the lexical index in OpenSearch while pushing the vector storage into a serverless, S3-priced tier. You keep the OpenSearch query API, hybrid search, filters, and aggregations. You give up some latency, some recall tuning, and several index lifecycle features. For agentic retrieval, where a 100-200ms tail is usually fine and cost dominates, that tradeoff often makes sense.

Why hybrid search, and why this matters for cost

Vector search alone is bad at precision. Embeddings collapse meaning into a single similarity score, which is fine for "give me 50 candidates" but weak for "rank these candidates correctly." Production retrieval pipelines for RAG combine BM25-style lexical scoring, structured filters (tenant, language, ACL, freshness), aggregations, and dense retrieval, and then often rerank the merged set. OpenSearch's hybrid query and the Search Pipelines framework are built for exactly that workflow.

The cost problem shows up when the vector index gets large. HNSW graphs traditionally live in RAM to keep latency low, which means RAM-priced storage for billions of floats plus per-vector graph overhead. Disk-based ANN, quantization, and tiered storage help, but a hot HNSW cluster sized for 1B vectors is still an expensive piece of infrastructure. S3 Vectors targets the other end of the spectrum: cheap, durable, serverless storage with native vector APIs, optimized for up to 90% lower cost on large vector datasets at the price of higher and less predictable query latency.

The cost-latency tradeoff in practice

Agentic workloads tolerate latency that interactive search does not. An agent typically issues retrieval as one step inside a multi-second reasoning loop, often with model-side latencies of 500ms to several seconds per tool call. Adding a vector lookup that takes 100ms instead of 10ms is rarely the bottleneck. The same is true for batch RAG indexing pipelines, internal knowledge bases, and long-tail search where freshness matters more than P50.

Where the S3-backed engine fits, and where it does not:

Workload Hot OpenSearch (HNSW) OpenSearch + S3 Vectors
Interactive autocomplete, recommendations Strong fit Avoid
Real-time semantic search (P95 <50ms) Strong fit Avoid
Agentic retrieval, tool-call RAG Works, often overkill Strong fit
Long-tail / archival semantic search Expensive Strong fit
Hybrid lexical + vector at billions of docs Expensive Strong fit
Deep retrieval (k > 100) with reranking Works Not supported

A useful mental model: S3 Vectors is the cold-to-warm tier for embeddings. Hot HNSW is the serving tier. Many teams will run both and route based on age, popularity, or tenant.

Setting up OpenSearch with S3 Vectors

The integration is exposed as a new knn_vector engine called s3vector. It is available on Amazon OpenSearch Service managed domains running OpenSearch 2.19 or later, with the S3 Vectors engine option enabled on the domain. Self-managed open-source OpenSearch does not currently support this engine. OpenSearch Serverless uses a separate import/export path described later in this post. The full reference is in the AWS docs for the S3 vector engine.

Create the index

The engine has to be set at index creation. You cannot promote an existing field to s3vector later; that requires a reindex.

PUT products-s3vec
  {
    "settings": {
      "index": {
        "knn": true
      }
    },
    "mappings": {
      "properties": {
        "title":    { "type": "text" },
        "category": { "type": "keyword" },
        "price":    { "type": "float" },
        "embedding": {
          "type": "knn_vector",
          "dimension": 384,
          "space_type": "cosinesimil",
          "method": { "engine": "s3vector" }
        }
      }
    }
  }
  

Supported distance spaces are l2 and cosinesimil. Dimensions must be greater than 0 and no larger than 4096.

Index documents

Indexing looks identical to a standard OpenSearch index. Non-vector fields stay in the OpenSearch shards; the embedding field is offloaded to S3 Vectors transparently.

POST _bulk
  { "index": { "_index": "products-s3vec", "_id": "1" } }
  { "title": "Wireless noise cancelling headphones", "category": "audio",    "price": 129.0, "embedding": [0.12, -0.03, 0.44] }
  { "index": { "_index": "products-s3vec", "_id": "2" } }
  { "title": "USB-C travel charger",                  "category": "chargers", "price": 39.0,  "embedding": [0.08,  0.11, -0.22] }
  

Query: k-NN, filtered k-NN, and hybrid

Standard _search works without changes. Vector queries against the embedding field are routed to S3 Vectors; everything else stays in OpenSearch.

GET products-s3vec/_search
  {
    "size": 10,
    "query": {
      "knn": {
        "embedding": {
          "vector": [0.10, -0.02, 0.41],
          "k": 10,
          "filter": {
            "bool": {
              "filter": [
                { "term":  { "category": "audio" } },
                { "range": { "price": { "lte": 200 } } }
              ]
            }
          }
        }
      }
    }
  }
  

Hybrid queries combine lexical and vector scoring through the standard OpenSearch hybrid query:

GET products-s3vec/_search
  {
    "size": 10,
    "query": {
      "hybrid": {
        "queries": [
          { "match": { "title": "noise cancelling headphones" } },
          { "knn":   { "embedding": { "vector": [0.10, -0.02, 0.41], "k": 10 } } }
        ]
      }
    }
  }
  

Two things to keep in mind. Maximum k is 100, which rules out the "retrieve 500, rerank to 20" pattern common in research-grade RAG. And filters are applied as post-filters with oversampling heuristics, so very selective filters can hurt recall. Both are documented in the AWS engine reference.

Limitations to plan around

The S3-backed engine is a different beast than HNSW. Latency is the obvious tradeoff, but the operational surface is also smaller. Realistic expectations, based on the AWS positioning and field behavior:

Metric Hot OpenSearch (HNSW) OpenSearch + S3 Vectors
P50 latency 5-20 ms 50-120 ms
P95 latency 20-60 ms 120-300 ms
P99 latency 50-150 ms 300-800+ ms
Max k thousands 100
Filter mode pre-filter (efficient) post-filter with oversampling
Recall tuning ef_search, m, IVF probes none exposed
Snapshots, split, shrink, clone supported not supported
UltraWarm migration, CCR, radial search supported not supported
Engine choice mutable after creation yes (reindex-free for some changes) no, fixed at index creation

A few of these bite harder than they look on paper. No snapshots means your backup strategy has to live in the data pipeline, not in OpenSearch. Post-filtering with hard ceilings on k means heavy filters compound: a tight tenant filter on a 100-result set leaves very few candidates. And cold-vs-warm behavior is real - first queries against a cold partition pay an S3 fetch, so bursty workloads see uneven tail latency until caches fill.

There are also workloads where you should not use this at all. Anything that needs sub-20ms latency, very high QPS per shard, deep retrieval with reranking, custom script scoring on vectors, or radial search belongs on a hot HNSW index. Build for the workload; do not pick the engine first.

Promoting hot data to OpenSearch Serverless

The cleanest production RAG pattern is tiered: keep the bulk of embeddings in S3 Vectors for cost, and move the hot subset into a fast tier when it starts taking real query traffic. Amazon OpenSearch Serverless supports importing vector data directly from S3 vector buckets, which removes the need to run a custom backfill job whenever a corpus or partition gets promoted.

Concretely: store the cold corpus in S3 Vectors, query it through OpenSearch with the s3vector engine for agentic retrieval, and trigger an import into a Serverless vector collection when a tenant, time window, or document set crosses a query-volume threshold. The Serverless collection then serves the latency-sensitive traffic with hot HNSW; the cold corpus stays in S3. Most teams do not need this from day one, but it is the pattern that holds up at scale.

Key takeaways

  • Hybrid search beats pure k-NN for RAG and agentic retrieval because lexical scoring, filters, and aggregations carry most of the ranking signal.
  • The s3vector engine in Amazon OpenSearch Service backs knn_vector fields with S3 Vectors, trading latency and operational flexibility for substantially lower vector storage cost.
  • It works well for agentic workloads, batch RAG indexing, long-tail semantic search, and billion-vector hybrid corpora. It does not fit interactive low-latency search.
  • The engine choice is fixed at index creation, snapshots and several index ops are unavailable, k caps at 100, and filters are post-filters with oversampling.
  • A tiered architecture - S3 Vectors for cold, OpenSearch HNSW or Serverless for hot - is the production pattern, and AWS supports direct import from S3 Vectors into OpenSearch Serverless.

If you are designing or scaling a hybrid search stack on AWS and want a second opinion on where the S3-backed tier fits, we do this for a living.