Amazon OpenSearch Service can back vector indexes with S3 Vectors for cheap hybrid search at scale. Setup, query examples, limits, cost-latency tradeoff.
RAG and agentic systems need hybrid search, not just vector lookups. Pure k-NN gives you semantic recall but no usable ranking signal when the corpus is large or noisy. Lexical scoring, filters, and aggregations are still doing most of the work that makes retrieval actually useful, and they need to live next to the vectors. The hard part is keeping that whole stack affordable when the corpus reaches hundreds of millions of embeddings.
Amazon OpenSearch Service now lets you back a knn_vector field directly with Amazon S3 Vectors, keeping document fields and the lexical index in OpenSearch while pushing the vector storage into a serverless, S3-priced tier. You keep the OpenSearch query API, hybrid search, filters, and aggregations. You give up some latency, some recall tuning, and several index lifecycle features. For agentic retrieval, where a 100-200ms tail is usually fine and cost dominates, that tradeoff often makes sense.
Why hybrid search, and why this matters for cost
Vector search alone is bad at precision. Embeddings collapse meaning into a single similarity score, which is fine for "give me 50 candidates" but weak for "rank these candidates correctly." Production retrieval pipelines for RAG combine BM25-style lexical scoring, structured filters (tenant, language, ACL, freshness), aggregations, and dense retrieval, and then often rerank the merged set. OpenSearch's hybrid query and the Search Pipelines framework are built for exactly that workflow.
The cost problem shows up when the vector index gets large. HNSW graphs traditionally live in RAM to keep latency low, which means RAM-priced storage for billions of floats plus per-vector graph overhead. Disk-based ANN, quantization, and tiered storage help, but a hot HNSW cluster sized for 1B vectors is still an expensive piece of infrastructure. S3 Vectors targets the other end of the spectrum: cheap, durable, serverless storage with native vector APIs, optimized for up to 90% lower cost on large vector datasets at the price of higher and less predictable query latency.
The cost-latency tradeoff in practice
Agentic workloads tolerate latency that interactive search does not. An agent typically issues retrieval as one step inside a multi-second reasoning loop, often with model-side latencies of 500ms to several seconds per tool call. Adding a vector lookup that takes 100ms instead of 10ms is rarely the bottleneck. The same is true for batch RAG indexing pipelines, internal knowledge bases, and long-tail search where freshness matters more than P50.
Where the S3-backed engine fits, and where it does not:
| Workload | Hot OpenSearch (HNSW) | OpenSearch + S3 Vectors |
|---|---|---|
| Interactive autocomplete, recommendations | Strong fit | Avoid |
| Real-time semantic search (P95 <50ms) | Strong fit | Avoid |
| Agentic retrieval, tool-call RAG | Works, often overkill | Strong fit |
| Long-tail / archival semantic search | Expensive | Strong fit |
| Hybrid lexical + vector at billions of docs | Expensive | Strong fit |
| Deep retrieval (k > 100) with reranking | Works | Not supported |
A useful mental model: S3 Vectors is the cold-to-warm tier for embeddings. Hot HNSW is the serving tier. Many teams will run both and route based on age, popularity, or tenant.
Setting up OpenSearch with S3 Vectors
The integration is exposed as a new knn_vector engine called s3vector. It is available on Amazon OpenSearch Service managed domains running OpenSearch 2.19 or later, with the S3 Vectors engine option enabled on the domain. Self-managed open-source OpenSearch does not currently support this engine. OpenSearch Serverless uses a separate import/export path described later in this post. The full reference is in the AWS docs for the S3 vector engine.
Create the index
The engine has to be set at index creation. You cannot promote an existing field to s3vector later; that requires a reindex.
PUT products-s3vec
{
"settings": {
"index": {
"knn": true
}
},
"mappings": {
"properties": {
"title": { "type": "text" },
"category": { "type": "keyword" },
"price": { "type": "float" },
"embedding": {
"type": "knn_vector",
"dimension": 384,
"space_type": "cosinesimil",
"method": { "engine": "s3vector" }
}
}
}
}
Supported distance spaces are l2 and cosinesimil. Dimensions must be greater than 0 and no larger than 4096.
Index documents
Indexing looks identical to a standard OpenSearch index. Non-vector fields stay in the OpenSearch shards; the embedding field is offloaded to S3 Vectors transparently.
POST _bulk
{ "index": { "_index": "products-s3vec", "_id": "1" } }
{ "title": "Wireless noise cancelling headphones", "category": "audio", "price": 129.0, "embedding": [0.12, -0.03, 0.44] }
{ "index": { "_index": "products-s3vec", "_id": "2" } }
{ "title": "USB-C travel charger", "category": "chargers", "price": 39.0, "embedding": [0.08, 0.11, -0.22] }
Query: k-NN, filtered k-NN, and hybrid
Standard _search works without changes. Vector queries against the embedding field are routed to S3 Vectors; everything else stays in OpenSearch.
GET products-s3vec/_search
{
"size": 10,
"query": {
"knn": {
"embedding": {
"vector": [0.10, -0.02, 0.41],
"k": 10,
"filter": {
"bool": {
"filter": [
{ "term": { "category": "audio" } },
{ "range": { "price": { "lte": 200 } } }
]
}
}
}
}
}
}
Hybrid queries combine lexical and vector scoring through the standard OpenSearch hybrid query:
GET products-s3vec/_search
{
"size": 10,
"query": {
"hybrid": {
"queries": [
{ "match": { "title": "noise cancelling headphones" } },
{ "knn": { "embedding": { "vector": [0.10, -0.02, 0.41], "k": 10 } } }
]
}
}
}
Two things to keep in mind. Maximum k is 100, which rules out the "retrieve 500, rerank to 20" pattern common in research-grade RAG. And filters are applied as post-filters with oversampling heuristics, so very selective filters can hurt recall. Both are documented in the AWS engine reference.
Limitations to plan around
The S3-backed engine is a different beast than HNSW. Latency is the obvious tradeoff, but the operational surface is also smaller. Realistic expectations, based on the AWS positioning and field behavior:
| Metric | Hot OpenSearch (HNSW) | OpenSearch + S3 Vectors |
|---|---|---|
| P50 latency | 5-20 ms | 50-120 ms |
| P95 latency | 20-60 ms | 120-300 ms |
| P99 latency | 50-150 ms | 300-800+ ms |
Max k |
thousands | 100 |
| Filter mode | pre-filter (efficient) | post-filter with oversampling |
| Recall tuning | ef_search, m, IVF probes |
none exposed |
| Snapshots, split, shrink, clone | supported | not supported |
| UltraWarm migration, CCR, radial search | supported | not supported |
| Engine choice mutable after creation | yes (reindex-free for some changes) | no, fixed at index creation |
A few of these bite harder than they look on paper. No snapshots means your backup strategy has to live in the data pipeline, not in OpenSearch. Post-filtering with hard ceilings on k means heavy filters compound: a tight tenant filter on a 100-result set leaves very few candidates. And cold-vs-warm behavior is real - first queries against a cold partition pay an S3 fetch, so bursty workloads see uneven tail latency until caches fill.
There are also workloads where you should not use this at all. Anything that needs sub-20ms latency, very high QPS per shard, deep retrieval with reranking, custom script scoring on vectors, or radial search belongs on a hot HNSW index. Build for the workload; do not pick the engine first.
Promoting hot data to OpenSearch Serverless
The cleanest production RAG pattern is tiered: keep the bulk of embeddings in S3 Vectors for cost, and move the hot subset into a fast tier when it starts taking real query traffic. Amazon OpenSearch Serverless supports importing vector data directly from S3 vector buckets, which removes the need to run a custom backfill job whenever a corpus or partition gets promoted.
Concretely: store the cold corpus in S3 Vectors, query it through OpenSearch with the s3vector engine for agentic retrieval, and trigger an import into a Serverless vector collection when a tenant, time window, or document set crosses a query-volume threshold. The Serverless collection then serves the latency-sensitive traffic with hot HNSW; the cold corpus stays in S3. Most teams do not need this from day one, but it is the pattern that holds up at scale.
Key takeaways
- Hybrid search beats pure k-NN for RAG and agentic retrieval because lexical scoring, filters, and aggregations carry most of the ranking signal.
- The
s3vectorengine in Amazon OpenSearch Service backsknn_vectorfields with S3 Vectors, trading latency and operational flexibility for substantially lower vector storage cost. - It works well for agentic workloads, batch RAG indexing, long-tail semantic search, and billion-vector hybrid corpora. It does not fit interactive low-latency search.
- The engine choice is fixed at index creation, snapshots and several index ops are unavailable,
kcaps at 100, and filters are post-filters with oversampling. - A tiered architecture - S3 Vectors for cold, OpenSearch HNSW or Serverless for hot - is the production pattern, and AWS supports direct import from S3 Vectors into OpenSearch Serverless.
If you are designing or scaling a hybrid search stack on AWS and want a second opinion on where the S3-backed tier fits, we do this for a living.