Learn how vector search optimization strategies evolve from millions to billions of vectors. Practical guide covering HNSW tuning, sharding, compression, and distributed querying for production ANN systems at scale.

Vector search looks deceptively simple. You embed data, store vectors, run nearest-neighbor search, and get semantically relevant results. At small scale, this works almost effortlessly - a single index, a few tuning knobs, and latency looks great.

Then data grows.

Suddenly, queries slow down, memory usage explodes, recall becomes unstable, and the system behaves very differently from what intuition suggests. Teams often respond by "turning knobs up": more shards, higher recall settings, bigger machines - which frequently makes things worse.

This article focuses on practical, production-grade vector search performance, assuming approximate nearest neighbor (ANN) search from the start. We'll walk through how optimization strategies evolve as you move from millions, to hundreds of millions, to billion-scale vector systems, and - more importantly - why the advice changes at each stage.

The Basics

Before talking about scale, it’s important to understand the most common mistakes teams make when deploying vector search.

What usually hurts performance

Treating vector search like keyword search Keyword search relies on inverted indexes and term statistics. Vector search relies on distance computations and graph or clustering traversal. Concepts like "search all shards just in case" or "always return exact matches" don’t translate well. ANN systems are probabilistic by design, and forcing them into deterministic mental models leads to inefficient configurations.

Over-indexing with frequent updates ANN indexes - especially graph-based ones - assume a relatively stable dataset. Constant streaming inserts and deletes degrade graph structure over time, which increases traversal cost and hurts recall. Unlike keyword indexes, more updates don’t just cost CPU - they degrade search quality itself.

Searching too many shards "just in case" By shard we mean a piece of the data. Essentially a cluster of vectors. Every shard you query adds:

  • network hops
  • CPU scheduling
  • result merging cost

At scale, fan-out dominates latency far more than distance computation.

Chasing 99.99% recall That last fraction of recall is brutally expensive. ANN recall curves are steep at first and flatten quickly. Going from 95% to 97% might cost 20% more latency. Going from 97% to 99.9% can cost multiples.

Ignoring memory layout and CPU cache behavior Distance computation is a tight loop over floating-point arrays. Cache misses, branch mispredictions, and memory alignment matter far more than most people expect.

General principles you should apply everywhere

Prefer ANN over exact search Exact kNN requires comparing the query vector against every candidate. Beyond small datasets, this is fundamentally unscalable. ANN trades a small amount of accuracy for massive reductions in computation - and is the only viable approach at scale.

Tune recall–latency trade-offs intentionally Recall is not a free metric. Every point of recall has a cost in CPU cycles, memory accesses, or IO. Define acceptable recall early and optimize towards it - not beyond it.

Measure end-to-end latency Vector search rarely lives alone. Real pipelines include:

  • embedding lookup
  • metadata filters
  • ANN retrieval
  • reranking
  • network aggregation

Optimizing only the ANN portion often misses the real bottleneck.

Separate indexing from querying Index build speed wants batching, relaxed consistency, and parallelism. Query speed wants locality, predictability, and stability. Mixing these concerns leads to poor results on both sides.

These principles hold regardless of whether you use OpenSearch, Elasticsearch, Vespa, Weaviate, Pinecone, or any other engine.

Millions of Vectors (≈1M–50M)

At this scale, vector search performance is mostly about not over-engineering. Anything below a few millions can work well with exact kNN - brute-force vector search. Once you pass the 5M mark, say, you are in the ANN world (unless you are willing to pay in latency) and this is where the fun begins.

Index & algorithm choices

HNSW is the right default Graph-based ANN methods like HNSW shine here. They provide excellent recall with very low latency because the entire graph fits in memory and traversal depth remains small.

Why HNSW tuning matters Parameters like M, ef_search, and ef_construction directly control graph density and traversal cost:

  • A denser graph improves recall but increases memory and traversal overhead.
  • Higher ef_search explores more neighbors per query, improving recall at the cost of CPU time.

At this scale, conservative tuning usually outperforms aggressive defaults.

Avoid premature quantization Quantization introduces approximation error and operational complexity. If your vectors fit comfortably in RAM, raw floats are usually faster and simpler.

Hardware & memory

RAM residency is non-negotiable ANN relies on pointer-heavy graph traversal. Once nodes spill to disk, latency becomes dominated by page faults rather than computation.

High clock speed beats core count Queries are latency-sensitive and often single-threaded per request. Faster cores matter more than more cores.

NUMA awareness matters earlier than expected Cross-socket memory access adds measurable latency. Binding threads to memory-local cores improves tail latency even at modest scale.

Data & embeddings

Dimensionality is a hidden tax Every extra dimension adds cost to every distance computation. PCA or model-level dimension reduction often yields better performance with negligible quality loss.

Normalize once, not per query Cosine similarity becomes a dot product when vectors are normalized. Doing this at ingest saves CPU on every query.

Keep the hot path simple Mixing sparse and dense scoring inside the ANN loop complicates execution and hurts predictability.

Query optimization

Large k multiplies cost ANN doesn’t magically make large result sets cheap. Asking for 200 neighbors instead of 20 often means 10× more work downstream.

Filters are double-edged Highly selective filters can reduce search space. Weak filters add overhead without pruning enough candidates.

Cache where distribution is skewed Many workloads have repeated or similar queries. Caching centroids or results pays off quickly.

Index lifecycle

Batch inserts preserve graph quality Streaming inserts fragment the graph and increase traversal depth.

Periodic rebuilds are healthy Fresh graphs are faster and more accurate than heavily mutated ones.

Separate hot and cold data Fresh content and archival data have different access patterns - treat them differently.

Tens to Hundreds of Millions (≈50M–500M)

This is the scale where vector search stops being "just an index" and starts behaving like a distributed system. Up to a few tens of millions of vectors, most problems can be solved by adding memory and tuning ANN parameters. Beyond that, latency and stability are dominated by shard fan-out, memory pressure, and coordination overhead - not distance computation itself.

Index structure

Shard by data, not by query patterns ANN indexes rely on dense, coherent vector spaces. Sharding by tenant, category, or query semantics fragments that space and forces you to query more shards to recover recall. A better model is simple: each shard holds a complete ANN index over a subset of the data, and queries fan out to all shards in parallel.

Why shard size matters Shard size is a balancing act. Large shards increase graph traversal depth and reduce cache locality. Small shards increase coordination cost and tail latency. In practice, many production systems converge on 10–30 million vectors per shard as a workable middle ground.

Over-sharding kills tail latency Average latency often looks fine even when over-sharded. p99 does not. Every shard adds variance, and at scale the slowest shard determines query latency. Over-sharding multiplies that risk.

Memory efficiency

At this scale, raw float vectors become prohibitively expensive. Memory and not CPU becomes the primary bottleneck.

Compression becomes mandatory

Two techniques are typically combined:

  • PCA reduces dimensionality.
  • Quantization reduces per-dimension storage.

Both attack memory pressure from different angles - and are often combined.

Why two-phase retrieval works Most candidates do not deserve expensive computation. Compressed vectors are cheap and cache-friendly, while original vectors are expensive and only useful for the final ranking. Two-phase retrieval embraces this asymmetry: use compressed vectors for broad candidate generation, then rerank only the top-K with full-precision vectors. This delivers most of the quality at a fraction of the cost. First pass: cheap, approximate, wide. Second pass: expensive, exact, narrow.

Memory-mapping cold segments When supported, memory-mapping colder data reduces RAM pressure while keeping hot data fast - a useful compromise once datasets grow large.

Query execution

Early termination matters ANN search has diminishing returns. Once recall targets are reached, continuing traversal wastes CPU for marginal gains and directly increases tail latency under load.

SIMD and AVX are real wins Distance computation maps well to vectorized CPU instructions. At this scale, low-level efficiency improvements translate directly into lower latency and higher throughput.

Operational tuning

Dedicated vector nodes Vector search is highly sensitive to cache pollution and memory bandwidth contention. Isolating vector workloads improves latency stability.

Throttle indexing aggressively Indexing competes with querying in subtle ways - memory bandwidth, cache locality, allocator pressure. Throttling indexing during peak query hours often yields better overall performance than adding hardware; or perform indexing on separate infrastructure altogether.

Monitor recall drift As data distributions change, ANN quality degrades quietly. Without explicit recall monitoring, relevance issues often surface late and unexpectedly.

Billion Scale (≈500M–10B+)

At billion scale, a hard limit appears: you cannot search everything anymore, even approximately. Architecture now matters more than individual ANN parameters.

Architecture

Hierarchical retrieval is unavoidable Billion-scale systems must first narrow the search space using coarse clustering (e.g., IVF-style centroids), then run ANN within the selected regions. Blind fan-out across the entire corpus is no longer viable.

Multi-tier indexes reflect reality

Hot data deserves fast, high-quality indexes. Cold data must be aggressively compressed. Treating all data the same is both expensive and slow.

Fan-out control is existential Every unnecessary shard or cluster touched adds network hops, coordination, and tail risk. Strict fan-out control determines whether the system scales or collapses.

Index & compression

Quantization is no longer optional To remain memory-resident, vectors must shrink by an order of magnitude or more. Techniques like IVF-PQ or IVF-SQ are table stakes at this scale.

Centroid tuning is delicate Too few centroids hurt recall. Too many increase routing and coordination overhead. Finding the balance is iterative and data-dependent.

Asymmetric Distance Computation (ADC) trades compute for memory locality ADC computes distances between full-precision queries and compressed vectors, preserving acceptable accuracy while improving memory locality and speed.

Distributed querying

Route queries intentionally Metadata, clustering, or ownership maps prevent blind broadcast.

Minimize network payloads IDs and scores are cheap. Vectors are not. Move as little data as possible between nodes, and specifically - better respond with metadata and IDs and not the full document or source vector.

Hierarchical aggregation reduces variance Local aggregation before global merging reduces coordination overhead and stabilizes tail latency.

Hardware strategy

Memory dominates everything Once vectors no longer fit comfortably in memory, performance degrades sharply. No amount of CPU or disk can compensate.

GPUs are situational GPUs excel at high-throughput batch workloads, but are often a poor fit for low-QPS, tail-latency-sensitive systems. Definitely use for indexing and re-indexing, not always a good fit for queries.

Network quality matters more than expected Tail latency often correlates with network jitter at this scale.

Summary

Vector search optimization is not about finding the "best" index or parameter. It’s about matching architecture, algorithms, and operational discipline to scale.

Specific workloads, like RAG systems, will have specific configuration tuning as well - for instance optimizing the chunk size and extraction strategy, applying cheap filters first in a prefilter step, and so on. So the use-case matters.

To summarize, here's a useful rule of thumb:

  • Millions → HNSW, RAM-resident, minimal complexity
  • Hundreds of millions → Sharding, compression, reranking
  • Billions → Hierarchical ANN, mandatory quantization, ruthless fan-out control

And remember - most performance failures come not from bad algorithms - but from applying yesterday’s assumptions to today’s scale.