pgvector in Production: When PostgreSQL Replaces a Dedicated Vector Database (and When It Shouldn't)

pgvector turns PostgreSQL into a capable vector store for RAG and semantic search. Here is the honest threshold: where HNSW indexes, halfvec, and iterative scans make Postgres enough for production, and where a dedicated vector database earns its keep.

Every team building a RAG pipeline or semantic search feature eventually asks the same question: do we need a dedicated vector database, or can we just put the embeddings in the database we already run? For a large share of production workloads, the answer is the one nobody selling a vector database wants to give you. PostgreSQL with the pgvector extension is enough.

It is not enough for everyone. The honest version of this story has a threshold, and the threshold is concrete: corpus size, recall targets, write churn, and how much memory you are willing to pay for. This post walks that boundary - the index types pgvector ships, the tuning knobs that move recall and latency, the version-specific features that changed the calculus in the last year, and the failure modes that should push you toward Pinecone, Qdrant, or OpenSearch instead.

What pgvector Actually Gives You

pgvector is a PostgreSQL extension that adds vector data types and approximate nearest neighbor (ANN) search to a standard Postgres instance. You store embeddings in a column, build an index over them, and query with distance operators inside ordinary SQL. No separate service, no second consistency model, no extra thing to back up.

The extension defines several column types. The core vector type holds single-precision floats and supports up to 16,000 dimensions for storage, though indexing caps at 2,000 dimensions. To index higher-dimensional embeddings, pgvector added halfvec, a half-precision (16-bit float) type that can be indexed up to 4,000 dimensions while cutting index size roughly in half. There is also sparsevec for sparse vectors and bit for binary vectors used in binary quantization. Distance is expressed through operators: <-> for L2/Euclidean, <=> for cosine, <#> for negative inner product, and <+> for L1.

A minimal end-to-end setup looks like this:

CREATE EXTENSION vector;
  
  CREATE TABLE documents (
      id        bigserial PRIMARY KEY,
      tenant_id int,
      body      text,
      embedding vector(1536)
  );
  
  -- HNSW index using cosine distance
  CREATE INDEX ON documents
      USING hnsw (embedding vector_cosine_ops)
      WITH (m = 16, ef_construction = 64);
  
  -- nearest-neighbor query
  SELECT id, body
  FROM documents
  ORDER BY embedding <=> '[0.012, -0.43, ...]'
  LIMIT 10;

The operator class in the index (vector_cosine_ops, vector_l2_ops, vector_ip_ops, or the halfvec_* variants) must match the operator you query with. Index on cosine and query with <=>. Mismatch the two and the planner silently skips the index and falls back to a sequential scan, which is the single most common reason people think "pgvector is slow."

HNSW vs IVFFlat: Pick HNSW Almost Every Time

pgvector offers two ANN index types, and the choice between them is mostly settled. HNSW (Hierarchical Navigable Small World) arrived in pgvector 0.5.0 in August 2023 and changed the extension from "interesting" to "production-ready." IVFFlat predates it.

HNSW builds a multi-layer proximity graph. It delivers better recall-versus-latency than IVFFlat at the cost of slower builds and higher memory use. Critically, it needs no training step, so you can create the index on an empty table and let it fill incrementally as rows arrive. IVFFlat partitions vectors into lists clusters and probes a subset at query time. It builds faster and uses less memory, but it must be built (or rebuilt) against representative data to choose good cluster centers, and recall degrades as the data drifts from those centers.

Dimension	HNSW	IVFFlat
Recall at low latency	Higher	Lower
Build speed	Slower	Faster
Memory footprint	Higher	Lower
Training / data needed at build	None	Yes (clusters from data)
Handles incremental inserts well	Yes	Degrades; needs rebuild
Key build params	`m`, `ef_construction`	`lists`
Key query param	`hnsw.ef_search`	`ivfflat.probes`

For most production RAG and semantic-search workloads, HNSW is the right default. Reach for IVFFlat only when build time and memory are the binding constraints and you can tolerate periodic rebuilds. We go deeper on the algorithmic trade-offs in HNSW vs IVFFlat: How to Choose the Right Vector Index.

Tuning Recall and Latency

ANN search trades exactness for speed. The whole game is dialing in how much exactness you give up. pgvector exposes that dial through a handful of parameters, split between build time and query time.

For HNSW, m (default 16) sets the number of connections per node in the graph. Higher m improves recall and increases both index size and build time. ef_construction (default 64) is the size of the candidate list during index construction; raising it builds a better graph at the cost of longer builds. The most important runtime knob is hnsw.ef_search (default 40), the candidate list size at query time. It is the lever you reach for first.

-- widen the search to lift recall on this session
  SET hnsw.ef_search = 100;
  
  SELECT id FROM documents
  ORDER BY embedding <=> :query
  LIMIT 10;

Raising ef_search increases recall and latency together; it is a per-query speed/accuracy slider you can tune without rebuilding. For IVFFlat the analogous runtime knob is ivfflat.probes, the number of lists scanned, paired with the build-time lists count. A common starting point is lists = rows / 1000 for up to a million rows, with probes raised until recall meets target.

Standalone definition: Recall in ANN search is the fraction of true nearest neighbors that the approximate index actually returns. A recall of 0.95 means the index found 95 of the 100 genuinely closest vectors. Tuning pgvector is the act of buying recall with latency and memory, then stopping at the cheapest point that meets your quality bar.

Two pitfalls dominate. First, building an HNSW index over a large table with a low maintenance_work_mem is painfully slow because the build spills out of memory; raise it for the duration of the build. Second, people benchmark recall against a tiny dataset, ship defaults, and watch recall collapse at scale. Measure recall against a held-out query set on production-sized data, not on a 10,000-row sample.

Filtered Search, Quantization, and the 0.8.0 Inflection

The hardest problem in production vector search is not pure nearest-neighbor lookup. It is filtered search: "find me the closest documents for this tenant and in this date range." Naively, an ANN index returns the top candidates by distance and then a WHERE clause throws most of them away, so you can ask for 10 results and get 2. This "overfiltering" was pgvector's sharpest edge for years.

pgvector 0.8.0, released in October 2024, addressed it directly with iterative index scans. When a filtered query does not return enough rows, the index keeps scanning until it does, governed by hnsw.iterative_scan (modes off, strict_order, or relaxed_order) and a ceiling of hnsw.max_scan_tuples. IVFFlat gets the equivalent through ivfflat.iterative_scan and ivfflat.max_probes. The same release improved the planner's cost estimation so Postgres more often picks a plain B-tree index when a filter is highly selective, instead of forcing the vector index.

The numbers AWS published for pgvector 0.8.0 on Amazon Aurora PostgreSQL show why this mattered. On a 10-million-product dataset with 384-dimensional embeddings, they report filtered-query result completeness rising from as low as 1% to 100% versus the 0.7.4 baseline, and p99 latency on one query dropping from 123.3 ms to 13.1 ms with ef_search=40 and relaxed_order. (AWS Database Blog) These are vendor figures on a favorable workload, so treat them as direction, not a promise. The direction is real: filtered search in pgvector went from a known weakness to broadly usable in one release.

Quantization is the other lever for fitting more vectors in memory. Half-precision with halfvec halves index size with minor recall loss and is often the easiest win. Binary quantization goes further, casting embeddings to bit via binary_quantize() for very large indexes where a fast first-pass with re-ranking on full precision is acceptable. If you have not internalized the underlying mechanics, our primer on what a vector database is covers the concepts these features implement.

Where pgvector Breaks, and What to Use Instead

The case for pgvector is strongest under a few conditions, and it weakens predictably outside them.

pgvector is enough when:

Your corpus is in the low millions of vectors, not hundreds of millions. HNSW indexes are memory-resident for good latency; once the index no longer fits in RAM, query latency falls off a cliff as it pages from disk.
You already run PostgreSQL and value one operational story: one backup, one replication setup, one access-control model, transactional consistency between your vectors and the rows they describe.
Your workload is read-heavy with moderate write churn. Heavy, continuous inserts and updates force HNSW index maintenance that competes with query traffic.
You need vectors joined to relational data: multi-tenant SaaS where every query filters by tenant_id, or embeddings sitting next to the business records they came from.

Reach for a dedicated vector database when:

You are at tens to hundreds of millions of vectors or beyond, where purpose-built systems shard the index, manage memory tiers, and keep tail latency flat. Our walkthrough on scaling vector search from millions to billions covers that regime.
You need first-class hybrid search that fuses BM25 lexical scoring with dense vectors at scale, which is OpenSearch and Elasticsearch territory.
Write churn is high and constant, or you need dense-index replication across regions, where Postgres index maintenance and physical replication of large indexes become the bottleneck.

Factor	pgvector (PostgreSQL)	Dedicated vector DB (Pinecone, Qdrant, OpenSearch)
Corpus sweet spot	Up to ~10M vectors	10M to billions
Operational overhead	Reuse existing Postgres	New service to run and learn
Transactional consistency with relational data	Native	External / eventual
Filtered + multi-tenant search	Strong (0.8.0+)	Strong
Hybrid lexical + vector at scale	Limited	First-class (OpenSearch)
Horizontal scale of the index	Manual / limited	Built-in sharding
High write churn	Index maintenance pressure	Designed for it

A pragmatic middle path is underused: keep metadata, access control, and the source of truth in PostgreSQL, and offload only the dense index to a specialized engine once you cross the threshold. You do not have to use the same store for the document and its embedding. If you are still deciding the broader data architecture, How to Choose a Database for Your Next Project frames that decision, and Vector Database Comparison 2026 lines up the dedicated options side by side.

Key Takeaways

pgvector turns PostgreSQL into a real vector store. For corpora up to roughly 10 million vectors with moderate write churn, it is genuinely enough for production RAG and semantic search.
Use HNSW, not IVFFlat, for almost all workloads. HNSW needs no training, handles incremental inserts, and wins on recall-versus-latency.
Match the index operator class to the query operator (vector_cosine_ops with <=>), or the planner silently sequential-scans.
Tune recall with hnsw.ef_search at query time; tune build quality with m and ef_construction. Measure recall on production-sized data, not a sample.
pgvector 0.8.0's iterative index scans fixed filtered-search overfiltering, the feature that previously pushed multi-tenant workloads off Postgres.
Use halfvec to index high-dimensional embeddings (up to 4,000 dims) and to halve index memory; binary quantization goes further for very large indexes.
Move to a dedicated vector database for hundreds of millions of vectors, hybrid lexical+vector at scale, high constant write churn, or multi-region dense-index replication.

Choosing the right store for embeddings, and tuning it so recall and latency both hold up in production, is exactly the kind of work BigData Boutique does with engineering teams. If you are weighing pgvector against a dedicated vector database for a RAG or search workload, we can help you find the threshold for your data.