PostgreSQL vs Dedicated Vector Databases: Where the Line Actually Is

Every team shipping an AI feature hits the same fork: keep embeddings in the Postgres you already run, or stand up Pinecone, Qdrant, Weaviate, or Milvus next to it. I've spent the past year benchmarking pgvector on dedicated hardware and watching teams migrate in both directions, and my answer has gotten more specific, not more diplomatic: below about a million vectors, a dedicated engine usually costs you more in glue code than it returns in performance. Above that, the conversation gets real.

The annoying part is that both camps argue with adjectives. "Blazing fast." "Web scale." "Production grade." So here's the decision with measured numbers attached, including the ones that make Postgres look bad.

What the dedicated engines genuinely do better#

Three things, and they're real.

Scale past what one Postgres node can hold. pgvector's HNSW index is built and served on a single node, and the build is memory-bound. I've measured this directly: an 8 vCPU/16 GB machine builds a 1M × 1536-dim HNSW index in about 9 minutes, and a 2M index on the same box simply will not build. Milvus, Qdrant, and Pinecone were architected around this exact wall: sharded indexes across nodes, product quantization, disk-backed graph variants in the DiskANN family. At 100M+ vectors, that architecture isn't a nice-to-have. It's the only way the workload exists.

Filtered search at enormous scale. Dedicated engines do filter-aware graph traversal, so a highly selective filter over hundreds of millions of vectors doesn't gut recall or force a giant over-fetch. pgvector 0.8 added iterative index scans, which fixed the worst "my filter ate the result set" cases at small and medium scale, but if you're combining high-cardinality filters with nine-figure vector counts, the purpose-built engines are ahead and I won't pretend otherwise.

Ops you don't have to think about. Pinecone's serverless tier means no capacity planning, no index-sizing spreadsheet, no failover drills. You pay per use, which cuts both ways (your bill scales with your success and you can't tune what you can't see) but for spiky workloads on a team with zero infrastructure appetite, it's a legitimate draw.

What they don't do better, despite the marketing: latency at moderate scale (numbers below), filtering at moderate scale (a B-tree index beats a bolted-on metadata system), or anything that touches your relational data. Which brings me to the other side.

What Postgres does better#

The production query is never "give me the 10 nearest neighbors." It's the 10 nearest neighbors for this tenant, that this user is allowed to see, from documents that aren't archived, joined back to titles and URLs for the prompt. In Postgres that's one statement:

BEGIN;
SET LOCAL hnsw.ef_search = 80;

SELECT d.title, d.url, c.content,
       c.embedding <=> $1 AS distance
FROM chunks c
JOIN documents d ON d.id = c.document_id
WHERE c.tenant_id = $2
  AND d.status = 'published'
  AND d.language = 'en'
ORDER BY c.embedding <=> $1
LIMIT 10;

COMMIT;

In a dedicated vector store, that query becomes: denormalize tenant_id, status, and language into metadata at write time, keep that copy in sync forever, run the filtered ANN query, then make a second round trip to Postgres to join back the document fields. Every schema change to documents now has a shadow in your vector store.

A detail in that SQL is load-bearing: SET LOCAL, not SET. Most managed Postgres puts you behind PgBouncer in transaction-pooling mode (on Rivestack, the $29 Solo tier connects through port 6432 exactly this way), and a session-level SET hnsw.ef_search = 80 is silently dropped, your connection goes back to the pool, and the next query runs at the default. Your search still "works," just at a recall you didn't choose, and nothing logs a warning. Use SET LOCAL inside the transaction or set it once with ALTER DATABASE app SET hnsw.ef_search = 80. There's a longer treatment in the HNSW tuning guide.

Then there are transactions. Delete a user and their chunks cascade in the same commit: no orphaned vectors, no reconciliation job. In a two-store architecture the delete propagates eventually, and "our chatbot cited a document the customer deleted last week" is not an inconvenience, it's a GDPR incident with your name on the postmortem.

And cost. Vector-store pricing is usually metered per vector, per write, or per query. Postgres pricing is hardware: a dedicated 2 vCPU/4 GB NVMe node at $49/month serves 250k embeddings at around 1,600 QPS with 0.93 recall@10 (16 concurrent clients). Most products never exceed that workload, and the bill never moves.

The two-system tax, itemized#

Teams budget for the vector database. They rarely budget for the seams.

The sync pipeline. Dual-write or CDC, pick your poison. Dual-writes fail halfway and you end up writing a reconciliation job anyway. CDC means Debezium plus Kafka, or a vendor connector, another moving part with lag, schema-evolution pain, and its own on-call page.

Consistency holes. The vector store is, structurally, a stale index over your real data. Stale means wrong results. Wrong results that include rows the requesting user shouldn't see means your retrieval layer is now a security surface.

Two backup stories. Postgres with pgBackRest restores to Tuesday at 14:02. Your vector store's snapshot is from 03:00. Now your disaster-recovery plan includes a re-sync procedure you've never tested, executed during an incident.

Two monitoring stacks. When retrieval p95 spikes, which system is it? Postgres gives you pg_stat_statements and EXPLAIN ANALYZE; vector-DB observability ranges from decent to a single latency graph.

Model migrations hit twice. Re-embedding with a new model means backfilling two systems and cutting over atomically, which, across two systems, you can't, exactly.

None of these is fatal. Together they're something like an engineer-month a year, indefinitely. That's a fine price if the second system buys you something Postgres can't do. It's a terrible price for 400k vectors.

The honest numbers: where single-node pgvector stops#

These are measurements, not extrapolations: June 2026, dedicated NVMe nodes on Rivestack, same-region clients, clustered 1536-dim embeddings, HNSW at m=16 / ef_construction=64, recall@10 scored against exact KNN. Full methodology in the NVMe vs cloud SSD benchmark post.

Node	RAM	Hot-serving capacity (1536d)	Build reality
2 vCPU / 4 GB	4 GB	~350k vectors	250k builds in ~5 min; 500k took hours; 1M fails
4 vCPU / 8 GB	8 GB	~600k vectors	500k in ~40 min; 1M fails (OOM)
8 vCPU / 16 GB	16 GB	~1M vectors	1M in ~9 min; 2M will not build

That 4.2-hour number isn't a typo. I watched a 500k HNSW build grind for most of an afternoon on a 4 GB box because the build spilled past RAM. The index worked afterward. The build window did not.

Throughput, with recall attached, because a QPS figure without its recall is marketing, not measurement:

4 GB node, 250k vectors: ~1,600 QPS with recall 0.93 (16 clients); throughput climbs further if you trade recall down. Same-region networking adds only ~0.4 ms over the in-datacenter floor.
8 GB node: 250k at recall 0.94 → ~2,950 QPS (16 clients).
16 GB node: 250k at recall 0.95 → ~4,465 QPS (16 clients). 1M vectors → ~3,600 QPS at p50 4.2 ms at recall 0.74 (16 clients).

And the failure mode nobody benchmarks: chase recall 0.99 on a 4 GB node by cranking ef_search to 200 and the working set spills the cache. Throughput collapses to somewhere between 25 and 300 QPS with p50 at 36 to 43 ms. The cliff is sudden and it's why "just raise ef_search" is bad advice without a memory budget.

One more limit, stated plainly: replicas don't buy you read throughput by default. In most managed HA setups, Rivestack's included today, added nodes are streaming-replication standbys for automatic failover, and the load balancer routes every query, reads included, to the primary. Vanilla Postgres can serve reads from hot standbys if you build the routing and accept replication lag, but don't pencil "add a node, double the QPS" into your capacity plan. Scale within a node first.

So the practical line for 1536-dim embeddings is roughly one million vectors per node at 16 GB of RAM. Past it, your in-Postgres options are bigger hardware, shorter embeddings (Matryoshka-truncated 512d roughly triples effective capacity and usually costs less retrieval quality than people fear), or partitioning by tenant. At 10M+ with no natural partition key, the dedicated-engine conversation is legitimate. At 100M, it's over. Go distributed.

The middle path: Postgres first, measure, then peel#

The boring strategy is the right one for most teams:

Put embeddings next to the rows they describe. A vector(1536) column, an HNSW index, ship it. If you want this without running the box yourself, managed pgvector is exactly this shape.
Measure with your data. Clustered real-world embeddings behave differently from the uniform random vectors in vendor benchmarks. We open-sourced pgvector-bench for this: it reports QPS at measured recall instead of QPS alone.
Set exit criteria in advance. "p95 over 50 ms at recall 0.90," or "corpus passes the node's build ceiling." Decided calmly, before the incident.
If you cross them, move only the vector workload. Postgres stays the source of truth; the vector engine becomes a derived, disposable index rebuilt from Postgres on demand. Sync becomes one-directional, reconciliation becomes "rebuild it," and roughly half the two-system tax disappears because the second system is a cache, not a peer.

What I'd avoid is the inverse: starting with two systems "to be safe." You pay the sync tax from day one, before product-market fit, for scale you may never reach, and you can't get that engineering time back.

What I'd actually pick, scenario by scenario#

RAG over internal docs, multi-tenant SaaS, 50k to 500k chunks. Postgres, no hesitation. The tenant filter is a WHERE clause, deletes are transactional, and the whole thing fits a $49 to $85 node. If you're currently bumping the limits of a shared or pooled Postgres tier, a dedicated node fixes the ceiling without changing your architecture. Want to feel the latency yourself? ask.rivestack.io is semantic search over Hacker News running on a stock Postgres + pgvector node.

Product search, ~2M items, heavy faceted filtering. Borderline, and I'll resolve it: 2M at 1536 dims won't build on any of the nodes above, so it's either 512d Matryoshka embeddings on a 16 GB node, partitioning by category, or Qdrant for the vectors with Postgres as source of truth. I'd try shorter embeddings first, since it keeps you on one system, and the recall hit is usually single digits.

100M+ embeddings, pure semantic retrieval, no relational context. Dedicated engine from day one. Milvus or Qdrant if you have an infra team, Pinecone serverless if you don't. This is the workload those systems were built for, and forcing it into single-node Postgres is malpractice in the opposite direction.

Already on Pinecone with under 1M vectors, fighting metadata sync. This migration runs in the direction the marketing doesn't mention. The workload fits a $85 to $159 Postgres node, you get joins and transactional deletes back, and the sync pipeline gets deleted rather than maintained. Rivestack will move the data for you, free, under its Vector Rescue program; the detailed comparison lives at /pinecone-alternative.

The pattern across all four: the dedicated engines win on a dimension (raw scale) that most products never reach, and Postgres wins on dimensions (correctness, simplicity, cost) that every product needs from day one. Choose for the workload you have, with an exit you've already designed.

Frequently asked questions#

How many vectors can pgvector realistically handle on one node?

For hot HNSW serving at 1536 dimensions: roughly 350k on a 4 GB node, 600k on 8 GB, and 1M on 16 GB. The build ceilings are harsher than the serving ceilings: a 500k build on 4 GB took hours in my testing, and 1M fails outright on anything under 16 GB. Shorter embeddings raise all of these roughly in proportion to the dimension cut. Past a few million vectors on the largest node you can buy, you're into partitioning or dedicated-engine territory.

Is pgvector slower than Pinecone?

Not at the scale most teams operate. Measured on an 8 vCPU/16 GB NVMe node: 1M 1536-dim vectors served at ~3,600 QPS with p50 4.2 ms at recall 0.74 (16 clients). At that level the comparison is dominated by hardware, network distance, and ef_search settings, not the engine. Dedicated engines pull ahead at tens of millions of vectors and on heavily filtered search at that scale. Whatever vendor numbers you're shown, insist they come with recall and client count attached. QPS alone is meaningless.

Can I add Postgres replicas to scale vector search reads?

Not automatically, and this catches people. In most managed HA configurations, including Rivestack's today, standby nodes exist for automatic failover, and the load balancer sends all queries, reads included, to the primary. Self-managed Postgres can serve reads from hot standbys if you build the routing and tolerate replication lag. Either way, do your capacity math against a single node.

Do I need a separate vector database for RAG?

Almost never. RAG retrieval is approximate nearest-neighbor search plus metadata filters plus a join back to source documents, exactly the shape Postgres handles in one query. The cases that justify a second system are a corpus past a few million chunks or sustained throughput beyond what one node delivers (~3,600 QPS at recall 0.74 with 16 clients on 16 GB-class hardware). Below that, the second system is mostly sync code and a bigger bill.

What's the biggest operational gotcha running pgvector in production?

PgBouncer transaction pooling silently discarding session-level settings. SET hnsw.ef_search = 100 over a transaction-pooled connection applies to one pooled session and vanishes; your queries quietly run at the default and recall drops with no error anywhere. Use SET LOCAL inside the query's transaction or ALTER DATABASE ... SET hnsw.ef_search to make it durable. The tuning guide covers this and the other parameters worth touching.

# keep reading

// pgvector