What Is pgvector? A Practitioner's Introduction

pgvector is an open-source extension that teaches PostgreSQL to store and search embeddings, the arrays of floats that AI models produce when they encode text, images, or audio. With it installed, "find the documents most similar to this query" becomes a one-line ORDER BY, sitting right next to your joins, your WHERE clauses, and your transactions.

I've run pgvector in production for a while now, benchmarked it across node sizes, and watched plenty of teams adopt it. The short version: for the majority of applications (semantic search, RAG, recommendations under a million vectors) it's not just "good enough," it's the boring, correct choice. But it has real limits and a few sharp edges that the typical intro post skips entirely. This one won't.

What the extension actually gives you#

Strip away the marketing and pgvector adds three concrete things to Postgres:

1. A vector column type. You declare the dimensionality up front:

CREATE EXTENSION vector;

CREATE TABLE documents (
  id        bigserial PRIMARY KEY,
  content   text NOT NULL,
  embedding vector(1536)   -- OpenAI text-embedding-3-small
);

A vector(1536) is 1536 single-precision floats plus a small header, about 6 KB per row before TOAST and index overhead. That number matters more than you'd think; we'll come back to it when we talk about memory.

2. Distance operators. Three of them, each measuring "how far apart" two vectors are in a different sense:

<->: Euclidean (L2) distance
<=>: cosine distance (1 minus cosine similarity)
<#>: negative inner product

For text embeddings from OpenAI, Cohere, or open models like bge and e5, you almost always want cosine, so <=> is the operator you'll type a hundred times. Lower result = more similar.

3. Two index types for approximate nearest-neighbor search. Without an index, ORDER BY embedding <=> $1 scans every row and computes every distance: exact, but O(n) per query. The indexes trade a little accuracy for a lot of speed:

HNSW builds a layered graph over your vectors. Best query speed and recall, more expensive to build. This is the production default.
IVFFlat clusters vectors into lists and only searches the nearest few lists. Cheaper and faster to build, but recall is touchier and it degrades as data changes, because the cluster centers are frozen at build time.

My advice: use HNSW unless you have a specific reason not to. The build cost is real, but you pay it once; query quality you pay for on every request.

Everything else (ACID, foreign keys, row-level security, logical backups, EXPLAIN) is just Postgres. That's the entire pitch. Your vectors aren't in a second system with its own auth, its own backup story, and its own consistency model. They're rows.

A real end-to-end example#

Here's the whole loop with actual OpenAI embeddings, not pseudocode. First, generate and insert (Python, but the shape is identical in any language):

from openai import OpenAI
import psycopg

client = OpenAI()
docs = [
    "PostgreSQL uses MVCC for concurrency control.",
    "HNSW is a graph-based approximate nearest neighbor index.",
    "Croissants are made with laminated dough.",
]

resp = client.embeddings.create(model="text-embedding-3-small", input=docs)

with psycopg.connect("postgresql://...") as conn:
    for doc, item in zip(docs, resp.data):
        conn.execute(
            "INSERT INTO documents (content, embedding) VALUES (%s, %s)",
            (doc, str(item.embedding)),
        )

pgvector accepts the vector as a string literal ('[0.018, -0.024, ...]') so there's no special driver requirement, though pgvector-python and friends give you nicer type handling.

Once you have data loaded (and ideally after bulk loading, more on that below), build the index:

CREATE INDEX ON documents
  USING hnsw (embedding vector_cosine_ops)
  WITH (m = 16, ef_construction = 64);

Note vector_cosine_ops. The operator class must match the operator you query with: a cosine index does nothing for <-> queries. This is the single most common silent failure in first deployments.

Now query:

SELECT content, embedding <=> $1 AS distance
FROM documents
ORDER BY embedding <=> $1
LIMIT 5;

With the query embedded as "how does Postgres handle concurrent writes?", you get back something like:

                     content                      | distance
--------------------------------------------------+----------
 PostgreSQL uses MVCC for concurrency control.    |   0.3107
 HNSW is a graph-based approximate nearest ...    |   0.5972
 Croissants are made with laminated dough.        |   0.8841

No keyword overlap between "concurrent writes" and "MVCC": that's the embedding model doing semantic matching, and pgvector ranking by it. And because this is SQL, filtering is free:

SELECT content
FROM documents
WHERE tenant_id = 42 AND published_at > now() - interval '90 days'
ORDER BY embedding <=> $1
LIMIT 5;

Try to express that in a dedicated vector database and you're suddenly reading docs about metadata filter syntax and pre- vs post-filtering semantics. In Postgres it's a WHERE clause. If you want to feel this on real data, ask.rivestack.io is a live semantic search over Hacker News running on exactly this setup.

How approximate search works, intuitively#

HNSW stands for Hierarchical Navigable Small World, which sounds academic but maps to a simple picture: highway, road, street.

The index is a stack of graphs. The top layer has very few nodes with long-range connections: the highway network. Each layer down gets denser and more local. A search starts at the top, greedily hops toward the query vector ("which of my neighbors is closest? move there, repeat"), then drops a layer and refines, until it reaches the bottom layer where every vector lives. It's how you'd navigate a city: highway to roughly the right area, surface streets to the door.

Two build parameters shape the graph: m (connections per node) and ef_construction (search width while building). The defaults of 16 and 64 are genuinely fine for most workloads.

The parameter you'll actually touch is hnsw.ef_search: how many candidate nodes the search keeps in hand at query time. It's a live dial between speed and recall, the fraction of the true nearest neighbors the approximate search actually finds. Recall@10 of 0.93 means that, on average, 9.3 of the true top-10 made it into your results.

This tradeoff isn't theoretical. On a 2 vCPU / 4 GB node with 250k 1536-dimension vectors, we measured ef_search=80 at recall@10 0.93, ~1,600 QPS with 16 concurrent clients. Lower the recall target and the same box trades recall for more QPS; raise it and you pay throughput back. Same data, same index. The dial just moved. There's a full write-up of the methodology in our NVMe vs cloud-SSD benchmark post, and the harness is open source (pgvector-bench) if you'd rather measure your own hardware than trust anyone's blog.

One asymmetry worth internalizing early: recall gets exponentially expensive at the top. Going from 0.90 to 0.93 is cheap. Going from 0.95 to 0.99 can wreck you: on that same 4 GB node, pushing ef_search to 200 spills the working set out of cache and throughput collapses to a small fraction of its former self while p50 jumps by roughly an order of magnitude. If your product genuinely needs recall 0.99, you need exact search or much more RAM, not a bigger ef_search.

The operational realities nobody mentions#

HNSW index builds are memory-bound, and the failure mode is brutal. The build wants the graph in RAM. When it fits, builds are fast; when it doesn't, they don't slow down gracefully. They fall off a cliff or die. From our measurements on dedicated nodes: a 4 GB machine builds a 250k × 1536-dim index in about 5 minutes, but 500k took hours on the same box, and 1M fails outright. An 8 GB node handles 500k in ~40 minutes but still can't build 1M. You need 16 GB before 1M × 1536 builds cleanly (~9 minutes there), and even that box won't build 2M. I've watched a 500k build sit at 100% CPU for an afternoon because the node was one size too small. Size your node for the build, not just the steady state, and set maintenance_work_mem accordingly.

Serving has a RAM budget too. Hot HNSW search assumes index and vectors are cached. As rough in-RAM capacity at 1536 dimensions: ~350k vectors on 4 GB, ~600k on 8 GB, ~1M on 16 GB. Past that, queries start touching disk and the latency distribution grows an ugly tail. Fast NVMe softens the cliff; network-attached cloud storage does not.

Connection poolers eat your session settings. This one costs people real debugging hours. SET hnsw.ef_search = 100 is a session-level command, and over PgBouncer in transaction-pooling mode, your next query may run on a different backend connection where the setting never happened. It fails silently: no error, just default recall. The fix is SET LOCAL hnsw.ef_search = 100 inside the same transaction as the query, or pin it with ALTER DATABASE mydb SET hnsw.ef_search = 100. (On Rivestack this bites specifically on the pooled port 6432; it's covered in more depth in the HNSW tuning guide.)

Recall is invisible unless you measure it. Approximate search never errors when it's wrong. It just quietly returns the 8th-best result instead of the 1st. Take a sample of queries, compute exact KNN with a sequential scan as ground truth, and compare. Ten lines of SQL, and it turns "the search feels off" into a number you can tune against.

When pgvector is enough, and when it isn't#

Here's the honest framing, with the thresholds I'd actually use:

pgvector is the right call when your corpus is under roughly 1M vectors at 1536 dimensions (or proportionally more at lower dimensions), you already run Postgres, and your queries mix similarity with relational filters. That covers most RAG applications, most product search, and most recommendation features. At those sizes the numbers are strong: we've measured ~4,465 QPS at recall@10 0.95 (16 clients, 250k vectors) on an 8 vCPU / 16 GB node, and ~3,600 QPS at recall 0.74, p50 4.2 ms on 1M vectors (16 clients) on the same hardware. Very few products need more than that from their retrieval layer, and you get it without operating a second database.

Start looking elsewhere when:

You're past a few million vectors and growing. A single Postgres node hits the RAM wall; 2M × 1536 won't even build on 16 GB. Dedicated engines like Milvus or Vespa shard and use disk-based or quantized indexes (DiskANN-style) designed for exactly this.
You need to rebuild indexes constantly at scale. Multi-hour HNSW builds on large tables are painful when your corpus churns hard.
Vector search is the entire product and you'll exploit engine-specific features Postgres doesn't have.

What I'd push back on is the reflexive "Postgres can't do vector search at scale" you see in vendor comparison pages. Below a million vectors it demonstrably can, with single-digit-millisecond p50s, and the operational simplicity of one database is worth a lot. If you're weighing this against Pinecone specifically, we wrote up a direct comparison with the same measured-numbers discipline; there's a Supabase one too.

A clarification that trips people up on any hosted Postgres with HA: standby nodes are for automatic failover, not read scaling. On Rivestack, for instance, the load balancer routes all queries, reads included, to the primary; the replicas exist so a node failure doesn't take you down. If someone's capacity math assumes "add replicas, multiply QPS," redo the math.

Mistakes I see in almost every first deployment#

Mismatched operator class and operator. Index with vector_l2_ops, query with <=>, and Postgres silently falls back to a sequential scan. Always check with EXPLAIN that you see an Index Scan using ... hnsw. If you see Seq Scan on a million-row table, this is usually why.

Building the index before bulk loading. Inserting a million rows into an existing HNSW index is dramatically slower than loading first and building once, and pgvector parallelizes fresh builds. Load, then CREATE INDEX.

Breaking the ORDER BY ... LIMIT shape. The index kicks in for ORDER BY embedding <=> $1 LIMIT k. Wrap the distance in an expression, stick it only in the SELECT list, or omit the LIMIT, and you can lose the index path. Keep the canonical shape.

Trusting session SET through a pooler. Covered above; worth repeating because it's invisible. SET LOCAL or ALTER DATABASE.

Mixing embedding models. Vectors from text-embedding-3-small and from bge-large live in unrelated spaces. Comparing them returns confident garbage. One model per column; re-embed everything when you switch, and treat the model name as part of your schema.

Assuming exact results. ANN search is approximate by design. If a known-relevant document occasionally vanishes from the top 10, that's recall, not a bug. Measure it, then raise ef_search deliberately instead of in a panic.

None of these are exotic. They're the same six issues, every time, and now you'll recognize them in the first hour instead of the first incident.

Frequently asked questions#

What is pgvector used for?

Storing embeddings and finding the most similar ones with SQL. The big four use cases: retrieval-augmented generation (fetch relevant chunks to ground an LLM's answer), semantic search (match by meaning, not keywords), recommendations ("more like this"), and near-duplicate detection. Its distinguishing strength is combining similarity with ordinary relational filters (tenant, date, permissions) in a single query against data you already have in Postgres.

Is pgvector free?

Yes. It's open source under the PostgreSQL license, with no paid tier. You pay only for the Postgres it runs on: self-hosted, or any managed provider that ships it. On Rivestack's managed pgvector it comes pre-installed (0.8.x on PostgreSQL 17), including on the free tier.

How many vectors can pgvector handle?

The honest answer is "as many as fit in RAM for your dimension count." At 1536 dimensions, expect comfortable fast-path serving around 350k vectors on a 4 GB node, 600k on 8 GB, and 1M on 16 GB, and remember the HNSW build is the binding constraint (1M won't even build below 16 GB in our testing). At 384 or 768 dimensions those ceilings rise proportionally. Past a few million high-dimensional vectors on a single node, a dedicated engine starts earning its operational cost.

Should I use HNSW or IVFFlat?

HNSW, almost always. It delivers better recall at a given latency, handles inserts and updates gracefully, and has one main query knob (ef_search). IVFFlat builds faster and uses less build memory, but its cluster centers are fixed at creation, so recall decays as your data drifts and you end up rebuilding. Choose IVFFlat only when build resources are the hard constraint.

Do I need a GPU to run pgvector?

No. Generating embeddings may use a GPU (usually on the model provider's side, you just call an API), but storing and searching them in pgvector is pure CPU work. Distance computations are SIMD-accelerated on modern x86 and ARM, which is how the 8 vCPU node in the numbers above sustains ~4,465 QPS at recall@10 0.95 (16 clients) over 250k vectors.

# keep reading

// pgvector