Building a Production RAG Stack on PostgreSQL

Every RAG architecture diagram I see has at least five boxes: a vector database, a document store, a metadata service, a cache, an orchestration framework. The RAG systems I've actually shipped have one box. PostgreSQL holds the documents, the chunks, the embeddings, the permissions, and the keyword index, and the application talks to all of it in SQL.

This is not minimalism for its own sake. A single database changes what your retrieval layer can do: permission-aware search becomes a join, ingestion becomes a transaction, and your backup story stops being two backup stories that disagree. Here's the whole pipeline, with the schema and queries I'd put in production, the tuning knobs that actually matter, and an honest accounting of where this architecture stops working.

Why one database beats a bolt-on vector store#

The standard counterargument to Postgres-as-vector-store is performance, and we'll get to real numbers. But first the part that decides most projects, because it has nothing to do with QPS.

Permissions are a join, not a sync job. Real RAG apps almost never search "all documents." They search "documents this user in this tenant is allowed to see." With a separate vector database you either copy ACL metadata into every vector's payload (and now have a cache-invalidation problem every time someone's access changes) or you over-fetch and post-filter in application code (and now sometimes return three results when the user asked for eight). In Postgres, the ACL table is right there.

Ingestion is transactional. A document, its forty chunks, and their forty embeddings get written in one transaction. Either all of it lands or none of it does. With a bolt-on vector DB, there is always a window where Postgres has the document and the vector store doesn't, or, worse, where you deleted the document and orphaned vectors keep surfacing in retrieval. That failure mode is insidious because nothing errors: the assistant just keeps confidently quoting content that no longer exists, and the first person to notice is usually a user.

One backup, one restore, one consistent point in time. Point-in-time recovery on a single database restores documents and embeddings to the same moment. Try coordinating a PITR restore across Postgres and an external vector store to a consistent state. You can't, not really.

There are workloads where a dedicated vector database earns its keep. I cover them at the end, and there's a longer comparison at /pinecone-alternative. For the large majority of RAG apps under a few million vectors, the operational simplicity wins outright.

The schema#

Two tables, three indexes. The tsvector column is generated, so full-text indexing is maintenance-free.

CREATE EXTENSION IF NOT EXISTS vector;

CREATE TABLE documents (
  id          bigint GENERATED ALWAYS AS IDENTITY PRIMARY KEY,
  tenant_id   bigint NOT NULL,
  source_uri  text NOT NULL,
  title       text,
  updated_at  timestamptz NOT NULL DEFAULT now()
);

CREATE TABLE chunks (
  id           bigint GENERATED ALWAYS AS IDENTITY PRIMARY KEY,
  document_id  bigint NOT NULL REFERENCES documents(id) ON DELETE CASCADE,
  tenant_id    bigint NOT NULL,
  ordinal      int    NOT NULL,           -- position within the document
  heading_path text[],                    -- e.g. {'Pricing','Refunds'}
  content      text   NOT NULL,
  embedding    vector(1536) NOT NULL,
  tsv          tsvector GENERATED ALWAYS AS
                 (to_tsvector('english', content)) STORED
);

CREATE INDEX ON chunks USING hnsw (embedding vector_cosine_ops)
  WITH (m = 16, ef_construction = 64);
CREATE INDEX ON chunks USING gin (tsv);
CREATE INDEX ON chunks (tenant_id, document_id);

Two deliberate choices here. ON DELETE CASCADE means deleting a document atomically removes its chunks from both the vector index and the keyword index, so the orphaned-vector bug is structurally impossible. And ordinal plus heading_path are retrieval features, not decoration: ordinal lets you fetch the chunks before and after a hit to widen context, and the heading path goes into the text you embed (more on that next).

Chunking, where most of the quality lives#

I'll be blunt: teams spend weeks tuning ef_search when their real problem is that they split documents every 1,000 characters mid-sentence. Chunking decisions dominate RAG quality. What works:

300 to 500 tokens per chunk, 10-15% overlap. Smaller chunks embed more precisely but lose context; bigger chunks dilute the embedding. This range is the boring, correct default for prose.
Split on structure, not byte counts. Headings first, then paragraphs, then sentences. A chunk should be a thought, not a window.
Prepend context to the embedded text. Embed "Acme Handbook > Pricing > Refunds\n\n" + chunk_text, store the raw chunk in content. A paragraph that says "this policy does not apply to annual plans" is unsearchable without knowing which policy. This one change moves retrieval quality more than any index parameter.
Never split tables or code blocks. Half a table embeds as noise. If a table exceeds your chunk size, keep it whole and let it be a big chunk.
Store provenance. document_id, ordinal, heading_path. When the LLM answers, you want to cite the source, and when a retrieval looks wrong, you want to find the chunk in the original document in seconds.

Embed with whatever model fits your stack. OpenAI's text-embedding-3-small at 1536 dimensions is the common default the schema above assumes. Write the document row and all its chunk rows in one transaction.

Retrieval with real filters#

The query that justifies the whole architecture, combining vector similarity, tenant isolation, and a live permission check, in one statement:

SELECT c.content, c.heading_path, d.source_uri,
       c.embedding <=> $3 AS distance
FROM chunks c
JOIN documents d     ON d.id = c.document_id
JOIN document_acl a  ON a.document_id = d.id
                    AND a.user_id = $2
WHERE c.tenant_id = $1
ORDER BY c.embedding <=> $3
LIMIT 8;

If someone's access was revoked one millisecond ago, this query reflects it. No sync lag, no payload-metadata drift, no post-filtering that silently shrinks your result set. This is the thing external vector stores genuinely cannot give you.

Hybrid search: pgvector + tsvector + RRF#

Pure vector search has a known blind spot: exact tokens. Error codes, SKUs, function names, people's names: an embedding model smears these into semantic mush, while tsvector nails them. Reciprocal Rank Fusion combines both rankings without any score normalization, which matters because cosine distances and ts_rank_cd scores live on incomparable scales.

WITH vec AS (
  SELECT id, row_number() OVER (ORDER BY embedding <=> $1) AS rank
  FROM chunks
  WHERE tenant_id = $2
  ORDER BY embedding <=> $1
  LIMIT 50
),
fts AS (
  SELECT id, row_number() OVER
           (ORDER BY ts_rank_cd(tsv, query) DESC) AS rank
  FROM chunks, plainto_tsquery('english', $3) AS query
  WHERE tenant_id = $2
    AND tsv @@ query
  LIMIT 50
)
SELECT c.id, c.content,
       coalesce(1.0 / (60 + vec.rank), 0) +
       coalesce(1.0 / (60 + fts.rank), 0) AS rrf_score
FROM vec
FULL OUTER JOIN fts USING (id)
JOIN chunks c USING (id)
ORDER BY rrf_score DESC
LIMIT 8;

The constant 60 is the standard RRF damping factor; I've never found a corpus where tuning it beat fixing chunking instead. The FULL OUTER JOIN is the important part: a chunk that ranks third on keywords but misses the vector top-50 entirely still surfaces. For a query like "what does error PGRST301 mean," that chunk is usually the answer. There's a deeper treatment in hybrid search with pgvector and Postgres.

Reranking: cheap insurance on top#

HNSW retrieval optimizes for "roughly the right neighborhood, fast." A reranker optimizes for "actually the best eight." The pattern: over-fetch 50 candidates with the hybrid query above, score each (query, chunk) pair with a cross-encoder (Cohere Rerank, a hosted BGE reranker, whatever) and keep the top 8 for the prompt.

Two practical notes. First, rerankers read the stored chunk text, so this is where ordinal pays off: you can expand each candidate with its neighboring chunks before scoring, giving the reranker full paragraphs instead of fragments. Second, reranking buys you slack everywhere upstream: with a reranker in place, approximate recall of 0.93 at the HNSW layer is comfortably enough, because the reranker fixes the ordering within the candidate pool. That has direct consequences for the next section.

The quality knobs: ef_search and the recall cliff#

hnsw.ef_search controls how much of the graph each query explores. More exploration, better recall, more latency. The tradeoff is real and measurable. These are numbers from our own benchmark runs (June 2026, 250k chunks of 1536-dim embeddings, same-region client, methodology in the NVMe benchmark post):

On a 2 vCPU / 4 GB node, ef_search = 80 gives recall@10 of 0.93 at roughly 1,600 QPS (16 concurrent clients).
Lower the recall target and the same node trades recall for more throughput; raise it and you pay throughput back. That is the whole knob.
Chase recall 0.99 with ef_search = 200 on that same 4 GB node and throughput collapses: the working set spills cache and every query starts touching disk, so latency jumps by roughly an order of magnitude and throughput drops several-fold. The recall curve is not linear; there's a cliff, and on small nodes it's close.

The lesson: with hybrid search and a reranker downstream, target recall around 0.9 at the index and stop. Recall 0.99 at the HNSW layer is paying cliff prices for precision your reranker provides for milliseconds.

One gotcha that costs people real debugging hours: if you connect through PgBouncer in transaction pooling mode (port 6432 on Rivestack's Solo plan, and common everywhere), a session-level SET hnsw.ef_search = 80 is silently dropped, and your next query may run on a different backend with the default value. Use SET LOCAL inside the query's transaction:

BEGIN;
SET LOCAL hnsw.ef_search = 80;
SELECT ... ORDER BY embedding <=> $1 LIMIT 50;
COMMIT;

Or set it once at the database level with ALTER DATABASE mydb SET hnsw.ef_search = 80. If your recall measurements look mysteriously like the defaults no matter what you set, this is why. Full tuning walkthrough: HNSW tuning on managed Postgres.

The latency budget of a RAG request#

Here's an end-to-end RAG request, roughly to scale:

Stage	Typical cost
Embed the user's question (API call)	50 to 200 ms
Hybrid retrieval in Postgres	4 to 10 ms
Rerank 50 candidates (API call)	30 to 150 ms
LLM generation, time to first token	400 ms to 2 s

Same-region network between your app and the database adds about 0.4 ms, or in our measurements, 0.38 ms over the pooled port with TLS. Round trip.

Look at the proportions. Database retrieval is one to two percent of the request. This is why "is pgvector fast enough for RAG" is usually the wrong question. At chat latencies, a $29/month 1 vCPU node serving p50 3.8 ms at recall 0.90 is invisible next to the embedding call that precedes it and the generation that follows. The retrieval numbers that do matter are recall (are the right chunks in the candidate set at all?) and concurrency headroom (what happens when 200 users hit the assistant at once?). Spend your tuning effort there, and spend your latency effort on streaming the LLM response and caching question embeddings for repeated queries.

If you want to feel this rather than read it, ask.rivestack.io is a live semantic search over Hacker News running on a real Rivestack database.

Where this runs out of road#

I'd rather you hit these limits with a map than discover them at 2 a.m.

HNSW builds are memory-bound, and the wall is abrupt. In our benchmark runs, a 4 GB box built a 250k-vector HNSW index in about five minutes, and then took hours at 500k. At 1M vectors on 4 GB, the build doesn't finish slowly; it fails. Rough in-RAM capacity for fast search at 1536 dimensions: ~350k vectors on 4 GB, ~600k on 8 GB, ~1M on 16 GB. A 16 GB node builds 1M in about 9 minutes and serves it hot at 16 clients: ~3,600 QPS at recall 0.74 (p50 4.2 ms). It will not build 2M. Size the node to the index build, not the steady state, and stay a comfortable margin below the ceiling.

Reads don't scale by adding nodes, at least not yet. On most managed Postgres HA setups, including Rivestack's, additional nodes are streaming-replication standbys for automatic failover; every query, reads included, lands on the primary. HA buys you uptime, not throughput. If your QPS outgrows one node's capacity, the honest options are a bigger node, fewer dimensions (a 768-dim Matryoshka truncation roughly doubles capacity, usually with modest recall cost), or partitioning the corpus by tenant.

Past the low millions of vectors per logical shard, sharded purpose-built vector databases start justifying their operational cost. Below that, which is most RAG applications most of the time, you're trading away transactions, joins, and a single backup story for headroom you don't need.

Everything in this post runs on stock PostgreSQL 17 with pgvector 0.8. If you'd rather not operate that yourself, managed pgvector on dedicated NVMe is what we built Rivestack for, but the architecture is yours either way.

Frequently asked questions#

Do I really not need a separate vector database for RAG?

For most RAG applications, no. The retrieval workload is moderate (thousands of QPS at most, against hundreds of thousands to low millions of chunks), and Postgres handles it on a single modest node while giving you transactional ingestion, permission-aware retrieval via joins, and one backup covering documents and embeddings together. A dedicated vector store starts making sense in the multi-million-vector range per shard, or when vector search is your product rather than a feature of it.

What chunk size should I use for RAG?

Start at 300 to 500 tokens with 10-15% overlap, split along document structure (headings, then paragraphs) rather than fixed character counts, and prepend the document title and heading path to the text you embed. That last step, contextual chunk headers, typically improves retrieval more than any index tuning. Keep tables and code blocks intact even when they exceed your target size.

Why is my hnsw.ef_search setting being ignored?

Almost certainly PgBouncer in transaction pooling mode. A session-level SET doesn't survive transaction pooling, so your next query can land on a different server backend with the default ef_search of 40. Wrap the setting and the query in one transaction with SET LOCAL hnsw.ef_search = ..., or persist it with ALTER DATABASE ... SET hnsw.ef_search = ....

How many vectors can one Postgres node actually handle?

For 1536-dimensional embeddings with an in-RAM HNSW index: plan on roughly 350k vectors on a 4 GB node, 600k on 8 GB, and 1M on 16 GB. The binding constraint is usually the index build, which is memory-bound: a 4 GB node builds 250k in minutes, takes hours at 500k, and fails outright at 1M. You can stretch these limits meaningfully by using lower-dimensional embeddings or partitioning chunks by tenant.

Is hybrid search worth the extra complexity?

Yes, and it's less complexity than it looks: one generated tsvector column, one GIN index, and the RRF query above. Pure vector search reliably misses exact-token queries: error codes, product SKUs, names, identifiers. Full-text search catches those, RRF merges the two rankings without any score-normalization headaches, and the whole thing stays a single SQL statement against a single database.

# keep reading

// pgvector