All posts Building a RAG Stack on PostgreSQL
·Rivestack Team· 3 min read

Building a RAG Stack on PostgreSQL

PostgreSQL
AI
pgvector
RAG

Retrieval-augmented generation (RAG) has a reputation for needing a sprawling stack — a vector database, a metadata store, an orchestration layer, a cache. You can build a production RAG system with far less: PostgreSQL, the pgvector extension, and your application code. Here is the architecture, end to end.

The whole stack is one table (plus an index)

At its core, RAG retrieval is "find the chunks most similar to this question." In PostgreSQL, that is a table with a vector column and an HNSW index:

CREATE TABLE chunks (
  id          bigserial PRIMARY KEY,
  document_id bigint NOT NULL,
  tenant_id   bigint NOT NULL,
  content     text NOT NULL,
  embedding   vector(1536)
);
CREATE INDEX ON chunks USING hnsw (embedding vector_cosine_ops);
CREATE INDEX ON chunks (tenant_id);

Everything else — documents, users, permissions — lives in the same database, so retrieval can filter and join against it directly.

Step 1: Chunk and embed

Split source documents into chunks of a few hundred tokens with some overlap, then embed each chunk with your model of choice (OpenAI text-embedding-3-small at 1536 dimensions is a common default). Insert the chunk text and its embedding in the same row — in the same transaction as the document it belongs to, so they can never drift apart.

Step 2: Retrieve with filters

This is where keeping vectors in PostgreSQL pays off. The retrieval query combines vector similarity with ordinary SQL filters:

SELECT content
FROM chunks
WHERE tenant_id = $1
ORDER BY embedding <=> $2
LIMIT 8;

The tenant_id filter uses a B-tree index; the similarity ranking uses the HNSW index. No separate metadata system, no "filter then search" gymnastics — the planner handles it.

Step 3: Assemble and generate

Take the retrieved chunks, build your prompt, and call the LLM. Because retrieval is just SQL, you can add re-ranking, recency boosts, or hybrid search (combine vector similarity with PostgreSQL full-text search) without leaving the database.

What about performance?

Two things matter: keep the HNSW index in memory where possible, and run on storage with low random-read latency. HNSW traversal is random-read bound, so NVMe beats cloud block storage noticeably on tail latency once the index spills past RAM. Tune hnsw.ef_search to trade a little latency for recall.

Frameworks, if you want them

LangChain and LlamaIndex both ship a PGVector store, so you can wire retrieval in a few lines and still drop to raw SQL when you need a join or filter the abstraction hides. See pgvector with LangChain and building RAG with pgvector and OpenAI.

FAQ

Can I build RAG entirely on PostgreSQL?

Yes. Storage, embeddings, similarity search, and metadata filtering all live in PostgreSQL with pgvector. The only external piece is the embedding and generation model, which you call over an API.

How do I add metadata filtering to RAG retrieval?

Add ordinary columns (tenant, language, document type, timestamp) to your chunks table and index them. Then combine them with the vector search in one query — WHERE tenant_id = $1 ORDER BY embedding <=> $2. The B-tree and HNSW indexes work together.

What embedding dimension should I use?

Match your model. OpenAI text-embedding-3-small is 1536 dimensions; many open-source models are 384–768. Declare the column as vector(N) with N equal to the model's output size. pgvector indexes support up to 2,000 dimensions.


Ready to build? Spin up a PostgreSQL database with pgvector free, or read the PostgreSQL for AI overview.