All posts pgvector with LlamaIndex: Build a RAG Pipeline on PostgreSQL
ยทRivestack Teamยท 10 min read

pgvector with LlamaIndex: Build a RAG Pipeline on PostgreSQL

pgvector
LlamaIndex
RAG
PostgreSQL
python
embeddings

LlamaIndex was built with one goal in mind: make it easy to connect your data to large language models. Where LangChain gives you a general-purpose toolkit for chaining LLM calls, LlamaIndex focuses specifically on the data layer โ€” ingestion, indexing, retrieval, and query pipelines. For teams building RAG applications, that focus shows.

And if you're already running PostgreSQL, there's no reason to introduce a separate vector database. pgvector turns your existing Postgres instance into a capable vector store, and LlamaIndex has first-class support for it. Everything lives in one place: your documents, embeddings, metadata, and application data. You get SQL joins, atomic transactions, and your existing infrastructure, all without a separate sync pipeline.

This guide walks through the LlamaIndex PGVectorStore integration from scratch. You'll ingest documents, run semantic queries, filter by metadata, and wire it into a full query engine.

LangChain or LlamaIndex?

A common question before any RAG project. The short answer: both work well with pgvector, and the choice comes down to what you're building.

LangChain is more general. It gives you tools for agents, chains, memory, tools, and output parsing. It's well-suited if your application does more than just retrieve and answer, or if you want to mix models, APIs, and custom logic in complex ways.

LlamaIndex is more focused. Its data abstractions (nodes, indexes, retrievers, query engines) are purpose-built for ingestion and retrieval. You get more out of the box for document-centric apps: automatic chunking, document deduplication, structured metadata handling, and composable query pipelines.

If you're building a knowledge base, documentation assistant, or any app where the primary work is "load documents, answer questions about them," LlamaIndex tends to require less glue code.

If you'd like the LangChain version of this guide, see pgvector with LangChain.

Prerequisites

You'll need:

  • PostgreSQL 15 or later with the vector extension
  • Python 3.9+
  • An OpenAI API key (or any embeddings model supported by LlamaIndex)

Install the required packages:

pip install llama-index llama-index-vector-stores-postgres llama-index-embeddings-openai psycopg2-binary

The llama-index-vector-stores-postgres package is the maintained integration. It uses asyncpg or psycopg2 under the hood and maps directly to a pgvector table in your database.

Enable the vector extension in your PostgreSQL instance:

CREATE EXTENSION IF NOT EXISTS vector;

On Rivestack, the vector extension is enabled by default on every instance. On local Postgres, you may need to install the pgvector package first and then run the command above.

Connecting LlamaIndex to pgvector

LlamaIndex's PGVectorStore takes a connection string and a few configuration parameters:

import psycopg2
from llama_index.vector_stores.postgres import PGVectorStore

connection_string = "postgresql://user:password@localhost:5432/mydb"

vector_store = PGVectorStore.from_params(
    host="localhost",
    port=5432,
    database="mydb",
    user="user",
    password="password",
    table_name="documents",
    embed_dim=1536,  # matches text-embedding-3-small dimensions
)

The table_name parameter controls which table in your database holds the vectors. LlamaIndex will create this table on first use if it doesn't already exist. The embed_dim must match the dimensionality of your embedding model: 1536 for text-embedding-3-small, 3072 for text-embedding-3-large.

Setting Up the Embedding Model

LlamaIndex uses a global service context to configure shared resources like the embedding model and LLM. In newer versions of the library, you configure these explicitly:

from llama_index.embeddings.openai import OpenAIEmbedding
from llama_index.llms.openai import OpenAI
from llama_index.core import Settings

Settings.embed_model = OpenAIEmbedding(
    model="text-embedding-3-small",
    api_key="your-openai-api-key"
)

Settings.llm = OpenAI(
    model="gpt-4o-mini",
    api_key="your-openai-api-key"
)

This sets the embedding model and LLM globally for the session. You can override these per-index if you need different models for different indexes.

Loading and Ingesting Documents

LlamaIndex has readers for many common data sources: PDFs, Markdown files, Notion databases, websites, and more. For this walkthrough, we'll load local text files, which covers the common case of ingesting your own documents.

from llama_index.core import SimpleDirectoryReader

documents = SimpleDirectoryReader(
    input_dir="./docs",
    recursive=True
).load_data()

print(f"Loaded {len(documents)} documents")

SimpleDirectoryReader handles the file I/O and wraps each file in a Document object. The recursive=True flag walks subdirectories. It auto-detects file type and uses the appropriate parser.

Chunking and Indexing

Raw documents need to be split into smaller chunks before embedding. LlamaIndex handles this automatically when you build a VectorStoreIndex:

from llama_index.core import VectorStoreIndex, StorageContext

storage_context = StorageContext.from_defaults(vector_store=vector_store)

index = VectorStoreIndex.from_documents(
    documents,
    storage_context=storage_context,
    show_progress=True,
)

Under the hood, LlamaIndex:

  1. Splits each document into nodes (chunks) using its default chunking strategy
  2. Calls the embedding model for each node
  3. Inserts each node and its embedding vector into the pgvector table

The default chunk size is 1024 tokens with 20 tokens of overlap. You can adjust this:

from llama_index.core.node_parser import SentenceSplitter

index = VectorStoreIndex.from_documents(
    documents,
    storage_context=storage_context,
    transformations=[
        SentenceSplitter(chunk_size=512, chunk_overlap=50)
    ]
)

Smaller chunks give more precise retrieval at the cost of more API calls during ingestion. Larger chunks carry more context per result but may dilute relevance scores. For most technical documentation, 512 to 768 tokens with 50 to 100 tokens of overlap is a reasonable starting point.

Querying the Index

Once the documents are ingested, you can query the index using a query engine:

query_engine = index.as_query_engine(similarity_top_k=4)

response = query_engine.query(
    "How do I configure connection pooling in PostgreSQL?"
)

print(response)

similarity_top_k=4 retrieves the four most semantically similar chunks to the query, then passes them as context to the LLM for synthesis. The query engine handles both the retrieval and the generation steps.

The response object includes both the synthesized answer and the source nodes used to construct it:

print(response.response)

for node in response.source_nodes:
    print(f"\nScore: {node.score:.3f}")
    print(f"Source: {node.metadata.get('file_name', 'unknown')}")
    print(node.get_content()[:300])

Inspecting source nodes is important during development. If the answer is wrong or incomplete, the source nodes tell you whether retrieval found the right content (a retrieval problem) or whether the LLM synthesized it poorly (a generation problem). These require different fixes.

Loading an Existing Index

When you restart your application, you don't want to re-ingest everything. Load the existing index from the vector store instead:

from llama_index.core import VectorStoreIndex
from llama_index.vector_stores.postgres import PGVectorStore

vector_store = PGVectorStore.from_params(
    host="localhost",
    port=5432,
    database="mydb",
    user="user",
    password="password",
    table_name="documents",
    embed_dim=1536,
)

storage_context = StorageContext.from_defaults(vector_store=vector_store)

index = VectorStoreIndex.from_vector_store(
    vector_store,
    storage_context=storage_context
)

from_vector_store connects to the existing table without re-embedding. This is how you use the index in production: ingest once (or incrementally), query many times.

Metadata Filtering

One of LlamaIndex's strengths is metadata filtering: the ability to restrict retrieval to a subset of your documents before running the vector similarity search. This is useful when your index spans many document types, users, or categories.

Adding Metadata at Ingestion

You can attach metadata to documents when loading them:

from llama_index.core import Document

documents = [
    Document(
        text="PostgreSQL supports multiple isolation levels...",
        metadata={
            "source": "postgres-docs",
            "category": "transactions",
            "version": "15",
        }
    ),
    Document(
        text="pgvector adds a new data type called vector...",
        metadata={
            "source": "pgvector-docs",
            "category": "extensions",
            "version": "0.7",
        }
    ),
]

LlamaIndex stores metadata alongside the vector in Postgres. The metadata column in the vector table is a JSONB field, so you can filter on any key at query time.

Filtering at Query Time

from llama_index.core.vector_stores import MetadataFilter, MetadataFilters

filters = MetadataFilters(filters=[
    MetadataFilter(key="category", value="transactions"),
])

query_engine = index.as_query_engine(
    similarity_top_k=4,
    filters=filters
)

response = query_engine.query(
    "What is the default transaction isolation level?"
)

The filter is applied in PostgreSQL before the vector similarity ranking. Only nodes matching category = 'transactions' are candidates for retrieval. This is more efficient than post-filtering: the database does the work, and your top-k results come from a narrower, more relevant pool.

For multi-tenant applications, this is the natural way to scope retrieval per user or per workspace: add a user_id or workspace_id to every document's metadata at ingestion time, then filter by it at query time.

Retriever vs. Query Engine

LlamaIndex separates retrieval from synthesis. The VectorIndexRetriever retrieves nodes; the query engine wraps a retriever and adds LLM synthesis on top. You can use them separately when you need more control.

Using the Retriever Directly

from llama_index.core.retrievers import VectorIndexRetriever

retriever = VectorIndexRetriever(
    index=index,
    similarity_top_k=6
)

nodes = retriever.retrieve("connection pooling strategies")

for node in nodes:
    print(f"Score: {node.score:.3f} | {node.node.get_content()[:200]}")

The retriever returns raw scored nodes without calling the LLM. This is useful when you want to:

  • Inspect what retrieval returns before generation
  • Post-process or rerank nodes yourself
  • Feed the nodes into a different synthesis pipeline

Custom Response Synthesis

You can compose a custom query engine by combining a retriever with a response synthesizer:

from llama_index.core.query_engine import RetrieverQueryEngine
from llama_index.core.response_synthesizers import get_response_synthesizer

synthesizer = get_response_synthesizer(
    response_mode="tree_summarize"
)

query_engine = RetrieverQueryEngine(
    retriever=retriever,
    response_synthesizer=synthesizer
)

LlamaIndex supports several response modes:

  • refine: Iterates over each retrieved node, refining the answer progressively. Slow but thorough.
  • compact: Packs as many nodes as possible into a single context window, then generates one answer. Faster and cheaper.
  • tree_summarize: Builds a tree of summaries from the retrieved nodes. Best for large result sets.
  • simple_summarize: Single LLM call with all nodes concatenated. Works well when the result set is small.

For most production use cases, compact is the default and a good choice.

Incremental Ingestion

Production systems rarely ingest once. Documents get updated, new ones arrive, old ones are removed. LlamaIndex supports incremental ingestion using a document store to track what has already been indexed.

from llama_index.core import VectorStoreIndex, StorageContext
from llama_index.core.ingestion import IngestionPipeline
from llama_index.core.node_parser import SentenceSplitter
from llama_index.storage.docstore.postgres import PostgresDocumentStore

docstore = PostgresDocumentStore.from_uri(
    "postgresql://user:password@localhost:5432/mydb"
)

pipeline = IngestionPipeline(
    transformations=[
        SentenceSplitter(chunk_size=512, chunk_overlap=50),
    ],
    vector_store=vector_store,
    docstore=docstore,
)

new_documents = SimpleDirectoryReader("./new_docs").load_data()

nodes = pipeline.run(documents=new_documents, show_progress=True)
print(f"Ingested {len(nodes)} new nodes")

The pipeline checks whether each document has already been indexed (by hashing its content and metadata). If a document is unchanged, it's skipped. If it's new or modified, the pipeline re-chunks, re-embeds, and re-inserts it. This keeps ingestion costs proportional to what has actually changed, not the size of your entire corpus.

Reranking Results

Semantic similarity in vector space is not the same as relevance to the user's actual intent. A reranker is a second-pass model that scores the retrieved nodes more accurately, at higher computational cost. You run it after the initial vector search, on a small candidate pool.

LlamaIndex makes this straightforward with postprocessors:

from llama_index.core.postprocessor import SentenceTransformerRerank

reranker = SentenceTransformerRerank(
    model="cross-encoder/ms-marco-MiniLM-L-2-v2",
    top_n=3
)

query_engine = index.as_query_engine(
    similarity_top_k=10,
    node_postprocessors=[reranker]
)

This retrieves the top 10 by vector similarity, then passes all 10 through the cross-encoder reranker, which keeps only the top 3. The retrieval step runs in pgvector; the reranking step runs locally in the Python process. Together, they produce higher-quality context for the LLM than either step alone.

For cross-encoder models, cross-encoder/ms-marco-MiniLM-L-2-v2 is fast and lightweight. For better accuracy at higher cost, cross-encoder/ms-marco-MiniLM-L-6-v2 is a common upgrade.

Checking the Data in PostgreSQL

One of the advantages of using pgvector is that you can inspect your data directly:

-- Check how many nodes are stored
SELECT COUNT(*) FROM documents;

-- Inspect the metadata for a sample of nodes
SELECT id, metadata, LEFT(text, 200)
FROM documents
LIMIT 10;

-- See the embedding dimensions for a row
SELECT array_length(embedding, 1)
FROM documents
LIMIT 1;

The table schema that LlamaIndex creates looks roughly like this:

CREATE TABLE documents (
    id          UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    text        TEXT,
    metadata    JSONB,
    node_id     TEXT,
    embedding   VECTOR(1536)
);

CREATE INDEX ON documents USING hnsw (embedding vector_cosine_ops);

The HNSW index is created automatically by LlamaIndex when you instantiate the vector store with hnsw_kwargs set. If you're ingesting large volumes, you may want to create the index after bulk ingestion rather than maintaining it incrementally:

vector_store = PGVectorStore.from_params(
    host="localhost",
    port=5432,
    database="mydb",
    user="user",
    password="password",
    table_name="documents",
    embed_dim=1536,
    hnsw_kwargs={
        "hnsw_m": 16,
        "hnsw_ef_construction": 64,
        "hnsw_ef_search": 40,
        "hnsw_dist_method": "vector_cosine_ops",
    }
)

For a deep dive on HNSW tuning parameters, see our pgvector HNSW tuning guide.

Putting It All Together

A minimal end-to-end script that you can run against your own documents:

import os
from llama_index.core import (
    VectorStoreIndex,
    SimpleDirectoryReader,
    StorageContext,
    Settings,
)
from llama_index.embeddings.openai import OpenAIEmbedding
from llama_index.llms.openai import OpenAI
from llama_index.vector_stores.postgres import PGVectorStore

# Configure models
Settings.embed_model = OpenAIEmbedding(model="text-embedding-3-small")
Settings.llm = OpenAI(model="gpt-4o-mini")

# Connect to pgvector
vector_store = PGVectorStore.from_params(
    host=os.getenv("PGHOST", "localhost"),
    port=int(os.getenv("PGPORT", 5432)),
    database=os.getenv("PGDATABASE", "mydb"),
    user=os.getenv("PGUSER", "user"),
    password=os.getenv("PGPASSWORD", ""),
    table_name="knowledge_base",
    embed_dim=1536,
)

storage_context = StorageContext.from_defaults(vector_store=vector_store)

# Load and index documents
documents = SimpleDirectoryReader("./docs").load_data()

index = VectorStoreIndex.from_documents(
    documents,
    storage_context=storage_context,
    show_progress=True,
)

# Query
query_engine = index.as_query_engine(similarity_top_k=4)
response = query_engine.query("What are the main configuration options?")

print(response.response)
print("\nSources:")
for node in response.source_nodes:
    print(f"  {node.metadata.get('file_name')} (score: {node.score:.3f})")

Running this for the first time will ingest and index all documents in ./docs. On subsequent runs, swap from_documents for from_vector_store to skip the ingestion step and query the existing index directly.

pgvector as Your Production Vector Store

The pattern above scales well. You get:

  • Atomic ingestion: document metadata and vectors insert in the same transaction
  • SQL-native filtering: filter on any JSONB metadata key before the vector search runs
  • Standard backups: your vectors are just Postgres rows; pg_dump, point-in-time recovery, and logical replication all work unchanged
  • No sync layer: no risk of embeddings drifting out of sync with your application data

The main limitation is raw throughput at very high scale: tens of millions of vectors with thousands of queries per second. At that scale, a purpose-built vector database starts to make sense. Below it, the operational simplicity of staying on PostgreSQL is worth considerably more than the marginal throughput gain.

For a more thorough comparison, see pgvector vs. Pinecone.

If you need a managed PostgreSQL instance with pgvector already configured, Rivestack provisions production-ready Postgres with vector support, daily backups, and connection pooling out of the box. No pgvector setup, no extension management, no infrastructure to maintain.