April 10, 2026·Rivestack Team· 9 min read

pgvector with LangChain: Build a RAG Pipeline on PostgreSQL

pgvector

LangChain

RAG

PostgreSQL

python

LangChain has a vectorstore abstraction that lets you swap out the underlying vector database without rewriting your application logic. Swap Chroma for Pinecone, Pinecone for Weaviate, whatever. In theory, it's clean. In practice, most teams end up staying with whatever they picked first, because migration is never as simple as swapping a class name.

So the decision matters. And if you're building on PostgreSQL, the answer is almost always: use pgvector. Your embeddings live in the same database as your users, documents, and application state. No sync pipeline. No eventual consistency. Full SQL.

This guide walks through the LangChain PGVector integration from scratch, including document loading, embedding, similarity search, metadata filtering, and wiring it into a working retrieval chain.

Why pgvector Over a Dedicated Vectorstore#

Before the code, let's be direct about the tradeoff.

Dedicated vectorstores like Pinecone are fast and scale to billions of vectors without you thinking about infrastructure. If you're building something where vectors are the entire product, they're a reasonable choice.

But most applications aren't like that. You have users. You have documents. You have metadata. You have access control logic. When your vectors live in Pinecone and everything else lives in PostgreSQL, you've just created a sync problem. Document gets deleted from Postgres, the embedding stays in Pinecone. You want to filter by user_id, now you need to implement that in a separate system. You want a transaction that inserts a document and its embedding atomically, you can't have one.

pgvector collapses this. Everything lives in one place. JOINs work. Transactions work. Your existing backup strategy, monitoring setup, and connection pooler all work with zero changes.

The one real downside: raw vector throughput at extreme scale (hundreds of millions of vectors, thousands of QPS) is slower than a purpose-built vector database. Below that, the difference is small enough to ignore. See our pgvector benchmarks post for concrete numbers.

Setting Up#

You'll need:

PostgreSQL with the vector extension enabled
Python 3.9+
An OpenAI API key (or any LangChain-compatible embeddings model)

Install dependencies:

pip install langchain langchain-openai langchain-postgres psycopg

The langchain-postgres package is the current home for the PGVector integration. The older langchain-community version still exists but is being deprecated in favor of the dedicated package.

If you're on a managed PostgreSQL service, the vector extension is likely already available. On Rivestack, it's enabled by default. On a local install, run this once:

CREATE EXTENSION IF NOT EXISTS vector;

Connecting LangChain to pgvector#

The PGVector class takes a connection string and an embeddings object:

from langchain_postgres import PGVector
from langchain_openai import OpenAIEmbeddings

connection_string = "postgresql+psycopg://user:password@localhost:5432/mydb"

embeddings = OpenAIEmbeddings(model="text-embedding-3-small")

vectorstore = PGVector(
    embeddings=embeddings,
    collection_name="documents",
    connection=connection_string,
    use_jsonb=True,
)

A few things worth noting here:

collection_name maps to a logical grouping within the langchain_pg_embedding table. You can have multiple collections in one database, each with its own embeddings.

use_jsonb=True stores the document metadata as JSONB rather than a plain string, which unlocks proper metadata filtering later. Always turn this on.

The first time you create the vectorstore, LangChain creates the langchain_pg_collection and langchain_pg_embedding tables automatically. You don't need to run any migrations.

Loading and Embedding Documents#

LangChain's document loaders produce Document objects with page_content and metadata. You can load from a directory of text files, a PDF, a web page, whatever your use case requires.

Here's a minimal example using plain text:

from langchain_core.documents import Document

docs = [
    Document(
        page_content="PostgreSQL supports ACID transactions and is MVCC-based.",
        metadata={"source": "postgres-intro", "category": "database"}
    ),
    Document(
        page_content="pgvector adds vector similarity search to PostgreSQL using HNSW and IVFFlat indexes.",
        metadata={"source": "pgvector-intro", "category": "extensions"}
    ),
    Document(
        page_content="LangChain provides a unified interface for building LLM-powered applications.",
        metadata={"source": "langchain-intro", "category": "framework"}
    ),
]

vectorstore.add_documents(docs)

add_documents generates an embedding for each document's page_content by calling the embeddings model, then inserts both the embedding and the metadata into Postgres. For larger document sets, this will be batched automatically.

For real use cases, you'll almost always want to split long documents first:

from langchain_text_splitters import RecursiveCharacterTextSplitter

splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200)
split_docs = splitter.split_documents(raw_docs)
vectorstore.add_documents(split_docs)

Chunking matters a lot for retrieval quality. Chunks that are too large return too much context and dilute relevance. Chunks that are too small lose the surrounding context the LLM needs. A chunk_size of 800 to 1200 characters with chunk_overlap of 100 to 200 is a reasonable starting point.

Similarity Search#

Once documents are embedded, querying is straightforward:

results = vectorstore.similarity_search(
    "How does pgvector index work?",
    k=4
)

for doc in results:
    print(doc.page_content)
    print(doc.metadata)
    print("---")

LangChain embeds the query string using the same embeddings model, then runs a cosine similarity search in Postgres using pgvector's <=> operator. The top k results are returned as Document objects.

If you want the similarity scores alongside the documents:

results_with_scores = vectorstore.similarity_search_with_score(
    "How does pgvector index work?",
    k=4
)

for doc, score in results_with_scores:
    print(f"Score: {score:.4f} — {doc.page_content[:80]}")

Lower scores mean higher similarity when using cosine distance. A score near 0 is very similar; a score near 1 is not.

Metadata Filtering#

This is where pgvector with LangChain gets interesting, and where a lot of tutorials skip ahead too fast.

If you're building a multi-tenant application, you need to scope searches to a specific user or organization. If you have a document management system, you might need to filter by date, category, or access level. The filter parameter handles this:

results = vectorstore.similarity_search(
    "database indexing",
    k=4,
    filter={"category": "database"}
)

With use_jsonb=True, this translates to a JSONB containment query in Postgres, which can be indexed. For more complex filters:

# Filter by multiple fields
results = vectorstore.similarity_search(
    "vector search performance",
    k=4,
    filter={"category": "extensions", "source": "pgvector-intro"}
)

You can also use the lower-level similarity_search_with_relevance_scores method and add your own SQL-level filters if you need something the LangChain filter syntax doesn't cover. Since it's Postgres, you always have that escape hatch.

MMR Search for Diverse Results#

Maximum Marginal Relevance (MMR) search is useful when you want results that are relevant but not all saying the same thing. Standard similarity search can return five chunks from the same document that all cover the same paragraph. MMR penalizes redundancy.

results = vectorstore.max_marginal_relevance_search(
    "PostgreSQL extensions",
    k=4,
    fetch_k=20,
    lambda_mult=0.5
)

fetch_k controls how many candidates are retrieved from Postgres before MMR re-ranking. lambda_mult is the diversity trade-off: 0 means maximum diversity, 1 means maximum relevance. Start at 0.5 and tune from there.

For a RAG pipeline where you're worried about the context window getting bloated with repetitive chunks, MMR is worth the small extra latency.

Building a Retrieval Chain#

A vectorstore on its own is just retrieval. The full RAG pattern is: retrieve relevant documents, then pass them as context to a language model that generates an answer grounded in that context.

Here's a minimal working chain using LangChain's expression language (LCEL):

from langchain_openai import ChatOpenAI
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.runnables import RunnablePassthrough
from langchain_core.output_parsers import StrOutputParser

retriever = vectorstore.as_retriever(
    search_type="similarity",
    search_kwargs={"k": 4}
)

prompt = ChatPromptTemplate.from_template("""
Answer the question based only on the following context:

{context}

Question: {question}
""")

llm = ChatOpenAI(model="gpt-4o-mini", temperature=0)

def format_docs(docs):
    return "\n\n".join(doc.page_content for doc in docs)

chain = (
    {"context": retriever | format_docs, "question": RunnablePassthrough()}
    | prompt
    | llm
    | StrOutputParser()
)

answer = chain.invoke("What indexes does pgvector support?")
print(answer)

This is the core RAG loop: the question goes in, the retriever pulls relevant chunks from Postgres, those chunks get formatted into the prompt, the LLM generates an answer, and the output parser returns a plain string.

The as_retriever method wraps the vectorstore in LangChain's BaseRetriever interface, which is compatible with the entire LCEL ecosystem including streaming, async execution, and tracing.

To use MMR retrieval in the chain:

retriever = vectorstore.as_retriever(
    search_type="mmr",
    search_kwargs={"k": 4, "fetch_k": 20, "lambda_mult": 0.5}
)

And to filter by metadata inside the retriever:

retriever = vectorstore.as_retriever(
    search_kwargs={"k": 4, "filter": {"user_id": "user_123"}}
)

This is the pattern for multi-tenant RAG: scope every retrieval to the authenticated user's documents.

HNSW Indexing for Production#

By default, pgvector uses exact k-NN search, which scans the entire table. For small datasets (under 100,000 vectors) this is fast enough. For anything larger, you'll want an HNSW index.

The PGVector class doesn't create the index automatically. Run this SQL once after your initial load:

CREATE INDEX ON langchain_pg_embedding
USING hnsw (embedding vector_cosine_ops)
WITH (m = 16, ef_construction = 64);

Then set ef_search at query time for the accuracy vs. speed tradeoff:

SET hnsw.ef_search = 100;

Higher ef_search values are more accurate but slower. Start at 64 to 100 and tune based on your recall requirements.

If your dataset is large enough to need the index, also make sure your Postgres server has enough work_mem to build it without spilling to disk:

SET maintenance_work_mem = '512MB';

For a deeper look at pgvector indexing and performance, see our post on getting started with pgvector and our NVMe vs cloud SSD benchmarks.

Connection Pooling#

LangChain's PGVector opens a database connection per instance. In a production web application, every request handler that creates a PGVector instance will try to open a new connection. Postgres has a hard connection limit and connections are not free.

The right approach is to pass a pre-configured SQLAlchemy engine with a connection pool:

from sqlalchemy import create_engine

engine = create_engine(
    connection_string,
    pool_size=10,
    max_overflow=20,
    pool_pre_ping=True,
)

vectorstore = PGVector(
    embeddings=embeddings,
    collection_name="documents",
    connection=engine,
    use_jsonb=True,
)

pool_pre_ping=True sends a lightweight SELECT 1 before each connection use, which lets the pool recover from stale connections after a database restart or network blip.

On Rivestack, you have PgBouncer available as a connection pooler at the infrastructure level, which handles pooling outside your application and works well with serverless or high-concurrency environments. If you're running on a provider that doesn't include a pooler, running PgBouncer alongside your application is worth the setup.

Async Usage#

For async web frameworks (FastAPI, Starlette), use PGVector's async interface:

from langchain_postgres import PGVector

vectorstore = PGVector(
    embeddings=embeddings,
    collection_name="documents",
    connection=async_engine,
    use_jsonb=True,
    async_mode=True,
)

results = await vectorstore.asimilarity_search("your query", k=4)

The chain also supports async invocation:

answer = await chain.ainvoke("What indexes does pgvector support?")

For high-throughput applications, running the entire pipeline async prevents the embedding and LLM calls from blocking your event loop.

Upsert and Document Management#

In real applications, documents change. The add_documents method always inserts new records. If you want to update an existing document's embedding when the source content changes, use the ids parameter:

vectorstore.add_documents(
    updated_docs,
    ids=["doc_id_1", "doc_id_2"]
)

When you provide an ids list, PGVector will upsert: update if the id exists, insert if it doesn't. This is how you keep your vectorstore in sync with your source data without rebuilding it from scratch.

To delete documents:

vectorstore.delete(ids=["doc_id_1"])

Or delete an entire collection:

vectorstore.delete_collection()

Putting It Together#

Here's a condensed version of the full pattern for a production RAG application:

from langchain_postgres import PGVector
from langchain_openai import OpenAIEmbeddings, ChatOpenAI
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.runnables import RunnablePassthrough
from langchain_core.output_parsers import StrOutputParser
from sqlalchemy import create_engine

# Configure once at startup
engine = create_engine(connection_string, pool_size=10, pool_pre_ping=True)
embeddings = OpenAIEmbeddings(model="text-embedding-3-small")

vectorstore = PGVector(
    embeddings=embeddings,
    collection_name="documents",
    connection=engine,
    use_jsonb=True,
)

# Build the retrieval chain
retriever = vectorstore.as_retriever(
    search_type="mmr",
    search_kwargs={"k": 4, "fetch_k": 20, "filter": {"user_id": current_user_id}}
)

prompt = ChatPromptTemplate.from_template(
    "Answer based on context:\n\n{context}\n\nQuestion: {question}"
)

chain = (
    {"context": retriever | (lambda docs: "\n\n".join(d.page_content for d in docs)),
     "question": RunnablePassthrough()}
    | prompt
    | ChatOpenAI(model="gpt-4o-mini", temperature=0)
    | StrOutputParser()
)

# Use per request
answer = chain.invoke(user_question)

The vectorstore and engine are initialized once at application startup. The chain is stateless and safe to share across requests.

What You Get With Managed PostgreSQL#

The setup above works with any PostgreSQL 14+ instance that has pgvector installed. Where things get tedious is operations: making sure the extension is available, keeping the database patched, handling connection limits under load, PITR backups, and monitoring index health.

On Rivestack, pgvector is pre-installed on every database, PgBouncer is included, and backups are handled automatically. If you want to get from "I have a LangChain app" to "it's running in production on a properly configured database" without a week of infrastructure work, that's the fastest path. See our managed pgvector hosting page for the full platform breakdown — plans, benchmarks, and migration paths.

You can spin up a free database at rivestack.io and have it connected to LangChain in a few minutes.

Summary#

pgvector with LangChain is a mature, production-ready combination. The langchain-postgres package handles embeddings storage, similarity search, MMR, and metadata filtering. The underlying database is PostgreSQL, which means you get transactions, JOINs, a real query planner, and a thirty-year ecosystem of operational tooling.

The main things to get right in production: use use_jsonb=True for metadata filtering, create an HNSW index before your dataset grows large, configure a connection pool at the engine level, and scope retrievers to the authenticated user for multi-tenant workloads.

If you're also using psycopg3 or SQLAlchemy directly alongside LangChain, see our guide on using pgvector with Python for the lower-level patterns.

Keep reading