Open-source CLI to measure HNSW latency (p50/p95/p99), throughput under concurrency, and recall@k against your own PostgreSQL. Single Go binary, interactive wizard, self-contained HTML report. Your vectors never leave your machine.
Vector workloads only. The numbers belong to your database, not the benchmark client — timing is captured inside the worker goroutine, never in the UI thread.
Single-threaded round-trip timing captured inside the worker goroutine so the animated UI never inflates the numbers. The TUI and --json modes report identical metrics.
Ramps through your --concurrency levels via pgxpool. Reports sustained QPS per level and the saturation point — the level after which gain drops below 10%.
Computes exact KNN inside a transaction with index scans disabled (verified with EXPLAIN), then compares to the ANN result. If the planner refuses to seq-scan, we skip recall instead of misreporting it.
Pass a comma list (--ef-search 40,100,200) and you get a clean (recall, p95, QPS) tradeoff table. The speed/quality knob, visualized for your workload, not someone else's.
One static binary. No runtime, no dependencies.
Detects your OS / arch and drops the binary into /usr/local/bin. Set PGVB_INSTALL_DIR to override.
curl -fsSL https://rivestack.io/install.sh | shpgvector-bench
# interactive wizard: paste your URL, pick existing
# table or synthetic, hit enter — runs the benchmark No flags to memorize, no shell-quoting traps. Paste your URL, pick existing table or synthetic, hit enter. The equivalent run command is printed at the end so you can save and rerun.
pgvector-bench run \
--url 'postgres://user:pass@host:5432/db?sslmode=require' \
--table documents --column embedding --metric cosinepgvector-bench run \
--url '...' \
--synthetic --rows 1000000 --dim 1536 Here's every byte that could leave your machine. The binary is open source — you can grep net/http and find one file.
grep -rn 'net/http' . Returns zero hits in the CLI source. The only network code path is the pgx connection to your --url. Errors are scrubbed of URLs, hostnames, and IPs before they ever reach stderr. Read the source →
We tried hard to report what your database can do, not what the benchmark client can do.
Each worker holds one Postgres connection for the duration of the level and submits queries back-to-back. Reported QPS is queries-completed / wall-clock. Saturation is the level after which gain over the prior drops below 10%.
The animated terminal UI and --json mode print the same numbers. The UI's only job is to make them nice to look at.
We open a transaction with index scans disabled and re-run the same ORDER BY col <=> $1 LIMIT k query. We verify with EXPLAIN on the first query that the planner is actually seq-scanning. If it still picks the index, we skip recall rather than report a misleading number.
pgvector ships two ANN index types. HNSW builds a hierarchical graph; its build tunables are m (neighbors per node) and ef_construction (search width at build time); its only query tunable is ef_search. IVFFlat clusters the dataset; its build tunable is lists and its query tunable is probes. pgvector-bench detects whichever index already exists on your column and reports against it — recall, p95, and QPS at each --ef-search value you pass — so you can plot the speed-vs-quality frontier for your data, not someone else's.
In a tuned HNSW setup the only knob you turn at query time is hnsw.ef_search. Drop it to 40 and queries finish faster but miss some true neighbors; push it to 200 and recall climbs at the cost of p95. There is no single right value — it depends on your dataset, your m, your embedding model, and what your application can tolerate. pgvector-bench prints the full table so you can pick the point on the curve that fits your SLO.
We don't detect NVMe vs SSD over a remote connection — it isn't reliably knowable. We don't subtract network RTT; if your DB is across the Atlantic, your p95 reflects that. Numbers projected against Rivestack reference benchmarks are clearly labeled and only shown when the workload shape is within tolerance.
Recall, ef_search, HNSW vs IVFFlat, realistic p95 targets — the questions that actually matter when you're tuning pgvector.
Recall@k measures how many of the true k nearest neighbors a vector index returns. If exact KNN says the top-10 neighbors are A,B,C,...,J and pgvector's HNSW index returns A,B,C,...,I plus one wrong row, that's recall@10 = 0.9. pgvector-bench computes recall by running the same ORDER BY ... LIMIT k query twice — once with the index, once with sequential scan (enable_indexscan = off) — and comparing the result sets.
In pgvector HNSW, ef_search controls how many candidate neighbors the graph traversal explores. Lower values are faster but miss some true neighbors; higher values are slower but recall climbs. A typical sweep on a 1M-vector dataset shows ef_search=40 → ~0.92 recall at ~3ms p95, ef_search=100 → ~0.97 recall at ~6ms p95, ef_search=200 → ~0.99 recall at ~10ms p95. pgvector-bench will sweep arbitrary values with --ef-search 40,100,200 and print the tradeoff table for your own workload.
Yes. pgvector-bench detects whichever index type already exists on the target column (HNSW or IVFFlat) and reports latency / throughput / recall against it. Create the same data with both index types in separate tables and run the tool twice to compare. For HNSW the relevant tunables are m and ef_construction (build) plus ef_search (query). For IVFFlat the build tunable is lists and the query tunable is probes.
On well-tuned PostgreSQL with HNSW on local NVMe and a typical embedding model (768–1536 dims, cosine), p95 latency for 10-NN queries is usually in the 2–8 ms range at ef_search=100. Network-attached SSDs (AWS gp3, GCP pd-balanced) typically add 5–20 ms because HNSW graph traversal is pointer-chasing — every miss is a network round-trip. If you see p95 over 50 ms, the bottleneck is almost always storage or shared_buffers under-sized for the index.
A default run is ~2–4 minutes: ~5 seconds for connect + introspection, ~10 seconds for warmup + latency (1000 queries single-threaded), ~24 seconds for the throughput ramp (three 8-second levels), ~10–30 seconds for the recall sample depending on dataset size. With --synthetic at 100k rows, add ~30 seconds for index build. Larger datasets and the --ef-search sweep multiply recall time by N (one pass per ef_search value).
Yes — any PostgreSQL with the vector extension installed. Tested against Supabase Pro, Neon Launch, AWS RDS PostgreSQL with pgvector, GCP Cloud SQL with pgvector, and self-hosted PostgreSQL 14/15/16/17. The tool only opens a Postgres connection to the URL you pass; it does not need IAM credentials, an SDK, or a control-plane API. If your provider exposes a Postgres connection string and the vector extension is installed, it works.
MIT licensed. Two minutes to your first benchmark. Star us on GitHub if it saves you a slack thread.