
PostgreSQL High Availability: A Practical Guide for Production
Your application is live. Customers are using it. The database goes down.
How long before traffic routes around the failure? Ten seconds? Five minutes? Never, because you're paged at 2 AM and have to manually promote a replica while the on-call engineer Slacks you asking if the database is "doing a thing"?
PostgreSQL high availability is one of those topics that looks straightforward in blog posts and turns out to be deeply humbling when you actually implement it in production. This guide covers how PostgreSQL HA actually works, the main tools people use, what typically goes wrong, and when the complexity of DIY HA stops being worth it.
What "High Availability" Means for PostgreSQL
High availability means your database keeps serving requests even when individual components fail. For PostgreSQL, that typically requires three things working together:
- Data replication — at least one copy of your data exists on a server other than the primary
- Failure detection — something notices when the primary is unreachable
- Automatic failover — the replica promotes itself to primary without a human in the loop
PostgreSQL ships with excellent replication primitives but no built-in automatic failover. The replication part is solid and well-understood. The failover part is where teams get into trouble.
Streaming Replication: The Foundation
PostgreSQL replication is based on Write-Ahead Log (WAL) shipping. Every write to the primary is first written to the WAL. Replicas connect to the primary and stream that WAL in near-real-time, replaying it to stay current.
Setting up a basic standby looks like this in postgresql.conf on the primary:
-- postgresql.conf on the primary
wal_level = replica
max_wal_senders = 3
wal_keep_size = 1GBAnd in pg_hba.conf, you allow the replica to connect for replication:
# pg_hba.conf on the primary
host replication replicator 10.0.0.2/32 scram-sha-256The replica connects with a primary_conninfo in its configuration and starts streaming WAL from the primary. Once streaming, the replica is typically only milliseconds behind.
Synchronous vs Asynchronous Replication
By default, PostgreSQL replication is asynchronous: the primary commits a transaction and returns success to the client before confirming the replica received the data. If the primary dies at exactly the wrong moment, you can lose the last few transactions.
Synchronous replication waits for at least one replica to confirm it has received and written the WAL before reporting the commit as successful:
-- postgresql.conf on the primary
synchronous_standby_names = 'replica1'
synchronous_commit = onThis gives you zero-RPO (recovery point objective) — no committed data is ever lost. The tradeoff is latency: every write waits for a round trip to the replica. On a local network this is typically 1–5ms. Across availability zones it can be 10–30ms depending on the cloud provider.
Most production setups use synchronous replication for the hot standby and asynchronous replication for additional read replicas or DR standbys that are geographically distant.
The Problem: PostgreSQL Doesn't Fail Over Itself
With streaming replication running, you have your data in two places. But if the primary goes down, PostgreSQL doesn't automatically promote the replica. You have two choices:
- Manual failover — someone runs
pg_ctl promoteorSELECT pg_promote()on the replica. Fast if someone is awake, catastrophic if not. - Automated failover via an HA tool — a separate process watches the primary and promotes the replica when it detects a failure.
Almost every production PostgreSQL HA setup uses one of three tools for automated failover: Patroni, pg_auto_failover, or repmgr. They all solve the same problem; they have meaningfully different complexity and tradeoff profiles.
Patroni: The Industry Standard (and Why It's Hard)
Patroni is what most teams with serious PostgreSQL HA requirements end up using. It's battle-tested, highly configurable, and runs at scale. It's also genuinely complex to operate.
Patroni uses a distributed consensus store — either etcd, Consul, or ZooKeeper — to maintain cluster state and elect a primary. When the primary fails, the Patroni agents on the replicas race to acquire a distributed lock in etcd. The winner promotes itself to primary. The others reconfigure to follow the new primary.
Here's the high-level architecture:
┌─────────────────────────────────────────────────────┐
│ etcd cluster (3 nodes, for DCS quorum) │
└─────────────────────────────────────────────────────┘
│ │ │
┌────┴────┐ ┌────┴────┐ ┌────┴────┐
│ Patroni │ │ Patroni │ │ Patroni │
│ Primary │ │ Replica │ │ Replica │
│ (PG 17) │ │ (PG 17) │ │ (PG 17) │
└─────────┘ └─────────┘ └─────────┘A minimal patroni.yml looks something like this:
scope: prod-cluster
namespace: /db/
name: pg-node-1
restapi:
listen: 0.0.0.0:8008
connect_address: 10.0.0.1:8008
etcd3:
hosts: 10.0.1.1:2379,10.0.1.2:2379,10.0.1.3:2379
bootstrap:
dcs:
ttl: 30
loop_wait: 10
retry_timeout: 30
maximum_lag_on_failover: 1048576 # 1MB
synchronous_mode: true
postgresql:
use_pg_rewind: true
parameters:
wal_level: replica
max_wal_senders: 10
max_replication_slots: 10
postgresql:
listen: 0.0.0.0:5432
connect_address: 10.0.0.1:5432
data_dir: /var/lib/postgresql/data
authentication:
replication:
username: replicator
password:
superuser:
username: postgres
password: What You're Actually Running
Here's the part that surprises teams: Patroni requires you to also run etcd (or Consul or ZooKeeper). And etcd itself needs to be highly available, which means a minimum of 3 etcd nodes for quorum.
So a minimal production Patroni setup is:
- 3 etcd nodes (for the consensus store)
- 2–3 PostgreSQL nodes running Patroni agents
- A load balancer or HAProxy to route connections to the current primary
- Optionally, a separate VIP (virtual IP) that floats to the current primary
That's 5–6 servers before you've even started. Each has its own failure modes. etcd has its own operational considerations. HAProxy needs to be configured to do health checks against the Patroni REST API (not just the PostgreSQL port) so it actually knows which node is the current primary.
Patroni is genuinely good software. But "run Patroni in production" is a weeks-long project, not an afternoon task.
pg_auto_failover: Simpler, More Opinionated
pg_auto_failover takes a different approach. Instead of an external consensus store, it uses a dedicated monitor node — a PostgreSQL instance that tracks the state of your cluster and makes failover decisions.
The architecture is simpler:
┌──────────────────────┐
│ Monitor Node (PG) │
│ (tracks state) │
└──────────────────────┘
│ │
┌─────┴───┐ ┌──┴──────┐
│ Primary │ │ Standby │
│ (PG 17) │ │ (PG 17) │
└─────────┘ └─────────┘Setting up a pg_auto_failover cluster:
# On the monitor node
pg_autoctl create monitor \
--pgdata /var/lib/postgresql/monitor \
--pgport 5000 \
--hostname monitor.internal
# On the primary node
pg_autoctl create postgres \
--pgdata /var/lib/postgresql/data \
--monitor postgres://autoctl_node@monitor.internal:5000/pg_auto_failover \
--hostname primary.internal \
--pgport 5432
# On the standby node
pg_autoctl create postgres \
--pgdata /var/lib/postgresql/data \
--monitor postgres://autoctl_node@monitor.internal:5000/pg_auto_failover \
--hostname standby.internal \
--pgport 5432Check cluster status:
pg_autoctl show state --pgdata /var/lib/postgresql/monitor Name | Port | Group | Node | Current State | Assigned State
------+-------+-------+-------+-------------------+------------------
pg | 5432 | 0 | 1 | primary | primary
pg | 5432 | 0 | 2 | secondary | secondarypg_auto_failover handles failover, replication slot management, and some connection routing automatically. It's easier to set up than Patroni, but it has real limitations: the monitor itself is a single point of failure (you need to make it durable separately), and it's less flexible for complex topologies.
For teams that want something that works without running etcd, pg_auto_failover is a solid option. For teams that need fine-grained control over failover behavior, replica priorities, or complex multi-datacenter topologies, Patroni is the better choice.
What Goes Wrong in Production
Both tools are mature and reasonably reliable. The failure modes that actually bite teams in production are usually one of these:
Split-Brain
The most dangerous failure mode. The network between the primary and replica is interrupted, but both are still reachable externally. The primary keeps accepting writes. The replica, thinking the primary is dead, promotes itself and starts accepting writes. Now you have two primaries, both accepting writes, with diverged data.
Patroni prevents this with the distributed lock in etcd — a node can only act as primary if it holds the lock. If it can't reach etcd to renew the lock, it demotes itself. This is why the etcd cluster needs its own quorum.
pg_auto_failover handles split-brain by having the monitor make all promotion decisions. A standby won't promote unless the monitor tells it to.
Both approaches work, but they rely on the consensus mechanism being available. If your etcd cluster goes down, Patroni won't fail over (by design). If the pg_auto_failover monitor is unreachable, failover is blocked.
Replication Lag at Failover Time
When synchronous replication is enabled, the replica that promotes will have all committed data. When it's asynchronous (or when a lagged replica promotes), you lose the transactions that hadn't been replicated yet.
Patroni's maximum_lag_on_failover setting controls this: a replica won't be eligible for promotion if it's more than N bytes behind the primary. This protects against promoting a stale replica but can also block failover if all replicas are lagged (e.g., after a network partition that affected replication traffic).
Application Not Reconnecting
This one is less about the HA tools and more about application behavior. After failover, the old primary is gone. If your application holds a connection open to the old primary's IP address, it will keep getting errors until it reconnects to the new primary.
The typical solution is a connection string that points to a VIP or a load balancer:
postgresql://ha-proxy.internal:5432/mydb?target_session_attrs=read-writeThe target_session_attrs=read-write parameter tells libpq to reject connections to read-only servers (i.e., replicas), which helps clients find the current primary after failover.
Testing Your Failover
If you haven't actually tested your failover, you don't have HA. You have a plan that might work.
Test it like this:
# While watching application logs and response times:
# Kill the primary
sudo systemctl stop postgresql@17-main
# Watch the Patroni logs on the replica
sudo journalctl -u patroni -f
# Check the cluster state
patronictl -c /etc/patroni/config.yml list
# Verify the application is still serving traffic
# (ideally you have automated monitoring that catches this)The first time you do this in a staging environment, you will find at least one thing that doesn't work the way you expected.
The Failover Time Question
"How fast does failover happen?" is the most common question, and the honest answer is: it depends on your configuration, but typically 10–30 seconds with Patroni in a well-tuned setup.
The timeline looks like this:
- Primary goes unreachable (0s)
- Patroni TTL expires — Patroni considers the primary dead (default: 30s)
- Replica acquires the DCS lock and promotes (1–2s)
- HAProxy/VIP detects the new primary (2–5s)
- Application connections reconnect to the new primary (depends on connection pool settings)
You can reduce the TTL to speed up failure detection, but aggressive settings increase false positives — a brief network blip triggers an unnecessary failover, which can cause its own disruption. Most teams settle on a TTL of 15–30 seconds.
When Managed PostgreSQL High Availability Makes Sense
DIY PostgreSQL HA is absolutely viable. Large companies run Patroni clusters at massive scale. But the operational overhead is real: you're running 5+ nodes, maintaining etcd or a monitor, testing failovers regularly, and debugging whatever edge case manifests itself at the worst possible time.
The inflection point is usually headcount. If you have a team with dedicated database reliability engineering experience, DIY HA is worth the control it gives you. If you're a startup or a team where the database is infrastructure rather than a product, the math often favors managed.
As we've broken down before, the real cost of self-hosted PostgreSQL with HA includes the infrastructure, the tooling, and the engineering time — and the engineering time dominates.
Rivestack's HA clusters handle streaming replication, automatic failover, and connection routing automatically. Failover happens in seconds. Your connection string doesn't change. No etcd cluster to maintain, no HAProxy configuration to debug, no 2 AM pages because the DCS quorum lost a member.
HA clusters are available starting at $99/month, which includes NVMe storage, automated backups, point-in-time recovery, and monitoring. If you're running a production workload that can't afford multi-minute outages, that's a reasonable trade.
What to Do If You're Starting From Scratch
If you're setting up PostgreSQL HA for the first time, here's the practical path:
For a managed setup: Try Rivestack — spin up an HA cluster, verify failover works (you can test it from the dashboard), and move on to building your application.
For a self-managed setup:
- Start with pg_auto_failover if you have 2–3 nodes and want something operational quickly.
- Move to Patroni if you need more control, multi-datacenter support, or are operating at scale.
- Use HAProxy with health checks against the Patroni REST API (port 8008) for connection routing — don't rely on DNS TTL for failover.
- Enable synchronous replication if your application can't tolerate any data loss.
- Set up automated failover testing in your staging environment before relying on it in production.
The PostgreSQL documentation on high availability is thorough and worth reading fully before you touch a production cluster. The Patroni documentation is also excellent.
The Bottom Line
PostgreSQL has excellent building blocks for high availability — streaming replication is fast, reliable, and well-understood. The gap is in the automated failover layer, and filling that gap with Patroni or pg_auto_failover works, but it adds meaningful operational complexity.
If you're already running multiple PostgreSQL clusters with dedicated infrastructure tooling, that complexity is probably justified. If you're a team of three trying to ship a product, it probably isn't.
Either way: whatever your HA setup, test your failover before you need it. The worst time to discover your replica is 10 minutes behind is during an actual outage.
If you want to skip the infrastructure work and get directly to building, Rivestack handles the HA layer for you — including pgvector if you're building AI applications. See the getting started guide if that's your stack.