A practical guide to building a multi-node PostgreSQL HA cluster with streaming replication, Patroni, and etcd - covering architecture, failover mechanics, and tool selection.
PostgreSQL does not ship with built-in multi-node clustering. A single instance handles reads and writes, and if that instance goes down, your application goes down with it. Building a production-grade high availability (HA) setup means assembling several components yourself: streaming replication for data redundancy, a consensus store to prevent split-brain, and a cluster manager to automate failover.
This post walks through how these pieces fit together, compares the main HA tooling options, and covers the failure scenarios you need to plan for.
How PostgreSQL Streaming Replication Works
PostgreSQL high availability starts with streaming replication. The primary server writes every change to the Write-Ahead Log (WAL). Standby nodes connect to the primary and continuously receive a stream of WAL records, replaying them to maintain a near-identical copy of the database.
Two replication modes exist, and the choice between them defines your RPO (Recovery Point Objective):
| Asynchronous Replication | Synchronous Replication | |
|---|---|---|
| How it works | Primary doesn't wait for standby acknowledgment | Primary waits for at least one standby to confirm WAL write |
| RPO | Seconds of potential data loss | Zero data loss (RPO = 0) |
| Write latency | No additional latency | Added round-trip latency to standby |
| Availability risk | Standby failure doesn't affect primary | Standby failure can stall writes unless fallback is configured |
| Best for | Read-heavy workloads tolerating brief data loss | Financial systems, order processing, anything requiring zero-loss guarantees |
Asynchronous replication is the default. Switching to synchronous mode requires setting synchronous_commit = on and configuring synchronous_standby_names in postgresql.conf. Patroni can manage this dynamically.
One detail that catches people off guard: replication slots. Without a replication slot, the primary can recycle WAL segments before a slow standby has consumed them, forcing a full base backup to resync. Replication slots prevent this by tracking each standby's replay position. The trade-off is that a disconnected standby causes WAL to accumulate on the primary indefinitely, which can fill the disk. Patroni handles slot lifecycle automatically, but you should still monitor WAL retention size.
HA Cluster Architecture: The Moving Parts
A PostgreSQL HA cluster that actually works in production has three layers:
Database nodes - One primary and one or more standby instances running PostgreSQL with streaming replication. A minimum of two database nodes is required; three nodes provide better fault tolerance and allow maintenance without losing redundancy.
Consensus store - A distributed key-value store like etcd, ZooKeeper, or Consul. This is what prevents split-brain. The cluster manager acquires a leader lock in the consensus store, and only the lock holder can act as primary. If the primary fails to renew its lock, another node can take over. Run at least three consensus store nodes on separate infrastructure.
Cluster manager - Software like Patroni that runs on each database node, monitors health, manages replication configuration, and orchestrates failover. The cluster manager is the brain of the operation.
On top of these, you need a connection routing layer. Applications should not connect directly to a specific database node because the primary can change at any time. HAProxy, PgBouncer, or a virtual IP (VIP) sits between the application and the database, routing writes to whichever node currently holds the leader lock. HAProxy with health checks against Patroni's REST API is the most common pattern - Patroni exposes endpoints that return HTTP 200 only for the current primary, making routing straightforward.
A minimal production setup looks like this: three PostgreSQL nodes (one primary, two standbys), three etcd nodes, and two HAProxy instances for connection routing. That is eight processes across a minimum of three to five servers.
Patroni, repmgr, and pg_auto_failover Compared
Three open-source tools dominate PostgreSQL HA management. They solve the same core problem - automated failover - but differ substantially in architecture and operational complexity.
| Patroni | repmgr | pg_auto_failover | |
|---|---|---|---|
| Architecture | Agent on each node + external DCS (etcd/ZooKeeper/Consul) | Daemon on each node, no external DCS required | Monitor node + agents on data nodes |
| Consensus mechanism | Delegated to etcd/ZooKeeper/Consul | Relies on repmgrd voting (less robust) | Single monitor makes all decisions |
| Failover automation | Fully automatic with leader election | Automatic via repmgrd, needs careful config | Fully automatic via state machine |
| Kubernetes support | Native support, widely used | Limited | Limited |
| Citus integration | Yes, since Patroni 3.0 | No | Yes (originally built by Citus team) |
| Complexity | High (requires managing DCS cluster) | Low to medium | Low |
| Community adoption | Largest; used by GitLab, Zalando, many others | Established but declining | Smaller but growing |
Patroni is the most widely deployed option. It delegates consensus to a battle-tested distributed store, which makes its split-brain prevention robust. The cost is operational complexity: you need to deploy and maintain the etcd (or equivalent) cluster alongside your database nodes. For teams already running Kubernetes, Patroni integrates naturally - operators like Zalando's postgres-operator and Crunchy Data's PGO build on it.
repmgr is simpler to set up. It doesn't require an external consensus store, which means fewer moving parts. The downside: its failover voting mechanism is less rigorous than true distributed consensus, making it more susceptible to edge cases in network partition scenarios. For small clusters in a single datacenter where simplicity matters most, it works fine.
pg_auto_failover takes a different approach with a dedicated monitor node that acts as the single source of truth. It was originally built by the Citus team at Microsoft. The state machine design is elegant and correct, but the single monitor is itself a potential single point of failure (though it can be made redundant).
For most production deployments, Patroni with etcd is the standard choice. If you are running a small cluster and want minimal operational overhead, pg_auto_failover is worth evaluating.
Failover, Switchover, and Recovery
Two distinct operations exist in any HA cluster, and conflating them causes confusion:
- Switchover is a planned, graceful promotion of a standby to primary. The old primary is demoted cleanly. Zero data loss. Used during maintenance windows.
- Failover is an unplanned reaction to a primary failure. The cluster manager detects the outage, selects the best standby, and promotes it.
How Patroni Handles Automatic Failover
When the primary node fails, Patroni's failover sequence works like this:
- The primary stops renewing its leader lock in etcd. The lock has a TTL (default: 30 seconds).
- After the TTL expires, Patroni on the standby nodes detects the leader lock is available.
- Patroni selects the standby with the least replication lag.
- The selected standby acquires the leader lock and promotes itself using
pg_ctl promote. - Remaining standbys reconfigure to replicate from the new primary.
- HAProxy health checks detect the topology change and route writes to the new primary.
The total failover time depends on your configuration. With default settings, expect 30-60 seconds. Tuning ttl and loop_wait can bring this under 20 seconds, though aggressive settings increase the risk of false failovers. Amazon RDS Multi-AZ failover takes 60-120 seconds by comparison, so a well-tuned Patroni cluster is competitive with managed services.
Reintegrating the Old Primary with pg_rewind
After failover, the old primary cannot simply rejoin as a standby - its WAL history has diverged. You have two options: take a full base backup (slow for large databases) or use pg_rewind.
pg_rewind compares the data directories of the old and new primaries, identifies only the blocks that changed after the timeline fork, and copies just those blocks. On a 500 GB database where the divergence is a few megabytes, pg_rewind finishes in seconds rather than the hours a full base backup would take.
To use pg_rewind, you need either wal_log_hints = on or data checksums enabled on the cluster. Patroni configures this automatically when use_pg_rewind is set to true in its configuration (which it is by default).
Common Pitfalls
Failing to test failover. Schedule regular failover drills. A Patroni switchover (patronictl switchover) is non-destructive and validates the full chain: promotion, replication reconfiguration, and connection routing.
Ignoring replication lag monitoring. During normal operation, async replication lag should stay under 1 second. If it regularly exceeds this, a failover will lose more data than you expect. Monitor pg_stat_replication.replay_lag on the primary.
Running etcd on the same nodes as PostgreSQL. This works for dev/test but is risky in production. If a server has a disk failure, you lose both a database node and a consensus voter simultaneously. Dedicated etcd nodes (or at least dedicated disks) reduce correlated failures.
Synchronous replication without a fallback. If you use synchronous replication with a single standby and that standby goes down, writes stall on the primary. Set synchronous_standby_names to use ANY 1 (node2, node3) with multiple standbys, or configure Patroni's synchronous_mode_strict: false to allow fallback to async when no sync standby is available.
Key Takeaways
- PostgreSQL HA requires assembling multiple components: streaming replication, a consensus store (etcd), a cluster manager (Patroni), and a connection proxy (HAProxy).
- Choose between synchronous replication (zero data loss, added write latency) and asynchronous replication (potential data loss, no latency penalty) based on your RPO requirements.
- Patroni with etcd is the industry standard for production PostgreSQL HA. repmgr is simpler but less robust; pg_auto_failover is a solid lightweight alternative.
- Typical failover time with Patroni is 20-60 seconds depending on configuration, competitive with managed cloud offerings.
- Use
pg_rewindto reintegrate failed primaries quickly without full base backups. - Test failover regularly, monitor replication lag, and keep etcd on separate infrastructure from your database nodes.