PostgreSQL Power Tips: Multi-Node High Availability

A practical guide to building a multi-node PostgreSQL HA cluster with streaming replication, Patroni, and etcd - covering architecture, failover mechanics, and tool selection.

PostgreSQL does not ship with built-in multi-node clustering. A single instance handles reads and writes, and if that instance goes down, your application goes down with it. Building a production-grade high availability (HA) setup means assembling several components yourself: streaming replication for data redundancy, a consensus store to prevent split-brain, and a cluster manager to automate failover.

This post walks through how these pieces fit together, compares the main HA tooling options, and covers the failure scenarios you need to plan for.

How PostgreSQL Streaming Replication Works

PostgreSQL high availability starts with streaming replication. The primary server writes every change to the Write-Ahead Log (WAL). Standby nodes connect to the primary and continuously receive a stream of WAL records, replaying them to maintain a near-identical copy of the database.

Two replication modes exist, and the choice between them defines your RPO (Recovery Point Objective):

	Asynchronous Replication	Synchronous Replication
How it works	Primary doesn't wait for standby acknowledgment	Primary waits for at least one standby to confirm WAL write
RPO	Seconds of potential data loss	Zero data loss (RPO = 0)
Write latency	No additional latency	Added round-trip latency to standby
Availability risk	Standby failure doesn't affect primary	Standby failure can stall writes unless fallback is configured
Best for	Read-heavy workloads tolerating brief data loss	Financial systems, order processing, anything requiring zero-loss guarantees

Asynchronous replication is the default. Switching to synchronous mode requires setting synchronous_commit = on and configuring synchronous_standby_names in postgresql.conf. Patroni can manage this dynamically.

One detail that catches people off guard: replication slots. Without a replication slot, the primary can recycle WAL segments before a slow standby has consumed them, forcing a full base backup to resync. Replication slots prevent this by tracking each standby's replay position. The trade-off is that a disconnected standby causes WAL to accumulate on the primary indefinitely, which can fill the disk. Patroni handles slot lifecycle automatically, but you should still monitor WAL retention size.

HA Cluster Architecture: The Moving Parts

A PostgreSQL HA cluster that actually works in production has three layers:

Database nodes - One primary and one or more standby instances running PostgreSQL with streaming replication. A minimum of two database nodes is required; three nodes provide better fault tolerance and allow maintenance without losing redundancy.
Consensus store - A distributed key-value store like etcd, ZooKeeper, or Consul. This is what prevents split-brain. The cluster manager acquires a leader lock in the consensus store, and only the lock holder can act as primary. If the primary fails to renew its lock, another node can take over. Run at least three consensus store nodes on separate infrastructure.
Cluster manager - Software like Patroni that runs on each database node, monitors health, manages replication configuration, and orchestrates failover. The cluster manager is the brain of the operation.

On top of these, you need a connection routing layer. Applications should not connect directly to a specific database node because the primary can change at any time. HAProxy, PgBouncer, or a virtual IP (VIP) sits between the application and the database, routing writes to whichever node currently holds the leader lock. HAProxy with health checks against Patroni's REST API is the most common pattern - Patroni exposes endpoints that return HTTP 200 only for the current primary, making routing straightforward.

A minimal production setup looks like this: three PostgreSQL nodes (one primary, two standbys), three etcd nodes, and two HAProxy instances for connection routing. That is eight processes across a minimum of three to five servers.

Patroni, repmgr, and pg_auto_failover Compared

Three open-source tools dominate PostgreSQL HA management. They solve the same core problem - automated failover - but differ substantially in architecture and operational complexity.

	Patroni	repmgr	pg_auto_failover
Architecture	Agent on each node + external DCS (etcd/ZooKeeper/Consul)	Daemon on each node, no external DCS required	Monitor node + agents on data nodes
Consensus mechanism	Delegated to etcd/ZooKeeper/Consul	Relies on repmgrd voting (less robust)	Single monitor makes all decisions
Failover automation	Fully automatic with leader election	Automatic via repmgrd, needs careful config	Fully automatic via state machine
Kubernetes support	Native support, widely used	Limited	Limited
Citus integration	Yes, since Patroni 3.0	No	Yes (originally built by Citus team)
Complexity	High (requires managing DCS cluster)	Low to medium	Low
Community adoption	Largest; used by GitLab, Zalando, many others	Established but declining	Smaller but growing

Patroni is the most widely deployed option. It delegates consensus to a battle-tested distributed store, which makes its split-brain prevention robust. The cost is operational complexity: you need to deploy and maintain the etcd (or equivalent) cluster alongside your database nodes. For teams already running Kubernetes, Patroni integrates naturally - operators like Zalando's postgres-operator and Crunchy Data's PGO build on it.

repmgr is simpler to set up. It doesn't require an external consensus store, which means fewer moving parts. The downside: its failover voting mechanism is less rigorous than true distributed consensus, making it more susceptible to edge cases in network partition scenarios. For small clusters in a single datacenter where simplicity matters most, it works fine.

pg_auto_failover takes a different approach with a dedicated monitor node that acts as the single source of truth. It was originally built by the Citus team at Microsoft. The state machine design is elegant and correct, but the single monitor is itself a potential single point of failure (though it can be made redundant).

For most production deployments, Patroni with etcd is the standard choice. If you are running a small cluster and want minimal operational overhead, pg_auto_failover is worth evaluating.

Failover, Switchover, and Recovery

Two distinct operations exist in any HA cluster, and conflating them causes confusion:

Switchover is a planned, graceful promotion of a standby to primary. The old primary is demoted cleanly. Zero data loss. Used during maintenance windows.
Failover is an unplanned reaction to a primary failure. The cluster manager detects the outage, selects the best standby, and promotes it.

How Patroni Handles Automatic Failover

When the primary node fails, Patroni's failover sequence works like this:

The primary stops renewing its leader lock in etcd. The lock has a TTL (default: 30 seconds).
After the TTL expires, Patroni on the standby nodes detects the leader lock is available.
Patroni selects the standby with the least replication lag.
The selected standby acquires the leader lock and promotes itself using pg_ctl promote.
Remaining standbys reconfigure to replicate from the new primary.
HAProxy health checks detect the topology change and route writes to the new primary.

The total failover time depends on your configuration. With default settings, expect 30-60 seconds. Tuning ttl and loop_wait can bring this under 20 seconds, though aggressive settings increase the risk of false failovers. Amazon RDS Multi-AZ failover takes 60-120 seconds by comparison, so a well-tuned Patroni cluster is competitive with managed services.

Reintegrating the Old Primary with pg_rewind

After failover, the old primary cannot simply rejoin as a standby - its WAL history has diverged. You have two options: take a full base backup (slow for large databases) or use pg_rewind.

pg_rewind compares the data directories of the old and new primaries, identifies only the blocks that changed after the timeline fork, and copies just those blocks. On a 500 GB database where the divergence is a few megabytes, pg_rewind finishes in seconds rather than the hours a full base backup would take.

To use pg_rewind, you need either wal_log_hints = on or data checksums enabled on the cluster. Patroni configures this automatically when use_pg_rewind is set to true in its configuration (which it is by default).

Common Pitfalls

Failing to test failover. Schedule regular failover drills. A Patroni switchover (patronictl switchover) is non-destructive and validates the full chain: promotion, replication reconfiguration, and connection routing.

Ignoring replication lag monitoring. During normal operation, async replication lag should stay under 1 second. If it regularly exceeds this, a failover will lose more data than you expect. Monitor pg_stat_replication.replay_lag on the primary.

Running etcd on the same nodes as PostgreSQL. This works for dev/test but is risky in production. If a server has a disk failure, you lose both a database node and a consensus voter simultaneously. Dedicated etcd nodes (or at least dedicated disks) reduce correlated failures.

Synchronous replication without a fallback. If you use synchronous replication with a single standby and that standby goes down, writes stall on the primary. Set synchronous_standby_names to use ANY 1 (node2, node3) with multiple standbys, or configure Patroni's synchronous_mode_strict: false to allow fallback to async when no sync standby is available.

Key Takeaways

PostgreSQL HA requires assembling multiple components: streaming replication, a consensus store (etcd), a cluster manager (Patroni), and a connection proxy (HAProxy).
Choose between synchronous replication (zero data loss, added write latency) and asynchronous replication (potential data loss, no latency penalty) based on your RPO requirements.
Patroni with etcd is the industry standard for production PostgreSQL HA. repmgr is simpler but less robust; pg_auto_failover is a solid lightweight alternative.
Typical failover time with Patroni is 20-60 seconds depending on configuration, competitive with managed cloud offerings.
Use pg_rewind to reintegrate failed primaries quickly without full base backups.
Test failover regularly, monitor replication lag, and keep etcd on separate infrastructure from your database nodes.