Why Self-Host Database High Availability?
Every production system eventually faces the same reality: a single database node is a single point of failure. When that node crashes, your application goes down, transactions are lost, and users walk away. The traditional answer has been to pay a premium for managed database services that handle replication and failover automatically. But managed services come with steep costs, opaque pricing models, and limited control over how your data is actually protected.
Self-hosting database high availability puts you back in the driver’s seat. You choose the replication topology, control the failover behavior, decide where data lives, and avoid the vendor markup that can triple your infrastructure bill at scale. Whether you’re running a SaaS platform, an e-commerce storefront, or an internal tooling stack, the open-source HA solutions covered in this guide give you enterprise-grade resilience without the enterprise price tag.
Key benefits of self-hosting database HA:
- Full control over failover logic — tune RPO and RTO to your exact requirements instead of accepting a provider’s defaults
- Transparent costs — no per-connection fees, storage premiums, or cross-region data transfer charges
- Data sovereignty — keep every byte within your own infrastructure, satisfying GDPR, HIPAA, and SOC 2 mandates
- Custom replication topologies — build cascading replicas, cross-region clusters, or hybrid read/write setups
- No lock-in — open-source tools that work with standard PostgreSQL and MySQL/MariaDB, not proprietary extensions
In this guide, we compare the three most widely deployed open-source database HA solutions: Patroni for PostgreSQL, Galera Cluster for MySQL and MariaDB, and repmgr for PostgreSQL. Each takes a fundamentally different approach to keeping your database online, and understanding those differences is critical for building a resilient architecture.
How Database High Availability Works
Before diving into specific tools, it helps to understand the two broad strategies for database replication:
Asynchronous replication sends changes to replica nodes without waiting for confirmation. The primary node continues processing immediately, which means fast write performance but the risk of losing unreplicated transactions if the primary crashes. The replication lag is typically measured in milliseconds but can spike under heavy load.
Synchronous replication waits for at least one replica to confirm receipt of each transaction before committing. This guarantees zero data loss (RPO = 0) but adds latency to every write operation. Most production setups use synchronous replication for the nearest replica and asynchronous for distant ones.
The third key concept is automatic failover: when the primary node becomes unreachable, a promotion process elevates a replica to primary status, reconfigures the remaining replicas, and redirects application traffic. The speed and reliability of this process determines your Recovery Time Objective (RTO).
| Concept | Definition | Typical Target |
|---|---|---|
| RPO (Recovery Point Objective) | Maximum acceptable data loss | 0 (sync) to seconds (async) |
| RTO (Recovery Time Objective) | Maximum acceptable downtime | 10–60 seconds |
| Quorum | Minimum nodes needed for consensus | Majority (N/2 + 1) |
| Split brain | Two primaries accepting writes | Must be prevented at all costs |
Patroni: PostgreSQL HA with Distributed Consensus
Patroni is widely considered the gold standard for PostgreSQL high availability. Originally developed by Zalando, it has become the go-to solution for production PostgreSQL clusters. Patroni uses a distributed consensus store — etcd, Consul, ZooKeeper, or kubernetes — to manage cluster state and coordinate leader elections.
Architecture
Patroni runs as a daemon alongside each PostgreSQL instance. The cluster elects one node as the leader (primary) using a distributed lock in the consensus store. The leader holds a TTL-based lease; if it fails to renew the lease within the configured timeout, the lease expires and a new leader election begins.
Replication is handled by PostgreSQL’s native streaming replication (physical) or logical replication. Patroni manages the entire lifecycle: initializing replicas, promoting a new leader, reconfiguring postgresql.conf and pg_hba.conf dynamically, and running custom scripts on state transitions.
docker Compose Setup
Here is a production-ready Patroni cluster with three PostgreSQL nodes and a three-node etcd quorum:
| |
The HAProxy configuration routes writes to the leader and reads to all replicas:
| |
Key Patroni Features
- Automatic failover in 10–30 seconds using distributed consensus
- Dynamic configuration — change PostgreSQL settings without restarting via the REST API
- Custom scripts — run shell scripts on leader change, restart, or switchover events
- Standby cluster support for cross-region disaster recovery
- pgBackRest and WAL-G integration for automated backup management
- Kubernetes native — the Patroni Operator runs directly on K8s without etcd
Monitoring and Operations
Patroni exposes a REST API on port 8008 with health endpoints:
| |
Galera Cluster: Synchronous Multi-Master for MySQL and MariaDB
Galera Cluster takes a fundamentally different approach from Patroni. Instead of a single primary with read replicas, Galera provides true multi-master replication — every node can accept writes simultaneously, and changes are synchronously replicated across the cluster using a certification-based replication protocol.
Architecture
Galera integrates directly into MySQL and MariaDB through a wsrep (write-set replication) plugin. When a transaction commits on any node, the write set is broadcast to all other nodes via group communication. Each node independently certifies the write set against its local state — if the certification passes, the transaction is applied locally; if it fails (due to a conflict), the transaction is rolled back on the originating node.
This model eliminates the need for leader elections entirely. Every node is equal, and the cluster continues operating as long as a majority of nodes remain connected.
Docker Compose Setup
A three-node Galera Cluster with MariaDB:
| |
For the Galera HAProxy configuration, all nodes accept writes, so the backend is simple round-robin:
| |
Key Galera Features
- True multi-master — every node accepts reads and writes simultaneously
- Synchronous replication — zero data loss guarantee (RPO = 0)
- Automatic node provisioning — new nodes sync via SST (State Snapshot Transfer) using MariaDB Backup or rsync
- Parallel replication — Galera 4 supports parallel apply for improved throughput
- Flow control — automatically throttles fast writers if replicas fall behind
- No single point of failure — no leader election, no consensus store required
Monitoring Galera
| |
repmgr: PostgreSQL Replication Manager
repmgr is the traditional PostgreSQL replication management tool, developed by 2ndQuadrant (now part of EDB). Unlike Patroni’s distributed consensus approach, repmgr manages replication through a metadata schema stored inside the PostgreSQL cluster itself, combined with a monitoring daemon that watches node health.
Architecture
repmgr registers each node in a dedicated repmgr database that tracks the cluster topology: which node is primary, which are standbys, and their replication relationships. The repmgrd daemon runs on each node, monitoring the upstream connection and triggering failover when the primary becomes unreachable.
The key distinction from Patroni is that repmgr’s failover requires a witness node or a tie-breaking mechanism to prevent split brain, and the failover process relies on monitoring daemons rather than a distributed consensus protocol.
Docker Compose Setup
| |
Key repmgr Features
- Simple topology management — register, unregister, and follow commands for manual control
- Built-in monitoring daemon —
repmgrdwatches the primary and can auto-promote - Standby cloning —
repmgr standby clonecreates replicas from any healthy node - Witness server — a lightweight node that breaks ties during failover without holding data
- Event notifications — hook scripts for failover, switchover, and node state changes
- Lower complexity — no external consensus store required, just PostgreSQL itself
Operations with repmgr
| |
Head-to-Head Comparison
| Feature | Patroni | Galera Cluster | repmgr |
|---|---|---|---|
| Database | PostgreSQL only | MySQL / MariaDB | PostgreSQL only |
| Replication Model | Primary + async/sync replicas | Multi-master (all nodes RW) | Primary + async streaming |
| Consensus | etcd / Consul / ZooKeeper / K8s | Built-in (group communication) | repmgr metadata DB + witness |
| Failover Time | 10–30 seconds | Instant (no failover needed) | 30–60 seconds |
| Data Loss Risk | Depends on sync config | Zero (synchronous) | Depends on sync config |
| Write Scalability | Single writer | All nodes accept writes | Single writer |
| Read Scalability | N-1 replicas | All N nodes | N-1 replicas |
| Network Partitions | Handled by consensus quorum | Cluster may split; needs bootstrap | Manual intervention often needed |
| External Dependencies | Yes (consensus store) | None | None (optional witness) |
| Configuration Complexity | Medium (YAML + etcd) | Medium (my.cnf + wsrep) | Low (repmgr.conf) |
| Automatic Rejoin | Yes (former primary becomes replica) | N/A (all nodes equal) | Manual or via hook scripts |
| Backup Integration | pgBackRest, WAL-G, Barman | Mariabackup, mysqldump, XtraBackup | pgBackRest, WAL-G, Barman |
| Kubernetes Support | Excellent (native operator) | Moderate (StatefulSet) | Moderate (custom operator) |
| Geographic Distribution | Good (standby cluster mode) | Poor (latency-sensitive sync) | Moderate (cascading replicas) |
| License | MIT | GPLv2 (Galera) + GPL (MariaDB) | PostgreSQL License |
| Best For | Production PostgreSQL, K8s | MySQL/MariaDB, multi-master workloads | Simpler PostgreSQL setups |
Choosing the Right Solution
Choose Patroni if:
- You run PostgreSQL and need production-grade automatic failover
- You are deploying on Kubernetes (the Patroni Operator is excellent)
- You want the ability to dynamically reconfigure PostgreSQL at runtime
- You need geographic distribution with standby clusters
- You already operate etcd or Consul for other services
Patroni is the safest choice for most PostgreSQL deployments. The distributed consensus approach eliminates split-brain scenarios, and the REST API enables deep automation. The learning curve comes from managing the consensus store, but that investment pays off in operational reliability.
Choose Galera Cluster if:
- You run MySQL or MariaDB and need multi-master write capability
- Your application writes to multiple nodes simultaneously (e.g., geographically distributed writes)
- Zero data loss is non-negotiable and you can tolerate the write latency
- You want HA without external dependencies like etcd
- Your workload is read-heavy with moderate write volume
Galera excels when you need every node to accept writes. However, be aware that write-heavy workloads with large transactions will experience certification conflicts and rollbacks. The synchronous replication also means that network latency between nodes directly impacts write performance — keep nodes in the same data center or use low-latency connections.
Choose repmgr if:
- You want the simplest possible PostgreSQL HA setup
- You do not want to manage an external consensus store
- Your team is already familiar with PostgreSQL administration
- You prefer manual or semi-automatic failover with explicit control
- You are running a small cluster (2–3 nodes) where the overhead of etcd is unjustified
repmgr is the most straightforward option but requires more operational discipline. Without a consensus store, you need to be careful about split-brain scenarios and ensure your witness node or tie-breaking strategy is properly configured.
Production Best Practices
Regardless of which tool you choose, these practices are essential for production database HA:
1. Test failover regularly. Schedule quarterly failover drills. Automated monitoring tells you when things are broken, but only an actual failover test proves your recovery process works.
2. Monitor replication lag continuously. Use built-in replication delay metrics and alert when lag exceeds your RPO threshold. For Patroni, check pg_stat_replication. For Galera, monitor wsrep_flow_control_paused.
| |
3. Use connection pooling. Place PgBouncer or ProxySQL in front of your cluster to handle connection management, failover re-routing, and connection limits. Applications should never connect directly to database nodes in an HA setup.
4. Separate WAL/binlog from data. Store transaction logs on a different physical disk or volume than the data directory. This prevents I/O contention and simplifies backup strategies.
5. Plan your backup strategy. HA is not a backup strategy. You still need point-in-time recovery capable backups. For PostgreSQL, use WAL-G or pgBackRest with off-site storage. For MariaDB, use Mariabackup with incremental backups.
6. Size your consensus store properly. If using Patroni with etcd, run at least three etcd nodes on separate failure domains. The etcd cluster should have dedicated fast storage (SSD) since it handles frequent write operations for leader lease renewal.
7. Document your runbook. When the primary fails at 3 AM, your team should not be reading documentation for the first time. Maintain a runbook with step-by-step recovery procedures, including manual promotion commands, DNS update steps, and application reconnection procedures.
Summary
| Scenario | Recommended Tool |
|---|---|
| PostgreSQL on Kubernetes | Patroni (with K8s operator) |
| PostgreSQL, simple setup | repmgr |
| PostgreSQL, production-grade | Patroni + etcd + HAProxy |
| MySQL/MariaDB, multi-master writes | Galera Cluster |
| MySQL/MariaDB, standard HA | MariaDB primary + replicas (no Galera) |
| Cross-region PostgreSQL | Patroni with standby cluster |
| Lowest operational complexity | repmgr |
| Highest availability guarantee | Patroni or Galera (depending on database) |
Database high availability is not a set-and-forget concern. The tools covered here give you the building blocks, but achieving reliable HA requires testing, monitoring, and ongoing operational discipline. Start with the setup that matches your team’s expertise, test failover thoroughly before going live, and iterate as your requirements evolve.
Frequently Asked Questions (FAQ)
Which one should I choose in 2026?
The best choice depends on your specific requirements:
- For beginners: Start with the simplest option that covers your core use case
- For production: Choose the solution with the most active community and documentation
- For teams: Look for collaboration features and user management
- For privacy: Prefer fully open-source, self-hosted options with no telemetry
Refer to the comparison table above for detailed feature breakdowns.
Can I migrate between these tools?
Most tools support data import/export. Always:
- Backup your current data
- Test the migration on a staging environment
- Check official migration guides in the documentation
Are there free versions available?
All tools in this guide offer free, open-source editions. Some also provide paid plans with additional features, priority support, or managed hosting.
How do I get started?
- Review the comparison table to identify your requirements
- Visit the official documentation (links provided above)
- Start with a Docker Compose setup for easy testing
- Join the community forums for troubleshooting