MySQL replication is the foundation of database high availability, but managing replication topologies at scale quickly becomes complex. Manual failover, replication lag monitoring, and topology changes require specialized tools. This guide compares three self-hosted solutions for MySQL replication topology management that automate failover, visualize replication graphs, and keep your databases running.

The Challenge of MySQL Replication Management

Running MySQL in production means dealing with:

  • Primary failover: When the primary goes down, promoting a replica without data loss
  • Replication lag: Detecting and resolving delays between primary and replicas
  • Topology changes: Adding/removing replicas, switching from star to chain topology
  • GTID management: Tracking Global Transaction IDs during failover events
  • Semi-sync configuration: Balancing consistency vs. latency with synchronous replication

Without proper tooling, a primary failure means minutes of manual intervention — or worse, split-brain scenarios where two nodes believe they are primary.

Comparison Overview

FeatureMySQL OrchestratorPercona PMMMySQL Shell
Automatic failoverYesNo (monitoring only)Semi-automatic
Topology visualizationWeb UIGrafana dashboardsCLI text output
Replication lag alertingBuilt-inAdvanced (Prometheus)Manual checks
GTID-aware failoverYesN/AYes
Semi-sync managementYesMonitoring onlyManual configuration
API accessREST APIGrafana APIPython/JS scripting
Cluster support1000+ nodesLimited by PrometheusPer-session
Open sourceApache 2.0Apache 2.0GPL v2
GitHub Stars5,700+1,000+Bundled with MySQL

MySQL Orchestrator: Automated Topology Management

MySQL Orchestrator (5,700+ GitHub stars) is the gold standard for MySQL replication topology management. It discovers your replication graph automatically, provides a web-based topology visualization, and supports both manual and automatic failover with safety checks.

How Orchestrator Works

Orchestrator connects to each MySQL instance, reads replication metadata (SHOW SLAVE STATUS, SHOW MASTER STATUS), and builds a complete topology graph. It stores this graph in a backend database (SQLite, MySQL, or Consul) and exposes it via REST API and web UI.

When a failover is triggered, Orchestrator:

  1. Detects the primary failure through health checks
  2. Identifies the most up-to-date replica using GTID position
  3. Promotes the selected replica with CHANGE MASTER TO
  4. Reconfigures remaining replicas to point to the new primary
  5. Updates topology visualization and sends alerts

Docker Compose Deployment

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
version: "3.8"

services:
  orchestrator:
    image: ghcr.io/openark/orchestrator:latest
    ports:
      - "3000:3000"
    volumes:
      - ./orchestrator/conf:/usr/local/orchestrator/resources/conf
      - ./orchestrator/data:/usr/local/orchestrator/data
    environment:
      - ORCHESTRATOR_DEBUG=false
      - ORCHESTRATOR_BACKEND_DB_TYPE=sqlite
      - ORCHESTRATOR_DISCOVER_SEEDS=mysql-primary:3306,mysql-replica1:3306
    restart: unless-stopped
    depends_on:
      - mysql-primary

  mysql-primary:
    image: mysql:8.4
    ports:
      - "3306:3306"
    environment:
      - MYSQL_ROOT_PASSWORD=rootpass
      - MYSQL_REPLICATION_USER=repl
      - MYSQL_REPLICATION_PASSWORD=replpass
    volumes:
      - ./mysql-primary/data:/var/lib/mysql
      - ./mysql-primary/conf.d:/etc/mysql/conf.d
    command: >
      --server-id=1
      --gtid-mode=ON
      --enforce-gtid-consistency=ON
      --log-bin=mysql-bin
      --binlog-format=ROW
      --semi-sync-master-enabled=1

networks:
  default:
    driver: bridge

Orchestrator Configuration

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
{
  "Debug": false,
  "MySQLTopologyUser": "orchestrator",
  "MySQLTopologyPassword": "orch_pass",
  "DetectionPeriodSeconds": 5,
  "InstancePollSeconds": 5,
  "ForgetUnseenAgentDifferential": 50,
  "DiscoverByShowSlaveHosts": true,
  "FailureDetectionPeriodBlockMinutes": 60,
  "RecoveryPeriodBlockSeconds": 300,
  "RecoveryIgnoreHostnameFilters": [],
  "ApplyMySQLPromotionAfterMasterFailover": true,
  "MasterFailoverDetachReplicaMasterHost": true,
  "MasterFailoverLostInstancesDowntimeMinutes": 10
}

Key failover safety settings:

  • RecoveryPeriodBlockSeconds: Prevents repeated failover attempts (default: 300s)
  • FailureDetectionPeriodBlockMinutes: Cooldown after a detected failure
  • ApplyMySQLPromotionAfterMasterFailover: Runs RESET SLAVE ALL on promoted replica

Percona Monitoring and Management: Replication Observability

Percona PMM (1,000+ GitHub stars) provides deep visibility into MySQL replication health through Grafana dashboards. While it doesn’t perform automatic failover, it excels at detecting replication issues before they become outages.

Replication Monitoring Features

PMM collects over 1,000 MySQL metrics including:

  • Replication lag (seconds behind primary)
  • Relay log growth rate
  • SQL thread and I/O thread status
  • GTID execution sets
  • Semi-sync replication status
  • Binlog write latency

Docker Compose for PMM Server

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
version: "3.8"

services:
  pmm-server:
    image: percona/pmm-server:2
    container_name: pmm-server
    hostname: pmm-server
    ports:
      - "443:443"
    volumes:
      - pmm-data:/srv
    environment:
      - PERCONA_TEST_DBAAS=1
      - PMM_AGENT_CONFIG_FILE=/usr/local/percona/pmm2/config/pmm-agent.yaml
    restart: unless-stopped

volumes:
  pmm-data: {}

Adding MySQL Instances to PMM

1
2
3
4
5
6
7
8
# Install PMM client on each MySQL host
docker pull percona/pmm-client:2

# Register MySQL instance with PMM server
docker run -d   -p 42000:42000   --name pmm-client   -e PMM_AGENT_SERVER_URL=https://pmm-server:443   -e PMM_AGENT_SERVER_INSECURE_TLS=1   -e PMM_AGENT_CONFIG_FILE=/usr/local/percona/pmm2/config/pmm-agent.yaml   percona/pmm-client:2

# Add MySQL monitoring
pmm-admin add mysql --username=pmm --password=pmpass --query-source=slowlog mysql-primary:3306

PMM’s replication dashboard shows lag trends, thread status history, and binlog throughput — essential for capacity planning and performance tuning.

MySQL Shell: Scripted Topology Management

MySQL Shell (bundled with MySQL 8.0+) provides a JavaScript/Python interface for managing InnoDB Cluster and replication topologies. It supports the AdminAPI, which enables programmatic cluster management including automated failover through MySQL Router.

InnoDB Cluster Setup

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
import mysqlsh

# Connect to the primary instance
shell.connect('admin@mysql-primary:3306')

# Create an InnoDB Cluster
cluster = dba.create_cluster('production')

# Add replicas
cluster.add_instance('admin@mysql-replica1:3306')
cluster.add_instance('admin@mysql-replica2:3306')

# Check cluster status
cluster.status()

# Perform a controlled switchover
cluster.set_primary_instance('mysql-replica1:3306')

Docker Compose with InnoDB Cluster

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
version: "3.8"

services:
  mysql-1:
    image: mysql:8.4
    environment:
      - MYSQL_ROOT_PASSWORD=rootpass
      - MYSQL_USER=icadmin
      - MYSQL_PASSWORD=icpass
    volumes:
      - ./mysql-1/data:/var/lib/mysql
    command: >
      --server-id=1
      --gtid-mode=ON
      --enforce-gtid-consistency=ON
      --log-bin=mysql-bin
      --binlog-format=ROW
      --mysqlx=ON
      --clone=ON

  mysql-2:
    image: mysql:8.4
    environment:
      - MYSQL_ROOT_PASSWORD=rootpass
      - MYSQL_USER=icadmin
      - MYSQL_PASSWORD=icpass
    volumes:
      - ./mysql-2/data:/var/lib/mysql
    depends_on:
      - mysql-1
    command: >
      --server-id=2
      --gtid-mode=ON
      --enforce-gtid-consistency=ON
      --log-bin=mysql-bin
      --binlog-format=ROW

  mysql-shell:
    image: mysql/mysql-shell:latest
    depends_on:
      - mysql-1
      - mysql-2
    entrypoint: ["mysqlsh", "--js"]

Decision Matrix

Choose MySQL Orchestrator when:

  • You need automated failover with safety guarantees
  • You manage 10+ MySQL instances with complex topologies
  • You want a web-based topology map for operations teams
  • You operate mixed MySQL versions and need unified management

Choose Percona PMM when:

  • Your priority is observability over automation
  • You already use Grafana/Prometheus for monitoring
  • You need deep query-level replication analysis
  • You want historical trend analysis for capacity planning

Choose MySQL Shell/InnoDB Cluster when:

  • You run MySQL 8.0+ exclusively
  • You want native MySQL Group Replication
  • You prefer programmatic management via Python/JS APIs
  • You need integrated MySQL Router for connection routing

For more database management guides, see our database monitoring comparison, database query profiling tools, and version-controlled databases guide.

Why Self-Host Replication Management?

Commercial MySQL management tools like Amazon RDS Multi-AZ or Google Cloud SQL handle failover automatically but lock you into specific cloud providers and charge premium pricing. Self-hosted replication management tools give you the same automation capabilities on any infrastructure — bare metal, VMs, or Kubernetes — without vendor lock-in.

For organizations running MySQL across multiple data centers or hybrid cloud environments, self-hosted tools provide a unified management layer that works everywhere. MySQL Orchestrator’s REST API integrates with existing runbook automation, while PMM’s Grafana dashboards fit into existing observability stacks.

Cost comparison: A managed MySQL service with automatic failover typically costs 2-3× the base database price. Self-hosted Orchestrator + PMM on a small monitoring server ($30-50/month) serves hundreds of MySQL instances at a fraction of the managed service cost.

Migration from manual to automated replication management typically takes one to two sprint cycles. Start by deploying Orchestrator in read-only discovery mode to map your existing topology, then gradually enable semi-automatic failover with operator approval before switching to full automation.

FAQ

Can Orchestrator perform automatic failover without human approval?

Yes. Configure RecoveryPeriodBlockSeconds and the detection settings to enable fully automatic failover. However, most production deployments use semi-automatic mode where Orchestrator detects failures and recommends actions, but requires operator approval via the REST API before executing failover.

What is the difference between Orchestrator and MySQL InnoDB Cluster?

Orchestrator works with traditional MySQL replication (asynchronous or semi-synchronous) and manages topology through external coordination. InnoDB Cluster uses MySQL Group Replication (synchronous, Paxos-based consensus) with built-in failure detection. Orchestrator is more flexible with existing deployments; InnoDB Cluster requires MySQL 8.0+ and Group Replication.

How does PMM detect replication lag?

PMM reads the Seconds_Behind_Master value from SHOW SLAVE STATUS on each replica, combined with GTID position comparison against the primary. It also monitors relay log apply rate to predict when lag will resolve.

Can I use Orchestrator with MariaDB?

Orchestrator has experimental MariaDB support, but it’s primarily designed for MySQL. MariaDB has its own MaxScale proxy with similar topology management capabilities. For MariaDB-specific deployments, consider MaxScale or Galera Cluster management tools.

How do I handle split-brain scenarios during failover?

Orchestrator uses GTID-based position comparison to ensure only the most up-to-date replica is promoted. It also implements a RecoveryPeriodBlockSeconds cooldown to prevent cascading failovers. For additional safety, enable semi-synchronous replication with rpl_semi_sync_master_wait_point=AFTER_SYNC to ensure at least one replica has acknowledged each transaction before the primary commits.

What monitoring metrics should I track for replication health?

Critical metrics include: replication lag (should be <1s for synchronous workloads), relay log size (growth indicates lag), SQL thread running status, IO thread running status, and GTID execution set comparison between primary and replicas. PMM provides all of these in pre-built Grafana dashboards.