When your application depends on asynchronous messaging, a single-node message broker is a ticking time bomb. One server failure and your entire event pipeline grinds to a halt. That’s why production deployments demand high availability (HA) configurations that survive node failures without dropping messages.

This guide compares three battle-tested self-hosted message brokers — RabbitMQ, Apache Kafka, and NATS — focusing specifically on their high availability architectures. We’ll look at how each handles leader election, data replication, failover, and partition tolerance so you can make an informed choice for your infrastructure.

Message Broker HA Architectures Compared

Each broker takes a fundamentally different approach to achieving high availability:

FeatureRabbitMQApache KafkaNATS
HA MechanismQuorum Queues (Raft)ISR-based ReplicationRaft Clustering + JetStream
Consistency ModelStrong (Raft)Strong (ISR)Strong (Raft for JetStream)
Leader ElectionBuilt-in Raft leaderController-basedBuilt-in Raft leader
Minimum Nodes for HA333 (5 recommended)
Data ReplicationQueue-levelTopic/Partition-levelStream-level
Automatic FailoverYes (sub-second)Yes (controller election)Yes (sub-second)
Split-Brain ProtectionRaft majority quorumISR + ZooKeeper/KRaftRaft majority quorum
Disk PersistenceOptional (durable queues)Mandatory (log segments)Optional (JetStream)
Primary LanguageErlangJava/ScalaGo
GitHub Stars13,680+32,687+19,937+
Memory Footprint (idle)~80 MB~1 GB~20 MB

RabbitMQ: Quorum Queues and Classic Mirrored Queues

RabbitMQ offers two HA strategies. The modern approach uses Quorum Queues — a Raft-based replicated queue type introduced in RabbitMQ 3.8. Quorum queues store data on a majority of cluster nodes and use the Raft consensus protocol for leader election and log replication.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
# docker-compose.yml — RabbitMQ 3-node cluster
version: "3.8"
services:
  rabbitmq1:
    image: rabbitmq:4.0-management
    hostname: rabbitmq1
    environment:
      RABBITMQ_ERLANG_COOKIE: "shared-secret-cookie"
      RABBITMQ_DEFAULT_USER: admin
      RABBITMQ_DEFAULT_PASS: securepassword
    ports:
      - "5672:5672"
      - "15672:15672"
    volumes:
      - rabbitmq1_data:/var/lib/rabbitmq

  rabbitmq2:
    image: rabbitmq:4.0-management
    hostname: rabbitmq2
    environment:
      RABBITMQ_ERLANG_COOKIE: "shared-secret-cookie"
    volumes:
      - rabbitmq2_data:/var/lib/rabbitmq

  rabbitmq3:
    image: rabbitmq:4.0-management
    hostname: rabbitmq3
    environment:
      RABBITMQ_ERLANG_COOKIE: "shared-secret-cookie"
    volumes:
      - rabbitmq3_data:/var/lib/rabbitmq

volumes:
  rabbitmq1_data:
  rabbitmq2_data:
  rabbitmq3_data:

After starting the containers, join nodes into a cluster:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
# On rabbitmq2 and rabbitmq3
docker exec rabbitmq2 rabbitmqctl stop_app
docker exec rabbitmq2 rabbitmqctl join_cluster rabbit@rabbitmq1
docker exec rabbitmq2 rabbitmqctl start_app

docker exec rabbitmq3 rabbitmqctl stop_app
docker exec rabbitmq3 rabbitmqctl join_cluster rabbit@rabbitmq1
docker exec rabbitmq3 rabbitmqctl start_app

# Create a quorum queue with replication factor 3
docker exec rabbitmq1 rabbitmqctl set_policy ha-quorum "^quorum\." \
  '{"ha-mode":"exactly","ha-params":3,"ha-sync-mode":"automatic"}'

For a quorum queue, configure the initial replica count when declaring it:

1
2
3
4
5
6
7
8
9
# Python (pika) — declare a quorum queue
channel.queue_declare(
    queue='orders.quorum',
    durable=True,
    arguments={
        'x-queue-type': 'quorum',
        'x-quorum-initial-group-size': 3
    }
)

Apache Kafka: ISR Replication with KRaft

Kafka achieves HA through in-sync replicas (ISR) — each topic partition has one leader and multiple follower replicas. The Kafka controller (now running via KRaft instead of ZooKeeper) monitors broker health and triggers leader re-election when a broker fails.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
# docker-compose.yml — Kafka 3-node cluster with KRaft
version: "3.8"
services:
  kafka1:
    image: apache/kafka:3.9.0
    hostname: kafka1
    environment:
      KAFKA_NODE_ID: 1
      KAFKA_PROCESS_ROLES: "controller,broker"
      KAFKA_LISTENERS: "PLAINTEXT://kafka1:9092,CONTROLLER://kafka1:9093"
      KAFKA_ADVERTISED_LISTENERS: "PLAINTEXT://kafka1:9092"
      KAFKA_CONTROLLER_LISTENER_NAMES: "CONTROLLER"
      KAFKA_LISTENER_SECURITY_PROTOCOL_MAP: "CONTROLLER:PLAINTEXT,PLAINTEXT:PLAINTEXT"
      KAFKA_CONTROLLER_QUORUM_VOTERS: "1@kafka1:9093,2@kafka2:9093,3@kafka3:9093"
      KAFKA_OFFSETS_TOPIC_REPLICATION_FACTOR: 3
      KAFKA_DEFAULT_REPLICATION_FACTOR: 3
      KAFKA_MIN_INSYNC_REPLICAS: 2
    volumes:
      - kafka1_data:/var/lib/kafka/data

  kafka2:
    image: apache/kafka:3.9.0
    hostname: kafka2
    environment:
      KAFKA_NODE_ID: 2
      KAFKA_PROCESS_ROLES: "controller,broker"
      KAFKA_LISTENERS: "PLAINTEXT://kafka2:9092,CONTROLLER://kafka2:9093"
      KAFKA_ADVERTISED_LISTENERS: "PLAINTEXT://kafka2:9092"
      KAFKA_CONTROLLER_LISTENER_NAMES: "CONTROLLER"
      KAFKA_LISTENER_SECURITY_PROTOCOL_MAP: "CONTROLLER:PLAINTEXT,PLAINTEXT:PLAINTEXT"
      KAFKA_CONTROLLER_QUORUM_VOTERS: "1@kafka1:9093,2@kafka2:9093,3@kafka3:9093"
      KAFKA_OFFSETS_TOPIC_REPLICATION_FACTOR: 3
      KAFKA_DEFAULT_REPLICATION_FACTOR: 3
      KAFKA_MIN_INSYNC_REPLICAS: 2
    volumes:
      - kafka2_data:/var/lib/kafka/data

  kafka3:
    image: apache/kafka:3.9.0
    hostname: kafka3
    environment:
      KAFKA_NODE_ID: 3
      KAFKA_PROCESS_ROLES: "controller,broker"
      KAFKA_LISTENERS: "PLAINTEXT://kafka3:9092,CONTROLLER://kafka3:9093"
      KAFKA_ADVERTISED_LISTENERS: "PLAINTEXT://kafka3:9092"
      KAFKA_CONTROLLER_LISTENER_NAMES: "CONTROLLER"
      KAFKA_LISTENER_SECURITY_PROTOCOL_MAP: "CONTROLLER:PLAINTEXT,PLAINTEXT:PLAINTEXT"
      KAFKA_CONTROLLER_QUORUM_VOTERS: "1@kafka1:9093,2@kafka2:9093,3@kafka3:9093"
      KAFKA_OFFSETS_TOPIC_REPLICATION_FACTOR: 3
      KAFKA_DEFAULT_REPLICATION_FACTOR: 3
      KAFKA_MIN_INSYNC_REPLICAS: 2
    volumes:
      - kafka3_data:/var/lib/kafka/data

volumes:
  kafka1_data:
  kafka2_data:
  kafka3_data:

Create a topic with replication factor 3 and min ISR of 2:

1
2
3
4
5
6
# Create topic with HA configuration
docker exec kafka1 /opt/kafka/bin/kafka-topics.sh \
  --create --topic orders-topic \
  --bootstrap-server kafka1:9092 \
  --partitions 6 --replication-factor 3 \
  --config min.insync.replicas=2

Kafka’s ISR model guarantees that writes acknowledged to the producer are replicated to at least min.insync.replicas brokers. Configuring acks=all on the producer ensures durability at the cost of some latency.

NATS: Raft Clustering with JetStream

NATS achieves HA through JetStream — its built-in persistence layer that uses the Raft consensus protocol for stream replication. The core NATS server is stateless and provides at-most-once delivery. JetStream adds at-least-once and exactly-once semantics with durable storage.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
# docker-compose.yml — NATS 3-node cluster with JetStream
version: "3.8"
services:
  nats1:
    image: nats:2.10-alpine
    command:
      - "--name=nats1"
      - "--cluster_name=nats-cluster"
      - "--cluster=nats://0.0.0.0:6222"
      - "--routes=nats://nats2:6222,nats://nats3:6222"
      - "--jetstream"
      - "--store_dir=/data/jetstream"
      - "--cluster_advertise=nats1:6222"
      - "--server_name=nats1"
    ports:
      - "4222:4222"
    volumes:
      - nats1_data:/data

  nats2:
    image: nats:2.10-alpine
    command:
      - "--name=nats2"
      - "--cluster_name=nats-cluster"
      - "--cluster=nats://0.0.0.0:6222"
      - "--routes=nats://nats1:6222,nats://nats3:6222"
      - "--jetstream"
      - "--store_dir=/data/jetstream"
      - "--cluster_advertise=nats2:6222"
      - "--server_name=nats2"
    volumes:
      - nats2_data:/data

  nats3:
    image: nats:2.10-alpine
    command:
      - "--name=nats3"
      - "--cluster_name=nats-cluster"
      - "--cluster=nats://0.0.0.0:6222"
      - "--routes=nats://nats1:6222,nats://nats2:6222"
      - "--jetstream"
      - "--store_dir=/data/jetstream"
      - "--cluster_advertise=nats3:6222"
      - "--server_name=nats3"
    volumes:
      - nats3_data:/data

volumes:
  nats1_data:
  nats2_data:
  nats3_data:

Create a replicated JetStream stream:

1
2
3
4
5
6
7
8
# Create stream with R3 replication
nats stream create ORDERS \
  --subjects "orders.>" \
  --storage file \
  --replicas 3 \
  --retention limits \
  --max-msgs 1000000 \
  --max-age 7d

NATS uses RAFT groups for each stream. With 3 replicas, the system can tolerate 1 node failure. For 5-node clusters, you can use R5 replication and survive 2 node failures.

Choosing the Right Broker for Your HA Requirements

When to Choose RabbitMQ

RabbitMQ excels when you need flexible routing patterns (topic exchanges, headers exchanges, dead-letter exchanges) and per-message acknowledgments. Quorum queues provide strong consistency with sub-second failover. RabbitMQ’s Erlang/OTP foundation gives it battle-tested distributed systems primitives built into the runtime.

When to Choose Apache Kafka

Kafka is the choice for high-throughput event streaming and log-based architectures. Its ISR replication model allows tuning consistency vs. availability on a per-topic basis. The KRaft consensus mode (replacing ZooKeeper) simplifies operations while maintaining the same strong guarantees. Kafka shines when you need to replay events, maintain strict ordering within partitions, or integrate with stream processing frameworks.

When to Choose NATS

NATS is the lightweight champion — a 20 MB binary that starts in milliseconds and handles millions of messages per second. JetStream adds persistence and HA with minimal operational overhead. NATS is ideal for edge computing, IoT deployments, and microservice architectures where you want a message bus that “just works” without tuning dozens of parameters.

Why Self-Host Your Message Broker HA?

Running your own highly available message broker cluster gives you complete control over your messaging infrastructure. Cloud-managed alternatives like Amazon MQ, Confluent Cloud, or Google Pub/Sub charge per-message or per-partition fees that scale unpredictably with traffic. A self-hosted 3-node cluster on modest VPS instances ($20-40/month each) can handle millions of messages per day at a fixed cost.

Data sovereignty is another critical factor. Sensitive event data stays within your network perimeter rather than flowing through a third-party cloud provider. For regulated industries handling PII, financial transactions, or healthcare data, this alone justifies the operational overhead.

Finally, self-hosting eliminates vendor lock-in. You’re free to migrate between brokers, adjust replication strategies, or integrate with any tool in your stack without being constrained by a cloud provider’s supported integrations. For teams already managing their own infrastructure, adding a message broker HA cluster is a natural extension of existing operational practices.

For a deeper understanding of messaging patterns, see our brokerless messaging comparison. If you’re interested in event-driven architectures, check out our event gateway comparison and our Kafka operations guide.

FAQ

How many nodes do I need for a production message broker cluster?

Three nodes is the minimum for all three brokers. This provides a Raft majority quorum (2 out of 3), allowing you to survive one node failure. For NATS, a 5-node cluster is recommended for production JetStream deployments because it gives you R5 replication with tolerance for 2 simultaneous failures.

What happens to in-flight messages during a leader failover?

RabbitMQ quorum queues use Raft — uncommitted entries on the old leader are re-proposed by the new leader. Kafka with acks=all guarantees that acknowledged messages are already replicated to all ISR brokers before the producer gets confirmation, so no data loss occurs. NATS JetStream similarly uses Raft and guarantees that acknowledged messages are committed to a majority of replicas.

Can I mix ARM and x86 nodes in the same cluster?

RabbitMQ supports mixed architectures in the same cluster as long as the Erlang versions match. Kafka officially supports mixed-architecture clusters from version 3.x onward with KRaft. NATS supports mixed architectures natively — the Go binary compiles identically for both ARM64 and AMD64. Always test thoroughly before deploying mixed-architecture clusters to production.

How do I monitor cluster health?

All three brokers expose Prometheus metrics endpoints. RabbitMQ provides a /metrics endpoint through the rabbitmq_prometheus plugin. Kafka exposes metrics via JMX with a Prometheus JMX exporter sidecar. NATS has a built-in /varz, /connz, /routez, and /jsz HTTP monitoring endpoints plus a native Prometheus /metrics endpoint. Set up alerting for partition counts, under-replicated partitions (Kafka), and Raft leader changes.

Should I use a load balancer in front of my broker cluster?

RabbitMQ clients can connect to any node in the cluster — the Erlang distribution protocol routes requests to the queue leader automatically, so a TCP load balancer (HAProxy, Traefik) is sufficient. Kafka clients need to connect directly to the partition leader broker, so a load balancer only works for the bootstrap connection — after that, the client discovers broker addresses directly. NATS clients connect to any server and the cluster routes messages internally, so a simple round-robin DNS or TCP load balancer works perfectly.


💡 想测试你的市场判断力?我用 Polymarket 做预测市场交易——这是全球最大的预测市场平台,从大选结果到 科技政策监管时间线,什么都可以押注。和赌博不同,这是真正的信息市场:你懂的信息越多,胜率越高。我靠预测 科技行业的发展趋势已经赚了不少。用我的邀请链接注册:Polymarket.com