Every growing infrastructure eventually hits the same wall: services multiply, configurations scatter across machines, and keeping track of what’s running where becomes a nightmare. Hardcoded endpoints in config files don’t scale. That’s where distributed coordination and service discovery come in.
Three projects dominate this space: etcd, Consul, and Apache ZooKeeper. All three solve fundamentally the same problem — maintaining a consistent, shared state across multiple nodes — but they differ significantly in architecture, feature set, and operational complexity.
If you’re running a self-hosted infrastructure and need reliable service discovery, configuration management, or distributed locks, this guide will help you choose the right tool and deploy it today.
Why Self-Host Your Service Discovery Layer
Service discovery is the backbone of any modern infrastructure. Instead of hardcoding IP addresses and ports, services register themselves with a central directory and query it at runtime. But why self-host instead of using managed cloud services?
Complete data sovereignty. Service registries contain your entire infrastructure topology — every service name, every endpoint, every health check result. For regulated industries and privacy-conscious organizations, keeping this metadata on your own servers isn’t optional.
No vendor lock-in. Cloud-native service discovery tools tie you to a specific provider’s ecosystem. etcd, Consul, and ZooKeeper all run identically on bare metal, VMs, or any cloud. Move your entire stack between providers without rewriting a single config.
Cost at scale. Managed service discovery pricing scales with the number of registered services and API calls. A medium-sized infrastructure with hundreds of microservices can easily spend thousands per month on a managed solution. Self-hosted, the only cost is the hardware you already own.
Integration with existikubernetessted tools. Kubernetes uses etcd natively. Traefik and Caddy integrate with Consul. Kafka depends on ZooKeeper (historically). Running the same coordination layer across your stack simplifies operations and reduces the learning curve.
Works offline and in air-gapped environments. Self-hosted service discovery doesn’t need an internet connection. For edge computing, industrial IoT, or military-grade air-gapped networks, it’s the only option.
etcd: The Kubernetes Native Choice
etcd is a distributed, consistent key-value store designed specifically for distributed systems. It’s best known as the backing store for Kubernetes, but it stands perfectly well on its own.
Architecture
etcd uses the Raft consensus algorithm for strong consistency guarantees. Every node in an etcd cluster maintains an identical copy of the data. Writes go through the leader node, which replicates the log entry to followers. A write is committed only when a quorum (majority) of nodes acknowledge it.
The Raft protocol gives etcd several important properties:
- Linearizable reads — every read returns the most recent committed value
- Partition tolerance — the cluster remains available as long as a majority of nodes can communicate
- Automatic leader election — if the leader fails, a new one is elected within seconds
Unlike Consul and ZooKeeper, etcd deliberately limits its scope. It provides a key-value API, watch mechanism, and lease system. Everything else — service discovery, leader election, distributed locks — is built on top of these primitives using conventions.
Self-Hosted Deployment
The simplestdockero run etcd is via Docker. Here’s a single-node setup for development:
| |
For production, a three-node cluster is the minimum recommended configuration:
| |
Adjust the IPs for nodes 2 and 3 accordingly. The key parameters are:
| Parameter | Purpose |
|---|---|
ETCD_NAME | Unique node identifier in the cluster |
ETCD_INITIAL_CLUSTER | Complete member list for bootstrapping |
ETCD_INITIAL_CLUSTER_STATE | new for fresh cluster, existing for adding nodes |
ETCD_INITIAL_CLUSTER_TOKEN | Shared token that prevents accidental cluster merges |
Using etcd for Service Discovery
etcd doesn’t have a built-in “service registry” concept. Instead, you use its key-value store with conventions and the watch mechanism:
| |
The pattern is simple: services write their own endpoint under a hierarchical key, attach it to a lease with a TTL, and periodically renew the lease. If a service crashes, the lease expires and the key is automatically deleted. Watching clients get notified immediately.
When etcd Shines
etcd excels when you need strong consistency, simple API semantics, and tight Kubernetes integration. Its watch mechanism is exceptionally reliable — clients get real-time notifications with guaranteed ordering and no missed events. The gRPC-based API is fast and supports multiplexed streams over a single connection.
The tradeoff is operational discipline: etcd requires careful sizing of its storage backend (it uses bbolt, an embedded B-tree), and it doesn’t tolerate high write volumes well. The recommended maximum is about 10,000 keys and 1.5 MB of total data. It’s designed for coordination, not for storing application data.
Consul: The Feature-Rich Contender
Consul, developed by HashiCorp, is a full-featured service mesh and service discovery platform. Where etcd provides primitives, Consul provides a complete solution out of the box.
Architecture
Consul uses the Raft consensus algorithm like etcd, but adds a gossip protocol (Serf) for intra-datacenter communication. This dual-protocol design gives Consul unique capabilities:
- Multi-datacenter awareness. Consul natively federates across data centers with WAN gossip between them.
- Built-in health checking. Consul actively probes registered services and automatically deregisters unhealthy instances.
- DNS interface. In addition to HTTP API, Consul answers DNS queries — any application that can resolve a hostname can discover services without SDK changes.
- Service mesh (Connect). Consul provides mTLS, traffic management, and observability between services through its Connect feature.
- Key-value store. Like etcd, Consul includes a KV store for configuration management.
The gossip protocol means Consul can detect node failures faster than pure Raft-based systems. Serf sends lightweight heartbeat messages between all nodes, detecting failures in seconds rather than relying on Raft election timeouts.
Self-Hosted Deployment
Consul’s Docker image includes both server and agent modes. Here’s a production-ready single-datacenter deployment:
| |
For a multi-node server cluster, add more server containers and configure them to join via the retry-join parameter:
| |
Using Consul for Service Discovery
Consul’s service discovery is more ergonomic than etcd’s. You define services declaratively and Consul handles the rest:
| |
Discover services via multiple interfaces:
| |
The DNS interface is Consul’s killer feature for self-hosted infrastructure. Legacy applications, database drivers, and third-party services that can’t integrate with an HTTP API still benefit from service discovery through standard DNS resolution.
When Consul Shines
Consul is the right choice when you need a complete service discovery and mesh platform with minimal custom development. Its built-in health checks, DNS interface, multi-datacenter support, and service mesh capabilities cover requirements that would take months to build on top of etcd.
The tradeoff is complexity. Consul has more moving parts (Raft + Serf + health check scheduler + DNS server + Connect proxies), which means more configuration surface area and more things to monitor. The resource footprint is also larger — a Consul server node typically uses 2-3x more memory than an etcd node under similar load.
Apache ZooKeeper: The Battle-Tested Veteran
ZooKeeper is the oldest of the three, originally developed at Yahoo! and now an Apache project. It pioneered the distributed coordination pattern and remains the backbone of major distributed systems including Apache Kafka, Apache HBase, and Apache Solr.
Architecture
ZooKeeper uses a custom consensus protocol called ZAB (ZooKeeper Atomic Broadcast), which predates Raft but shares similar principles. It maintains a hierarchical namespace of znodes (like a filesystem) that all servers in the ensemble replicate.
Key architectural features:
- Hierarchical data model. Znodes form a tree structure (
/services/web-app/instance-1) with support for ephemeral nodes (deleted when the client disconnects) and sequential nodes (auto-incrementing names). - Watchers. Clients can set one-time watches on znodes to receive notifications when data changes.
- High write throughput. ZAB is optimized for write-heavy workloads and can sustain higher throughput than Raft-based systems under certain conditions.
- Java ecosystem. Being a Java project, ZooKeeper integrates naturally with the broader Apache ecosystem.
ZooKeeper’s znode types give it unique expressiveness:
| Type | Behavior | Use Case |
|---|---|---|
| Persistent | Survives client disconnect | Configuration data |
| Ephemeral | Deleted on client disconnect | Service registration |
| Persistent Sequential | Auto-appends sequence number | Distributed queues |
| Ephemeral Sequential | Both ephemeral and sequential | Leader election |
Self-Hosted Deployment
ZooKeeper’s Docker deployment requires more configuration than etcd or Consul, but the result is equally robust:
| |
For bare metal deployment without Docker, the configuration uses zoo.cfg:
| |
Using ZooKeeper for Service Discovery
The ZooKeeper client API is more verbose than etcd or Consul, but the hierarchical model maps naturally to service discovery patterns:
| |
When ZooKeeper Shines
ZooKeeper is the best choice when you’re already running Apache ecosystem tools that depend on it (Kafka, HBase, Solr, Dubbo), or when you need high write throughput with a hierarchical data model. Its battle-tested track record — running in production at some of the largest tech companies for over 15 years — speaks to its reliability.
The main drawbacks are operational: ZooKeeper requires Java, has a steeper learning curve, and its one-time watch semantics (you must re-register a watch after each notification) add complexity compared to etcd’s continuous watches. The community has also slowed in recent years as newer alternatives have gained traction.
Head-to-Head Comparison
| Feature | etcd | Consul | ZooKeeper |
|---|---|---|---|
| Consensus Protocol | Raft | Raft + Serf gossip | ZAB |
| API | gRPC + HTTP/JSON | HTTP/JSON + DNS + gRPC | Custom binary protocol |
| Data Model | Flat key-value | Key-value + service catalog | Hierarchical znodes |
| Health Checks | No (build on leases) | Built-in (HTTP, TCP, script, TTL) | No (rely on ephemeral nodes) |
| Service Discovery | Convention-based | First-class feature | Convention-based |
| Multi-Datacenter | No | Native | No |
| Service Mesh | No (use Istio separately) | Built-in (Connect) | No |
| Watch Mechanism | Continuous | Continuous | One-time (must re-register) |
| KV Store Limits | ~10K keys, 1.5 MB total | No strict limits | ~50 MB max znode size |
| Write Throughput | ~10K ops/sec | ~5K ops/sec | ~30K ops/sec |
| Resource Usage | Low (~50 MB RAM per node) | Medium (~150 MB RAM per node) | High (~200 MB RAM per node + JVM) |
| Language | Go | Go | Java |
| Binary Size | ~80 MB | ~120 MB | ~35 MB + JVM (~100 MB) |
| UI/Dashboard | etcd-manager (third-party) | Built-in web UI | No (use third-party tools) |
| Kubernetes Integration | Native (backing store) | Via consul-k8s | Not recommended |
| TLS/mTLS | Native | Native | Via external proxy |
| ACL System | RBAC (v3.5+) | Built-in token-based | SASL/Digest |
| License | Apache 2.0 | BUSL 1.1 (source-available) | Apache 2.0 |
Important Licensing Note
As of 2026, Consul uses the Business Source License (BUSL 1.1) for releases after September 2023. This is a source-available license — not open source in the OSI definition. For most self-hosted internal use cases, BUSL is permissive enough. However, if you need to offer Consul as a managed service to customers, or if your organization has strict open-source-only policies, etcd or ZooKeeper (both Apache 2.0) are the safer choices.
Decision Framework
Choose etcd if:
- You’re running Kubernetes and want a unified coordination layer
- You need strong consistency with a minimal, well-defined API
- Your data fits within the 10K key limit (configuration, not application data)
- You prefer Apache 2.0 licensing
- Your team knows Go and values simplicity
Choose Consul if:
- You need a complete service discovery platform with health checks and DNS
- You operate across multiple data centers
- You want built-in service mesh capabilities (mTLS, traffic splitting)
- You have legacy applications that benefit from DNS-based discovery
- You value the built-in web UI for operational visibility
Choose ZooKeeper if:
- You’re already running Kafka, HBase, or Solr
- You need high write throughput with hierarchical data organization
- You’re invested in the Java/Apache ecosystem and use Curator
- You need the proven reliability of a 15+ year old system
- Apache 2.0 licensing is a hard requirement
Monitoring Your Cluster
Whichever tool you choose, monitoring is essential. Here are the key metrics to track:
| |
For all three, deploy a Grafana dashboard with pre-built panels. The Prometheus community maintains official dashboards for etcd and Consul, and the ZooKeeper exporter (prometheus-jmx-exporter) provides JVM-level metrics.
Final Thoughts
The self-hosted service discovery landscape has matured significantly. In 2026, you have three production-grade options, each with distinct strengths:
etcd wins on simplicity and Kubernetes integration. Its minimal API and small footprint make it easy to operate.
Consul wins on feature completeness. Its service mesh, health checks, and DNS interface provide a turnkey solution that reduces development overhead.
ZooKeeper wins on proven reliability and throughput. If your stack already includes Apache ecosystem tools, it’s often the pragmatic choice.
For most new self-hosted projects, etcd is the recommended starting point. It’s simple, well-documented, and the conventions for service discovery on top of its KV store are well-established. If you need the additional features — especially health checks and multi-datacenter support — Consul is worth the added complexity.
Whichever you choose, the key is consistency: run the same coordination layer across your entire infrastructure, automate the deployment with infrastructure-as-code, and invest in monitoring before issues arise.
Frequently Asked Questions (FAQ)
Which one should I choose in 2026?
The best choice depends on your specific requirements:
- For beginners: Start with the simplest option that covers your core use case
- For production: Choose the solution with the most active community and documentation
- For teams: Look for collaboration features and user management
- For privacy: Prefer fully open-source, self-hosted options with no telemetry
Refer to the comparison table above for detailed feature breakdowns.
Can I migrate between these tools?
Most tools support data import/export. Always:
- Backup your current data
- Test the migration on a staging environment
- Check official migration guides in the documentation
Are there free versions available?
All tools in this guide offer free, open-source editions. Some also provide paid plans with additional features, priority support, or managed hosting.
How do I get started?
- Review the comparison table to identify your requirements
- Visit the official documentation (links provided above)
- Start with a Docker Compose setup for easy testing
- Join the community forums for troubleshooting