etcd is the backbone of every Kubernetes cluster — it stores all cluster state, configuration data, and secrets. If your etcd data is lost, your entire Kubernetes cluster becomes unrecoverable. Yet many operators rely solely on etcd’s built-in snapshot capability without a comprehensive backup strategy.
This guide covers every aspect of etcd backup and disaster recovery: manual snapshots, automated backup tools, off-site storage, cluster restoration procedures, and production-tested best practices to ensure your cluster state is always recoverable.
Why etcd Backup Matters
etcd stores everything that makes your Kubernetes cluster function:
- Pod and service definitions — every deployment, service, ingress, and configmap
- Secrets and certificates — TLS certs, API tokens, database credentials
- RBAC policies — role bindings, service accounts, cluster roles
- CRD data — custom resource definitions and their instances
- Cluster state — node status, endpoint mappings, lease information
Without a backup, a disk failure or operator error on etcd means rebuilding your cluster from scratch and re-deploying every workload manually.
For Kubernetes disaster recovery strategies, see our Kanister vs K8up vs Stash guide which covers broader cluster-level backup solutions.
Quick Comparison: etcd Backup Tools
| Tool | Method | Schedule | Off-site | Restore | Min. etcd Version |
|---|---|---|---|---|---|
| etcdctl snapshot | CLI manual | Manual | Via rsync/cron | Full cluster | 3.0+ |
| etcd-backup-operator | Kubernetes Operator | Cron-based | S3/GCS | Full cluster | 3.2+ |
| Velero etcd plugin | Velero integration | Cron-based | S3/GCS/Azure | Full cluster | 3.3+ |
| kubeadm backup | kubeadm etcd | Manual | Via copy | Full cluster | 3.4+ |
| Automated cron | Systemd timer | Scheduled | S3/rsync | Full cluster | Any |
Method 1: Manual Snapshots with etcdctl
The simplest and most reliable method — using etcd’s built-in snapshot capability.
Taking a Snapshot
| |
Output shows snapshot size, hash, and total key count:
| |
Automated Snapshot via Cron
| |
This takes a snapshot every 6 hours and removes backups older than 7 days.
Restoring from a Snapshot
| |
Method 2: etcd-backup-operator
The etcd-backup-operator runs as a Kubernetes operator and manages automated backups with configurable schedules and S3/GCS storage.
Installation
| |
Backup Schedule Configuration
| |
S3 Credentials Secret
| |
Method 3: Velero Integration
Velero is a popular Kubernetes backup tool that can include etcd snapshots as part of full cluster backups.
Install Velero with etcd Plugin
| |
Create a Backup Schedule
| |
Restore a Cluster
| |
Method 4: Automated Cron with Systemd Timer
For non-Kubernetes etcd deployments or when you want OS-level backup scheduling:
Systemd Timer Unit
| |
Systemd Service Unit
| |
Backup Script
| |
Multi-Node Cluster Backup Considerations
For etcd clusters with 3 or 5 nodes:
- Backup from one node only: Snapshots are consistent across the cluster — you only need to back up one member
- Rotate the backup source: Periodically back up from different nodes to catch any replication lag
- Test restore to a new cluster: Periodically verify your backups by restoring to a fresh environment
- Monitor snapshot size: A sudden increase in snapshot size may indicate a runaway controller creating too many objects
| |
Disaster Recovery Scenarios
Scenario 1: Single Node Failure
If one etcd node in a 3-node cluster fails:
| |
Scenario 2: Complete Cluster Loss
When all etcd nodes are unrecoverable:
- Stop all Kubernetes control plane components
- Restore etcd from the latest snapshot on a new node (see Method 1 restore steps)
- Restart the API server, controller manager, and scheduler
- Verify cluster state:
kubectl get nodesandkubectl get pods --all-namespaces
Scenario 3: Corrupted Data
If etcd data is corrupted but the service is still running:
| |
Best Practices
- Backup at least every 6 hours: etcd changes rapidly; a 24-hour backup window risks significant data loss
- Store off-site: Never keep backups on the same server as etcd — use S3, GCS, or a separate NFS share
- Test restores quarterly: A backup you haven’t tested is not a backup
- Encrypt backups at rest: etcd contains secrets — encrypt backup files with age or GPG before uploading
- Monitor backup success: Set up alerts for failed backup jobs — silent backup failures are worse than no backups
- Keep at least 30 days of history: This covers delayed discovery of issues and provides multiple restore points
- Document the restore procedure: In a disaster, your team needs clear, tested instructions — not wiki pages they hope are current
For broader Kubernetes backup strategies that cover application data alongside etcd, see our Velero vs Stash vs Volsync guide.
FAQ
How often should I back up etcd?
At minimum every 6 hours. In high-change environments (CI/CD clusters with frequent deployments), every 1-2 hours is recommended. etcd snapshots are fast (typically 1-5 seconds) and have minimal performance impact.
Can I back up etcd while it is running?
Yes. etcd snapshots are consistent point-in-time captures that work on a live cluster. The etcdctl snapshot save command is safe to run during normal operations and does not require stopping etcd.
How large are etcd snapshots?
For a typical Kubernetes cluster with 100-500 pods, snapshots range from 10-50 MB. Very large clusters with thousands of resources may produce 100-500 MB snapshots. The size correlates with the number of keys in etcd, not the number of nodes.
What happens if I restore an old etcd snapshot?
Restoring an old snapshot rolls back all cluster state to that point in time. Any resources created, modified, or deleted after the snapshot will be lost. Always restore the most recent verified snapshot, and be prepared to re-apply any changes made since the backup.
Should I encrypt etcd backups?
Absolutely. etcd stores Kubernetes Secrets in plaintext (unless you use encryption at rest, which many clusters don’t). Backup files contain all your secrets, TLS certificates, and service account tokens. Always encrypt backups with a tool like age before storing them off-site.
How do I verify an etcd backup is valid?
Run etcdctl snapshot status <backup-file> --write-out=table to check the snapshot’s hash, revision, total keys, and total size. A valid snapshot will display these values. You can also test by restoring to a temporary directory and running etcdctl get / --prefix --keys-only to list all keys.