Container checkpoint and restore is the ability to save the complete runtime state of a running container — memory, file descriptors, network connections, process state — to disk, then restore it later on the same or a different host. This capability enables live migration, zero-downtime updates, and stateful container recovery that traditional container restarts cannot achieve.
The technology behind this is CRIU (Checkpoint/Restore in Userspace), a Linux kernel feature that has matured significantly and now supports most major container runtimes.
What Is Container Checkpoint & Restore?
When you checkpoint a container, the system captures:
- Process memory — all RAM pages for running processes
- Open file descriptors — sockets, pipes, file handles
- Network state — TCP connections, listening ports
- Process tree — parent-child relationships, PID mapping
- Mount namespaces — filesystem mount points and state
When restored, the container resumes exactly where it left off — including established TCP connections, in-memory caches, and running computations. This is fundamentally different from docker stop && docker start, which terminates the process and loses all runtime state.
CRIU — The Foundation
CRIU is the core Linux technology that enables checkpoint/restore in userspace. With 3,800+ GitHub stars, it is the foundation for all container-level checkpoint/restore implementations.
Key Features:
- Checkpoint and restore of entire process trees
- Network connection migration (TCP state preservation)
- File descriptor migration (open files, pipes, sockets)
- Namespace support (PID, mount, network, IPC)
- Lazy pages support (page in memory on demand during restore)
- User namespace support for rootless containers
Installation:
| |
Basic Usage:
| |
Docker Compose Setup for CRIU:
| |
Podman Checkpoint & Restore
Podman has built-in support for CRIU-based checkpoint and restore, making it the easiest container runtime to use for this workflow. Since Podman runs rootless by default, it also supports rootless checkpoint/restore.
Installation:
| |
Usage:
| |
Docker Compose Alternative (Podman Compose):
| |
Kubernetes Container Migration
While Kubernetes does not natively support container checkpoint/restore, several projects enable live migration of pods between nodes:
Kube-CRIU Integration
The cri (Container Runtime Interface) supports checkpoint/restore operations through the CRI API. Some runtime implementations provide this functionality:
| |
Using crictl for Checkpoint:
| |
KubeVirt Live Migration
For VM-based workloads, KubeVirt provides native live migration between Kubernetes nodes, which is often a more practical approach for production stateful workloads:
| |
Trigger Migration:
| |
Comparison Table
| Feature | CRIU (Direct) | Podman Checkpoint | KubeVirt Migration |
|---|---|---|---|
| Checkpoint Target | Processes | Containers | Virtual Machines |
| Restore Location | Same or different host | Same or different host | Different K8s node |
| TCP State Preservation | ✅ Yes | ✅ Yes | ✅ Yes |
| File Descriptor Migration | ✅ Yes | ✅ Yes | ✅ Yes |
| Rootless Support | ✅ (with user ns) | ✅ Yes | N/A (VM-level) |
| Orchestration Integration | Manual | Podman Compose | Kubernetes native |
| Live Migration | Manual (dump + copy + restore) | Manual (tar + copy + restore) | ✅ Zero-downtime |
| Setup Complexity | High | Low | Medium |
| Production Ready | ✅ Yes | ✅ Yes | ✅ Yes |
| GitHub Stars | 3,800+ | 31,500+ (Podman) | 17,000+ (KubeVirt) |
| Last Updated | April 2026 | May 2026 | May 2026 |
Use Cases
Zero-Downtime Container Updates
Instead of stopping a container and starting a new version (losing in-memory state), checkpoint the running container, start the new version, and restore the old container’s state into a sidecar for debugging:
| |
Disaster Recovery
Save container state regularly as a backup strategy. If a node fails, restore the checkpoint on a healthy node:
| |
Cross-Datacenter Migration
Migrate workloads between datacenters by checkpointing on the source, transferring the checkpoint archive, and restoring on the destination — preserving all active connections and in-memory state.
Why Container Checkpoint Matters
Traditional container orchestration treats containers as ephemeral and stateless. When you need to move a container, you stop it and start a fresh instance — losing all runtime state. Checkpoint and restore changes this paradigm:
- Preserves TCP connections — clients don’t see connection drops during migration
- Maintains in-memory caches — no cold-start performance penalty after restore
- Enables live migration — move workloads between hosts without downtime
- Simplifies debugging — save a failing container’s exact state for offline analysis
For container runtime comparisons, see our containerd vs CRI-O vs Podman guide and Incus vs LXD vs Podman comparison. For container sandboxing approaches that complement checkpoint/restore security, check our gVisor vs Kata Containers guide.
FAQ
What is the difference between checkpoint and snapshot?
A snapshot captures the filesystem state of a container at a point in time (like docker commit). A checkpoint captures the complete runtime state — memory, network connections, file descriptors, and process state. Snapshots let you restart from a filesystem baseline; checkpoints let you resume exactly where the container left off, including active network connections.
Does Docker support checkpoint and restore?
Docker added experimental checkpoint/restore support using CRIU in version 1.13, but it was never promoted to stable and was deprecated in later versions. Podman has better native support. For Docker-based workflows, you need to use CRIU directly or switch to Podman.
What are the limitations of container checkpoint/restore?
Not all applications can be checkpointed. Limitations include: applications with open GPU contexts, some database engines with memory-mapped files, containers with mounted FUSE filesystems, and applications using certain Linux kernel features (userfaultfd, inotify). CRIU provides a pre-check (criu check) to verify compatibility before attempting checkpoint.
Can I checkpoint a running database?
Technically yes, but it is risky. Databases like PostgreSQL and MySQL use memory-mapped files and transaction logs. Checkpointing a running database may result in inconsistent state on restore. It is safer to use the database’s native backup tools (pg_dump, mysqldump) combined with WAL/archive logs for consistent backups.
How large is a checkpoint file?
Checkpoint size depends on the container’s memory usage. A container using 512MB of RAM will produce a checkpoint of roughly 200-500MB (compressed). With lazy pages support, you can restore without immediately loading all memory pages, reducing restore time significantly.
Is container checkpoint/restore production-ready?
CRIU and Podman checkpoint are production-ready for supported workloads. Kubernetes-native migration is still emerging — KubeVirt provides the most mature solution for VM workloads, while container-level migration in Kubernetes requires manual orchestration or custom operators. Always test checkpoint/restore with your specific application before relying on it in production.