The internet loses millions of pages every year. Link rot is real — studies show that over 50% of URLs referenced in academic papers from the early 2000s are now dead. Social media posts vanish, news articles get paywalled, documentation sites restructure, and entire platforms shut down without warning. If you rely on the web for research, compliance, or personal knowledge management, you need a strategy to preserve content before it disappears.
While the Internet Archive’s Wayback Machine is the most well-known web archiving service, it has limitations: you cannot control what gets archived, you cannot search your private archives efficiently, and you are at the mercy of a third party’s infrastructure and policies. Self-hosting your own web archiving solution gives you full ownership, instant access, searchable storage, and complete control over what you preserve.
Why Self-Host Your Web Archive
There are compelling reasons to run your own web archiving infrastructure instead of relying on external services.
Complete ownership of your data. When you archive pages to your own server, the content never leaves your control. You decide retention policies, access controls, and backup strategies. This is especially critical for legal compliance, research data, and sensitive business intelligence.
Full-text search across your entire archive. Most third-party archiving services do not provide robust search capabilities across your saved pages. With a self-hosted solution, you can build a personal knowledge base where every archived page is searchable by content, tag, date, or source.
Archive anything, including private pages. The Wayback Machine cannot archive pages behind authentication, paywalls, or corporate intranets. Your own archiving server can be configured with credentials to capture content from any source you have access to.
No rate limits or quotas. External archiving services impose limits on how many pages you can save and how frequently. A self-hosted instance lets you archive as much as your storage and bandwidth allow.
Preserve interactive and dynamic content. Modern web pages rely heavily on JavaScript, WebAssembly, and dynamic loading. Self-hosted archiving tools can be configured to wait for JavaScript execution, capture screenshots, and save multiple output formats to ensure you preserve the page as it appeared.
What Is ArchiveBox?
ArchiveBox is the most comprehensive open-source self-hosted web archiving solution available. It takes a URL and captures a full snapshot of the page in multiple formats:
- HTML — raw page source and cleaned DOM output
- Wget mirror — complete recursive download with all assets
- SingleFile — entire page saved as a single self-contained HTML file
- PDF — rendered page output via headless browser
- Screenshot — PNG capture of the page viewport
- WARC — raw HTTP response archive (the web archiving standard)
- Media — embedded video and audio extraction via yt-dlp
- Git — repository clones for GitHub, GitLab, and similar hosts
- DOM — post-JavaScript-rendered HTML snapshot
ArchiveBox is not just a downloader — it is a full-featured archiving platform with a web interface, REST API, scheduling capabilities, and support for importing bookmarks from browser export files, Pocket, Pinboard, Raindrop, Shaarli, Delicious, and many other sources.
ArchiveBox vs Alternative Archiving Tools
| Feature | ArchiveBox | Wayback Machine | SingleFile CLI | warcprox | Browsh |
|---|---|---|---|---|---|
| Self-hosted | ✅ Yes | ❌ No | ✅ Yes | ✅ Yes | ✅ Yes |
| Multi-format capture | ✅ 9 formats | ✅ Limited | ✅ Single HTML | ✅ WARC only | ❌ Browser only |
| Web UI | ✅ Built-in | ✅ Web only | ❌ CLI only | ❌ CLI only | ✅ Built-in |
| Scheduling/cron | ✅ Built-in | ❌ No | ❌ Manual | ❌ No | ❌ No |
| Full-text search | ✅ Yes | ✅ Yes | ❌ No | ❌ No | ❌ No |
| API access | ✅ REST API | ✅ API | ❌ No | ❌ No | ❌ No |
| Bookmark import | ✅ 15+ sources | ❌ No | ❌ No | ❌ No | ❌ No |
| JavaScript rendering | ✅ Headless Chrome | ✅ Yes | ✅ Via extension | ❌ No | ✅ Yes |
| Media extraction | ✅ yt-dlp | ❌ No | ❌ No | ❌ No | ❌ No |
| docker support | ✅ Official image | ❌ N/A | ✅ Community | ✅ Yes | ✅ Yes |
| Active development | ✅ Very active | ✅ Active | ✅ Active | ⚠️ Low activity | ⚠️ Low activity |
For most users who want a complete self-hosted web archiving solution, ArchiveBox is the clear choice. It combines the capture capabilities of multiple tools into a single platform with a polished interface and active community.
Prerequisites
Before deploying ArchiveBox, ensure your server meets these requirements:
- CPU: 2+ cores recommended (headless browser rendering is CPU-intensive)
- RAM: 4 GB minimum, 8 GB recommended for concurrent archiving
- Storage: Depends on archive size. A typical page takes 2-10 MB across all formats. Plan for at least 100 GB for serious archiving.
- OS: Linux (Ubuntu 22.04+, Debian 12+, or any distro with Docker)
- Docker and Docker Compose installed
The following commands assume an Ubuntu/Debian server. Adjust package names for your distribution.
Installation: Docker Compose (Recommended)
The Docker Compose deployment is the fastest and most reliable way to run ArchiveBox. It bundles all dependencies — Python, Chromium, Node.js, yt-dlp, and SingleFile — into a single container.
Step 1: Create the Project Directory
| |
Step 2: Write the Docker Compose File
Create a docker-compose.yml file:
| |
The SYS_CHROOT capability is required for Chromium to run in sandboxed mode inside the container. The volume mount persists all archived data on your host filesystem.
Step 3: Initialize the Archive
| |
This command creates the SQLite database, sets up the admin user, and generates the initial configuration. You will be prompted to create an admin username, email, and password.
Step 4: Start the Service
| |
ArchiveBox is now running at http://your-server-ip:8080. Log in with the admin credentials you created during initialization.
Step 5: Verify the Installation
| |
This should display your archive statistics, including the number of snapshots, total disk usage, and configured extractors.
Configuration and Optimization
Out of the box, ArchiveBox works well for casual use. For production deployments with heavy archiving loads, you should tune several settings.
Core Configuration Options
Configuration is managed through the web UI or via environment variables in your docker-compose.yml. Key settings include:
| |
| Setting | Default | Recommended | Description |
|---|---|---|---|
CHROME_TIME_LIMIT | 60s | 120s | Max time for headless browser rendering per page |
WGET_TIMEOUT | 60s | 120s | Max time for wget download per page |
YOUTUBEDL_TIMEOUT | 60s | 120s | Max time for media extraction |
RESOLUTION | 1440,2000 | 1440,2000 | Browser viewport for screenshots and PDFs |
SAVE_ARCHIVE_DOT_ORG | True | False | Disable submitting to Internet Archive (self-hosted only) |
CHROME_HEADLESS | True | True | Run Chrome in headless mode |
SAVE_FAVICON | True | True | Download site favicons for visual identification |
RESOLUTION | 1440,2000 | 1920,1080 | Screenshot and PDF viewport size |
Adding Full-Text Search with Sonic
For archives with thousands of pages, the default SQLite search can become slow. ArchiveBox supports Sonic, a lightweight full-text search backend:
| |
After adding Sonic, rebuild the search index:
| |
Using ArchiveBox
Adding URLs via the Web UI
The web interface is the simplest way to archive pages. Log in and use the “Add URL” button in the top right corner. You can:
- Add a single URL
- Paste multiple URLs (one per line)
- Import a bookmark export file (HTML, JSON, or Netscape format)
ArchiveBox immediately begins processing and shows real-time progress.
Archiving via the CLI
For automation and scripting, the CLI is more powerful:
| |
Importing Bookmarks
ArchiveBox can import bookmarks from virtually any source:
| |
Scheduling Regular Archives
Set up automatic archiving with cron jobs on your host machine:
| |
Advanced Use Cases
Archiving Pages Behind Authentication
To archive content behind login walls, configure ArchiveBox with browser cookies:
| |
For more complex authentication flows, you can configure a browser user data directory with saved sessions:
| |
Archiving Entire Sites
For comprehensive site archiving, use wget’s recursive mode through ArchiveBox:
| |
For large sites, consider using wget directly with ArchiveBox as the post-processor:
| |
Integrating with Your Existing Stack
ArchiveBox provides a REST API for integration with other tools:
| |
Browser Extension for One-Click Archiving
Install the ArchiveBox browser extension for Chrome or Firefox to archive any page with a single click. The extension sends the current tab’s URL directly to your ArchiveBox instance.
Configure it by setting the ArchiveBox URL and API token in the extension settings. After that, clicking the extension icon immediately queues the current page for archiving.
Monitoring and Maintenance
Keep your archive healthy with these maintenance tasks:
| |
Set up monitoring with Uptime Kuma or similar tools to track ArchiveBox availability:
| |
Reverse Proxy Setup with HTTPS
For production use, you should put ArchiveBox behind a reverse proxy with TLS. Here is an NGINX configuration:
| |
With this configuration and Let’s Encrypt certificates, your archive is accessible at https://archive.example.com with full encryption.
Backup Strategy
Your archive is only as valuable as your backup strategy. ArchiveBox stores everything in the data/ directory, making backups straightforward:
| |
Run this script daily via cron. For larger archives, consider using restic or borg for deduplicated backups.
Storage Management
Web archives grow quickly. Here are strategies to manage storage:
| |
A practical approach is to archive everything in full for the first 30 days, then keep only the PDF and SingleFile formats for long-term storage, discarding heavier WARC and full wget mirrors.
Conclusion
Self-hosted web archiving with ArchiveBox gives you a personal Wayback Machine that you fully control. Whether you are a researcher preserving source material, a compliance officer maintaining records, or simply someone who values digital preservation, ArchiveBox provides a robust, open-source foundation for building your own archive.
The combination of multi-format capture, full-text search, bookmark importing, REST API, and scheduling makes it the most complete self-hosted web archiving solution available. With Docker deployment, you can have a working instance running in under five minutes.
Start arching today — the page you bookmark now might be gone tomorrow.
Frequently Asked Questions (FAQ)
Which one should I choose in 2026?
The best choice depends on your specific requirements:
- For beginners: Start with the simplest option that covers your core use case
- For production: Choose the solution with the most active community and documentation
- For teams: Look for collaboration features and user management
- For privacy: Prefer fully open-source, self-hosted options with no telemetry
Refer to the comparison table above for detailed feature breakdowns.
Can I migrate between these tools?
Most tools support data import/export. Always:
- Backup your current data
- Test the migration on a staging environment
- Check official migration guides in the documentation
Are there free versions available?
All tools in this guide offer free, open-source editions. Some also provide paid plans with additional features, priority support, or managed hosting.
How do I get started?
- Review the comparison table to identify your requirements
- Visit the official documentation (links provided above)
- Start with a Docker Compose setup for easy testing
- Join the community forums for troubleshooting