Modern organizations manage hundreds of databases, data lakes, warehouses, and streaming pipelines. Finding the right dataset, understanding its lineage, knowing who owns it, and trusting its quality are daily challenges for data teams. Commercial data catalogs like Alation, Collibra, and Informatica Enterprise Data Catalog charge premium prices and lock your metadata behind proprietary formats.
Open-source data catalogs solve this problem differently: they give you full control over your metadata, run on your own infrastructure, and integrate with the tools you already use. This guide compares the three leading open-source data catalog platforms — Amundsen, DataHub, and OpenMetadata — with practical deployment instructions so you can choose and deploy the right solution for your organization.
Why You Need a Self-Hosted Data Catalog
A data catalog is a centralized inventory of all your data assets — tables, columns, dashboards, pipelines, and machine learning models — enriched with metadata that makes them discoverable, understandable, and trustworthy. Here is why self-hosting matters:
- Data privacy and compliance. Metadata often reveals sensitive information about your business: table schemas, data lineage, ownership, and usage patterns. Self-hosting keeps this intelligence within your network, avoiding exposure to third-party SaaS vendors.
- No per-user licensing fees. Commercial catalogs charge per seat, making them prohibitively expensive for large organizations. Open-source tools scale to hundreds or thousands of users at zero additional license cost.
- Deep integration flexibility. Open-source catalogs expose APIs and plugin architectures that let you connect to internal systems, custom metadata sources, and proprietary tools that commercial products simply cannot integrate with.
- Metadata ownership. Your metadata graph — the relationships between tables, pipelines, dashboards, and teams — is a strategic asset. Self-hosting means you own this graph permanently, with no risk of vendor lock-in or service discontinuation.
- Air-gapped and regulated environments. Government, healthcare, and financial organizations often operate in air-gapped or heavily regulated environments where SaaS data catalogs are not an option.
What to Look for in a Data Catalog
Not all data catalogs are equal. When evaluating platforms, focus on these capabilities:
| Capability | Why It Matters |
|---|---|
| Automated metadata ingestion | Manual metadata entry does not scale. The catalog must pull schema, statistics, and lineage from your data sources automatically. |
| Data lineage visualization | Understanding where data comes from and where it flows is essential for debugging pipelines and regulatory compliance. |
| Search and discovery | Users must find datasets quickly using natural language, tag search, or column-level search. |
| Column-level metadata | Descriptions, data types, PII tags, and ownership at the column level, not just the table level. |
| Data quality integration | Integration with data quality tools (Great Expectations, dbt tests) so users see health scores when browsing datasets. |
| Access control | Fine-grained permissions for who can view, edit, or manage metadata. |
| Business glossary | A shared vocabulary that maps business terms to technical assets, bridging the gap between analysts and engineers. |
| Extensibility | Plugin architecture for custom ingestion sources, metadata transformers, and UI extensions. |
Overview: Amundsen vs DataHub vs OpenMetadata
Amundsen — The Pioneer
Amundsen was created by Lyft in 2019 and donated to the Linux Foundation’s LF AI & Data organization. It was one of the first open-source data catalogs and remains focused on data discovery and search. Its architecture uses Elasticsearch for search, Neo4j for the metadata graph, and a React-based frontend.
Strengths:
- Simple, clean search-first interface
- Strong integration with Presto, Hive, and BigQuery
- Mature preview functionality for seeing sample data
- Active community with Lyft engineering backing
- Lightweight deployment footprint compared to alternatives
Weaknesses:
- No built-in data lineage visualization (requires third-party plugins)
- Limited metadata editing capabilities in the UI
- No native business glossary feature
- Smaller plugin ecosystem compared to DataHub
- Development pace has slowed in recent years
DataHub — The Modern Platform
DataHub was open-sourced by LinkedIn in 2020 and is now maintained by Acryl Data and an active open-source community. It has become the most feature-complete open-source data catalog, with extensive ingestion capabilities, real-time metadata streaming via Kafka, and a robust GraphQL API.
Strengths:
- Most comprehensive ingestion framework with 80+ built-in source connectors
- Real-time metadata updates via Kafka-based event streaming
- Rich data lineage visualization (column-level, multi-hop)
- Native business glossary and data product concepts
- Active development with frequent releases
- Strong API-first design with GraphQL and REST endpoints
- Built-in data quality insights integration
Weaknesses:
- Complex architecture with many dependencies (Kafka, Elasticsearch, MySQL, ZooKeeper)
- Higher operational overhead for self-hosting
- Steeper learning curve for administrators
- UI can feel overwhelming with its density of features
OpenMetadata — The Collaborative Choice
OpenMetadata was launched in 2021 by the former CEO of DataHub at LinkedIn. It positions itself as a metadata standard and collaboration platform. Its distinguishing features include a standardized metadata schema (using JSON Schema), built-in data quality monitoring, team collaboration tools, and an intuitive user interface.
Strengths:
- Standardized metadata schema based on JSON Schema definitions
- Built-in data quality tests and observability features
- Clean, intuitive user interface with collaborative editing
- Team and ownership workflows built into the core product
- Automated metadata ingestion with a well-documented ingestion framework
- Active open-source community and commercial support from the creators
- Native integration with dbt, Airflow, Great Expectations, and more
Weaknesses:
- Newer project with a shorter track record than Amundsen
- Fewer third-party integrations than DataHub
- The ingestion framework uses a custom Python SDK rather than a generic connector model
- Community size is smaller than DataHub’s
Feature Comparison
| Feature | Amundsen | DataHub | OpenMetadata |
|---|---|---|---|
| Search engine | Elasticsearch | Elasticsearch | Elasticsearch |
| Metadata store | Neo4j | MySQL/MariaDB | MySQL/PostgreSQL |
| Data lineage | Plugin-based | Native, column-level | Native, column-level |
| Business glossary | No | Yes | Yes |
| Data quality | Limited | Via Great Expectations | Built-in tests |
| API | REST | GraphQL + REST | REST |
| Ingestion sources | ~25 | 80+ | 50+ |
| Real-time updates | No | Yes (Kafka) | Partial |
| Access control | Basic | RBAC + Policies | RBAC |
| UI customization | Limited dockerrate | Good | |
| Docker deployment | Yes | kuberneteser-compose) | Yes (docker-compose) |
| Kubernetes support | Via Helm | Via Helm | Via Helm |
| License | Apache 2.0 | Apache 2.0 | Apache 2.0 |
Deployment Guide: Self-Hosting Each Platform
Deploying Amundsen with Docker Compose
Amundsen is the lightest of the three to deploy. Its Docker Compose setup brings up the frontend, metadata service, search service, and their dependencies.
| |
The default setup includes:
- Frontend (port 5000) — React-based search and discovery UI
- Metadata Service (port 5002) — Flask-based API for metadata CRUD
- Search Service (port 9200) — Elasticsearch instance
- Neo4j (port 7687) — Graph database for metadata relationships
To ingest metadata from a PostgreSQL database:
| |
Deploying DataHub with Docker Compose
DataHub requires more infrastructure but provides a richer feature set. The official quickstart uses Docker Compose with Kafka, Elasticsearch, MySQL, and ZooKeeper.
| |
To ingest metadata from Snowflake:
| |
| |
For dbt metadata ingestion:
| |
| |
To deploy DataHub on Kubernetes with Helm:
| |
Deploying OpenMetadata with Docker Compose
OpenMetadata offers a clean deployment experience with well-organized Docker Compose files and a straightforward ingestion framework.
| |
To ingest metadata from a PostgreSQL database:
| |
| |
To run data quality tests with OpenMetadata:
| |
| |
Real-World Usage Scenarios
Small Data Team (5-15 people)
For a small team, OpenMetadata offers the best balance of features and operational simplicity. Its Docker Compose deployment is straightforward, the UI is intuitive, and the built-in data quality features cover most needs without additional tooling. A single engineer can manage the deployment, and the team benefits from collaborative metadata editing.
| |
Mid-Size Organization (50-200 data users)
At this scale, DataHub shines. Its extensive connector ecosystem means it can ingest metadata from virtually every data source in your stack. The real-time metadata streaming ensures that catalog updates propagate immediately, and the business glossary feature helps maintain a shared vocabulary across teams.
| |
Enterprise (500+ users, regulated industry)
For regulated enterprises, Amundsen or DataHub with custom plugins is often the choice. Amundsen’s mature access controls and audit trail capabilities meet compliance requirements, while DataHub’s policy framework enables fine-grained metadata governance. In some organizations, running both — Amundsen for discovery and DataHub for governance — provides comprehensive coverage.
Cost Comparison
Self-hosting eliminates per-user licensing fees, but infrastructure costs still apply:
| Platform | Minimum Infrastructure | Estimated Monthly Cost (cloud) |
|---|---|---|
| Amundsen | 2 CPU, 8 GB RAM, Elasticsearch + Neo4j | $80-150 |
| DataHub | 4 CPU, 16 GB RAM, Kafka + ES + MySQL | $150-300 |
| OpenMetadata | 4 CPU, 12 GB RAM, ES + MySQL | $120-250 |
Compare this to commercial alternatives: Alation starts at $15,000/year for 10 users, and Collibra pricing typically exceeds $50,000/year. For organizations with 100+ data users, the savings from self-hosting are substantial.
Making the Decision
Choose Amundsen if:
- You want the simplest possible deployment
- Search and discovery are your primary needs
- You have a Presto/Hive-heavy stack
- You prefer a lightweight, focused tool
Choose DataHub if:
- You need the broadest possible integration coverage
- Real-time metadata updates are important
- Your organization needs a business glossary and data products
- You have the engineering resources to manage a complex deployment
Choose OpenMetadata if:
- You value a clean, collaborative user interface
- Built-in data quality monitoring is a priority
- You want a standardized metadata schema
- You prefer a balance between features and operational simplicity
All three platforms are Apache 2.0 licensed, meaning you can use, modify, and distribute them freely. The best approach is to run each platform’s Docker Compose quickstart, ingest metadata from your most critical data source, and evaluate the experience with your actual data. The catalog that feels most intuitive to your team is the right choice.
Keeping Your Catalog Updated
A data catalog is only useful if its metadata is current. Set up automated ingestion pipelines that run on a schedule:
| |
| |
Monitor ingestion health through the platform’s built-in dashboards or by scraping ingestion logs into your existing monitoring stack (Prometheus, Grafana). Set up alerts for failed ingestion runs so metadata gaps are caught early rather than discovered by frustrated analysts.
Frequently Asked Questions (FAQ)
Which one should I choose in 2026?
The best choice depends on your specific requirements:
- For beginners: Start with the simplest option that covers your core use case
- For production: Choose the solution with the most active community and documentation
- For teams: Look for collaboration features and user management
- For privacy: Prefer fully open-source, self-hosted options with no telemetry
Refer to the comparison table above for detailed feature breakdowns.
Can I migrate between these tools?
Most tools support data import/export. Always:
- Backup your current data
- Test the migration on a staging environment
- Check official migration guides in the documentation
Are there free versions available?
All tools in this guide offer free, open-source editions. Some also provide paid plans with additional features, priority support, or managed hosting.
How do I get started?
- Review the comparison table to identify your requirements
- Visit the official documentation (links provided above)
- Start with a Docker Compose setup for easy testing
- Join the community forums for troubleshooting