Data contracts define the schema, quality expectations, and SLAs between data producers and consumers. As organizations scale their data pipelines, informal agreements about data structure break down — column types change without notice, null values appear in non-nullable fields, and downstream dashboards silently return wrong results. This guide compares three self-hosted tools for defining, validating, and enforcing data contracts across your data infrastructure.
What Are Data Contracts?
A data contract is a formal agreement between a data producer (e.g., an application team writing to a database) and data consumers (e.g., analytics teams building dashboards). It specifies:
- Schema: Column names, types, constraints, and allowed values
- Quality rules: Null thresholds, freshness requirements, uniqueness guarantees
- Terms: Ownership, update frequency, deprecation policies
- SLAs: Maximum acceptable downtime, data freshness targets
Without data contracts, a producer team changing a column type from INT to VARCHAR can break dozens of downstream pipelines before anyone notices.
Comparison Overview
| Feature | Data Contract CLI | Soda Core | dbt Tests |
|---|---|---|---|
| Contract definition format | YAML | YAML + SodaCL | YAML (dbt schema.yml) |
| Schema validation | Yes (JSON Schema, SQL DDL) | Yes (via checks) | Yes (test macros) |
| Data quality checks | Built-in | Extensive (SodaCL) | Built-in + custom SQL |
| Freshness monitoring | Yes | Yes | Yes (source freshness) |
| CI/CD integration | GitHub Actions, GitLab CI | GitHub Actions, CI | Native (dbt Cloud/CI) |
| Alerting | CLI exit codes, webhooks | Soda Cloud, webhooks | dbt Slack/Discord hooks |
| Supported databases | PostgreSQL, MySQL, BigQuery, Snowflake, DuckDB | 15+ data platforms | Any dbt-supported adapter |
| Open source | Apache 2.0 | Apache 2.0 | Apache 2.0 |
| GitHub Stars | 870+ | 2,300+ | 8,500+ (dbt-core) |
Data Contract CLI: Contract-First Data Engineering
Data Contract CLI (870+ GitHub stars) is a purpose-built tool for defining and validating data contracts. It uses a YAML-based contract format that describes schemas, quality rules, and service level agreements, then validates actual data against those contracts.
Contract Definition
A data contract in YAML format:
| |
Docker Compose for CI/CD Validation
| |
CI/CD Pipeline Integration
| |
Soda Core: Comprehensive Data Quality Engine
Soda Core (2,300+ GitHub stars) is a data quality engine that scans datasets for anomalies, schema changes, and quality violations. While Soda started as a cloud service, Soda Core is fully open source and runs entirely self-hosted.
SodaCL (Soda Checks Language)
SodaCL provides a declarative language for expressing data quality checks:
| |
Docker Compose for Soda Core
| |
Soda Configuration
| |
Running Soda Scans
| |
dbt Tests: Transformation-Native Validation
dbt (data build tool) includes built-in testing capabilities that serve as lightweight data contracts. While dbt is primarily a transformation tool, its test framework validates data quality as part of the ELT pipeline.
dbt Schema and Test Definitions
| |
Docker Compose for dbt
| |
Running dbt Tests
| |
When to Choose Each Tool
Choose Data Contract CLI when:
- You want contract-first data engineering with formal YAML specifications
- You need to share contracts between teams as version-controlled artifacts
- Your focus is on schema validation and freshness monitoring
- You want database-agnostic contracts that work across PostgreSQL, MySQL, and cloud warehouses
Choose Soda Core when:
- You need comprehensive data quality checks with a mature expression language
- Your priority is detecting data anomalies and quality regressions
- You want to scan data without transforming it (separate from dbt)
- You operate across 15+ data platforms and need a unified scanning layer
Choose dbt Tests when:
- You already use dbt for data transformations
- You want tests integrated into your transformation pipeline
- Your team prefers SQL-based test definitions
- You need source freshness monitoring as part of your ELT pipeline
For related data engineering topics, see our data pipeline orchestration guide, data quality tools comparison, and data observability platforms.
Why Self-Host Data Contract Validation?
Cloud-based data quality and contract platforms charge per scan, per user, or per data volume — costs that scale unfavorably as your data grows. Self-hosted data contract validation runs entirely within your infrastructure, processing unlimited scans at the cost of compute resources alone.
For organizations with strict data governance requirements (HIPAA, GDPR, SOC 2), self-hosted validation ensures contract definitions and quality results never leave your network. The YAML-based contract format serves as living documentation that version control systems track, enabling code review workflows for data schema changes.
The shift-left approach — validating data contracts in CI/CD before data reaches production — prevents quality issues from reaching downstream consumers. A single broken pipeline can affect dozens of dashboards, ML models, and business reports. Contract validation catches these issues at the source.
FAQ
What is the difference between data contracts and data quality checks?
Data contracts are formal agreements that define what data consumers can expect — schema, quality thresholds, freshness, and ownership terms. Data quality checks are the technical implementation that verifies whether data meets those expectations. A data contract specifies “order_id must be unique”; a quality check executes SELECT COUNT(*) - COUNT(DISTINCT order_id) to verify it.
Can I use Data Contract CLI without a database connection?
Yes. Data Contract CLI supports linting and schema validation without connecting to a database. Run datacontract lint contract.yaml to validate the contract syntax and structure. You can also export contracts to SQL DDL, JSON Schema, or Avro format for offline use.
How often should data contract validations run?
For production data pipelines, run validations on every pipeline execution (event-driven). For scheduled batch pipelines, run after each batch completes. Additionally, schedule periodic contract reviews (weekly or monthly) to catch drift between contract definitions and actual data evolution.
Can Soda Core and dbt Tests be used together?
Yes, they complement each other well. dbt tests validate data as part of the transformation pipeline (testing output of transformations), while Soda Core scans source data independently of transformations. Together they provide end-to-end coverage: Soda validates raw source data quality, dbt tests validate transformed data correctness.
How do I handle breaking changes to data contracts?
Version your contracts alongside your code. When a producer team needs to make a breaking change:
- Create a new contract version (v2) alongside the existing one (v1)
- Run both contracts in parallel during a transition period
- Notify all consumers of the upcoming change
- After consumers migrate, deprecate v1 and remove validation
- Update the contract specification version in the YAML file
What databases does Data Contract CLI support?
Data Contract CLI supports PostgreSQL, MySQL, BigQuery, Snowflake, DuckDB, and any database with a SQLAlchemy driver. It also supports file-based formats like Parquet, CSV, and JSON for local testing and development. The contract definition is database-agnostic — the same YAML contract works across different database backends.