← Back to posts
comparison guide self-hosted · · 13 min read

DVC vs LakeFS vs Pachyderm: Best Self-Hosted Data Versioning 2026

Complete guide to self-hosted data versioning tools in 2026. Compare DVC, LakeFS, and Pachyderm with hands-on setup instructions, feature comparison, and real-world use cases.

OS
Editorial Team

Managing data in modern projects — whether for machine learning pipelines, analytics, or ETL workflows — is notoriously difficult. Code has Git. Data has a completely different set of challenges: large files, binary formats, slow transfers, and the need to reproduce exact dataset states months later. That is where data versioning tools come in.

This guide compares three leading open-source solutions for self-hosted data versioning: DVC (Data Version Control), LakeFS, and Pachyderm. We will walk through installation, configuration, key features, and help you pick the right tool for your use case.

Why Self-Host Your Data Versioning

Storing data alongside your code repository seems convenient until your dataset exceeds a few megabytes. Git was never designed for large binary files. Git LFS helps, but it still struggles with datasets in the gigabyte or terabyte range, and it ties your data storage to your code hosting provider.

Self-hosting your data versioning infrastructure gives you several advantages that managed services cannot match:

  • Full data sovereignty — your data never leaves your infrastructure. This is critical for regulated industries, healthcare, and finance.
  • No bandwidth limits — managed data platforms charge by storage and egress. Self-hosting means you control the cost.
  • Custom storage backends — connect to any S3-compatible object store, NFS share, or local disk without vendor lock-in.
  • Deep integration — wire data versioning directly into your CI/CD pipelines, internal tools, and existing infrastructure.
  • Audit trails — keep complete logs of who changed what data, when, and why. Essential for compliance.
  • Reproducibility — pin exact dataset states to experiments, reports, or production models. Anyone can recreate the results.

If you work with datasets larger than a few hundred megabytes, or need to track data changes over time, a dedicated data versioning tool pays for itself quickly.

DVC: Git for Data

DVC is the most widely adopted open-source data versioning tool. It treats data like Git treats code: you get branches, commits, and diffs, but the actual data files live in external storage (S3, GCS, local disk, SSH, or any S3-compatible backend).

Key Features

  • Git-native workflow — DVC sits on top of Git and uses .dvc metafiles tracked in your repository
  • Lightweight and simple — no server infrastructure required for basic usage
  • Experiment tracking — built-in experiment management with hyperparameter logging
  • Pipeline orchestration — define multi-stage data pipelines in dvc.yaml
  • Remote storage support — S3, GCS, Azure Blob, HDFS, SSH, webdav, local
  • Data sharing — dvc pull and dvc push for team collaboration

Installation

Install DVC via pip, Homebrew, or your package manager:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
# Python (recommended)
pip install dvc[s3,gs,azure]

# macOS
brew install dvc

# With specific remote support
pip install dvc[s3]  # S3 only
pip install dvc[gs]  # Google Cloud Storage
pip install dvc[all] # All remotes

Self-Hosted Setup with Local S3 Backend

For a fully self-hosted setup, combine DVC with MinIO (S3-compatible object storage):

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
# 1. Start MinIO
[docker](https://www.docker.com/) run -d --name minio-dvc \
  -p 9000:9000 -p 9001:9001 \
  -v /data/dvc-storage:/data \
  -e MINIO_ROOT_USER=dvcadmin \
  -e MINIO_ROOT_PASSWORD=dvc-secret-password \
  quay.io/minio/minio server /data --console-address ":9001"

# 2. Create the bucket
mc alias set local http://localhost:9000 dvcadmin dvc-secret-password
mc mb local/dvc-remote

# 3. Configure DVC in your project
mkdir my-project && cd my-project
git init
dvc init

# 4. Add the remote storage
dvc remote add -d minio http://localhost:9000/dvc-remote
dvc remote modify minio access_key_id dvcadmin
dvc remote modify minio secret_access_key dvc-secret-password

# 5. Track a dataset
cp ~/data/training.csv .
dvc add training.csv
git add training.csv.dvc .dvc/.gitignore
git commit -m "Add training dataset v1"

# 6. Push to remote
dvc push

Defining a Pipeline

DVC pipelines let you chain data processing steps with automatic dependency tracking:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
# dvc.yaml
stages:
  ingest:
    cmd: python ingest.py data/raw/ data/processed/
    deps:
      - data/raw/
      - ingest.py
    outs:
      - data/processed/

  train:
    cmd: python train.py --data data/processed/ --model models/model.pkl
    deps:
      - data/processed/
      - train.py
    outs:
      - models/model.pkl

  evaluate:
    cmd: python evaluate.py models/model.pkl metrics.json
    deps:
      - models/model.pkl
      - evaluate.py
    metrics:
      - metrics.json:
          cache: false

Run the full pipeline:

1
dvc repro

DVC automatically skips stages whose inputs have not changed, making incremental runs fast.

Pros and Cons

Pros:

  • Zero server overhead for basic usage
  • Integrates seamlessly with existing Git workflows
  • Large community and extensive documentation
  • Supports virtually any storage backend
  • Experiment tracking is built in
  • Free and open source (Apache 2.0)

Cons:

  • No built-in access control or multi-tenant support
  • Data operations can be slow with millions of small files
  • No SQL-like querying over data
  • Collaboration requires shared remote storage setup
  • No data catalog or metadata management

LakeFS: Git for Data Lakes

LakeFS takes a different approach. Instead of sitting on top of Git, it provides a Git-like versioning layer directly on object storage. You get branches, commits, merges, and rollbacks — but the data lives in S3 (or compatible storage) and is accessed through a familiar S3 API.

Key Features

  • Zero-copy branching — branches are instant, regardless of dataset size
  • S3-compatible API — existing tools (Spark, Pandas, Trino, Presto) work without modification
  • Atomic commits — multi-file commits that are all-or-nothing
  • Garbage collection — reclaim storage from deleted or unreferenced data
  • Access control — fine-grained policies for teams
  • Webhook-based hooks — run validation or transformation on commit
  • Metadata search — query data by commit metadata

Self-Hosted Installation

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
# 1. Create a docker-compose.yml
cat > docker-compose.yml << 'EOF'
services:
  lakefs:
    image: treeverse/lakefs:latest
    ports:
      - "8000:8000"
    environment:
      - LAKEFS_DATABASE_TYPE=local
      - LAKEFS_AUTH_ENCRYPT_SECRET_KEY=your-encryption-secret-key
      - LAKEFS_BLOCKSTORE_TYPE=s3
      - LAKEFS_BLOCKSTORE_S3_ENDPOINT=http://minio:9000
      - LAKEFS_BLOCKSTORE_S3_FORCE_PATH_STYLE=true
      - LAKEFS_BLOCKSTORE_S3_CREDENTIALS_ACCESS_KEY_ID=lakefsadmin
      - LAKEFS_BLOCKSTORE_S3_CREDENTIALS_SECRET_ACCESS_KEY=lakefs-secret
    depends_on:
      - minio

  minio:
    image: quay.io/minio/minio:latest
    command: server /data --console-address ":9001"
    ports:
      - "9000:9000"
      - "9001:9001"
    environment:
      - MINIO_ROOT_USER=lakefsadmin
      - MINIO_ROOT_PASSWORD=lakefs-secret
    volumes:
      - lakefs-data:/data

  setup:
    image: minio/mc:latest
    depends_on:
      - minio
    entrypoint: >
      /bin/sh -c "
      mc alias set local http://minio:9000 lakefsadmin lakefs-secret;
      mc mb local/lakefs-bucket;
      exit 0;
      "

volumes:
  lakefs-data:
EOF

# 2. Start everything
docker compose up -d

# 3. Access the UI at http://localhost:8000
# Default credentials: access_key_id = AKIAIOSFODNN7EXAMPLE
#                      secret_access_key = wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY

Creating Branches and Commits

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
# Install the lakeFS CLI
pip install lakefs-sdk

# Using Python SDK
import lakefs

# Connect to your self-hosted instance
client = lakefs.LakeFS(
    endpoint="http://localhost:8000",
    username="AKIAIOSFODNN7EXAMPLE",
    password="wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY"
)

# Create a repository (backed by your S3 bucket)
repo = client.repositories.create("my-data-lake", "s3://lakefs-bucket/")

# Create a branch from main
branch = repo.branches.create("experiment/new-features", source_ref="main")

# Upload data (zero-copy — no data is duplicated)
branch.upload("data/raw/users.parquet", "/local/path/users.parquet")

# Commit the changes
branch.commit("Add user dataset with new feature columns",
              metadata={"author": "data-team", "pipeline": "v2"})

# Merge back to main
repo.refs.merge("experiment/new-features", "main",
                message="Merge new feature columns")

Integration with Spark and Pandas

Since lakeFS exposes an S3-compatible API, your existing code works with minimal changes:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
import pandas as pd

# Read from a specific lakeFS branch
df = pd.read_parquet(
    "s3a://my-data-lake/main/data/raw/users.parquet",
    storage_options={
        "key": "AKIAIOSFODNN7EXAMPLE",
        "secret": "wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY",
        "client_kwargs": {"endpoint_url": "http://localhost:8000"}
    }
)

# Read from a specific commit (reproducible snapshot)
df_v2 = pd.read_parquet(
    "s3a://my-data-lake/abc123def456/data/raw/users.parquet",
    storage_options={
        "key": "AKIAIOSFODNN7EXAMPLE",
        "secret": "wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY",
        "client_kwargs": {"endpoint_url": "http://localhost:8000"}
    }
)

Pros and Cons

Pros:

  • Zero-copy branching — instant regardless of data size
  • S3-compatible — no tool changes required
  • Built-in access control and policies
  • Garbage collection for storage management
  • Web UI for browsing and managing repositories
  • Works with Spark, Trino, Presto, DuckDB, Pandas out of the box
  • Open source (Apache 2.0)

Cons:

  • Requires server infrastructure (Docker container)
  • Tied to S3 or S3-compatible storage
  • No pipeline orchestration built in
  • Steeper learning curve than DVC
  • Garbage collection requires careful configuration

Pachyderm: Data Pipelines with Provenance

Pachyderm takes the most opinionated approach. It combines data versioning with pipeline orchestration — every pipeline stage automatically tracks its inputs, outputs, and the code that produced themkubernetes it as “DVC plus Kubernetes-native pipelines.”

Key Features

  • Automatic provenance — every output file traces back to exact input data and code versions
  • Kubernetes-native — runs on your existing K8s cluster
  • Content-addressed storage — data is deduplicated automatically
  • Incremental processing — only reprocess changed data
  • Cron and event triggers — schedule or trigger pipelines on data changes
  • Spouts — streaming data ingestion
  • Enterprise features — RBAC, audit logs, SSO (commercial edition)

Self-Hosted Installation

Pachyderm runs on Kubernetes. For local development, use the Docker-based version:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
# 1. Install the CLI
curl -fsSL https://raw.githubusercontent.com/pachyderm/pachyderm/main/etc/installer/get-pachyderm.sh | sh
sudo mv pachctl /usr/local/bin

# 2. Start Pachyderm locally (Docker-based)
pachctl deploy local

# Wait for the cluster to be ready
pachctl version

# 3. Or deploy to your Kubernetes cluster
pachctl deploy k8s \
  --namespace pachyderm \
  --object-store s3 \
  --s3-endpoint http://minio:9000 \
  --s3-bucket pachyderm-data \
  --s3-access-key-id pachadmin \
  --s3-secret-access-key pach-secret \
  --disk-size 50Gi

Creating a Data Pipeline

Pachyderm pipelines are defined in JSON or YAML:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
{
  "pipeline": {
    "name": "etl-pipeline"
  },
  "description": "Process raw data into cleaned features",
  "transform": {
    "image": "python:3.11-slim",
    "cmd": ["bash"],
    "stdin": [
      "pip install pandas pyarrow",
      "python /pfs/code/transform.py /pfs/raw/ /pfs/out/"
    ]
  },
  "input": {
    "cross": [
      {
        "pfs": {
          "repo": "raw-data",
          "glob": "/*"
        }
      },
      {
        "pfs": {
          "repo": "code",
          "glob": "/*"
        }
      }
    ]
  }
}

Deploy and watch:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
# Create the pipeline
pachctl create pipeline -f etl-pipeline.json

# Upload data (triggers the pipeline automatically)
pachctl put file raw-data@master:/dataset.csv -f ./dataset.csv

# Watch pipeline processing
pachctl list job

# Inspect output
pachctl get file etl-pipeline@master:/result.parquet > result.parquet

# Check provenance — see exactly what produced this output
pachctl inspect file etl-pipeline@master:/result.parquet --history

Data Branching in Pachyderm

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
# Create a branch
pachctl create branch raw-data@dev

# Upload to the dev branch (does not trigger production pipeline)
pachctl put file raw-data@dev:/dataset.csv -f ./new_dataset.csv

# Merge when ready
pachctl create branch raw-data@master -f raw-data@dev

# Rollback to a previous version
pachctl create branch raw-data@master \
  -f raw-data@master@v1.2.0

Pros and Cons

Pros:

  • Automatic provenance tracking — unmatched for debugging and compliance
  • Kubernetes-native — scales with your cluster
  • Incremental processing saves compute
  • Pipelines trigger automatically on data changes
  • Content-addressed storage deduplicates data
  • Strong data lineage for regulatory compliance
  • Open source core (Apache 2.0)

Cons:

  • Kubernetes dependency — significant operational overhead
  • Complex setup compared to DVC
  • Learning curve for Pachyderm concepts
  • No standalone mode (always needs K8s or Docker emulation)
  • Enterprise features locked behind commercial license
  • Resource-intensive for small projects

Comparison Table

FeatureDVCLakeFSPachyderm
Core modelGit overlayGit on object storageK8s data pipelines
Versioning.dvc metafiles in GitBranches/commits on S3Content-addressed repos
Zero-copy branchesNoYesYes
Pipeline orchestrationYes (dvc.yaml)NoYes (automatic)
Infrastructure requiredNone (client-only)Docker containerKubernetes cluster
S3 compatibilityClient-sideNative APIStorage backend
Access controlNo (use storage ACLs)Yes (built-in RBAC)Yes (RBAC, commercial)
Provenance trackingManual (experiments)Via commit metadataAutomatic, built-in
Incremental processingYes (stage caching)NoYes (automatic)
Storage backendsS3, GCS, Azure, SSH, local, webdav, HDFSS3-compatible onlyS3, GCS, Azure
Experiment trackingBuilt-inNoNo
Web UIDVC Studio (cloud)Yes (self-hosted)Console (self-hosted)
Best forIndividuals, ML teamsData lakes, analyticsProduction data pipelines
LicenseApache 2.0Apache 2.0Apache 2.0

Choosing the Right Tool

Choose DVC if:

  • You want the simplest possible setup with no server infrastructure
  • Your team already uses Git and wants a familiar workflow
  • You need experiment tracking alongside data versioning
  • You work primarily with Python and ML frameworks
  • Your datasets fit within a single storage backend
  • You prefer a client-only tool with minimal operational overhead

DVC is the best starting point for most teams. It is easy to adopt, integrates with existing workflows, and has the largest community.

Choose LakeFS if:

  • You manage a data lake with multiple teams and datasets
  • You need zero-copy branching for large datasets (terabytes+)
  • You want S3 compatibility so existing tools work without changes
  • You need access control and multi-tenant support
  • Your workflow involves Spark, Trino, DuckDB, or other SQL engines
  • You want a web UI for browsing and managing data repositories

LakeFS shines when you have large datasets and multiple consumers. The S3-compatible API means you do not need to rewrite existing code.

Choose Pachyderm if:

  • You need automatic provenance tracking for compliance
  • You already run Kubernetes and want native integration
  • Your pipelines must trigger automatically on data changes
  • You need strong data lineage for regulatory requirements
  • You process data incrementally and want to save compute
  • Your team builds production-grade data pipelines

Pachyderm is the most powerful but also the most complex. It is the right choice when data provenance and pipeline automation are critical.

Combining Tools

These tools are not mutually exclusive. Common combinations include:

  • DVC + LakeFS: Use DVC for experiment tracking and ML workflows, with LakeFS as the remote storage backend. DVC pushes to a LakeFS branch, giving you both experiment management and zero-copy branching.
  • DVC + Pachyderm: Use DVC locally for development and experiment tracking, then deploy to Pachyderm for production pipeline execution.
  • LakeFS + Pachyderm: Use LakeFS as the storage layer for Pachyderm, combining LakeFS branching with Pachyderm pipeline orchestration.

Conclusion

Data versioning is no longer optional for teams working with large datasets. The right tool depends on your infrastructure, team size, and workflow requirements:

  • DVC is the easiest to adopt and works for most ML teams starting their data versioning journey.
  • LakeFS provides the most flexible versioning layer for data lakes with S3-compatible access.
  • Pachyderm delivers the strongest provenance tracking and pipeline automation for production workloads.

All three are open source, self-hostable, and free to use. The best approach is to start simple — try DVC first — and graduate to LakeFS or Pachyderm as your needs grow.

Frequently Asked Questions (FAQ)

Which one should I choose in 2026?

The best choice depends on your specific requirements:

  • For beginners: Start with the simplest option that covers your core use case
  • For production: Choose the solution with the most active community and documentation
  • For teams: Look for collaboration features and user management
  • For privacy: Prefer fully open-source, self-hosted options with no telemetry

Refer to the comparison table above for detailed feature breakdowns.

Can I migrate between these tools?

Most tools support data import/export. Always:

  1. Backup your current data
  2. Test the migration on a staging environment
  3. Check official migration guides in the documentation

Are there free versions available?

All tools in this guide offer free, open-source editions. Some also provide paid plans with additional features, priority support, or managed hosting.

How do I get started?

  1. Review the comparison table to identify your requirements
  2. Visit the official documentation (links provided above)
  3. Start with a Docker Compose setup for easy testing
  4. Join the community forums for troubleshooting
Advertise here