Self-Hosted Distributed Training: Horovod vs DeepSpeed vs PyTorch FSDP Guide 2026

When a single GPU can no longer keep up with model training workloads, distributed training becomes essential. Whether you’re running large-scale recommendation systems, computer vision pipelines, or natural language processing models on a self-hosted GPU cluster, choosing the right distributed training framework determines your hardware utilization, training speed, and operational complexity.

This guide compares the three most widely adopted self-hosted distributed training solutions: Uber Horovod, Microsoft DeepSpeed, and PyTorch Fully Sharded Data Parallel (FSDP). We’ll cover architecture differences, provide Docker deployment configurations, and help you choose the right approach for your training infrastructure.

For model experiment tracking with these frameworks, see our ML experiment tracking guide. If you need model versioning alongside training, our model registry comparison covers the options. And for GPU hardware monitoring during training runs, check our GPU monitoring tools guide.

Why Distributed Training Matters

Training modern models on a single GPU can take weeks or months. Distributed training splits the workload across multiple GPUs — potentially across multiple machines — reducing training time from weeks to hours. The challenge is that different frameworks approach distribution in fundamentally different ways, each with trade-offs in memory efficiency, communication overhead, and code complexity.

The three approaches covered here represent distinct philosophies:

Horovod — Data parallelism with ring-allreduce communication, minimal code changes
DeepSpeed — Hybrid parallelism (data, tensor, pipeline, ZeRO) for extreme-scale training
PyTorch FSDP — Native sharded data parallelism built into PyTorch core

Horovod: Ring-Allreduce Data Parallelism

Horovod was created by Uber to make distributed training straightforward. It uses the ring-allreduce algorithm for gradient synchronization and works with TensorFlow, PyTorch, Keras, and MXNet.

Architecture

Horovod wraps the training loop so that each GPU processes a different mini-batch, computes local gradients, and then averages gradients across all workers using NCCL (NVIDIA Collective Communications Library). Every worker ends up with identical updated parameters.

1
2
3
4
# Install Horovod with GPU support
pip install horovod[gpu] tensorflow
# or for PyTorch
pip install horovod[pytorch] torch

Docker Deployment

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
# docker-compose-horovod.yml
version: "3.8"

services:
  horovod-trainer:
    image: horovod/horovod:latest-gpu
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: all
              capabilities: [gpu]
    volumes:
      - ./training-scripts:/workspace
      - /data/datasets:/datasets:ro
    working_dir: /workspace
    command: >
      horovodrun -np 4 -H localhost:4
      python train.py --epochs 100 --batch-size 256
    environment:
      - NCCL_DEBUG=INFO
      - HOROVOD_GPU_ALLREDUCE=NCCL
      - HOROVOD_FUSION_THRESHOLD=16777216
    shm_size: "8g"

Code Integration

Horovod requires minimal code changes — typically 5-10 lines:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
import horovod.tensorflow as hvd
hvd.init()

# Pin GPU to local process
gpus = tf.config.experimental.list_physical_devices('GPU')
for gpu in gpus:
    tf.config.experimental.set_memory_growth(gpu, True)
if gpus:
    tf.config.experimental.set_visible_devices(gpus[hvd.local_rank()], 'GPU')

# Scale learning rate by number of workers
opt = tf.keras.optimizers.SGD(0.001 * hvd.size())
opt = hvd.DistributedOptimizer(opt)

Key Strengths

Minimal code modifications required
Excellent scaling efficiency with NCCL ring-allreduce
Multi-framework support (TensorFlow, PyTorch, MXNet)
Automatic gradient compression and fusion
Elastic training support for dynamic worker scaling

DeepSpeed: ZeRO-Powered Extreme-Scale Training

DeepSpeed, developed by Microsoft Research, goes far beyond simple data parallelism. Its ZeRO (Zero Redundancy Optimizer) technology partitions optimizer states, gradients, and parameters across GPUs, enabling training of models with hundreds of billions of parameters.

Architecture

DeepSpeed implements a 3D parallelism strategy:

ZeRO Data Parallelism — Partitions optimizer states, gradients, and parameters
Tensor Parallelism — Splits individual layers across GPUs
Pipeline Parallelism — Splits model layers across GPU stages

Docker Deployment

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
# docker-compose-deepspeed.yml
version: "3.8"

services:
  deepspeed-trainer:
    image: deepspeed/deepspeed:latest
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: all
              capabilities: [gpu]
    volumes:
      - ./training-scripts:/workspace
      - /data/datasets:/datasets:ro
    working_dir: /workspace
    command: >
      deepspeed --num_gpus=8
      train.py
      --deepspeed
      --deepspeed_config ds_config.json
    environment:
      - NCCL_DEBUG=INFO
      - OMP_NUM_THREADS=4
      - CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
    shm_size: "16g"

DeepSpeed Configuration (ZeRO Stage 3)

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
{
  "train_batch_size": 2048,
  "gradient_accumulation_steps": 16,
  "fp16": {
    "enabled": true,
    "loss_scale": 0,
    "initial_scale_power": 16
  },
  "zero_optimization": {
    "stage": 3,
    "offload_optimizer": {
      "device": "cpu",
      "pin_memory": true
    },
    "offload_param": {
      "device": "cpu",
      "pin_memory": true
    },
    "overlap_comm": true,
    "contiguous_gradients": true,
    "sub_group_size": 1e9,
    "reduce_bucket_size": 1e6,
    "stage3_prefetch_bucket_size": 0.94e6,
    "stage3_max_live_parameters": 1e9,
    "stage3_max_reuse_distance": 1e9
  },
  "gradient_clipping": 1.0,
  "steps_per_print": 100,
  "wall_clock_breakdown": false
}

Key Strengths

ZeRO-3 enables training models that don’t fit in a single GPU’s memory
CPU offloading for optimizer states and parameters
ZeRO-Infinity extends offloading to NVMe storage
Mixed precision (FP16, BF16) with automatic loss scaling
Built-in communication compression (1-bit Adam, 3-bit MG)
Activation checkpointing for memory savings

PyTorch FSDP: Native Sharded Data Parallelism

PyTorch Fully Sharded Data Parallel (FSDP) is PyTorch’s native answer to distributed training, integrated directly into the torch.distributed package since PyTorch 1.11. It implements the same core concept as DeepSpeed’s ZeRO-3 but as a first-class PyTorch feature.

Architecture

FSDP shards model parameters, gradients, and optimizer states across all participating processes. During the forward pass, each process fetches only the parameter shards it needs, computes, and then discards them. This means the full model never needs to fit in any single GPU’s memory.

Docker Deployment

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
# docker-compose-fsdp.yml
version: "3.8"

services:
  fsdp-trainer:
    image: pytorch/pytorch:2.3.0-cuda12.1-cudnn8-runtime
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: all
              capabilities: [gpu]
    volumes:
      - ./training-scripts:/workspace
      - /data/datasets:/datasets:ro
    working_dir: /workspace
    command: >
      torchrun --nproc_per_node=8 --nnodes=1
      train_fsdp.py --batch-size 128 --epochs 50
    environment:
      - NCCL_DEBUG=INFO
      - TORCH_DISTRIBUTED_DEBUG=DETAIL
      - OMP_NUM_THREADS=4
      - PYTHONUNBUFFERED=1
    shm_size: "8g"

Code Integration

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
from torch.distributed.fsdp import FullyShardedDataParallel as FSDP
from torch.distributed.fsdp.fully_sharded_data_parallel import (
    CPUOffload,
    BackwardPrefetch,
    MixedPrecision,
    ShardingStrategy,
)
from torch.distributed.fsdp.wrap import (
    size_based_auto_wrap_policy,
    enable_wrap,
    wrap,
)

# Wrap model with FSDP
model = FSDP(
    model,
    sharding_strategy=ShardingStrategy.FULL_SHARD,
    cpu_offload=CPUOffload(offload_params=True),
    auto_wrap_policy=size_based_auto_wrap_policy,
    backward_prefetch=BackwardPrefetch.BACKWARD_PRE,
    mixed_precision=MixedPrecision(
        param_dtype=torch.float16,
        reduce_dtype=torch.float16,
        buffer_dtype=torch.float16,
    ),
)

Key Strengths

Zero additional dependencies — part of PyTorch core
Multiple sharding strategies (FULL_SHARD, SHARD_GRAD_OP, NO_SHARD)
Automatic mixed precision support
CPU offloading for parameters
Activation checkpointing for memory savings
Native integration with PyTorch ecosystem (torch.compile, torch.profiler)

Comparison Table

Feature	Horovod	DeepSpeed	PyTorch FSDP
Parallelism Type	Data Parallel	ZeRO + 3D Parallel	Sharded Data Parallel
Max Model Size	GPU memory limit	100B+ params (ZeRO-3)	100B+ params
Framework Support	TF, PyTorch, MXNet, Keras	PyTorch	PyTorch only
CPU Offloading	No	Yes (optimizer + params)	Yes (params)
NVMe Offloading	No	Yes (ZeRO-Infinity)	No
Code Changes	~10 lines	Config file + wrapper	~15 lines
Communication	Ring-allreduce (NCCL)	Ring/Tree + compression	Ring-allreduce (NCCL)
Mixed Precision	Yes (AMP)	Yes (FP16, BF16)	Yes (FP16, BF16)
Gradient Compression	Yes (FP16, 1-bit)	Yes (1-bit Adam, 3-bit)	No
Elastic Training	Yes	Limited	No
Multi-Node	Yes	Yes	Yes
PyTorch Native	No	No	Yes
Active Maintainer	Uber (community)	Microsoft	PyTorch core team
GitHub Stars	14,700+	42,300+	Built into PyTorch (99,600+)
Last Update	2025-12	2026-05	2026-05

Multi-Node Cluster Setup

For production deployments spanning multiple machines, each framework has different setup requirements.

Horovod Multi-Node

1
2
3
4
5
6
7
# On each node, ensure SSH access and identical environments
horovodrun -np 32 -H node1:8,node2:8,node3:8,node4:8 \
  python train.py --epochs 100

# With SLURM scheduler
horovodrun -np 64 --slurm \
  python train.py --epochs 100

DeepSpeed Multi-Node

Create a hostfile:

1
2
3
node1 slots=8
node2 slots=8
node3 slots=8

1
2
deepspeed --hostfile=hostfile --num_nodes=3 train.py \
  --deepspeed --deepspeed_config ds_config.json

PyTorch FSDP Multi-Node

1
2
3
4
5
6
7
8
# On each node
torchrun \
  --nproc_per_node=8 \
  --nnodes=4 \
  --node_rank=$NODE_RANK \
  --master_addr=$MASTER_ADDR \
  --master_port=29500 \
  train_fsdp.py

When to Choose Each Framework

Choose Horovod when:

You need multi-framework support (TensorFlow + PyTorch in the same team)
You want minimal code changes to existing training scripts
Your models fit in GPU memory and you just need faster training
You need elastic training for spot instance clusters

Choose DeepSpeed when:

Your models exceed single-GPU memory capacity
You need ZeRO-3 with CPU/NVMe offloading for massive models
You want advanced optimizations like activation checkpointing and communication compression
You’re already in the PyTorch ecosystem and need maximum memory efficiency

Choose PyTorch FSDP when:

You’re building new PyTorch training code and want zero external dependencies
You want native integration with the PyTorch ecosystem (torch.compile, profiler)
You need straightforward sharded data parallelism without the complexity of 3D parallelism
Long-term maintenance and PyTorch version compatibility are priorities

Why Self-Host Distributed Training Infrastructure

Running distributed training on self-hosted infrastructure offers several advantages over cloud-based alternatives:

Cost Control: GPU instances in the cloud are expensive. A self-hosted cluster of 4-8 RTX 4090 or A6000 GPUs can handle most training workloads at a fraction of the ongoing cloud cost. For teams running continuous training pipelines, the payback period is often under 6 months.

Data Sovereignty: Training datasets often contain sensitive information — proprietary business data, customer records, or regulated content. Self-hosted training keeps data within your network perimeter, eliminating transfer costs and compliance risks.

No Queue Times: Cloud GPU availability fluctuates, especially for high-end GPUs like A100/H100. Self-hosted clusters provide immediate access without waiting in provider queues or dealing with capacity constraints during peak demand.

Custom Hardware: Self-hosted clusters let you mix GPU generations, use consumer GPUs (which cloud providers rarely offer), and optimize cooling and power for your specific workload patterns.

For related infrastructure topics, see our distributed file systems comparison for storage backends and our distributed task scheduling guide for scheduling training jobs across the cluster.

FAQ

What is the main difference between data parallelism and model parallelism?

Data parallelism replicates the full model on each GPU and splits the input data across workers. Each GPU computes gradients on its batch, then gradients are averaged. Model parallelism splits the model itself across GPUs — each GPU holds only a portion of the model’s layers. Horovod uses data parallelism, DeepSpeed supports both, and FSDP uses sharded data parallelism (a hybrid that shards parameters across data-parallel workers).

Can I mix Horovod, DeepSpeed, and FSDP in the same project?

Technically no — they each wrap the training loop differently. However, you can use them in different experiments or projects. Some teams use Horovod for TensorFlow workloads and DeepSpeed or FSDP for PyTorch workloads.

Does FSDP replace DeepSpeed?

Not entirely. FSDP provides similar functionality to DeepSpeed’s ZeRO-3 for parameter sharding. However, DeepSpeed offers additional features like ZeRO-Infinity (NVMe offloading), 3D parallelism (tensor + pipeline parallelism), and advanced communication compression that FSDP does not yet implement. For most use cases under 10B parameters, FSDP is sufficient.

How many GPUs do I need before distributed training becomes worthwhile?

Distributed training overhead becomes beneficial at 2+ GPUs for models that take more than a few hours to train on a single GPU. For smaller models, the communication overhead may outweigh the parallelism benefit. As a rule of thumb: if single-GPU training takes more than 4 hours, distributed training will likely speed things up.

What network bandwidth do I need for multi-node distributed training?

Ring-allreduce communication is bandwidth-intensive. For efficient multi-node training, you should have at least 10 Gbps Ethernet (25 Gbps recommended). InfiniBand (100+ Gbps) provides the best scaling for large clusters. Horovod’s gradient compression and DeepSpeed’s communication compression can reduce bandwidth requirements significantly.

Can I run distributed training without NVIDIA GPUs?

Horovod supports AMD GPUs via RCCL (ROCm Collective Communications Library). DeepSpeed and FSDP currently require NVIDIA GPUs due to their reliance on CUDA and NCCL. For AMD-based clusters, Horovod is the best choice.

How do I monitor distributed training performance across multiple nodes?

Use our GPU monitoring tools to track per-GPU utilization. Combine with distributed logging (each worker logs to a shared backend) and metrics collection (Prometheus + Grafana) for a complete view of cluster-wide training performance.

Why Distributed Training Matters

Horovod: Ring-Allreduce Data Parallelism

Architecture

Docker Deployment

Code Integration

Key Strengths

DeepSpeed: ZeRO-Powered Extreme-Scale Training

Architecture

Docker Deployment

DeepSpeed Configuration (ZeRO Stage 3)

Key Strengths

PyTorch FSDP: Native Sharded Data Parallelism

Architecture

Docker Deployment

Code Integration

Key Strengths

Comparison Table

Multi-Node Cluster Setup

Horovod Multi-Node

DeepSpeed Multi-Node

PyTorch FSDP Multi-Node

When to Choose Each Framework

Why Self-Host Distributed Training Infrastructure

FAQ

What is the main difference between data parallelism and model parallelism?

Can I mix Horovod, DeepSpeed, and FSDP in the same project?

Does FSDP replace DeepSpeed?

How many GPUs do I need before distributed training becomes worthwhile?

What network bandwidth do I need for multi-node distributed training?

Can I run distributed training without NVIDIA GPUs?

How do I monitor distributed training performance across multiple nodes?

Related Posts

Self-Hosted Hyperparameter Optimization: Optuna vs Ray Tune vs Hyperopt Guide 2026

Self-Hosted GPU Management in Kubernetes: NVIDIA GPU Operator vs Container Toolkit vs Volcano Guide 2026

XXL-JOB vs PowerJob vs Apache DolphinScheduler: Distributed Task Scheduling Guide 2026