When a single GPU can no longer keep up with model training workloads, distributed training becomes essential. Whether you’re running large-scale recommendation systems, computer vision pipelines, or natural language processing models on a self-hosted GPU cluster, choosing the right distributed training framework determines your hardware utilization, training speed, and operational complexity.

This guide compares the three most widely adopted self-hosted distributed training solutions: Uber Horovod, Microsoft DeepSpeed, and PyTorch Fully Sharded Data Parallel (FSDP). We’ll cover architecture differences, provide Docker deployment configurations, and help you choose the right approach for your training infrastructure.

For model experiment tracking with these frameworks, see our ML experiment tracking guide. If you need model versioning alongside training, our model registry comparison covers the options. And for GPU hardware monitoring during training runs, check our GPU monitoring tools guide.

Why Distributed Training Matters

Training modern models on a single GPU can take weeks or months. Distributed training splits the workload across multiple GPUs — potentially across multiple machines — reducing training time from weeks to hours. The challenge is that different frameworks approach distribution in fundamentally different ways, each with trade-offs in memory efficiency, communication overhead, and code complexity.

The three approaches covered here represent distinct philosophies:

  • Horovod — Data parallelism with ring-allreduce communication, minimal code changes
  • DeepSpeed — Hybrid parallelism (data, tensor, pipeline, ZeRO) for extreme-scale training
  • PyTorch FSDP — Native sharded data parallelism built into PyTorch core

Horovod: Ring-Allreduce Data Parallelism

Horovod was created by Uber to make distributed training straightforward. It uses the ring-allreduce algorithm for gradient synchronization and works with TensorFlow, PyTorch, Keras, and MXNet.

Architecture

Horovod wraps the training loop so that each GPU processes a different mini-batch, computes local gradients, and then averages gradients across all workers using NCCL (NVIDIA Collective Communications Library). Every worker ends up with identical updated parameters.

1
2
3
4
# Install Horovod with GPU support
pip install horovod[gpu] tensorflow
# or for PyTorch
pip install horovod[pytorch] torch

Docker Deployment

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
# docker-compose-horovod.yml
version: "3.8"

services:
  horovod-trainer:
    image: horovod/horovod:latest-gpu
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: all
              capabilities: [gpu]
    volumes:
      - ./training-scripts:/workspace
      - /data/datasets:/datasets:ro
    working_dir: /workspace
    command: >
      horovodrun -np 4 -H localhost:4
      python train.py --epochs 100 --batch-size 256
    environment:
      - NCCL_DEBUG=INFO
      - HOROVOD_GPU_ALLREDUCE=NCCL
      - HOROVOD_FUSION_THRESHOLD=16777216
    shm_size: "8g"

Code Integration

Horovod requires minimal code changes — typically 5-10 lines:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
import horovod.tensorflow as hvd
hvd.init()

# Pin GPU to local process
gpus = tf.config.experimental.list_physical_devices('GPU')
for gpu in gpus:
    tf.config.experimental.set_memory_growth(gpu, True)
if gpus:
    tf.config.experimental.set_visible_devices(gpus[hvd.local_rank()], 'GPU')

# Scale learning rate by number of workers
opt = tf.keras.optimizers.SGD(0.001 * hvd.size())
opt = hvd.DistributedOptimizer(opt)

Key Strengths

  • Minimal code modifications required
  • Excellent scaling efficiency with NCCL ring-allreduce
  • Multi-framework support (TensorFlow, PyTorch, MXNet)
  • Automatic gradient compression and fusion
  • Elastic training support for dynamic worker scaling

DeepSpeed: ZeRO-Powered Extreme-Scale Training

DeepSpeed, developed by Microsoft Research, goes far beyond simple data parallelism. Its ZeRO (Zero Redundancy Optimizer) technology partitions optimizer states, gradients, and parameters across GPUs, enabling training of models with hundreds of billions of parameters.

Architecture

DeepSpeed implements a 3D parallelism strategy:

  1. ZeRO Data Parallelism — Partitions optimizer states, gradients, and parameters
  2. Tensor Parallelism — Splits individual layers across GPUs
  3. Pipeline Parallelism — Splits model layers across GPU stages

Docker Deployment

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
# docker-compose-deepspeed.yml
version: "3.8"

services:
  deepspeed-trainer:
    image: deepspeed/deepspeed:latest
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: all
              capabilities: [gpu]
    volumes:
      - ./training-scripts:/workspace
      - /data/datasets:/datasets:ro
    working_dir: /workspace
    command: >
      deepspeed --num_gpus=8
      train.py
      --deepspeed
      --deepspeed_config ds_config.json
    environment:
      - NCCL_DEBUG=INFO
      - OMP_NUM_THREADS=4
      - CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
    shm_size: "16g"

DeepSpeed Configuration (ZeRO Stage 3)

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
{
  "train_batch_size": 2048,
  "gradient_accumulation_steps": 16,
  "fp16": {
    "enabled": true,
    "loss_scale": 0,
    "initial_scale_power": 16
  },
  "zero_optimization": {
    "stage": 3,
    "offload_optimizer": {
      "device": "cpu",
      "pin_memory": true
    },
    "offload_param": {
      "device": "cpu",
      "pin_memory": true
    },
    "overlap_comm": true,
    "contiguous_gradients": true,
    "sub_group_size": 1e9,
    "reduce_bucket_size": 1e6,
    "stage3_prefetch_bucket_size": 0.94e6,
    "stage3_max_live_parameters": 1e9,
    "stage3_max_reuse_distance": 1e9
  },
  "gradient_clipping": 1.0,
  "steps_per_print": 100,
  "wall_clock_breakdown": false
}

Key Strengths

  • ZeRO-3 enables training models that don’t fit in a single GPU’s memory
  • CPU offloading for optimizer states and parameters
  • ZeRO-Infinity extends offloading to NVMe storage
  • Mixed precision (FP16, BF16) with automatic loss scaling
  • Built-in communication compression (1-bit Adam, 3-bit MG)
  • Activation checkpointing for memory savings

PyTorch FSDP: Native Sharded Data Parallelism

PyTorch Fully Sharded Data Parallel (FSDP) is PyTorch’s native answer to distributed training, integrated directly into the torch.distributed package since PyTorch 1.11. It implements the same core concept as DeepSpeed’s ZeRO-3 but as a first-class PyTorch feature.

Architecture

FSDP shards model parameters, gradients, and optimizer states across all participating processes. During the forward pass, each process fetches only the parameter shards it needs, computes, and then discards them. This means the full model never needs to fit in any single GPU’s memory.

Docker Deployment

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
# docker-compose-fsdp.yml
version: "3.8"

services:
  fsdp-trainer:
    image: pytorch/pytorch:2.3.0-cuda12.1-cudnn8-runtime
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: all
              capabilities: [gpu]
    volumes:
      - ./training-scripts:/workspace
      - /data/datasets:/datasets:ro
    working_dir: /workspace
    command: >
      torchrun --nproc_per_node=8 --nnodes=1
      train_fsdp.py --batch-size 128 --epochs 50
    environment:
      - NCCL_DEBUG=INFO
      - TORCH_DISTRIBUTED_DEBUG=DETAIL
      - OMP_NUM_THREADS=4
      - PYTHONUNBUFFERED=1
    shm_size: "8g"

Code Integration

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
from torch.distributed.fsdp import FullyShardedDataParallel as FSDP
from torch.distributed.fsdp.fully_sharded_data_parallel import (
    CPUOffload,
    BackwardPrefetch,
    MixedPrecision,
    ShardingStrategy,
)
from torch.distributed.fsdp.wrap import (
    size_based_auto_wrap_policy,
    enable_wrap,
    wrap,
)

# Wrap model with FSDP
model = FSDP(
    model,
    sharding_strategy=ShardingStrategy.FULL_SHARD,
    cpu_offload=CPUOffload(offload_params=True),
    auto_wrap_policy=size_based_auto_wrap_policy,
    backward_prefetch=BackwardPrefetch.BACKWARD_PRE,
    mixed_precision=MixedPrecision(
        param_dtype=torch.float16,
        reduce_dtype=torch.float16,
        buffer_dtype=torch.float16,
    ),
)

Key Strengths

  • Zero additional dependencies — part of PyTorch core
  • Multiple sharding strategies (FULL_SHARD, SHARD_GRAD_OP, NO_SHARD)
  • Automatic mixed precision support
  • CPU offloading for parameters
  • Activation checkpointing for memory savings
  • Native integration with PyTorch ecosystem (torch.compile, torch.profiler)

Comparison Table

FeatureHorovodDeepSpeedPyTorch FSDP
Parallelism TypeData ParallelZeRO + 3D ParallelSharded Data Parallel
Max Model SizeGPU memory limit100B+ params (ZeRO-3)100B+ params
Framework SupportTF, PyTorch, MXNet, KerasPyTorchPyTorch only
CPU OffloadingNoYes (optimizer + params)Yes (params)
NVMe OffloadingNoYes (ZeRO-Infinity)No
Code Changes~10 linesConfig file + wrapper~15 lines
CommunicationRing-allreduce (NCCL)Ring/Tree + compressionRing-allreduce (NCCL)
Mixed PrecisionYes (AMP)Yes (FP16, BF16)Yes (FP16, BF16)
Gradient CompressionYes (FP16, 1-bit)Yes (1-bit Adam, 3-bit)No
Elastic TrainingYesLimitedNo
Multi-NodeYesYesYes
PyTorch NativeNoNoYes
Active MaintainerUber (community)MicrosoftPyTorch core team
GitHub Stars14,700+42,300+Built into PyTorch (99,600+)
Last Update2025-122026-052026-05

Multi-Node Cluster Setup

For production deployments spanning multiple machines, each framework has different setup requirements.

Horovod Multi-Node

1
2
3
4
5
6
7
# On each node, ensure SSH access and identical environments
horovodrun -np 32 -H node1:8,node2:8,node3:8,node4:8 \
  python train.py --epochs 100

# With SLURM scheduler
horovodrun -np 64 --slurm \
  python train.py --epochs 100

DeepSpeed Multi-Node

Create a hostfile:

1
2
3
node1 slots=8
node2 slots=8
node3 slots=8
1
2
deepspeed --hostfile=hostfile --num_nodes=3 train.py \
  --deepspeed --deepspeed_config ds_config.json

PyTorch FSDP Multi-Node

1
2
3
4
5
6
7
8
# On each node
torchrun \
  --nproc_per_node=8 \
  --nnodes=4 \
  --node_rank=$NODE_RANK \
  --master_addr=$MASTER_ADDR \
  --master_port=29500 \
  train_fsdp.py

When to Choose Each Framework

Choose Horovod when:

  • You need multi-framework support (TensorFlow + PyTorch in the same team)
  • You want minimal code changes to existing training scripts
  • Your models fit in GPU memory and you just need faster training
  • You need elastic training for spot instance clusters

Choose DeepSpeed when:

  • Your models exceed single-GPU memory capacity
  • You need ZeRO-3 with CPU/NVMe offloading for massive models
  • You want advanced optimizations like activation checkpointing and communication compression
  • You’re already in the PyTorch ecosystem and need maximum memory efficiency

Choose PyTorch FSDP when:

  • You’re building new PyTorch training code and want zero external dependencies
  • You want native integration with the PyTorch ecosystem (torch.compile, profiler)
  • You need straightforward sharded data parallelism without the complexity of 3D parallelism
  • Long-term maintenance and PyTorch version compatibility are priorities

Why Self-Host Distributed Training Infrastructure

Running distributed training on self-hosted infrastructure offers several advantages over cloud-based alternatives:

Cost Control: GPU instances in the cloud are expensive. A self-hosted cluster of 4-8 RTX 4090 or A6000 GPUs can handle most training workloads at a fraction of the ongoing cloud cost. For teams running continuous training pipelines, the payback period is often under 6 months.

Data Sovereignty: Training datasets often contain sensitive information — proprietary business data, customer records, or regulated content. Self-hosted training keeps data within your network perimeter, eliminating transfer costs and compliance risks.

No Queue Times: Cloud GPU availability fluctuates, especially for high-end GPUs like A100/H100. Self-hosted clusters provide immediate access without waiting in provider queues or dealing with capacity constraints during peak demand.

Custom Hardware: Self-hosted clusters let you mix GPU generations, use consumer GPUs (which cloud providers rarely offer), and optimize cooling and power for your specific workload patterns.

For related infrastructure topics, see our distributed file systems comparison for storage backends and our distributed task scheduling guide for scheduling training jobs across the cluster.

FAQ

What is the main difference between data parallelism and model parallelism?

Data parallelism replicates the full model on each GPU and splits the input data across workers. Each GPU computes gradients on its batch, then gradients are averaged. Model parallelism splits the model itself across GPUs — each GPU holds only a portion of the model’s layers. Horovod uses data parallelism, DeepSpeed supports both, and FSDP uses sharded data parallelism (a hybrid that shards parameters across data-parallel workers).

Can I mix Horovod, DeepSpeed, and FSDP in the same project?

Technically no — they each wrap the training loop differently. However, you can use them in different experiments or projects. Some teams use Horovod for TensorFlow workloads and DeepSpeed or FSDP for PyTorch workloads.

Does FSDP replace DeepSpeed?

Not entirely. FSDP provides similar functionality to DeepSpeed’s ZeRO-3 for parameter sharding. However, DeepSpeed offers additional features like ZeRO-Infinity (NVMe offloading), 3D parallelism (tensor + pipeline parallelism), and advanced communication compression that FSDP does not yet implement. For most use cases under 10B parameters, FSDP is sufficient.

How many GPUs do I need before distributed training becomes worthwhile?

Distributed training overhead becomes beneficial at 2+ GPUs for models that take more than a few hours to train on a single GPU. For smaller models, the communication overhead may outweigh the parallelism benefit. As a rule of thumb: if single-GPU training takes more than 4 hours, distributed training will likely speed things up.

What network bandwidth do I need for multi-node distributed training?

Ring-allreduce communication is bandwidth-intensive. For efficient multi-node training, you should have at least 10 Gbps Ethernet (25 Gbps recommended). InfiniBand (100+ Gbps) provides the best scaling for large clusters. Horovod’s gradient compression and DeepSpeed’s communication compression can reduce bandwidth requirements significantly.

Can I run distributed training without NVIDIA GPUs?

Horovod supports AMD GPUs via RCCL (ROCm Collective Communications Library). DeepSpeed and FSDP currently require NVIDIA GPUs due to their reliance on CUDA and NCCL. For AMD-based clusters, Horovod is the best choice.

How do I monitor distributed training performance across multiple nodes?

Use our GPU monitoring tools to track per-GPU utilization. Combine with distributed logging (each worker logs to a shared backend) and metrics collection (Prometheus + Grafana) for a complete view of cluster-wide training performance.