Distributed Training in PyTorch (DDP, FSDP, ZeRO) Cheat Sheet

Updated 2026-05-21

Next Topic: Domain-Specific Language Models Cheat Sheet

Distributed training in PyTorch enables scaling deep learning workloads across multiple GPUs and nodes by dividing data, model layers, or both across devices. It lives at the intersection of systems programming and machine learning, and is essential for training models that are too large for a single GPU or too slow to train on one. The key mental model is that every parallelism strategy trades some combination of memory, communication bandwidth, and implementation complexity — understanding that tradeoff upfront determines which API to reach for first.

What This Cheat Sheet Covers

This topic spans 16 focused tables and 106 indexed concepts. Below is a complete table-by-table outline of this topic, spanning foundational concepts through advanced details.

Table 1: Parallelism Strategy SelectionTable 2: Process Group InitializationTable 3: DistributedDataParallel (DDP) Core UsageTable 4: FSDP Sharding StrategiesTable 5: FSDP Configuration and WrappingTable 6: FSDP2 (fully_shard) — Next-Generation APITable 7: ZeRO Optimizer Stages (DeepSpeed)Table 8: Gradient Accumulation in Distributed TrainingTable 9: Mixed Precision TrainingTable 10: Activation CheckpointingTable 11: torchrun — Launching Distributed JobsTable 12: Collective Communication OperationsTable 13: Debugging and Profiling Distributed JobsTable 14: Distributed CheckpointingTable 15: FSDP vs DeepSpeed — Decision GuideTable 16: DeviceMesh and Multi-Dimensional Parallelism

Table 1: Parallelism Strategy Selection

Choosing the right strategy before writing a line of training code saves significant rework. The decision tree is almost always model size first: if the full model fits on one GPU, DDP is optimal; if not, FSDP or ZeRO is needed; only when those hit scaling limits should tensor or pipeline parallelism be added.

Strategy	Example	Description
DistributedDataParallel (DDP)	`ddp_model = DDP(model, device_ids=[rank])`	• Replicates full model on every GPU • synchronizes gradients via all-reduce after each backward pass • Best choice when model fits on one GPU
FullyShardedDataParallel (FSDP)	`fsdp_model = FSDP(model)`	• Shards parameters, gradients, and optimizer states across all ranks • all-gathers before compute, reduce-scatters after • Required when model does not fit on one GPU
ZeRO (DeepSpeed)	`ds_config = {"zero_optimization": {"stage": 2}}`	• Microsoft's Zero Redundancy Optimizer stages 1–3 progressively shard optimizer states, gradients, and parameters • ZeRO-Infinity adds NVMe offloading

Table 1: Parallelism Strategy Selection

Strategy	Example	Description
DistributedDataParallel (DDP)	`ddp_model = DDP(model, device_ids=[rank])`	• Replicates full model on every GPU • synchronizes gradients via all-reduce after each backward pass • Best choice when model fits on one GPU
FullyShardedDataParallel (FSDP)	`fsdp_model = FSDP(model)`	• Shards parameters, gradients, and optimizer states across all ranks • all-gathers before compute, reduce-scatters after • Required when model does not fit on one GPU
ZeRO (DeepSpeed)	`ds_config = {"zero_optimization": {"stage": 2}}`	• Microsoft's Zero Redundancy Optimizer stages 1–3 progressively shard optimizer states, gradients, and parameters • ZeRO-Infinity adds NVMe offloading