Distributed training in PyTorch enables scaling deep learning workloads across multiple GPUs and nodes by dividing data, model layers, or both across devices. It lives at the intersection of systems programming and machine learning, and is essential for training models that are too large for a single GPU or too slow to train on one. The key mental model is that every parallelism strategy trades some combination of memory, communication bandwidth, and implementation complexity β understanding that tradeoff upfront determines which API to reach for first.
What This Cheat Sheet Covers
This topic spans 16 focused tables and 106 indexed concepts. Below is a complete table-by-table outline of this topic, spanning foundational concepts through advanced details.
Table 1: Parallelism Strategy Selection
Choosing the right strategy before writing a line of training code saves significant rework. The decision tree is almost always model size first: if the full model fits on one GPU, DDP is optimal; if not, FSDP or ZeRO is needed; only when those hit scaling limits should tensor or pipeline parallelism be added.
| Strategy | Example | Description |
|---|---|---|
ddp_model = DDP(model, device_ids=[rank]) | Replicates full model on every GPU; synchronizes gradients via all-reduce after each backward pass. Best choice when model fits on one GPU. | |
fsdp_model = FSDP(model) | Shards parameters, gradients, and optimizer states across all ranks; all-gathers before compute, reduce-scatters after. Required when model does not fit on one GPU. | |
ds_config = {"zero_optimization": {"stage": 2}} | Microsoft's Zero Redundancy Optimizer stages 1β3 progressively shard optimizer states, gradients, and parameters; ZeRO-Infinity adds NVMe offloading. |