Skip to main content

Menu

LEVEL 0
0/5 XP
HomeAboutTopicsPricingMy VaultStats

Categories

πŸ€– Artificial Intelligence
☁️ Cloud and Infrastructure
πŸ’Ύ Data and Databases
πŸ’Ό Professional Skills
🎯 Programming and Development
πŸ”’ Security and Networking
πŸ“š Specialized Topics
HomeAboutTopicsPricingMy VaultStats
LEVEL 0
0/5 XP
GitHub
Β© 2026 CheatGridβ„’. All rights reserved.
Privacy PolicyTerms of UseAboutContact

Distributed Training in PyTorch (DDP, FSDP, ZeRO) Cheat Sheet

Distributed Training in PyTorch (DDP, FSDP, ZeRO) Cheat Sheet

Back to AI and Machine Learning
Updated 2026-05-21
Next Topic: Domain-Specific Language Models Cheat Sheet

Distributed training in PyTorch enables scaling deep learning workloads across multiple GPUs and nodes by dividing data, model layers, or both across devices. It lives at the intersection of systems programming and machine learning, and is essential for training models that are too large for a single GPU or too slow to train on one. The key mental model is that every parallelism strategy trades some combination of memory, communication bandwidth, and implementation complexity β€” understanding that tradeoff upfront determines which API to reach for first.

What This Cheat Sheet Covers

This topic spans 16 focused tables and 106 indexed concepts. Below is a complete table-by-table outline of this topic, spanning foundational concepts through advanced details.

Table 1: Parallelism Strategy SelectionTable 2: Process Group InitializationTable 3: DistributedDataParallel (DDP) Core UsageTable 4: FSDP Sharding StrategiesTable 5: FSDP Configuration and WrappingTable 6: FSDP2 (fully_shard) β€” Next-Generation APITable 7: ZeRO Optimizer Stages (DeepSpeed)Table 8: Gradient Accumulation in Distributed TrainingTable 9: Mixed Precision TrainingTable 10: Activation CheckpointingTable 11: torchrun β€” Launching Distributed JobsTable 12: Collective Communication OperationsTable 13: Debugging and Profiling Distributed JobsTable 14: Distributed CheckpointingTable 15: FSDP vs DeepSpeed β€” Decision GuideTable 16: DeviceMesh and Multi-Dimensional Parallelism

Table 1: Parallelism Strategy Selection

Choosing the right strategy before writing a line of training code saves significant rework. The decision tree is almost always model size first: if the full model fits on one GPU, DDP is optimal; if not, FSDP or ZeRO is needed; only when those hit scaling limits should tensor or pipeline parallelism be added.

StrategyExampleDescription
DistributedDataParallel (DDP)
ddp_model = DDP(model, device_ids=[rank])
Replicates full model on every GPU; synchronizes gradients via all-reduce after each backward pass. Best choice when model fits on one GPU.
FullyShardedDataParallel (FSDP)
fsdp_model = FSDP(model)
Shards parameters, gradients, and optimizer states across all ranks; all-gathers before compute, reduce-scatters after. Required when model does not fit on one GPU.
ZeRO (DeepSpeed)
ds_config = {"zero_optimization": {"stage": 2}}
Microsoft's Zero Redundancy Optimizer stages 1–3 progressively shard optimizer states, gradients, and parameters; ZeRO-Infinity adds NVMe offloading.

More in AI and Machine Learning

  • DeepSpeed Cheat Sheet
  • Domain-Specific Language Models Cheat Sheet
  • AI Bias & Fairness Cheat Sheet
  • Feature Engineering Cheat Sheet
  • MLflow Cheat Sheet
  • PyTorch Cheat Sheet
View all 83 topics in AI and Machine Learning