DeepSpeed is Microsoft's open-source deep learning optimization library that makes distributed training and inference of massive models efficient and accessible. It addresses the fundamental problem of GPU memory limitations β modern LLMs require far more memory than any single GPU can hold β through the Zero Redundancy Optimizer (ZeRO) and a layered ecosystem of parallelism, offloading, and inference tools. The key mental model: ZeRO does not copy; it partitions β each GPU holds only its slice of optimizer states, gradients, or parameters, reducing per-GPU memory proportionally to the number of GPUs while keeping communication overhead manageable.
What This Cheat Sheet Covers
This topic spans 16 focused tables and 103 indexed concepts. Below is a complete table-by-table outline of this topic, spanning foundational concepts through advanced details.
Table 1: ZeRO Optimizer Stages β Core Memory Partitioning
ZeRO (Zero Redundancy Optimizer) is the foundation of DeepSpeed's training efficiency. Each progressive stage partitions more of the model's memory footprint across data-parallel GPUs, with the critical distinction that only Stage 3 partitions the model parameters themselves.
| Technique | Example | Description |
|---|---|---|
"zero_optimization": {"stage": 1, "reduce_bucket_size": 5e8} | Partitions optimizer states (32-bit weights, Adam first and second moment estimates) across ranks; ~4x memory reduction for Adam; no extra communication vs. baseline. | |
"zero_optimization": {"stage": 2, "overlap_comm": true, "contiguous_gradients": true} | Partitions optimizer states and gradients; each rank retains only the gradients matching its optimizer slice; same communication volume as Stage 1 β effectively a free upgrade. | |
"zero_optimization": {"stage": 3, "stage3_max_live_parameters": 1e9, "stage3_prefetch_bucket_size": 5e7} | Partitions optimizer states, gradients, and model parameters; enables 16x+ memory reduction; parameters are all-gathered before use and discarded after each forward/backward pass. | |
"zero_optimization": {"stage": 0} | ZeRO disabled; standard data parallelism with full replication of all states on every GPU; useful as a correctness baseline. | |
"reduce_bucket_size": 5e8 | Maximum number of elements reduced/all-reduced at once; trades communication frequency for GPU memory; lower values save memory but increase round trips. |