DeepSpeed Cheat Sheet

Updated 2026-05-21

Next Topic: Distributed Training in PyTorch (DDP, FSDP, ZeRO) Cheat Sheet

DeepSpeed is Microsoft's open-source deep learning optimization library that makes distributed training and inference of massive models efficient and accessible. It addresses the fundamental problem of GPU memory limitations — modern LLMs require far more memory than any single GPU can hold — through the Zero Redundancy Optimizer (ZeRO) and a layered ecosystem of parallelism, offloading, and inference tools. The key mental model: ZeRO does not copy; it partitions — each GPU holds only its slice of optimizer states, gradients, or parameters, reducing per-GPU memory proportionally to the number of GPUs while keeping communication overhead manageable.

What This Cheat Sheet Covers

This topic spans 16 focused tables and 103 indexed concepts. Below is a complete table-by-table outline of this topic, spanning foundational concepts through advanced details.

Table 1: ZeRO Optimizer Stages — Core Memory PartitioningTable 2: ZeRO-Infinity and OffloadingTable 3: ZeRO Stage 3 Advanced ParametersTable 4: ZeRO++ — Quantized CommunicationTable 5: 3D Parallelism — Data, Tensor, and PipelineTable 6: DeepSpeed Initialization and Training LoopTable 7: ds_config.json — Core Configuration ParametersTable 8: Mixed Precision — FP16 and BF16 ConfigurationTable 9: Activation CheckpointingTable 10: Compressed Communication OptimizersTable 11: DeepSpeed-Inference EngineTable 12: DeepSpeed-MII — Managed Model InferenceTable 13: DeepSpeed-Chat — End-to-End RLHF PipelineTable 14: Hugging Face Trainer and Accelerate IntegrationTable 15: Profiling and Monitoring ToolsTable 16: Data Efficiency and Advanced Training Features

Table 1: ZeRO Optimizer Stages — Core Memory Partitioning

ZeRO (Zero Redundancy Optimizer) is the foundation of DeepSpeed's training efficiency. Each progressive stage partitions more of the model's memory footprint across data-parallel GPUs, with the critical distinction that only Stage 3 partitions the model parameters themselves.

Technique	Example	Description
ZeRO Stage 1	`"zero_optimization": {"stage": 1,` `"reduce_bucket_size": 5e8}`	• Partitions optimizer states (32-bit weights, Adam first and second moment estimates) across ranks • ~4x memory reduction for Adam • no extra communication vs. baseline
ZeRO Stage 2	`"zero_optimization": {"stage": 2,` `"overlap_comm": true,` `"contiguous_gradients": true}`	• Partitions optimizer states and gradients • each rank retains only the gradients matching its optimizer slice • same communication volume as Stage 1 — effectively a free upgrade
ZeRO Stage 3	`"zero_optimization": {"stage": 3,` `"stage3_max_live_parameters": 1e9,` `"stage3_prefetch_bucket_size": 5e7}`	• Partitions optimizer states, gradients, and model parameters • enables 16x+ memory reduction • parameters are all-gathered before use and discarded after each forward/backward pass
ZeRO Stage 0	`"zero_optimization": {"stage": 0}`	• ZeRO disabled • standard data parallelism with full replication of all states on every GPU • useful as a correctness baseline
reduce_bucket_size	`"reduce_bucket_size": 5e8`	• Maximum number of elements reduced/all-reduced at once • trades communication frequency for GPU memory • lower values save memory but increase round trips

Table 1: ZeRO Optimizer Stages — Core Memory Partitioning

Technique	Example	Description
ZeRO Stage 1	`"zero_optimization": {"stage": 1,` `"reduce_bucket_size": 5e8}`	• Partitions optimizer states (32-bit weights, Adam first and second moment estimates) across ranks • ~4x memory reduction for Adam • no extra communication vs. baseline
ZeRO Stage 2	`"zero_optimization": {"stage": 2,` `"overlap_comm": true,` `"contiguous_gradients": true}`	• Partitions optimizer states and gradients • each rank retains only the gradients matching its optimizer slice • same communication volume as Stage 1 — effectively a free upgrade
ZeRO Stage 3	`"zero_optimization": {"stage": 3,` `"stage3_max_live_parameters": 1e9,` `"stage3_prefetch_bucket_size": 5e7}`	• Partitions optimizer states, gradients, and model parameters • enables 16x+ memory reduction • parameters are all-gathered before use and discarded after each forward/backward pass
ZeRO Stage 0	`"zero_optimization": {"stage": 0}`	• ZeRO disabled • standard data parallelism with full replication of all states on every GPU • useful as a correctness baseline
reduce_bucket_size	`"reduce_bucket_size": 5e8`	• Maximum number of elements reduced/all-reduced at once • trades communication frequency for GPU memory • lower values save memory but increase round trips