Skip to main content

Menu

LEVEL 0
0/5 XP
HomeAboutTopicsPricingMy VaultStats

Categories

πŸ€– Artificial Intelligence
☁️ Cloud and Infrastructure
πŸ’Ύ Data and Databases
πŸ’Ό Professional Skills
🎯 Programming and Development
πŸ”’ Security and Networking
πŸ“š Specialized Topics
HomeAboutTopicsPricingMy VaultStats
LEVEL 0
0/5 XP
GitHub
Β© 2026 CheatGridβ„’. All rights reserved.
Privacy PolicyTerms of UseAboutContact

DeepSpeed Cheat Sheet

DeepSpeed Cheat Sheet

Back to AI and Machine Learning
Updated 2026-05-21
Next Topic: Distributed Training in PyTorch (DDP, FSDP, ZeRO) Cheat Sheet

DeepSpeed is Microsoft's open-source deep learning optimization library that makes distributed training and inference of massive models efficient and accessible. It addresses the fundamental problem of GPU memory limitations β€” modern LLMs require far more memory than any single GPU can hold β€” through the Zero Redundancy Optimizer (ZeRO) and a layered ecosystem of parallelism, offloading, and inference tools. The key mental model: ZeRO does not copy; it partitions β€” each GPU holds only its slice of optimizer states, gradients, or parameters, reducing per-GPU memory proportionally to the number of GPUs while keeping communication overhead manageable.

What This Cheat Sheet Covers

This topic spans 16 focused tables and 103 indexed concepts. Below is a complete table-by-table outline of this topic, spanning foundational concepts through advanced details.

Table 1: ZeRO Optimizer Stages β€” Core Memory PartitioningTable 2: ZeRO-Infinity and OffloadingTable 3: ZeRO Stage 3 Advanced ParametersTable 4: ZeRO++ β€” Quantized CommunicationTable 5: 3D Parallelism β€” Data, Tensor, and PipelineTable 6: DeepSpeed Initialization and Training LoopTable 7: ds_config.json β€” Core Configuration ParametersTable 8: Mixed Precision β€” FP16 and BF16 ConfigurationTable 9: Activation CheckpointingTable 10: Compressed Communication OptimizersTable 11: DeepSpeed-Inference EngineTable 12: DeepSpeed-MII β€” Managed Model InferenceTable 13: DeepSpeed-Chat β€” End-to-End RLHF PipelineTable 14: Hugging Face Trainer and Accelerate IntegrationTable 15: Profiling and Monitoring ToolsTable 16: Data Efficiency and Advanced Training Features

Table 1: ZeRO Optimizer Stages β€” Core Memory Partitioning

ZeRO (Zero Redundancy Optimizer) is the foundation of DeepSpeed's training efficiency. Each progressive stage partitions more of the model's memory footprint across data-parallel GPUs, with the critical distinction that only Stage 3 partitions the model parameters themselves.

TechniqueExampleDescription
ZeRO Stage 1
"zero_optimization": {"stage": 1,
"reduce_bucket_size": 5e8}
Partitions optimizer states (32-bit weights, Adam first and second moment estimates) across ranks; ~4x memory reduction for Adam; no extra communication vs. baseline.
ZeRO Stage 2
"zero_optimization": {"stage": 2,
"overlap_comm": true,
"contiguous_gradients": true}
Partitions optimizer states and gradients; each rank retains only the gradients matching its optimizer slice; same communication volume as Stage 1 β€” effectively a free upgrade.
ZeRO Stage 3
"zero_optimization": {"stage": 3,
"stage3_max_live_parameters": 1e9,
"stage3_prefetch_bucket_size": 5e7}
Partitions optimizer states, gradients, and model parameters; enables 16x+ memory reduction; parameters are all-gathered before use and discarded after each forward/backward pass.
ZeRO Stage 0
"zero_optimization": {"stage": 0}
ZeRO disabled; standard data parallelism with full replication of all states on every GPU; useful as a correctness baseline.
reduce_bucket_size
"reduce_bucket_size": 5e8
Maximum number of elements reduced/all-reduced at once; trades communication frequency for GPU memory; lower values save memory but increase round trips.

More in AI and Machine Learning

  • Deep Learning Cheat Sheet
  • Distributed Training in PyTorch (DDP, FSDP, ZeRO) Cheat Sheet
  • AI Bias & Fairness Cheat Sheet
  • Feature Engineering Cheat Sheet
  • MLflow Cheat Sheet
  • PyTorch Cheat Sheet
View all 83 topics in AI and Machine Learning