Skip to main content

Menu

LEVEL 0
0/5 XP
HomeAboutTopicsPricingMy VaultStatsPractice TestsCertifications

Categories

🎓 Certifications
🤖 Artificial Intelligence
☁️ Cloud and Infrastructure
💾 Data and Databases
💼 Professional Skills
🎯 Programming and Development
🔒 Security and Networking
📚 Specialized Topics
CheatGrid
HomeAboutTopicsPricingMy VaultStatsPractice TestsCertifications
LVLEVEL 0
0/5 XP
GitHub
© 2026 CheatGrid™. All rights reserved.
Privacy PolicyTerms of UseAboutContact

DeepSpeed Cheat Sheet

DeepSpeed Cheat Sheet

Back to AI and Machine Learning
Updated 2026-05-21
Next Topic: Distributed Training in PyTorch (DDP, FSDP, ZeRO) Cheat Sheet

DeepSpeed is Microsoft's open-source deep learning optimization library that makes distributed training and inference of massive models efficient and accessible. It addresses the fundamental problem of GPU memory limitations — modern LLMs require far more memory than any single GPU can hold — through the Zero Redundancy Optimizer (ZeRO) and a layered ecosystem of parallelism, offloading, and inference tools. The key mental model: ZeRO does not copy; it partitions — each GPU holds only its slice of optimizer states, gradients, or parameters, reducing per-GPU memory proportionally to the number of GPUs while keeping communication overhead manageable.

What This Cheat Sheet Covers

This topic spans 16 focused tables and 103 indexed concepts. Below is a complete table-by-table outline of this topic, spanning foundational concepts through advanced details.

Table 1: ZeRO Optimizer Stages — Core Memory PartitioningTable 2: ZeRO-Infinity and OffloadingTable 3: ZeRO Stage 3 Advanced ParametersTable 4: ZeRO++ — Quantized CommunicationTable 5: 3D Parallelism — Data, Tensor, and PipelineTable 6: DeepSpeed Initialization and Training LoopTable 7: ds_config.json — Core Configuration ParametersTable 8: Mixed Precision — FP16 and BF16 ConfigurationTable 9: Activation CheckpointingTable 10: Compressed Communication OptimizersTable 11: DeepSpeed-Inference EngineTable 12: DeepSpeed-MII — Managed Model InferenceTable 13: DeepSpeed-Chat — End-to-End RLHF PipelineTable 14: Hugging Face Trainer and Accelerate IntegrationTable 15: Profiling and Monitoring ToolsTable 16: Data Efficiency and Advanced Training Features

Table 1: ZeRO Optimizer Stages — Core Memory Partitioning

ZeRO (Zero Redundancy Optimizer) is the foundation of DeepSpeed's training efficiency. Each progressive stage partitions more of the model's memory footprint across data-parallel GPUs, with the critical distinction that only Stage 3 partitions the model parameters themselves.

TechniqueExampleDescription
ZeRO Stage 1
"zero_optimization": {"stage": 1,
"reduce_bucket_size": 5e8}
• Partitions optimizer states (32-bit weights, Adam first and second moment estimates) across ranks
• ~4x memory reduction for Adam
• no extra communication vs. baseline
ZeRO Stage 2
"zero_optimization": {"stage": 2,
"overlap_comm": true,
"contiguous_gradients": true}
• Partitions optimizer states and gradients
• each rank retains only the gradients matching its optimizer slice
• same communication volume as Stage 1 — effectively a free upgrade
ZeRO Stage 3
"zero_optimization": {"stage": 3,
"stage3_max_live_parameters": 1e9,
"stage3_prefetch_bucket_size": 5e7}
• Partitions optimizer states, gradients, and model parameters
• enables 16x+ memory reduction
• parameters are all-gathered before use and discarded after each forward/backward pass
ZeRO Stage 0
"zero_optimization": {"stage": 0}
• ZeRO disabled
• standard data parallelism with full replication of all states on every GPU
• useful as a correctness baseline
reduce_bucket_size
"reduce_bucket_size": 5e8
• Maximum number of elements reduced/all-reduced at once
• trades communication frequency for GPU memory
• lower values save memory but increase round trips

More in AI and Machine Learning

  • Deep Learning Cheat Sheet
  • Distributed Training in PyTorch (DDP, FSDP, ZeRO) Cheat Sheet
  • AI Bias & Fairness Cheat Sheet
  • Feature Engineering Cheat Sheet
  • MLflow Cheat Sheet
  • PyTorch Cheat Sheet
View all 83 topics in AI and Machine Learning