Diffusion Models Cheat Sheet

Updated 2026-05-25

Next Topic: Direct Preference Optimization (DPO) and Alignment Methods Cheat Sheet

Diffusion models are a class of generative models that create data by learning to reverse a gradual noising process, transforming random noise into structured outputs through iterative denoising. Operating on the principle of stochastic differential equations and score-based modeling, they have achieved state-of-the-art results in image, video, and audio generation. Unlike GANs which require adversarial training or VAEs which compress data into fixed latent spaces, diffusion models iteratively refine noise using learned score functions, enabling highly controllable generation with stable training dynamics. The key architectural shift in 2024–2026 has been the widespread adoption of Multimodal Diffusion Transformers (MMDiT) — replacing U-Net backbones with scalable transformer architectures, culminating in models like Stable Diffusion 3, FLUX.1, and SANA that scale to billions of parameters while maintaining practical inference speed.

What This Cheat Sheet Covers

This topic spans 14 focused tables and 120 indexed concepts. Below is a complete table-by-table outline of this topic, spanning foundational concepts through advanced details.

Table 1: Forward Diffusion ProcessTable 2: Reverse Diffusion and DenoisingTable 3: Training Objectives and Loss FunctionsTable 4: Model Parameterization and Prediction TargetsTable 5: Sampling Methods and AlgorithmsTable 6: Architecture Components (U-Net)Table 7: Architecture Evolution (Transformers)Table 8: Latent Diffusion ModelsTable 9: Conditioning TechniquesTable 10: Advanced Training TechniquesTable 11: Distillation and AccelerationTable 12: Evaluation MetricsTable 13: Applications and VariantsTable 14: Personalization and Fine-Tuning

Table 1: Forward Diffusion Process

The forward process defines how clean data is progressively corrupted into pure noise — the trajectory the model must learn to reverse. Choosing the right noise schedule and timestep sampling strategy is one of the most impactful decisions in training a diffusion model.

Concept	Example	Description
Forward diffusion	$x_t = \sqrt{\bar{\alpha}_t} x_0 + \sqrt{1-\bar{\alpha}_t} \epsilon$	• Gradually adds Gaussian noise to data over $T$ timesteps until reaching pure noise • defines the corruption trajectory the model learns to reverse
Noise schedule	Linear: $\beta_t = 0.0001 \to 0.02$ Cosine: $\bar{\alpha}_t = \cos^2(\frac{t/T + s}{1+s} \cdot \frac{\pi}{2})$	• Controls variance $\beta_t$ of noise added at each step • linear increases uniformly; cosine slows noise near endpoints to preserve signal longer
Signal-to-noise ratio (SNR)	$\text{SNR}(t) = \frac{\bar{\alpha}_t}{1 - \bar{\alpha}_t}$	• Ratio of signal variance to noise variance at timestep $t$ • determines how much original structure remains versus noise corruption
Timestep $t$	Uniform: $t \sim \mathcal{U}[0, T]$ Importance: sample near $\log\text{SNR} \approx 0$	• Index of corruption level; model predicts noise conditioned on $t$ • importance sampling focuses training on challenging mid-noise timesteps
Logit-normal timestep sampling	$t \sim \text{LogitNormal}(\mu, \sigma)$	• Biases timestep sampling toward perceptually relevant intermediate scales • used in SD3 and rectified flow training to improve convergence and image quality

Table 1: Forward Diffusion Process

Concept	Example	Description
Forward diffusion	$x_t = \sqrt{\bar{\alpha}_t} x_0 + \sqrt{1-\bar{\alpha}_t} \epsilon$	• Gradually adds Gaussian noise to data over $T$ timesteps until reaching pure noise • defines the corruption trajectory the model learns to reverse
Noise schedule	Linear: $\beta_t = 0.0001 \to 0.02$ Cosine: $\bar{\alpha}_t = \cos^2(\frac{t/T + s}{1+s} \cdot \frac{\pi}{2})$	• Controls variance $\beta_t$ of noise added at each step • linear increases uniformly; cosine slows noise near endpoints to preserve signal longer
Signal-to-noise ratio (SNR)	$\text{SNR}(t) = \frac{\bar{\alpha}_t}{1 - \bar{\alpha}_t}$	• Ratio of signal variance to noise variance at timestep $t$ • determines how much original structure remains versus noise corruption
Timestep $t$	Uniform: $t \sim \mathcal{U}[0, T]$ Importance: sample near $\log\text{SNR} \approx 0$	• Index of corruption level; model predicts noise conditioned on $t$ • importance sampling focuses training on challenging mid-noise timesteps
Logit-normal timestep sampling	$t \sim \text{LogitNormal}(\mu, \sigma)$	• Biases timestep sampling toward perceptually relevant intermediate scales • used in SD3 and rectified flow training to improve convergence and image quality