Diffusion models are a class of generative models that create data by learning to reverse a gradual noising process, transforming random noise into structured outputs through iterative denoising. Operating on the principle of stochastic differential equations and score-based modeling, they have achieved state-of-the-art results in image, video, and audio generation. Unlike GANs which require adversarial training or VAEs which compress data into fixed latent spaces, diffusion models iteratively refine noise using learned score functions, enabling highly controllable generation with stable training dynamics. The key architectural shift in 2024–2026 has been the widespread adoption of Multimodal Diffusion Transformers (MMDiT) — replacing U-Net backbones with scalable transformer architectures, culminating in models like Stable Diffusion 3, FLUX.1, and SANA that scale to billions of parameters while maintaining practical inference speed.
What This Cheat Sheet Covers
This topic spans 14 focused tables and 120 indexed concepts. Below is a complete table-by-table outline of this topic, spanning foundational concepts through advanced details.
Table 1: Forward Diffusion Process
The forward process defines how clean data is progressively corrupted into pure noise — the trajectory the model must learn to reverse. Choosing the right noise schedule and timestep sampling strategy is one of the most impactful decisions in training a diffusion model.
| Concept | Example | Description |
|---|---|---|
x_t = \sqrt{\bar{\alpha}_t} x_0 + \sqrt{1-\bar{\alpha}_t} \epsilon | • Gradually adds Gaussian noise to data over T timesteps until reaching pure noise• defines the corruption trajectory the model learns to reverse | |
Linear: \beta_t = 0.0001 \to 0.02Cosine: \bar{\alpha}_t = \cos^2(\frac{t/T + s}{1+s} \cdot \frac{\pi}{2}) | • Controls variance \beta_t of noise added at each step• linear increases uniformly; cosine slows noise near endpoints to preserve signal longer | |
\text{SNR}(t) = \frac{\bar{\alpha}_t}{1 - \bar{\alpha}_t} | • Ratio of signal variance to noise variance at timestep t• determines how much original structure remains versus noise corruption | |
Uniform: t \sim \mathcal{U}[0, T]Importance: sample near \log\text{SNR} \approx 0 | • Index of corruption level; model predicts noise conditioned on t• importance sampling focuses training on challenging mid-noise timesteps | |
t \sim \text{LogitNormal}(\mu, \sigma) | • Biases timestep sampling toward perceptually relevant intermediate scales • used in SD3 and rectified flow training to improve convergence and image quality |