Stable Diffusion is an open-source latent diffusion model family that generates images from text descriptions. The original SD 1.x/2.x series uses a U-Net backbone operating in a VAE-compressed latent space guided by CLIP text encoders; SD 3 and later shifted to a Multimodal Diffusion Transformer (MMDiT) architecture with triple text-encoder conditioning (CLIP-L, OpenCLIP-G, T5-XXL). Competing models like Flux.1 and the newer Flux.2 (Black Forest Labs) apply a transformer-based flow-matching design, with Flux.2 coupling a 32B rectified flow transformer to a vision-language model for state-of-the-art multi-reference editing. Understanding generation parameters—from CFG scale to sampling schedulers—gives precise control over output, while extensions like ControlNet, LoRA, and IP-Adapter enable advanced customization without retraining the full model.
What This Cheat Sheet Covers
This topic spans 21 focused tables and 170 indexed concepts. Below is a complete table-by-table outline of this topic, spanning foundational concepts through advanced details.
Table 1: Core Model Versions
The diffusion model ecosystem evolved rapidly from SD 1.5's modest 983M U-Net to transformer giants like FLUX.2 [dev] at 32B parameters; knowing each model's architecture, VRAM requirements, and license determines which is practical for your hardware and use case. Models are ordered from most widely deployed to most specialized.
| Model | Example | Description |
|---|---|---|
stabilityai/stable-diffusion-xl-base-1.0 | • 1024×1024 native resolution • 3.5B parameters • dual text encoders (CLIP ViT-L + OpenCLIP ViT-bigG) • optional refiner model; largest extension ecosystem after SD 1.5 | |
runwayml/stable-diffusion-v1-5 | • 512×512 base resolution • 983M parameters • largest LoRA/embedding/extension ecosystem overall • fastest on low VRAM | |
black-forest-labs/FLUX.1-dev | • 12B parameter flow-matching transformer (Black Forest Labs) • guidance-distilled; excels at text rendering and photorealism • 20–50 steps; non-commercial license | |
black-forest-labs/FLUX.1-schnell | • Step-distilled Flux variant; 1–4 steps for rapid generation • Apache 2.0 license • slight quality tradeoff vs dev | |
black-forest-labs/FLUX.2-dev | • 32B parameter rectified flow transformer (Black Forest Labs, Nov 2025) • multi-reference support (up to 10 images), image generation + editing in one model • up to 4MP output; couples with Mistral-3 24B VLM for world knowledge • FP8 quantization runs on RTX 4090 via weight streaming; non-commercial license | |
Flux Kontext dev (12B) | • In-context image editing model (Black Forest Labs, May 2025) • edits existing images via text instructions • maintains character/style consistency across edits | |
HiDream-ai/HiDream-I1-Full | • 17B parameter sparse Diffusion Transformer with dynamic MoE architecture (HiDream.ai, April 2025) • MIT license; SOTA on DPG-Bench and GenEval • four text encoders: OpenCLIP ViT-bigG + CLIP ViT-L + T5-XXL + Llama-3.1-8B • variants: Full (50 steps), Dev (28 steps), Fast (16 steps) | |
stabilityai/stable-diffusion-3-5-large | • 8B parameters; MMDiT with CLIP-L + OpenCLIP-G + T5-XXL • requires 18GB+ VRAM (GGUF ~12GB) • best prompt adherence in Stability lineup |