Stable Diffusion Cheat Sheet

Updated 2026-05-28

Next Topic: Structured Output Generation with LLMs Cheat Sheet

Stable Diffusion is an open-source latent diffusion model family that generates images from text descriptions. The original SD 1.x/2.x series uses a U-Net backbone operating in a VAE-compressed latent space guided by CLIP text encoders; SD 3 and later shifted to a Multimodal Diffusion Transformer (MMDiT) architecture with triple text-encoder conditioning (CLIP-L, OpenCLIP-G, T5-XXL). Competing models like Flux.1 and the newer Flux.2 (Black Forest Labs) apply a transformer-based flow-matching design, with Flux.2 coupling a 32B rectified flow transformer to a vision-language model for state-of-the-art multi-reference editing. Understanding generation parameters—from CFG scale to sampling schedulers—gives precise control over output, while extensions like ControlNet, LoRA, and IP-Adapter enable advanced customization without retraining the full model.

What This Cheat Sheet Covers

This topic spans 21 focused tables and 170 indexed concepts. Below is a complete table-by-table outline of this topic, spanning foundational concepts through advanced details.

Table 1: Core Model VersionsTable 2: Generation ModesTable 3: Key Generation ParametersTable 4: Sampling Methods (Schedulers)Table 5: Prompt Engineering TechniquesTable 6: ControlNet ModelsTable 7: Fine-Tuning TechniquesTable 8: Advanced ExtensionsTable 9: VAE (Variational Autoencoder)Table 10: UI PlatformsTable 11: Hardware RequirementsTable 12: Model File FormatsTable 13: Upscaling MethodsTable 14: Common Artifacts & FixesTable 15: Aspect Ratios & ResolutionsTable 16: Popular Community ModelsTable 17: Prompt Modifiers by CategoryTable 18: Negative Prompt EssentialsTable 19: CLIP Text EncoderTable 20: Latent Diffusion ProcessTable 21: Performance Optimization

Table 1: Core Model Versions

The diffusion model ecosystem evolved rapidly from SD 1.5's modest 983M U-Net to transformer giants like FLUX.2 [dev] at 32B parameters; knowing each model's architecture, VRAM requirements, and license determines which is practical for your hardware and use case. Models are ordered from most widely deployed to most specialized.

Model	Example	Description
SDXL 1.0	`stabilityai/stable-diffusion-xl-base-1.0`	• 1024×1024 native resolution • 3.5B parameters • dual text encoders (CLIP ViT-L + OpenCLIP ViT-bigG) • optional refiner model; largest extension ecosystem after SD 1.5
SD 1.5	`runwayml/stable-diffusion-v1-5`	• 512×512 base resolution • 983M parameters • largest LoRA/embedding/extension ecosystem overall • fastest on low VRAM
Flux.1 [dev]	`black-forest-labs/FLUX.1-dev`	• 12B parameter flow-matching transformer (Black Forest Labs) • guidance-distilled; excels at text rendering and photorealism • 20–50 steps; non-commercial license
Flux.1 [schnell]	`black-forest-labs/FLUX.1-schnell`	• Step-distilled Flux variant; 1–4 steps for rapid generation • Apache 2.0 license • slight quality tradeoff vs dev
FLUX.2 [dev]	`black-forest-labs/FLUX.2-dev`	• 32B parameter rectified flow transformer (Black Forest Labs, Nov 2025) • multi-reference support (up to 10 images), image generation + editing in one model • up to 4MP output; couples with Mistral-3 24B VLM for world knowledge • FP8 quantization runs on RTX 4090 via weight streaming; non-commercial license
Flux.1 Kontext [dev]	Flux Kontext dev (12B)	• In-context image editing model (Black Forest Labs, May 2025) • edits existing images via text instructions • maintains character/style consistency across edits
HiDream-I1	`HiDream-ai/HiDream-I1-Full`	• 17B parameter sparse Diffusion Transformer with dynamic MoE architecture (HiDream.ai, April 2025) • MIT license; SOTA on DPG-Bench and GenEval • four text encoders: OpenCLIP ViT-bigG + CLIP ViT-L + T5-XXL + Llama-3.1-8B • variants: Full (50 steps), Dev (28 steps), Fast (16 steps)
SD 3.5 Large	`stabilityai/stable-diffusion-3-5-large`	• 8B parameters; MMDiT with CLIP-L + OpenCLIP-G + T5-XXL • requires 18GB+ VRAM (GGUF ~12GB) • best prompt adherence in Stability lineup

Table 1: Core Model Versions

Model	Example	Description
SDXL 1.0	`stabilityai/stable-diffusion-xl-base-1.0`	• 1024×1024 native resolution • 3.5B parameters • dual text encoders (CLIP ViT-L + OpenCLIP ViT-bigG) • optional refiner model; largest extension ecosystem after SD 1.5
SD 1.5	`runwayml/stable-diffusion-v1-5`	• 512×512 base resolution • 983M parameters • largest LoRA/embedding/extension ecosystem overall • fastest on low VRAM
Flux.1 [dev]	`black-forest-labs/FLUX.1-dev`	• 12B parameter flow-matching transformer (Black Forest Labs) • guidance-distilled; excels at text rendering and photorealism • 20–50 steps; non-commercial license
Flux.1 [schnell]	`black-forest-labs/FLUX.1-schnell`	• Step-distilled Flux variant; 1–4 steps for rapid generation • Apache 2.0 license • slight quality tradeoff vs dev
FLUX.2 [dev]	`black-forest-labs/FLUX.2-dev`	• 32B parameter rectified flow transformer (Black Forest Labs, Nov 2025) • multi-reference support (up to 10 images), image generation + editing in one model • up to 4MP output; couples with Mistral-3 24B VLM for world knowledge • FP8 quantization runs on RTX 4090 via weight streaming; non-commercial license
Flux.1 Kontext [dev]	Flux Kontext dev (12B)	• In-context image editing model (Black Forest Labs, May 2025) • edits existing images via text instructions • maintains character/style consistency across edits
HiDream-I1	`HiDream-ai/HiDream-I1-Full`	• 17B parameter sparse Diffusion Transformer with dynamic MoE architecture (HiDream.ai, April 2025) • MIT license; SOTA on DPG-Bench and GenEval • four text encoders: OpenCLIP ViT-bigG + CLIP ViT-L + T5-XXL + Llama-3.1-8B • variants: Full (50 steps), Dev (28 steps), Fast (16 steps)
SD 3.5 Large	`stabilityai/stable-diffusion-3-5-large`	• 8B parameters; MMDiT with CLIP-L + OpenCLIP-G + T5-XXL • requires 18GB+ VRAM (GGUF ~12GB) • best prompt adherence in Stability lineup