Generative AI Cheat Sheet

Updated 2026-04-20

Generative AI refers to a class of artificial intelligence systems that create new content — text, images, audio, video, code, 3D assets, and more — by learning patterns from training data. Unlike discriminative models that classify or predict from existing inputs, generative models learn the underlying distribution of data to produce novel outputs that resemble real examples. As of 2026, the field is defined by transformer-based large language models (GPT-5, Claude, Gemini), diffusion-based image and video generators (Flux, DALL-E, Stable Diffusion), agentic AI systems that plan and act autonomously, and hybrid architectures combining attention with state space models. A key insight: while these models appear to "understand" content, they fundamentally operate by predicting likely continuations based on statistical patterns — which shapes both their capabilities and limitations in reasoning, factual accuracy, and consistency.

What This Cheat Sheet Covers

This topic spans 21 focused tables and 207 indexed concepts. Below is a complete table-by-table outline of this topic, spanning foundational concepts through advanced details.

Table 1: Core Model ArchitecturesTable 2: Large Language Model TypesTable 3: Attention MechanismsTable 4: Image Generation TechniquesTable 5: Training ApproachesTable 6: Prompt Engineering TechniquesTable 7: Sampling and Decoding MethodsTable 8: Embeddings and Vector RepresentationsTable 9: Retrieval-Augmented Generation (RAG)Table 10: Fine-Tuning and OptimizationTable 11: Evaluation MetricsTable 12: AI Safety and AlignmentTable 13: Inference OptimizationTable 14: Multimodal CapabilitiesTable 15: AI Agents and Tool UseTable 16: Context and Memory ManagementTable 17: Scaling Laws and Training TheoryTable 18: Common Use Cases and ApplicationsTable 19: Model Deployment PatternsTable 20: Data Preparation and FormattingTable 21: Advanced Training Techniques

Table 1: Core Model Architectures

Every generative system is built on one of a handful of structural blueprints, and knowing them is the foundation for everything else. Transformers and their attention mechanism dominate language, diffusion models and DiTs rule image and video, and state space models like Mamba challenge attention on long sequences — while older families like GANs, VAEs, and flow-based models each solved generation in their own way before being partly superseded. Recognizing which architecture a model uses tells you most of what to expect about its strengths, costs, and failure modes.

Architecture	Example	Description
Transformer	`attention_output = softmax(QK^T / sqrt(d_k))V`	• Foundation architecture using self-attention to process sequences in parallel • powers most modern LLMs and multimodal models
Diffusion Model	`x_t = sqrt(alpha_t) * x_0 + sqrt(1 - alpha_t) * noise` `x_t-1 = denoise(x_t, t)`	• Gradually adds noise to data during training, then learns to reverse the process for generation • dominant in image and video synthesis
Autoregressive Model	`P(x) = P(x1) * P(x2\|x1) * P(x3\|x1,x2) ...`	• Generates sequences one token at a time, conditioning each on all previous tokens • includes GPT family and most language models
Diffusion Transformer (DiT)	`latent_patches → transformer_blocks → denoise`	• Replaces U-Net backbone with vision transformer in diffusion models • powers Flux, Sora, and Stable Diffusion 3; scales better with compute
State Space Model (Mamba)	`h_t = A * h_t-1 + B * x_t` `y_t = C * h_t`	• Uses selective state space updates instead of attention for linear-time sequence modeling • enables faster inference and longer context than transformers

Table 1: Core Model Architectures

Architecture	Example	Description
Transformer	`attention_output = softmax(QK^T / sqrt(d_k))V`	• Foundation architecture using self-attention to process sequences in parallel • powers most modern LLMs and multimodal models
Diffusion Model	`x_t = sqrt(alpha_t) * x_0 + sqrt(1 - alpha_t) * noise` `x_t-1 = denoise(x_t, t)`	• Gradually adds noise to data during training, then learns to reverse the process for generation • dominant in image and video synthesis
Autoregressive Model	`P(x) = P(x1) * P(x2\|x1) * P(x3\|x1,x2) ...`	• Generates sequences one token at a time, conditioning each on all previous tokens • includes GPT family and most language models
Diffusion Transformer (DiT)	`latent_patches → transformer_blocks → denoise`	• Replaces U-Net backbone with vision transformer in diffusion models • powers Flux, Sora, and Stable Diffusion 3; scales better with compute
State Space Model (Mamba)	`h_t = A * h_t-1 + B * x_t` `y_t = C * h_t`	• Uses selective state space updates instead of attention for linear-time sequence modeling • enables faster inference and longer context than transformers