Generative AI refers to a class of artificial intelligence systems that create new content — text, images, audio, video, code, 3D assets, and more — by learning patterns from training data. Unlike discriminative models that classify or predict from existing inputs, generative models learn the underlying distribution of data to produce novel outputs that resemble real examples. As of 2026, the field is defined by transformer-based large language models (GPT-5, Claude, Gemini), diffusion-based image and video generators (Flux, DALL-E, Stable Diffusion), agentic AI systems that plan and act autonomously, and hybrid architectures combining attention with state space models. A key insight: while these models appear to "understand" content, they fundamentally operate by predicting likely continuations based on statistical patterns — which shapes both their capabilities and limitations in reasoning, factual accuracy, and consistency.
What This Cheat Sheet Covers
This topic spans 21 focused tables and 207 indexed concepts. Below is a complete table-by-table outline of this topic, spanning foundational concepts through advanced details.
Table 1: Core Model Architectures
Every generative system is built on one of a handful of structural blueprints, and knowing them is the foundation for everything else. Transformers and their attention mechanism dominate language, diffusion models and DiTs rule image and video, and state space models like Mamba challenge attention on long sequences — while older families like GANs, VAEs, and flow-based models each solved generation in their own way before being partly superseded. Recognizing which architecture a model uses tells you most of what to expect about its strengths, costs, and failure modes.
| Architecture | Example | Description |
|---|---|---|
attention_output = softmax(QK^T / sqrt(d_k))V | • Foundation architecture using self-attention to process sequences in parallel • powers most modern LLMs and multimodal models | |
x_t = sqrt(alpha_t) * x_0 + sqrt(1 - alpha_t) * noisex_t-1 = denoise(x_t, t) | • Gradually adds noise to data during training, then learns to reverse the process for generation • dominant in image and video synthesis | |
P(x) = P(x1) * P(x2|x1) * P(x3|x1,x2) ... | • Generates sequences one token at a time, conditioning each on all previous tokens • includes GPT family and most language models | |
latent_patches → transformer_blocks → denoise | • Replaces U-Net backbone with vision transformer in diffusion models • powers Flux, Sora, and Stable Diffusion 3; scales better with compute | |
h_t = A * h_t-1 + B * x_ty_t = C * h_t | • Uses selective state space updates instead of attention for linear-time sequence modeling • enables faster inference and longer context than transformers |