Generative AI refers to a class of artificial intelligence systems that create new content — text, images, audio, video, code, 3D assets, and more — by learning patterns from training data. Unlike discriminative models that classify or predict from existing inputs, generative models learn the underlying distribution of data to produce novel outputs that resemble real examples. As of 2026, the field is defined by transformer-based large language models (GPT-5, Claude, Gemini), diffusion-based image and video generators (Flux, DALL-E, Stable Diffusion), agentic AI systems that plan and act autonomously, and hybrid architectures combining attention with state space models. A key insight: while these models appear to "understand" content, they fundamentally operate by predicting likely continuations based on statistical patterns — which shapes both their capabilities and limitations in reasoning, factual accuracy, and consistency.
What This Cheat Sheet Covers
This topic spans 21 focused tables and 207 indexed concepts. Below is a complete table-by-table outline of this topic, spanning foundational concepts through advanced details.
Table 1: Core Model Architectures
| Architecture | Example | Description |
|---|---|---|
attention_output = softmax(QK^T / sqrt(d_k))V | • Foundation architecture using self-attention to process sequences in parallel • powers most modern LLMs and multimodal models | |
x_t = sqrt(alpha_t) * x_0 + sqrt(1 - alpha_t) * noisex_t-1 = denoise(x_t, t) | • Gradually adds noise to data during training, then learns to reverse the process for generation • dominant in image and video synthesis | |
P(x) = P(x1) * P(x2|x1) * P(x3|x1,x2) ... | • Generates sequences one token at a time, conditioning each on all previous tokens • includes GPT family and most language models | |
latent_patches → transformer_blocks → denoise | • Replaces U-Net backbone with vision transformer in diffusion models • powers Flux, Sora, and Stable Diffusion 3; scales better with compute | |
h_t = A * h_t-1 + B * x_ty_t = C * h_t | • Uses selective state space updates instead of attention for linear-time sequence modeling • enables faster inference and longer context than transformers |