Variational Autoencoders (VAEs) are probabilistic generative models that learn to encode data into a continuous latent space and reconstruct it through a decoder, introduced by Kingma and Welling in 2013. Unlike traditional autoencoders, VAEs impose a structured probabilistic distribution (typically Gaussian) on the latent space, enabling generation of new samples by sampling from the learned distribution. Beyond standalone use, VAEs serve as the backbone encoder-decoder in latent diffusion models — including Stable Diffusion, DALL-E 3, and video generation systems like Sora — compressing high-dimensional inputs into compact latent representations where powerful generative models can operate far more efficiently. The key technical insight is the reparameterization trick, which makes stochastic sampling differentiable for end-to-end training, while the Evidence Lower Bound (ELBO) objective balances reconstruction quality against latent space regularization.
What This Cheat Sheet Covers
This topic spans 13 focused tables and 107 indexed concepts. Below is a complete table-by-table outline of this topic, spanning foundational concepts through advanced details.
Table 1: Core Architecture Components
A VAE consists of a paired encoder and decoder connected by a stochastic bottleneck: the encoder maps input data to a probability distribution over latent codes, and the decoder inverts this mapping. Understanding the interplay between prior, approximate posterior, and decoder likelihood is essential before tackling any VAE variant.
| Component | Example | Description |
|---|---|---|
z_mean, z_log_var = encoder(x)# Maps input to latent params | • Neural network that maps input x to parameters of a probability distribution (mean \mu and log-variance \log \sigma^2) in the latent space• typically uses CNN layers for images or MLP layers for tabular data | |
x_reconstructed = decoder(z)# Maps latent code to output | • Neural network that reconstructs input from latent code z• mirrors encoder architecture in reverse, often using transposed convolutions or upsample+conv for images | |
z ~ N(mu, sigma^2)# Gaussian distribution | • Low-dimensional continuous representation where each dimension ideally captures a meaningful factor of variation • enables smooth interpolation and generation of new samples | |
p(z) = N(0, I)# Standard Gaussian prior | • Assumed distribution over latent variables before observing data • typically standard normal \mathcal{N}(0, I) to simplify KL divergence computation and enable random sampling at inference |