Synthetic Data Generation is the process of creating artificial data that mirrors the statistical properties and patterns of real-world datasets without containing actual observations. Born from the intersection of generative modeling, privacy engineering, and machine learning, it has become a cornerstone technique for addressing data scarcity, privacy constraints, and class imbalance challenges. What makes this field particularly powerful is that quality synthetic data can often outperform real data in specific scenarios — such as rare event modeling or privacy-sensitive applications — when generated and validated correctly. Keep this in mind: the goal is never perfect replication, but rather faithful statistical representation that preserves utility while minimizing privacy risks.
What This Cheat Sheet Covers
This topic spans 15 focused tables and 108 indexed concepts. Below is a complete table-by-table outline of this topic, spanning foundational concepts through advanced details.
Table 1: Generative Model Approaches
| Method | Example | Description |
|---|---|---|
Generator vs Discriminator trained adversarially | • Two-network competition where generator creates synthetic samples and discriminator distinguishes real from fake • most widely used for images and tabular data, prone to mode collapse | |
from sdv.tabular import CTGANmodel = CTGAN()model.fit(df) | • GAN specifically designed for tabular data with mode-specific normalization and conditional vector • handles mixed data types and imbalanced categorical columns better than vanilla GANs | |
Encoder → latent z \sim \mathcal{N}(\mu, \sigma^2)→ Decoder | • Probabilistic encoder-decoder that learns compressed latent representations • generates smooth interpolations but often produces blurrier outputs than GANs | |
VAE variant for structured data | • VAE adapted for tabular data with mixed types • faster training than CTGAN, better for smaller datasets | |
Forward diffusion adds noise → reverse learns denoising | • Iterative denoising process that generates data by reversing a gradual noise addition • state-of-the-art image quality, slower inference than GANs |