Imitation Learning (IL) enables agents to learn policies by observing and mimicking expert behavior, positioning itself as a practical alternative to reinforcement learning when reward engineering is difficult or when abundant expert demonstrations are available. Rather than requiring an explicit reward signal, IL methods extract patterns from state-action trajectories to train policies that replicate expert performance. A key challenge in IL is distributional shift—small errors compound as the learned policy visits states unseen during training, leading to divergence from expert trajectories. The field addresses this through interactive dataset aggregation (DAgger), adversarial methods (GAIL), and offline techniques that learn from fixed logged datasets without further environment interaction.
What This Cheat Sheet Covers
This topic spans 12 focused tables and 69 indexed concepts. Below is a complete table-by-table outline of this topic, spanning foundational concepts through advanced details.
Table 1: Core Imitation Learning Paradigms
The foundational families every IL practitioner reaches for first. They run a spectrum from the brutally simple—behavioral cloning, just supervised learning on expert state-action pairs—through interactive correction (DAgger), adversarial matching (GAIL), and reward recovery (IRL and RLHF), all the way to offline learning from fixed logged data. The thread connecting them is how each one fights the compounding-error problem that plain cloning ignores.
| Method | Example | Description |
|---|---|---|
policy = train_supervised(expert_demos)action = policy(state) | • Supervised learning on expert state-action pairs • simplest IL approach but vulnerable to compounding errors from distributional shift | |
for iter in range(N): rollouts = execute(policy) labels = expert(rollouts) policy.update(labels) | • Iteratively collects data under the learned policy and queries the expert for corrections • mitigates distributional shift through online aggregation. | |
D(s,a) = discriminator(state, action)reward = -log(1 - D(s,a))policy = RL(reward) | • GAN-like framework where a discriminator distinguishes expert from learned trajectories • the policy trains via RL to fool the discriminator | |
reward_fn = recover_reward(expert_demos)policy = RL(reward_fn) | • Infers the underlying reward function from demonstrations, then uses RL to optimize a policy • addresses the ambiguity of what the expert is optimizing |