Reinforcement learning is a machine learning paradigm where agents learn to make sequential decisions by interacting with environments, receiving feedback through rewards and penalties. Unlike supervised learning, RL agents discover optimal behaviors through trial-and-error exploration rather than labeled examples, making it ideal for autonomous decision-making in robotics, game playing, resource optimization, and increasingly for aligning large language models with human preferences. The field balances a fundamental tension: agents must exploit known rewarding actions while simultaneously exploring new possibilities to discover better strategies β a tradeoff that defines every RL algorithm's character and effectiveness.
What This Cheat Sheet Covers
This topic spans 28 focused tables and 199 indexed concepts. Below is a complete table-by-table outline of this topic, spanning foundational concepts through advanced details.
Table 1: Core MDP Foundations
| Concept | Example | Description |
|---|---|---|
(S, A, P, R, \gamma) | β’ Formal framework defining states, actions, transition probabilities, rewards, and discount factor β’ assumes future depends only on current state (Markov property), not history. | |
position = [3, 4]health = 85 | β’ Current situation of the environment the agent observes β’ can be fully observable or partially observable (POMDP). | |
move_leftaccelerate(2.5) | β’ Choice the agent makes in a given state β’ can be discrete (finite set) or continuous (real-valued vector). | |
+100 (goal reached)-1 (time penalty) | β’ Scalar feedback signal from environment indicating immediate desirability of a state-action β’ agent's objective is to maximize cumulative reward. | |
\pi(\text{left} \mid s) = 0.7 | β’ Mapping from states to actions β’ can be deterministic a = \pi(s) or stochastic a \sim \pi(\cdot \mid s)β’ what the agent learns. |