Reinforcement Learning Cheat Sheet

Updated 2026-04-28

🧠Study flashcards on this topic141 cards · spaced repetition→

Reinforcement learning is a machine learning paradigm where agents learn to make sequential decisions by interacting with environments, receiving feedback through rewards and penalties. Unlike supervised learning, RL agents discover optimal behaviors through trial-and-error exploration rather than labeled examples, making it ideal for autonomous decision-making in robotics, game playing, resource optimization, and increasingly for aligning large language models with human preferences. The field balances a fundamental tension: agents must exploit known rewarding actions while simultaneously exploring new possibilities to discover better strategies — a tradeoff that defines every RL algorithm's character and effectiveness.

What This Cheat Sheet Covers

This topic spans 28 focused tables and 199 indexed concepts. Below is a complete table-by-table outline of this topic, spanning foundational concepts through advanced details.

Table 1: Core MDP FoundationsTable 2: Value FunctionsTable 3: Exploration vs Exploitation StrategiesTable 4: Multi-Armed Bandit AlgorithmsTable 5: Dynamic Programming MethodsTable 6: Monte Carlo MethodsTable 7: Temporal Difference LearningTable 8: Deep Q-Networks (DQN) and VariantsTable 9: Policy Gradient MethodsTable 10: Actor-Critic Algorithms for Continuous ControlTable 11: Model-Based Reinforcement LearningTable 12: Monte Carlo Tree Search (MCTS)Table 13: Offline Reinforcement LearningTable 14: Hierarchical Reinforcement LearningTable 15: Multi-Agent Reinforcement Learning (MARL)Table 16: Safe Reinforcement LearningTable 17: Meta-Reinforcement LearningTable 18: RL for LLMs — RLHF and AlignmentTable 19: Reward Shaping and Curriculum LearningTable 20: Imitation LearningTable 21: Function ApproximationTable 22: Partial Observability and POMDPsTable 23: Reward Functions and SpecificationTable 24: Convergence and StabilityTable 25: Policy Optimization TechniquesTable 26: Distributed and Scalable RLTable 27: Evaluation Metrics and BenchmarksTable 28: Common Pitfalls and Practical Tips

Quick IndexSubscribe to unlock

A jump-to index of every table row in this cheat sheet.

Mind MapSubscribe to unlock

An interactive map of every table and concept in this topic.

Table 1: Core MDP Foundations

Every RL problem is ultimately a Markov Decision Process — the vocabulary of states, actions, rewards, policies, and the discount factor that turns "act well over time" into something math can optimize. Get comfortable with these symbols first; nearly every algorithm later in this sheet is just a different way of estimating or improving the pieces defined here.

Concept	Example	Description
Markov Decision Process (MDP)	$(S, A, P, R, \gamma)$	• Formal framework defining states, actions, transition probabilities, rewards, and discount factor • assumes future depends only on current state (Markov property), not history.
State $s$	`position = [3, 4]` `health = 85`	• Current situation of the environment the agent observes • can be fully observable or partially observable (POMDP).
Action $a$	`move_left` `accelerate(2.5)`	• Choice the agent makes in a given state • can be discrete (finite set) or continuous (real-valued vector).
Reward $r$	`+100` (goal reached) `-1` (time penalty)	• Scalar feedback signal from environment indicating immediate desirability of a state-action • agent's objective is to maximize cumulative reward.
Policy $\pi(a \mid s)$	$\pi(\text{left} \mid s) = 0.7$	• Mapping from states to actions • can be deterministic $a = \pi(s)$ or stochastic $a \sim \pi(\cdot \mid s)$ • what the agent learns.

Table 1: Core MDP Foundations

Concept	Example	Description
Markov Decision Process (MDP)	$(S, A, P, R, \gamma)$	• Formal framework defining states, actions, transition probabilities, rewards, and discount factor • assumes future depends only on current state (Markov property), not history.
State $s$	`position = [3, 4]` `health = 85`	• Current situation of the environment the agent observes • can be fully observable or partially observable (POMDP).
Action $a$	`move_left` `accelerate(2.5)`	• Choice the agent makes in a given state • can be discrete (finite set) or continuous (real-valued vector).
Reward $r$	`+100` (goal reached) `-1` (time penalty)	• Scalar feedback signal from environment indicating immediate desirability of a state-action • agent's objective is to maximize cumulative reward.
Policy $\pi(a \mid s)$	$\pi(\text{left} \mid s) = 0.7$	• Mapping from states to actions • can be deterministic $a = \pi(s)$ or stochastic $a \sim \pi(\cdot \mid s)$ • what the agent learns.