GRPO (Group Relative Policy Optimization) Cheat Sheet

Updated 2026-05-21

Next Topic: Hugging Face Ecosystem Cheat Sheet

GRPO is a reinforcement learning algorithm for post-training large language models, introduced in the DeepSeekMath paper (2024) and prominently used to train the DeepSeek-R1 reasoning model. It replaces PPO's learned critic network with a group-based Monte Carlo baseline: the model generates multiple completions per prompt, scores them, and normalizes those scores relative to the group to compute advantages. The key insight is that a critic model is unnecessary when you can estimate the baseline directly from a group of parallel rollouts — cutting training memory roughly in half while maintaining stable policy gradients. Understanding GRPO requires keeping one mental model front of mind: every advantage is relative, not absolute — the algorithm never asks "is this response good?" but only "is this response better than the average for this prompt?"

What This Cheat Sheet Covers

This topic spans 14 focused tables and 91 indexed concepts. Below is a complete table-by-table outline of this topic, spanning foundational concepts through advanced details.

Table 1: Core Concepts and TerminologyTable 2: The GRPO Algorithm Step by StepTable 3: GRPO Loss Function and MathTable 4: Comparison with PPO, DPO, and REINFORCE++Table 5: Group Sampling ConfigurationTable 6: KL Divergence and Reference PolicyTable 7: Clip-Higher and Entropy ManagementTable 8: Reward Function DesignTable 9: Overlong Reward Shaping and Truncation HandlingTable 10: Common Pitfalls and Failure ModesTable 11: GRPO Variants and ImprovementsTable 12: Implementations and FrameworksTable 13: DeepSeek-R1 Training PipelineTable 14: Key Hyperparameters

Table 1: Core Concepts and Terminology

GRPO's vocabulary differs subtly from standard RL: "group" replaces "episode batch," "advantage whitening" replaces critic subtraction, and "verifiable reward" replaces reward-model scoring. Getting these definitions precise before reading the math prevents most common misunderstandings.

Concept	Example	Description
Group sampling	`G = 8` responses generated per prompt	• For each prompt $q$ , the policy $\pi_{\theta_\text{old}}$ generates $G$ independent completions $\{o_1, \ldots, o_G\}$ • these form the "group" used for advantage estimation
Group-relative advantage	$\hat{A}_i = \frac{r_i - \mu_G}{\sigma_G}$	• Each completion's reward is z-score normalized within its group: subtract the group mean $\mu_G$ and divide by group std $\sigma_G$ • no critic or value function needed
Verifiable reward	Correct boxed math answer → `r=1`; wrong → `r=0`	A rule-based, deterministic reward function (e.g., regex match, compiler pass/fail) that eliminates reward-model training and reduces reward hacking compared to neural reward models.
Reference policy	`π_ref` = frozen pre-trained or SFT model	• A frozen checkpoint used exclusively for KL regularization • prevents the training policy from drifting too far from its starting distribution
KL penalty	$\beta D_\text{KL}(\pi_\theta \Vert \pi_\text{ref})$	• Added directly to the loss (not subtracted from the reward as in PPO) • coefficient $\beta$ controls how tightly the policy stays near the reference

Table 1: Core Concepts and Terminology

Concept	Example	Description
Group sampling	`G = 8` responses generated per prompt	• For each prompt $q$ , the policy $\pi_{\theta_\text{old}}$ generates $G$ independent completions $\{o_1, \ldots, o_G\}$ • these form the "group" used for advantage estimation
Group-relative advantage	$\hat{A}_i = \frac{r_i - \mu_G}{\sigma_G}$	• Each completion's reward is z-score normalized within its group: subtract the group mean $\mu_G$ and divide by group std $\sigma_G$ • no critic or value function needed
Verifiable reward	Correct boxed math answer → `r=1`; wrong → `r=0`	A rule-based, deterministic reward function (e.g., regex match, compiler pass/fail) that eliminates reward-model training and reduces reward hacking compared to neural reward models.
Reference policy	`π_ref` = frozen pre-trained or SFT model	• A frozen checkpoint used exclusively for KL regularization • prevents the training policy from drifting too far from its starting distribution
KL penalty	$\beta D_\text{KL}(\pi_\theta \Vert \pi_\text{ref})$	• Added directly to the loss (not subtracted from the reward as in PPO) • coefficient $\beta$ controls how tightly the policy stays near the reference