Lecture 13: PPO, GRPO & DAPO — Policy Optimization for LLMs

Advantage $\hat{A}_t$	Ratio $r_t$ Direction	What Happens	Gradient
$\hat{A}_t > 0$ (good action)	$r_t \uparrow$ (more likely)	Encouraged — but clips at $1{+}\epsilon$	Zero beyond $1{+}\epsilon$
$\hat{A}_t > 0$ (good action)	$r_t \downarrow$ (less likely)	Wrong direction — gradient corrects without limit	Full (unclipped)
$\hat{A}_t < 0$ (bad action)	$r_t \downarrow$ (less likely)	Encouraged — but clips at $1{-}\epsilon$	Zero beyond $1{-}\epsilon$
$\hat{A}_t < 0$ (bad action)	$r_t \uparrow$ (more likely)	Wrong direction — gradient corrects without limit	Full (unclipped)

Initialize policy π_θ, value function V_ϕ
Set θ_old ← θ

for iteration = 1, 2, ... do

  ┌─ STEP 1: ROLLOUT ─────────────────────────────────────┐
  │ Freeze current policy as π_old                        │
  │ Collect trajectories using π_old                      │
  │ Store: s_t, a_t, r_t, log π_old(a_t|s_t), V_ϕ(s_t)  │
  └───────────────────────────────────────────────────────┘

  ┌─ STEP 2: COMPUTE ADVANTAGES (GAE) ───────────────────┐
  │ for t = T-1 down to 0:                               │
  │   δ_t = r_t + γ·V_ϕ(s_{t+1}) - V_ϕ(s_t)            │
  │   Â_t = δ_t + (γλ)·Â_{t+1}                          │
  │ G_t = Â_t + V_ϕ(s_t)            // return targets    │
  │ Normalize: Â ← (Â - μ) / (σ + ε)                    │
  └───────────────────────────────────────────────────────┘

  ┌─ STEP 3: OPTIMIZE (K epochs over minibatches) ───────┐
  │ for epoch = 1 to K:                                  │
  │   Shuffle buffer into minibatches                    │
  │   for each minibatch:                                │
  │     r_t(θ) = exp(log π_θ(a_t|s_t) - log π_old(...)) │
  │     L_clip = min(r_t·Â_t, clip(r_t,1-ε,1+ε)·Â_t)   │
  │     L_value = (V_ϕ(s_t) - G_t)²                     │
  │     L_entropy = H[π_θ(·|s_t)]                       │
  │     loss = -mean(L_clip) + c_v·mean(L_value)         │
  │            - c_ent·mean(L_entropy)                   │
  │     gradient_step(θ, ϕ)                              │
  │                                                      │
  │   approx_kl = mean(log π_old - log π_θ)             │
  │   if approx_kl > kl_target: early-stop epochs       │
  └───────────────────────────────────────────────────────┘

  θ_old ← θ, collect fresh trajectories

	PPO	GRPO	DAPO
Value model	Yes — full LLM critic	No	No
Models in memory	4 (actor, critic, ref, reward)	3 (actor, ref, reward/verifier)	3 (actor, ref, reward/verifier)
Advantage	Per-token via GAE	Per-response: $(r_i - \mu)/\sigma$	Same as GRPO + fixes
KL handling	Per-token reward shaping	Loss penalty term	Dropped for RLVR tasks
Clipping	Symmetric $\epsilon$	Symmetric $\epsilon$	Asymmetric: $\epsilon_{\text{low}} < \epsilon_{\text{high}}$
Dead groups	N/A (single response)	Zero gradient, wasted compute	Dynamic sampling filters them
Length bias	N/A	Yes ($1/\|o_i\|$ normalization)	Fixed with token-level loss
Samples per prompt	1 response	$G$ responses (8–64)	$G$ responses + oversampling
Best for	General alignment (RLHF)	Reasoning RL (RLVR)	Reasoning RL at scale
Used by	OpenAI, Anthropic, Google	DeepSeek-R1, Qwen	Open-source SOTA, frontier labs

PPO, GRPO & DAPO

Policy Optimization from Trust Regions to Frontier Reasoning RL

Learning Objectives

From Last Time

Part 1: The Stability Crisis

Why Large Updates Kill RL

The Trust Region Idea

Part 2: The PPO Clipped Objective

From Policy Gradient to Surrogate Objective

The Probability Ratio

The Clipped Surrogate Objective

Interactive Demo: PPO Clipping

The Four Cases of PPO Clipping

Important Subtlety: PPO Is Only an Approximate Trust Region

Part 3: The Full PPO Algorithm

The Complete PPO Loss

PPO Training Loop

Implementation Details That Matter

Part 4: PPO for Language Models (RLHF)

The LLM RL Setup: Four Models in Memory

Interactive Demo: PPO Training Loop

Part 5: GRPO — Eliminating the Value Model

The Problem with PPO's Critic

Group Relative Advantages

The GRPO Objective

Interactive Demo: GRPO

GRPO Training Loop

Part 6: DAPO — Making GRPO Work at Scale

GRPO's Scaling Problems

DAPO's Four Fixes

DAPO Results

PPO vs. GRPO vs. DAPO

Part 7: The Unified View

All These Algorithms Are Variations of One Idea

RLVR: The Paradigm Shift

Summary

All Interactive Demos

Lecture Summary

Supplementary Resources