πŸ–¨οΈ Printing Instructions: Press Ctrl/Cmd + P and select "Save as PDF".
1

PPO, GRPO & DAPO

Policy Optimization from Trust Regions to Frontier Reasoning RL

2

Learning Objectives

3

From Last Time

4

Part 1: The Stability Crisis

5

Why Large Updates Kill RL

6

The Trust Region Idea

7

Part 2: The PPO Clipped Objective

8

From Policy Gradient to Surrogate Objective

9

The Probability Ratio

10

The Clipped Surrogate Objective

11

Interactive Demo: PPO Clipping

12

The Four Cases of PPO Clipping

Advantage $\hat{A}_t$Ratio $r_t$ DirectionWhat HappensGradient
$\hat{A}_t > 0$ (good action)$r_t \uparrow$ (more likely)Encouraged β€” but clips at $1{+}\epsilon$Zero beyond $1{+}\epsilon$
$\hat{A}_t > 0$ (good action)$r_t \downarrow$ (less likely)Wrong direction β€” gradient corrects without limitFull (unclipped)
$\hat{A}_t < 0$ (bad action)$r_t \downarrow$ (less likely)Encouraged β€” but clips at $1{-}\epsilon$Zero beyond $1{-}\epsilon$
$\hat{A}_t < 0$ (bad action)$r_t \uparrow$ (more likely)Wrong direction β€” gradient corrects without limitFull (unclipped)
The min selects the pessimistic bound. Beneficial overshooting is prevented; wrong-direction moves are always corrected.
13

Important Subtlety: PPO Is Only an Approximate Trust Region

14

Part 3: The Full PPO Algorithm

15

The Complete PPO Loss

16

PPO Training Loop

Initialize policy Ο€_ΞΈ, value function V_Ο•
Set ΞΈ_old ← ΞΈ

for iteration = 1, 2, ... do

  β”Œβ”€ STEP 1: ROLLOUT ─────────────────────────────────────┐
  β”‚ Freeze current policy as Ο€_old                        β”‚
  β”‚ Collect trajectories using Ο€_old                      β”‚
  β”‚ Store: s_t, a_t, r_t, log Ο€_old(a_t|s_t), V_Ο•(s_t)  β”‚
  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

  β”Œβ”€ STEP 2: COMPUTE ADVANTAGES (GAE) ───────────────────┐
  β”‚ for t = T-1 down to 0:                               β”‚
  β”‚   Ξ΄_t = r_t + Ξ³Β·V_Ο•(s_{t+1}) - V_Ο•(s_t)            β”‚
  β”‚   Γ‚_t = Ξ΄_t + (Ξ³Ξ»)Β·Γ‚_{t+1}                          β”‚
  β”‚ G_t = Γ‚_t + V_Ο•(s_t)            // return targets    β”‚
  β”‚ Normalize: Γ‚ ← (Γ‚ - ΞΌ) / (Οƒ + Ξ΅)                    β”‚
  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

  β”Œβ”€ STEP 3: OPTIMIZE (K epochs over minibatches) ───────┐
  β”‚ for epoch = 1 to K:                                  β”‚
  β”‚   Shuffle buffer into minibatches                    β”‚
  β”‚   for each minibatch:                                β”‚
  β”‚     r_t(ΞΈ) = exp(log Ο€_ΞΈ(a_t|s_t) - log Ο€_old(...)) β”‚
  β”‚     L_clip = min(r_tΒ·Γ‚_t, clip(r_t,1-Ξ΅,1+Ξ΅)Β·Γ‚_t)   β”‚
  β”‚     L_value = (V_Ο•(s_t) - G_t)Β²                     β”‚
  β”‚     L_entropy = H[Ο€_ΞΈ(Β·|s_t)]                       β”‚
  β”‚     loss = -mean(L_clip) + c_vΒ·mean(L_value)         β”‚
  β”‚            - c_entΒ·mean(L_entropy)                   β”‚
  β”‚     gradient_step(ΞΈ, Ο•)                              β”‚
  β”‚                                                      β”‚
  β”‚   approx_kl = mean(log Ο€_old - log Ο€_ΞΈ)             β”‚
  β”‚   if approx_kl > kl_target: early-stop epochs       β”‚
  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

  ΞΈ_old ← ΞΈ, collect fresh trajectories
17

Implementation Details That Matter

18

Part 4: PPO for Language Models (RLHF)

19

The LLM RL Setup: Four Models in Memory

20

Interactive Demo: PPO Training Loop

21

Part 5: GRPO β€” Eliminating the Value Model

22

The Problem with PPO's Critic

23

Group Relative Advantages

24

The GRPO Objective

25

Interactive Demo: GRPO

26

GRPO Training Loop

Initialize policy Ο€_ΞΈ, reference policy Ο€_ref ← Ο€_ΞΈ
Set ΞΈ_old ← ΞΈ
NO VALUE MODEL NEEDED

for iteration = 1, 2, ... do

  β”Œβ”€ STEP 1: SAMPLE GROUPS ──────────────────────────────┐
  β”‚ Sample prompts {q_1, ..., q_B}                       β”‚
  β”‚ For each prompt q_j:                                 β”‚
  β”‚   Generate G responses: {o_1,...,o_G} ~ Ο€_ΞΈ_old(Β·|q) β”‚
  β”‚   Score each: {r_1,...,r_G} via reward/verifier      β”‚
  β”‚   Store: responses, rewards, log Ο€_ΞΈ_old(o_{i,t}|.) β”‚
  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

  β”Œβ”€ STEP 2: COMPUTE GROUP ADVANTAGES ──────────────────┐
  β”‚ For each prompt q_j:                                 β”‚
  β”‚   ΞΌ = mean(r_1,...,r_G),  Οƒ = std(r_1,...,r_G)       β”‚
  β”‚   Γ‚_i = (r_i - ΞΌ) / Οƒ   for each response           β”‚
  β”‚   (One scalar per response β€” all tokens share it)    β”‚
  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

  β”Œβ”€ STEP 3: PPO-STYLE UPDATE (K epochs) ───────────────┐
  β”‚ for epoch = 1 to K:                                  β”‚
  β”‚   for each minibatch:                                β”‚
  β”‚     r_{i,t} = Ο€_ΞΈ(o_{i,t}|q,o_{i,<t})               β”‚
  β”‚             / Ο€_ΞΈ_old(o_{i,t}|q,o_{i,<t})            β”‚
  β”‚     L = min(rΒ·Γ‚, clip(r,1-Ξ΅,1+Ξ΅)Β·Γ‚)                 β”‚
  β”‚     KL = approx_kl(Ο€_ΞΈ, Ο€_ref)                      β”‚
  β”‚     loss = -mean(L / |o_i|) + Ξ²Β·mean(KL)             β”‚
  β”‚     gradient_step(ΞΈ)                                 β”‚
  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

  ΞΈ_old ← ΞΈ
27

Part 6: DAPO β€” Making GRPO Work at Scale

28

GRPO's Scaling Problems

29

DAPO's Four Fixes

30

DAPO Results

31

PPO vs. GRPO vs. DAPO

PPOGRPODAPO
Value modelYes β€” full LLM criticNoNo
Models in memory4 (actor, critic, ref, reward)3 (actor, ref, reward/verifier)3 (actor, ref, reward/verifier)
AdvantagePer-token via GAEPer-response: $(r_i - \mu)/\sigma$Same as GRPO + fixes
KL handlingPer-token reward shapingLoss penalty termDropped for RLVR tasks
ClippingSymmetric $\epsilon$Symmetric $\epsilon$Asymmetric: $\epsilon_{\text{low}} < \epsilon_{\text{high}}$
Dead groupsN/A (single response)Zero gradient, wasted computeDynamic sampling filters them
Length biasN/AYes ($1/|o_i|$ normalization)Fixed with token-level loss
Samples per prompt1 response$G$ responses (8–64)$G$ responses + oversampling
Best forGeneral alignment (RLHF)Reasoning RL (RLVR)Reasoning RL at scale
Used byOpenAI, Anthropic, GoogleDeepSeek-R1, QwenOpen-source SOTA, frontier labs
All three share the same PPO clipping mechanism. They differ in advantage computation, memory cost, and engineering robustness.
32

Part 7: The Unified View

33

All These Algorithms Are Variations of One Idea

34

RLVR: The Paradigm Shift

35

Summary

36

All Interactive Demos

37

Lecture Summary

38

Supplementary Resources