Ctrl/Cmd + P and select "Save as PDF".
| Advantage $\hat{A}_t$ | Ratio $r_t$ Direction | What Happens | Gradient |
|---|---|---|---|
| $\hat{A}_t > 0$ (good action) | $r_t \uparrow$ (more likely) | Encouraged β but clips at $1{+}\epsilon$ | Zero beyond $1{+}\epsilon$ |
| $\hat{A}_t > 0$ (good action) | $r_t \downarrow$ (less likely) | Wrong direction β gradient corrects without limit | Full (unclipped) |
| $\hat{A}_t < 0$ (bad action) | $r_t \downarrow$ (less likely) | Encouraged β but clips at $1{-}\epsilon$ | Zero beyond $1{-}\epsilon$ |
| $\hat{A}_t < 0$ (bad action) | $r_t \uparrow$ (more likely) | Wrong direction β gradient corrects without limit | Full (unclipped) |
Initialize policy Ο_ΞΈ, value function V_Ο
Set ΞΈ_old β ΞΈ
for iteration = 1, 2, ... do
ββ STEP 1: ROLLOUT ββββββββββββββββββββββββββββββββββββββ
β Freeze current policy as Ο_old β
β Collect trajectories using Ο_old β
β Store: s_t, a_t, r_t, log Ο_old(a_t|s_t), V_Ο(s_t) β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
ββ STEP 2: COMPUTE ADVANTAGES (GAE) ββββββββββββββββββββ
β for t = T-1 down to 0: β
β Ξ΄_t = r_t + Ξ³Β·V_Ο(s_{t+1}) - V_Ο(s_t) β
β Γ_t = Ξ΄_t + (Ξ³Ξ»)Β·Γ_{t+1} β
β G_t = Γ_t + V_Ο(s_t) // return targets β
β Normalize: Γ β (Γ - ΞΌ) / (Ο + Ξ΅) β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
ββ STEP 3: OPTIMIZE (K epochs over minibatches) ββββββββ
β for epoch = 1 to K: β
β Shuffle buffer into minibatches β
β for each minibatch: β
β r_t(ΞΈ) = exp(log Ο_ΞΈ(a_t|s_t) - log Ο_old(...)) β
β L_clip = min(r_tΒ·Γ_t, clip(r_t,1-Ξ΅,1+Ξ΅)Β·Γ_t) β
β L_value = (V_Ο(s_t) - G_t)Β² β
β L_entropy = H[Ο_ΞΈ(Β·|s_t)] β
β loss = -mean(L_clip) + c_vΒ·mean(L_value) β
β - c_entΒ·mean(L_entropy) β
β gradient_step(ΞΈ, Ο) β
β β
β approx_kl = mean(log Ο_old - log Ο_ΞΈ) β
β if approx_kl > kl_target: early-stop epochs β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
ΞΈ_old β ΞΈ, collect fresh trajectoriesInitialize policy Ο_ΞΈ, reference policy Ο_ref β Ο_ΞΈ
Set ΞΈ_old β ΞΈ
NO VALUE MODEL NEEDED
for iteration = 1, 2, ... do
ββ STEP 1: SAMPLE GROUPS βββββββββββββββββββββββββββββββ
β Sample prompts {q_1, ..., q_B} β
β For each prompt q_j: β
β Generate G responses: {o_1,...,o_G} ~ Ο_ΞΈ_old(Β·|q) β
β Score each: {r_1,...,r_G} via reward/verifier β
β Store: responses, rewards, log Ο_ΞΈ_old(o_{i,t}|.) β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
ββ STEP 2: COMPUTE GROUP ADVANTAGES βββββββββββββββββββ
β For each prompt q_j: β
β ΞΌ = mean(r_1,...,r_G), Ο = std(r_1,...,r_G) β
β Γ_i = (r_i - ΞΌ) / Ο for each response β
β (One scalar per response β all tokens share it) β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
ββ STEP 3: PPO-STYLE UPDATE (K epochs) ββββββββββββββββ
β for epoch = 1 to K: β
β for each minibatch: β
β r_{i,t} = Ο_ΞΈ(o_{i,t}|q,o_{i,<t}) β
β / Ο_ΞΈ_old(o_{i,t}|q,o_{i,<t}) β
β L = min(rΒ·Γ, clip(r,1-Ξ΅,1+Ξ΅)Β·Γ) β
β KL = approx_kl(Ο_ΞΈ, Ο_ref) β
β loss = -mean(L / |o_i|) + Ξ²Β·mean(KL) β
β gradient_step(ΞΈ) β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
ΞΈ_old β ΞΈ| PPO | GRPO | DAPO | |
|---|---|---|---|
| Value model | Yes β full LLM critic | No | No |
| Models in memory | 4 (actor, critic, ref, reward) | 3 (actor, ref, reward/verifier) | 3 (actor, ref, reward/verifier) |
| Advantage | Per-token via GAE | Per-response: $(r_i - \mu)/\sigma$ | Same as GRPO + fixes |
| KL handling | Per-token reward shaping | Loss penalty term | Dropped for RLVR tasks |
| Clipping | Symmetric $\epsilon$ | Symmetric $\epsilon$ | Asymmetric: $\epsilon_{\text{low}} < \epsilon_{\text{high}}$ |
| Dead groups | N/A (single response) | Zero gradient, wasted compute | Dynamic sampling filters them |
| Length bias | N/A | Yes ($1/|o_i|$ normalization) | Fixed with token-level loss |
| Samples per prompt | 1 response | $G$ responses (8β64) | $G$ responses + oversampling |
| Best for | General alignment (RLHF) | Reasoning RL (RLVR) | Reasoning RL at scale |
| Used by | OpenAI, Anthropic, Google | DeepSeek-R1, Qwen | Open-source SOTA, frontier labs |