Prompt & Responses
Prompt
What is 24 × 17?
DAPO Fixes TOGGLE
Clip-Higher
Dynamic Sampling
Token-Level Loss
Overlong Shaping
Toggle DAPO fixes above to see their effect on the optimization.
Group Advantage Computation
Group Mean μ
—
Group Std σ
—
Min Reward
—
Max Reward
—
Âi = (ri − μ) / σ → Good responses get  > 0, bad ones get  < 0
Training Progress
PPO vs GRPO Comparison
PPO (Classic)
✦ Per-token advantages via GAE
✦ Requires learned value model V(s)
✦ 4 models: actor, critic, ref, reward
✦ Fine-grained credit assignment
✦ Best for general RLHF alignment
GRPO (DeepSeek)
✦ Per-response advantages via group
✦ No value model needed
✦ 3 models: actor, ref, reward/verifier
✦ Simpler, less memory
✦ Best for reasoning RL (RLVR)