GRPO: Group Relative Policy Optimization

Sample multiple responses per prompt → compute group-relative advantages → no critic needed

Iteration 0 · Group size G = 8

Prompt & Responses

Prompt
What is 24 × 17?
Group size G8

DAPO Fixes TOGGLE

Clip-Higher
Dynamic Sampling
Token-Level Loss
Overlong Shaping
Toggle DAPO fixes above to see their effect on the optimization.

Group Advantage Computation

Group Mean μ
Group Std σ
Min Reward
Max Reward
Âi = (riμ) / σ   →   Good responses get  > 0, bad ones get  < 0

Training Progress

PPO vs GRPO Comparison

PPO (Classic)

✦ Per-token advantages via GAE
✦ Requires learned value model V(s)
✦ 4 models: actor, critic, ref, reward
✦ Fine-grained credit assignment
✦ Best for general RLHF alignment

GRPO (DeepSeek)

✦ Per-response advantages via group
✦ No value model needed
✦ 3 models: actor, ref, reward/verifier
✦ Simpler, less memory
✦ Best for reasoning RL (RLVR)