GRPO — Group Relative Policy Optimization Demo

Sample multiple responses per prompt → compute group-relative advantages → no critic needed

Prompt

What is 24 × 17?

Group size G8

Clip-Higher

Dynamic Sampling

Token-Level Loss

Overlong Shaping

Toggle DAPO fixes above to see their effect on the optimization.

Group Mean μ

—

Group Std σ

—

Min Reward

—

Max Reward

—

Â_i = (r_i − μ) / σ → Good responses get Â > 0, bad ones get Â < 0

✦ Per-token advantages via GAE

✦ Requires learned value model V(s)

✦ 4 models: actor, critic, ref, reward

✦ Fine-grained credit assignment

✦ Best for general RLHF alignment

✦ Per-response advantages via group

✦ No value model needed

✦ 3 models: actor, ref, reward/verifier

✦ Simpler, less memory

✦ Best for reasoning RL (RLVR)

GRPO: Group Relative Policy Optimization