Click "Next Step" to sample a batch of trajectories and compare REINFORCE vs. Advantage
REINFORCE: Raw Returns Gt
Gradient weight: ∇log π · Gt — all positive → all actions reinforced!
With Baseline: Advantage At = Gt − V(s)
Gradient weight: ∇log π · At — good ↑, bad ↓, clean signal!
Positive advantage (better than avg)
Negative advantage (worse than avg)