Advantage Function & Variance Reduction

How subtracting a baseline transforms noisy returns into clean learning signals

Click "Next Step" to sample a batch of trajectories and compare REINFORCE vs. Advantage

REINFORCE: Raw Returns Gt

Gradient weight: ∇log π · Gt — all positive → all actions reinforced!

With Baseline: Advantage At = Gt − V(s)

Gradient weight: ∇log π · At — good ↑, bad ↓, clean signal!
Mean Return (Baseline)
Variance (Raw Returns)
Variance (Advantages)
Variance Reduction
Raw return Gt
Positive advantage (better than avg)
Negative advantage (worse than avg)
Baseline V(s)