DPO Demo

A toy demonstration of Direct Preference Optimization. Instead of fitting a separate reward model, we optimize the policy directly on preference pairs using r̂_θ(x,y) = β log [π_θ(y|x) / π_ref(y|x)]. The gradient is strongest when the model currently prefers the wrong response.

Controls

β (how sharply preferences matter) Learning rate

In this toy setup, each whole-response log-probability is a single trainable parameter. That makes the DPO update easy to inspect numerically.

Inspect pair

Chosen log πθ

Rejected log πθ

The reference log-probs are frozen. DPO learns by changing the policy’s relative log-prob advantage over the reference.

Training State

Step

Avg loss

Pair accuracy

Avg grad weight

Automatic Curriculum

Average DPO loss Average gradient weight σ(-margin) Pair accuracy

DPO Demo

Controls

Training State

Automatic Curriculum

Selected Pair

All Preference Pairs