DPO Demo

A toy demonstration of Direct Preference Optimization. Instead of fitting a separate reward model, we optimize the policy directly on preference pairs using θ(x,y) = β log [πθ(y|x) / πref(y|x)]. The gradient is strongest when the model currently prefers the wrong response.

Controls

In this toy setup, each whole-response log-probability is a single trainable parameter. That makes the DPO update easy to inspect numerically.

Chosen log πθ
Rejected log πθ

The reference log-probs are frozen. DPO learns by changing the policy’s relative log-prob advantage over the reference.

Training State

Step
Avg loss
Pair accuracy
Avg grad weight

Automatic Curriculum

Average DPO loss Average gradient weight σ(-margin) Pair accuracy

Selected Pair

All Preference Pairs