A toy demonstration of Direct Preference Optimization.
Instead of fitting a separate reward model, we optimize the policy directly on preference pairs using
r̂θ(x,y) = β log [πθ(y|x) / πref(y|x)].
The gradient is strongest when the model currently prefers the wrong response.
In this toy setup, each whole-response log-probability is a single trainable parameter. That makes the DPO update easy to inspect numerically.
The reference log-probs are frozen. DPO learns by changing the policy’s relative log-prob advantage over the reference.