Reward Model Training Demo

A toy Bradley–Terry reward model trained on preference pairs. The “reward model” here is a simple linear scalar head over response features, so you can see the math directly. The key idea is the same as in RLHF: learn a scalar r(x, y) so preferred responses score higher than dispreferred ones.

Controls

Synthetic labels were generated from a hidden “human preference” rule that values correctness and safety more than mere verbosity.

Reward head weights

Note: in Bradley–Terry, only reward differences matter. Adding the same constant to every response would not change any preference probability.

Training State

Step
Avg loss
Pair accuracy
Avg margin

Training History

Average loss Pair accuracy Average reward margin

Inspect a Preference Pair

All Preference Pairs