A toy Bradley–Terry reward model trained on preference pairs.
The “reward model” here is a simple linear scalar head over response features, so you can see the math directly.
The key idea is the same as in RLHF: learn a scalar r(x, y) so preferred responses score higher than dispreferred ones.
Synthetic labels were generated from a hidden “human preference” rule that values correctness and safety more than mere verbosity.
Note: in Bradley–Terry, only reward differences matter. Adding the same constant to every response would not change any preference probability.