πŸ–¨οΈ Printing Instructions: Press Ctrl/Cmd + P and select "Save as PDF".
1

Reward Signals

Reward Models, DPO, Verifiable Rewards, and Process Rewards β€” The Specification Language of Post-Training

2

Learning Objectives

3

From Last Time

4

Part 1: The Reward Landscape

5

Three Sources of Reward

6

The Fundamental Tradeoff

SourceSignal QualityScalabilityGeneralityCost
Verifiable rewardsExactUnlimitedNarrow (math, code, formal)Near zero
LLM-as-judge / RLAIFGood (depends on judge)Very highBroadLow (API calls)
Learned reward modelApproximateHigh (once trained)BroadMedium (train RM)
Human preferencesHigh but noisyLowBroadHigh ($1–5/label)
DPO (implicit)ApproximateLimited to datasetBroadLow (supervised)
Signal quality Γ— scalability Γ— generality. No single source covers all needs β€” frontier labs combine them.
7

Part 2: Reward Models from Preferences

8

The Bradley-Terry Preference Model

9

Training a Reward Model

10

Discriminative vs. Generative Reward Models

11

Interactive Demo: Reward Model Training

12

Part 3: DPO β€” Direct Preference Optimization

13

The DPO Insight

14

DPO Derivation Step 1: The Optimal Policy

15

DPO Derivation Step 2: From Reward to Policy to Loss

16

Interactive Demo: DPO

17

Understanding the DPO Gradient

18

What DPO Computes in Practice

19

DPO Variants That Matter

20

Reward Model + RL vs. DPO β€” When to Use Which

Reward Model + PPO/GRPODPO (and variants)
Models in memory3–4 (actor, [critic], ref, RM)1–2 (actor, [ref])
Training loopOnline: generate β†’ score β†’ update β†’ repeatSupervised: (typically) fixed dataset of pairs
Reward signalLearned scalar β€” applicable to any new responseImplicit β€” only defined on the preference data
Data efficiencyOne RM across many RL iterationsEach pair used directly in the loss
StrengthOnline exploration, arbitrary rewards, long-horizonSimpler, more stable, fewer hyperparameters
WeaknessComplex, expensive, RM can be hackedOffline by default, constrained to data distribution
Best forContinuous improvement, complex behaviorsQuick alignment, smaller teams, limited compute
DPO is simpler; RM+RL is more powerful. Frontier labs typically use both at different stages.
21

Part 4: Verifiable Rewards and RLVR

22

The Simplest and Most Powerful Reward: Check the Answer

23

Why Verifiable Rewards Changed Everything

24

Part 5: Process Reward Models

25

Outcome vs. Process Rewards

26

Training Process Reward Models

27

Outcome vs. Process vs. Generative Process Reward Models

ORMDiscriminative PRMGenerative PRM (ThinkPRM)
GranularityPer-responsePer-stepPer-step + explanation
Credit assignmentSparseDenseDense + interpretable
Training dataFinal answer correctnessStep labels (human or MC)~1% of step labels + CoT fine-tuning
Domain transferModerateFragile under domain shiftMore robust (uses reasoning)
Compute at inferenceFixedFixedScalable (more CoT = better)
Best forGeneral alignmentMath reasoning, trainingMath/code, test-time search
The field is shifting from discriminative β†’ generative PRMs: cheaper to train, more robust, scalable at inference.
28

Part 6: Reward Hacking

29

Goodhart's Law: When a Measure Becomes a Target

30

Reward Hacking in Frontier Reasoning Models

31

Mitigations for Reward Hacking

32

Interactive Demo: Reward Hacking

33

Part 7: Scaling Reward Signals β€” LLM-as-Judge

34

LLM-as-Judge: The Dominant Source of Preference Data

35

Part 8: Putting It All Together

36

The Complete Reward Signal Toolkit

Reward SourceQualityScalabilityGeneralityExtra ModelsBest For
VerifiableExactUnlimitedNarrow0RLVR (GRPO/DAPO)
Discriminative RMApproximateHighBroad+1 (RM)Online RL (PPO/GRPO)
Generative RMGoodModerateBroad+1 (GenRM)RL + interpretable scoring
DPO (implicit)ApproximateDataset-limitedBroad0Offline alignment
Process RMGoodModerateReasoning+1 (PRM)Long chains, test-time search
LLM-as-JudgeGoodVery highBroad0 (API)Preference data at scale
Frontier labs combine multiple sources across training stages. No single method covers all needs.
37

How the Pieces Fit Together (Preview of Next Lecture)

38

Summary

39

All Interactive Demos

40

Lecture Summary

41

Supplementary Resources