Ctrl/Cmd + P and select "Save as PDF".
| Source | Signal Quality | Scalability | Generality | Cost |
|---|---|---|---|---|
| Verifiable rewards | Exact | Unlimited | Narrow (math, code, formal) | Near zero |
| LLM-as-judge / RLAIF | Good (depends on judge) | Very high | Broad | Low (API calls) |
| Learned reward model | Approximate | High (once trained) | Broad | Medium (train RM) |
| Human preferences | High but noisy | Low | Broad | High ($1β5/label) |
| DPO (implicit) | Approximate | Limited to dataset | Broad | Low (supervised) |
| Reward Model + PPO/GRPO | DPO (and variants) | |
|---|---|---|
| Models in memory | 3β4 (actor, [critic], ref, RM) | 1β2 (actor, [ref]) |
| Training loop | Online: generate β score β update β repeat | Supervised: (typically) fixed dataset of pairs |
| Reward signal | Learned scalar β applicable to any new response | Implicit β only defined on the preference data |
| Data efficiency | One RM across many RL iterations | Each pair used directly in the loss |
| Strength | Online exploration, arbitrary rewards, long-horizon | Simpler, more stable, fewer hyperparameters |
| Weakness | Complex, expensive, RM can be hacked | Offline by default, constrained to data distribution |
| Best for | Continuous improvement, complex behaviors | Quick alignment, smaller teams, limited compute |
| ORM | Discriminative PRM | Generative PRM (ThinkPRM) | |
|---|---|---|---|
| Granularity | Per-response | Per-step | Per-step + explanation |
| Credit assignment | Sparse | Dense | Dense + interpretable |
| Training data | Final answer correctness | Step labels (human or MC) | ~1% of step labels + CoT fine-tuning |
| Domain transfer | Moderate | Fragile under domain shift | More robust (uses reasoning) |
| Compute at inference | Fixed | Fixed | Scalable (more CoT = better) |
| Best for | General alignment | Math reasoning, training | Math/code, test-time search |
| Reward Source | Quality | Scalability | Generality | Extra Models | Best For |
|---|---|---|---|---|---|
| Verifiable | Exact | Unlimited | Narrow | 0 | RLVR (GRPO/DAPO) |
| Discriminative RM | Approximate | High | Broad | +1 (RM) | Online RL (PPO/GRPO) |
| Generative RM | Good | Moderate | Broad | +1 (GenRM) | RL + interpretable scoring |
| DPO (implicit) | Approximate | Dataset-limited | Broad | 0 | Offline alignment |
| Process RM | Good | Moderate | Reasoning | +1 (PRM) | Long chains, test-time search |
| LLM-as-Judge | Good | Very high | Broad | 0 (API) | Preference data at scale |