Ctrl/Cmd + P and select "Save as PDF".
INPUT: Pre-trained base model M_0
ββ STAGE 1: SFT βββββββββββββββββββββββββββββββββββββββββββ
β Data: curated (prompt, response) pairs β
β - instruction following, conversation β
β - chain-of-thought with <think>...</think> format β
β - tool use, code generation β
β Method: standard cross-entropy fine-tuning β
β Scale: 10Kβ100K examples, 1β3 epochs β
β Output: M_1 (SFT model β knows the FORMAT of reasoning) β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β
ββ STAGE 2: REASONING RL βββββββββββββββββββββββββββββββββββ
β Data: math/code/logic problems with verifiable answers β
β Method: GRPO/DAPO + verifiable rewards β
β For each problem: β
β Sample G=16β64 solutions from current policy β
β Verify each: correct β r=1, incorrect β r=0 β
β Group-normalize: Γ_i = (r_i - ΞΌ) / Ο β
β PPO-clip update on token-level ratios β
β Scale: 100K+ problems Γ 1000s of RL steps β
β Output: M_2 (reasoning model β knows HOW to reason) β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β
ββ STAGE 3: REJECTION SAMPLING + DISTILLATION βββββββββββββ
β Use M_2 to generate N solutions per problem β
β Keep only verified-correct solutions (optionally top-k) β
β Fine-tune M_1 (or M_2) on this curated dataset β
β Distills RL gains into clean supervised checkpoint β
β Output: M_3 (distilled reasoning model β more stable) β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β
ββ STAGE 4: GENERAL ALIGNMENT ββββββββββββββββββββββββββββββ
β Data: preference pairs (from LLM-as-judge or humans) β
β Method: DPO or PPO + reward model β
β Objectives: helpfulness, safety, instruction following β
β KL penalty anchored to M_3 to preserve reasoning β
β Output: M_4 (final aligned model) β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
OUTPUT: M_4 β deployed model| Method | How It Works | Scaling Type | Strength | Weakness |
|---|---|---|---|---|
| Best-of-N | Generate $N$ solutions, pick best | Parallel | Simple, fully parallelizable | No info sharing between samples |
| Long CoT | Extended reasoning with self-correction | Sequential | Dynamic, adaptive to difficulty | Model must have learned self-correction via RL |
| Tree search + PRM | Branch-and-bound over reasoning steps | Hybrid | Systematic exploration | Needs good PRM, complex orchestration |
| Budget forcing | Control thinking budget per problem | Sequential | Efficient compute allocation | Needs difficulty estimation |
| Model | Reasoning RL | Agentic Training | Distillation | Key Innovation |
|---|---|---|---|---|
| DeepSeek-R1 | GRPO + verifiable rewards | Limited | Rejection sampling β smaller models | R1-Zero showed emergent CoT from pure RL |
| DeepSeek-V3.2 | Mixed RL (GRPO), >10% of pre-train compute | 1,800+ envs, 85K+ tasks, real tools | Specialist distillation into unified model | Single-stage mixed RL; thinking integrated into tool-use; IMO/IOI gold |
| GLM-5 | GRPO β Agentic RL β General RL | Sequential stage after reasoning | Cross-stage distillation | Async 'slime' infra, 100K chips |
| Qwen 3.x | GRPO + progressive curriculum | Supported | On-policy distillation | Thinking/non-thinking mode switching |
| OpenAI GPT-5.x | RL with verifiable rewards | Native tool calling | Unknown | Pioneered test-time compute at scale |