πŸ–¨οΈ Printing Instructions: Press Ctrl/Cmd + P and select "Save as PDF".
1

From Base Model to Frontier Reasoner

Multi-Stage Pipelines, Test-Time Compute, Distillation, and the Agentic Frontier

2

Learning Objectives

3

From Last Time

4

Part 1: The Multi-Stage Pipeline

5

Why Not Just Run RL Once?

6

The Canonical Multi-Stage Pipeline

INPUT: Pre-trained base model M_0

β”Œβ”€ STAGE 1: SFT ──────────────────────────────────────────┐
β”‚ Data: curated (prompt, response) pairs                   β”‚
β”‚   - instruction following, conversation                  β”‚
β”‚   - chain-of-thought with <think>...</think> format      β”‚
β”‚   - tool use, code generation                            β”‚
β”‚ Method: standard cross-entropy fine-tuning               β”‚
β”‚ Scale: 10K–100K examples, 1–3 epochs                     β”‚
β”‚ Output: M_1 (SFT model β€” knows the FORMAT of reasoning)  β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
           ↓
β”Œβ”€ STAGE 2: REASONING RL ──────────────────────────────────┐
β”‚ Data: math/code/logic problems with verifiable answers   β”‚
β”‚ Method: GRPO/DAPO + verifiable rewards                   β”‚
β”‚   For each problem:                                      β”‚
β”‚     Sample G=16–64 solutions from current policy         β”‚
β”‚     Verify each: correct β†’ r=1, incorrect β†’ r=0          β”‚
β”‚     Group-normalize: Γ‚_i = (r_i - ΞΌ) / Οƒ                β”‚
β”‚     PPO-clip update on token-level ratios                β”‚
β”‚ Scale: 100K+ problems Γ— 1000s of RL steps                β”‚
β”‚ Output: M_2 (reasoning model β€” knows HOW to reason)      β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
           ↓
β”Œβ”€ STAGE 3: REJECTION SAMPLING + DISTILLATION ────────────┐
β”‚ Use M_2 to generate N solutions per problem              β”‚
β”‚ Keep only verified-correct solutions (optionally top-k)  β”‚
β”‚ Fine-tune M_1 (or M_2) on this curated dataset           β”‚
β”‚ Distills RL gains into clean supervised checkpoint       β”‚
β”‚ Output: M_3 (distilled reasoning model β€” more stable)    β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
           ↓
β”Œβ”€ STAGE 4: GENERAL ALIGNMENT ─────────────────────────────┐
β”‚ Data: preference pairs (from LLM-as-judge or humans)     β”‚
β”‚ Method: DPO or PPO + reward model                        β”‚
β”‚ Objectives: helpfulness, safety, instruction following    β”‚
β”‚ KL penalty anchored to M_3 to preserve reasoning         β”‚
β”‚ Output: M_4 (final aligned model)                        β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

OUTPUT: M_4 β€” deployed model
7

Why This Ordering Matters

8

Part 2: The DeepSeek-R1 Story

9

R1-Zero: Emergent Reasoning from Pure RL

10

R1-Zero β†’ R1: From Scientific Discovery to Deployable Model

11

Part 3: Distillation as a Core Pipeline Component

12

Why Distillation Matters in the RL Pipeline

13

Distillation vs. RL: Complementary, Not Competing

14

Part 4: Test-Time Compute Scaling

15

The Core Idea: Think Longer on Harder Problems

16

Four Methods for Test-Time Scaling

17

Test-Time Compute Methods Compared

MethodHow It WorksScaling TypeStrengthWeakness
Best-of-NGenerate $N$ solutions, pick bestParallelSimple, fully parallelizableNo info sharing between samples
Long CoTExtended reasoning with self-correctionSequentialDynamic, adaptive to difficultyModel must have learned self-correction via RL
Tree search + PRMBranch-and-bound over reasoning stepsHybridSystematic explorationNeeds good PRM, complex orchestration
Budget forcingControl thinking budget per problemSequentialEfficient compute allocationNeeds difficulty estimation
Can be combined. All require RL-trained models that know how to use extra tokens productively.
18

Test-Time Scaling for Agents

19

Part 5: Infrastructure for RL at Scale

20

The Bottleneck: Rollout Generation

21

The Actor–Verifier–Learner Architecture

22

Scale of Production RL

23

Part 6: Agentic RL β€” The Newest Frontier

24

From Reasoning to Agency

25

Agentic Reward Stacks

26

Key Differences from Reasoning RL

27

Agentic RL: Failure Modes and Guardrails

28

Part 7: The Complete Picture

29

Interactive Demo: Full Pipeline Visualization

30

Frontier Model Recipes (2025–2026)

ModelReasoning RLAgentic TrainingDistillationKey Innovation
DeepSeek-R1GRPO + verifiable rewardsLimitedRejection sampling β†’ smaller modelsR1-Zero showed emergent CoT from pure RL
DeepSeek-V3.2Mixed RL (GRPO), >10% of pre-train compute1,800+ envs, 85K+ tasks, real toolsSpecialist distillation into unified modelSingle-stage mixed RL; thinking integrated into tool-use; IMO/IOI gold
GLM-5GRPO β†’ Agentic RL β†’ General RLSequential stage after reasoningCross-stage distillationAsync 'slime' infra, 100K chips
Qwen 3.xGRPO + progressive curriculumSupportedOn-policy distillationThinking/non-thinking mode switching
OpenAI GPT-5.xRL with verifiable rewardsNative tool callingUnknownPioneered test-time compute at scale
All use the same core recipe: policy optimization + verifiable rewards + multi-stage/mixed pipeline. They differ in scale, infrastructure, and whether stages are sequential or unified.
31

The Unified Framework: From Base Model to Frontier Reasoner

32

What's Still Open

33

Summary

34

All Interactive Demos

35

Summary

36

RL Arc: The Complete Map

37

Supplementary Resources