Lecture 15: From Base Model to Frontier Reasoner — Pipelines, Test-Time Compute, and Agentic RL

INPUT: Pre-trained base model M_0

┌─ STAGE 1: SFT ──────────────────────────────────────────┐
│ Data: curated (prompt, response) pairs                   │
│   - instruction following, conversation                  │
│   - chain-of-thought with <think>...</think> format      │
│   - tool use, code generation                            │
│ Method: standard cross-entropy fine-tuning               │
│ Scale: 10K–100K examples, 1–3 epochs                     │
│ Output: M_1 (SFT model — knows the FORMAT of reasoning)  │
└──────────────────────────────────────────────────────────┘
           ↓
┌─ STAGE 2: REASONING RL ──────────────────────────────────┐
│ Data: math/code/logic problems with verifiable answers   │
│ Method: GRPO/DAPO + verifiable rewards                   │
│   For each problem:                                      │
│     Sample G=16–64 solutions from current policy         │
│     Verify each: correct → r=1, incorrect → r=0          │
│     Group-normalize: Â_i = (r_i - μ) / σ                │
│     PPO-clip update on token-level ratios                │
│ Scale: 100K+ problems × 1000s of RL steps                │
│ Output: M_2 (reasoning model — knows HOW to reason)      │
└──────────────────────────────────────────────────────────┘
           ↓
┌─ STAGE 3: REJECTION SAMPLING + DISTILLATION ────────────┐
│ Use M_2 to generate N solutions per problem              │
│ Keep only verified-correct solutions (optionally top-k)  │
│ Fine-tune M_1 (or M_2) on this curated dataset           │
│ Distills RL gains into clean supervised checkpoint       │
│ Output: M_3 (distilled reasoning model — more stable)    │
└──────────────────────────────────────────────────────────┘
           ↓
┌─ STAGE 4: GENERAL ALIGNMENT ─────────────────────────────┐
│ Data: preference pairs (from LLM-as-judge or humans)     │
│ Method: DPO or PPO + reward model                        │
│ Objectives: helpfulness, safety, instruction following    │
│ KL penalty anchored to M_3 to preserve reasoning         │
│ Output: M_4 (final aligned model)                        │
└──────────────────────────────────────────────────────────┘

OUTPUT: M_4 — deployed model

Method	How It Works	Scaling Type	Strength	Weakness
Best-of-N	Generate $N$ solutions, pick best	Parallel	Simple, fully parallelizable	No info sharing between samples
Long CoT	Extended reasoning with self-correction	Sequential	Dynamic, adaptive to difficulty	Model must have learned self-correction via RL
Tree search + PRM	Branch-and-bound over reasoning steps	Hybrid	Systematic exploration	Needs good PRM, complex orchestration
Budget forcing	Control thinking budget per problem	Sequential	Efficient compute allocation	Needs difficulty estimation

Model	Reasoning RL	Agentic Training	Distillation	Key Innovation
DeepSeek-R1	GRPO + verifiable rewards	Limited	Rejection sampling → smaller models	R1-Zero showed emergent CoT from pure RL
DeepSeek-V3.2	Mixed RL (GRPO), >10% of pre-train compute	1,800+ envs, 85K+ tasks, real tools	Specialist distillation into unified model	Single-stage mixed RL; thinking integrated into tool-use; IMO/IOI gold
GLM-5	GRPO → Agentic RL → General RL	Sequential stage after reasoning	Cross-stage distillation	Async 'slime' infra, 100K chips
Qwen 3.x	GRPO + progressive curriculum	Supported	On-policy distillation	Thinking/non-thinking mode switching
OpenAI GPT-5.x	RL with verifiable rewards	Native tool calling	Unknown	Pioneered test-time compute at scale

From Base Model to Frontier Reasoner

Multi-Stage Pipelines, Test-Time Compute, Distillation, and the Agentic Frontier

Learning Objectives

From Last Time

Part 1: The Multi-Stage Pipeline

Why Not Just Run RL Once?

The Canonical Multi-Stage Pipeline

Why This Ordering Matters

Part 2: The DeepSeek-R1 Story

R1-Zero: Emergent Reasoning from Pure RL

R1-Zero → R1: From Scientific Discovery to Deployable Model

Part 3: Distillation as a Core Pipeline Component

Why Distillation Matters in the RL Pipeline

Distillation vs. RL: Complementary, Not Competing

Part 4: Test-Time Compute Scaling

The Core Idea: Think Longer on Harder Problems

Four Methods for Test-Time Scaling

Test-Time Compute Methods Compared

Test-Time Scaling for Agents

Part 5: Infrastructure for RL at Scale

The Bottleneck: Rollout Generation

The Actor–Verifier–Learner Architecture

Scale of Production RL

Part 6: Agentic RL — The Newest Frontier

From Reasoning to Agency

Agentic Reward Stacks

Key Differences from Reasoning RL

Agentic RL: Failure Modes and Guardrails

Part 7: The Complete Picture

Interactive Demo: Full Pipeline Visualization

Frontier Model Recipes (2025–2026)

The Unified Framework: From Base Model to Frontier Reasoner

What's Still Open

Summary

All Interactive Demos

Summary

RL Arc: The Complete Map

Supplementary Resources