From Base Model to Frontier Reasoner — Interactive Pipeline Demo

The Canonical Multi-Stage Pipeline

Click each stage to see its details. The pipeline transforms a raw base model into a frontier reasoning system through progressive refinement.

Input

M₀ Base Model

Pre-trained on trillions of tokens

→

Stage 1

SFT

Cross-entropy on curated pairs

→

Stage 2

Reasoning RL

GRPO + verifiable rewards

→

Stage 3

Distillation

Rejection sampling + SFT

→

Stage 4

General Alignment

DPO / PPO + reward model

M₀ — Pre-trained Base Model

The starting point. Trained on trillions of tokens with next-token prediction. Knows language, facts, and some latent reasoning ability — but has no instruction-following capability.

Capabilities

Language fluency, factual knowledge
Latent pattern completion
Code understanding (if in corpus)

Limitations

No instruction following
No structured reasoning
No safety awareness
Cannot have a conversation

Stage 1 — Supervised Fine-Tuning (SFT)

Teaches the model the format of good behavior: how to follow instructions, use chain-of-thought, structure tool calls.

Data

10K–100K curated (prompt, response) pairs
Instruction following, conversation
Chain-of-thought with <think>...</think>
Tool use templates, code generation

Method

Standard cross-entropy fine-tuning
1–3 epochs, moderate learning rate
Loss only on response tokens

Output: M₁

Knows the FORMAT of reasoning
Can follow instructions
Structures CoT, uses tool syntax

Why Before RL?

RL needs structure to optimize over
Without SFT, RL has no format to refine
Format ≠ quality — SFT gives format, RL gives quality

Stage 2 — Reasoning RL

The core capability builder. Uses GRPO with verifiable rewards on math/code/logic problems to teach the model how to reason correctly.

Data

100K+ problems with verifiable answers
Math, code, logic, science
Progressive difficulty curriculum

Method

GRPO/DAPO with G=16–64 rollouts
Correct → r=1, Incorrect → r=0
Group-normalize: Â = (r − μ) / σ
PPO-clip on token-level ratios

Scale

1000s of RL steps
1000s of GPUs
Days to weeks of training

Output: M₂

Knows HOW to reason (not just format)
Self-verification, backtracking emerge
Longer thinking on harder problems
But: noisy policy, formatting issues

L(θ) = -𝔼[Σᵢ min(rᵢ(θ) · Âᵢ, clip(rᵢ(θ), 1-ε, 1+ε) · Âᵢ)]
where Âᵢ = (rᵢ - μ_group) / σ_group

Stage 3 — Rejection Sampling + Distillation

Cleans up the noisy RL policy by generating many solutions, keeping only the best, and fine-tuning on this curated dataset.

Process

Use M₂ to generate N solutions per problem
Verify each solution (exact check)
Keep only verified-correct solutions
Optionally filter for quality/length

Why Distill?

RL policies are noisy — explore broadly
Distillation locks in good behaviors
Creates stable starting point for alignment
Clean supervised checkpoint

Variants

Rejection sampling: keep correct, SFT
On-policy distillation: student generates, teacher supervises token-by-token
Specialist distillation: merge multiple specialist models

Output: M₃

Stable, clean reasoning behavior
RL gains preserved in SFT-style checkpoint
Good base for alignment fine-tuning

Stage 4 — General Alignment

The final stage: fine adjustments for helpfulness, safety, and instruction following. KL penalty prevents forgetting reasoning capabilities.

Data

Preference pairs from humans / LLM judges
Helpfulness, safety, formatting
Multi-domain coverage

Method

DPO or PPO + reward model
KL penalty anchored to M₃
Low learning rate, short training
Replay of earlier-stage data

Why Last?

Reasoning is harder to learn than politeness
Alignment can partially undo reasoning gains
Fine adjustments are easier to preserve

Output: M₄ (Deployed)

Reasons correctly ✓
Follows instructions ✓
Is safe and helpful ✓
Ready for production deployment

Emerging alternative: DeepSeek-V3.2 merges reasoning, agentic tasks, and alignment into a single mixed RL stage — avoids catastrophic forgetting between stages but requires careful reward balancing.

💡 Why This Ordering?

What happens if we run RL before SFT? Click to see:

The DeepSeek-R1 Story

R1-Zero proved that pure RL on a base model can produce emergent reasoning. R1 showed how to make it deployable.

🧠 R1-Zero — Early Training (Step ~100)

RL Step ~100 · GRPO on base model · No SFT

📊 Emergent Behaviors

Key observation: In early RL training, the model produces short, unstructured responses. It hasn't yet discovered that thinking longer leads to better rewards.

R1-Zero → R1: The Full Journey

🧪 R1-Zero Discovery

✓ Emergent chain-of-thought reasoning
✓ Self-verification behavior
✓ Backtracking on errors
✓ Scales thinking with difficulty
✗ Unreadable, mixed-language output
✗ No instruction following
✗ Format reward hacking
✗ Length explosion

🚀 R1 Deployed

✓ All R1-Zero reasoning capabilities
✓ Clean, readable output format
✓ Instruction following
✓ Safety awareness
✓ Multi-domain helpfulness
✓ Distilled to smaller models

Cold-start SFT Reasoning RL Rejection Sampling Alignment RL

Distillation as a Core Pipeline Component

RL discovers capabilities. Distillation makes them stable, efficient, and transferable to smaller models.

🔬 RL (Exploration)

✓ Can discover entirely new capabilities
✓ Pushes the frontier of what's possible
✓ R1-Zero: emergent CoT from scratch
✗ Expensive (millions of GPU-hours)
✗ Noisy — explores broadly
✗ Risk of reward hacking
✗ Unstable training dynamics

🧬 Distillation (Transfer)

✓ 10–100× cheaper than RL
✓ Stable, clean training
✓ Can compress to smaller models
✓ Cleans up noisy RL behaviors
✗ Cannot exceed the teacher
✗ Requires strong teacher model
✗ Transfers, doesn't discover

The frontier recipe: RL creates the best model → distillation makes it accessible/efficient. They are complementary, not competing.

Three Distillation Methods

Rejection Sampling

Standard

Generate N solutions with RL model → keep verified-correct ones → SFT on curated set. Simple and effective.

Specialist Distillation

DeepSeek-V3.2

Train separate specialist models per domain (math, code, agentic), then distill all into a single unified model.

On-Policy Distillation

Emerging

Student generates trajectories, teacher provides dense token-level supervision on those trajectories. 10–100× cheaper than RL.

R1 Distillation Results: Distilled vs. RL-Trained

Distilled models (from R1-671B teacher) vs. RL-trained models of the same size on math benchmarks:

Result: Distillation from a strong teacher often beats RL from scratch at the same model size. The teacher's knowledge is transferred more efficiently than it can be rediscovered.

Test-Time Compute Scaling

Instead of making the model bigger, make it think longer. Trade inference compute for accuracy, scaling cost with problem difficulty.

1. Best-of-N

Parallel

Generate N independent solutions, score with verifier, return the best. Simple, embarrassingly parallel.

2. Long CoT

Sequential

Extended chain-of-thought with self-correction. Generate → evaluate → backtrack → retry. This is what o1/R1 do.

3. Tree Search + PRM

Hybrid

At each step, generate candidates → score with PRM → expand promising branches → prune bad ones.

4. Budget Forcing

Sequential

Explicitly control thinking tokens. Route easy→short, hard→long. Efficient compute allocation.

Best-of-N Simulator

See how generating more samples improves pass rate. Each "sample" is an independent solution attempt.

Samples (N) 1

Base Pass Rate (per sample) 30%

Infrastructure: Actor–Verifier–Learner

The bottleneck in RL for LLMs is rollout generation. Modern systems decompose training into asynchronous specialized components.

Click to start

🎭

Actors

Generate rollouts using vLLM/SGLang. Inference-optimized. G=64 per prompt.

🌍

Environments

Code sandboxes, web browsers, APIs. Execute actions, return observations.

✅

Verifiers

Run tests, judges, constraint checks. Assemble reward signals.

📦

Replay Buffer

Store completed trajectories with rewards. Feed batches to learner.

🧠

Learner

GRPO/PPO gradient updates. Push new checkpoint to actors.

🎯

Task Sampler

Selects problems at frontier difficulty. Mix of success/failure.

Rollouts Generated

Verified

Policy Updates

GPU Utilization

Synchronous vs. Asynchronous RL

🔄 Synchronous

Generate → Train → Generate → Train
~50% GPU utilization
Training GPUs idle during generation
Simpler to implement
No staleness issues

⚡ Asynchronous

Generate and train simultaneously
~100% GPU utilization
5× faster for agentic tasks
1–2 updates staleness (OK with PPO clip)
Complex orchestration required

Scale of Production RL (2025–2026)

DeepSeek-R1

1000s H800 GPUs

DeepSeek-V3.2

>10% pre-train cost

GLM-5

100K Ascend chips

Qwen 3.x

GRPO + curriculum

Agentic RL — Multi-Turn Environment Interaction

Extending reasoning RL from single-turn problem solving to multi-step interaction with real environments: web browsing, code execution, tool use.

Agent Loop Architecture

👁

Observe

Environment state

→

🧠

Think

Plan next action

→

⚡

Act

Tool call / command

→

🌍

Environment

Execute & return

→

🎯

Reward

Composite score

📋 Agent Trajectory

Task: Find the population of Tokyo from a reliable source and convert to millions

Agentic Reward Stack

Agentic RL rarely uses a single scalar reward. It composes multiple reward components:

R(τ) = w_out · r_out + Σ_t w_proc · r_proc(h_t, a_t) − λ_cost · C(τ) − λ_safe · S(τ)

🏆

Outcome Reward

Did the task succeed? Binary or graded final assessment.

w = 1.0

📊

Process Reward

Were intermediate steps useful? Valid tool syntax, correct docs retrieved.

w = 0.3

💰

Cost Penalty

Tokens, latency, tool calls, retries. Prevents runaway verbosity.

λ = 0.1

🛡

Safety Penalty

Did the agent violate constraints or attempt unsafe actions?

λ = 0.5

Key Differences from Reasoning RL

Episode Length

Reasoning: Single response
Agentic: 50–200+ turns with environment

Reward Timing

Reasoning: Immediate (check answer)
Agentic: Delayed, sparse (end of trajectory)

Rollout Cost

Reasoning: Just token generation
Agentic: Real environment execution

Credit Assignment

Reasoning: Group normalization (GRPO)
Agentic: Per-token GAE (PPO may be better)

Action Space

Reasoning: Tokens → answer
Agentic: Tokens → tool calls → env responses

Token Masking

Reasoning: All tokens are agent's
Agentic: Must mask env outputs from gradients

⚠️ Failure Modes

🎭

Reward Hacking

Exploit the verifier instead of solving the task — modifying test cases, accessing answer keys, exploiting loopholes.

🔄

Verbosity / Retry Loops

Generate excessive tokens/tool calls because cost penalty is too weak.

🔓

Unsafe Tool Behavior

Agent learns high-reward but prohibited action sequences.

📉

Distribution Fragility

Works on training environments but breaks under small task variations.