From Base Model to Frontier Reasoner

Interactive exploration of multi-stage pipelines, test-time compute, distillation, and agentic RL

The Canonical Multi-Stage Pipeline
Click each stage to see its details. The pipeline transforms a raw base model into a frontier reasoning system through progressive refinement.
Input

Mβ‚€ Base Model

Pre-trained on trillions of tokens
β†’
Stage 1

SFT

Cross-entropy on curated pairs
β†’
Stage 2

Reasoning RL

GRPO + verifiable rewards
β†’
Stage 3

Distillation

Rejection sampling + SFT
β†’
Stage 4

General Alignment

DPO / PPO + reward model

Mβ‚€ β€” Pre-trained Base Model

The starting point. Trained on trillions of tokens with next-token prediction. Knows language, facts, and some latent reasoning ability β€” but has no instruction-following capability.

Capabilities

  • Language fluency, factual knowledge
  • Latent pattern completion
  • Code understanding (if in corpus)

Limitations

  • No instruction following
  • No structured reasoning
  • No safety awareness
  • Cannot have a conversation

Stage 1 β€” Supervised Fine-Tuning (SFT)

Teaches the model the format of good behavior: how to follow instructions, use chain-of-thought, structure tool calls.

Data

  • 10K–100K curated (prompt, response) pairs
  • Instruction following, conversation
  • Chain-of-thought with <think>...</think>
  • Tool use templates, code generation

Method

  • Standard cross-entropy fine-tuning
  • 1–3 epochs, moderate learning rate
  • Loss only on response tokens

Output: M₁

  • Knows the FORMAT of reasoning
  • Can follow instructions
  • Structures CoT, uses tool syntax

Why Before RL?

  • RL needs structure to optimize over
  • Without SFT, RL has no format to refine
  • Format β‰  quality β€” SFT gives format, RL gives quality

Stage 2 β€” Reasoning RL

The core capability builder. Uses GRPO with verifiable rewards on math/code/logic problems to teach the model how to reason correctly.

Data

  • 100K+ problems with verifiable answers
  • Math, code, logic, science
  • Progressive difficulty curriculum

Method

  • GRPO/DAPO with G=16–64 rollouts
  • Correct β†’ r=1, Incorrect β†’ r=0
  • Group-normalize: Γ‚ = (r βˆ’ ΞΌ) / Οƒ
  • PPO-clip on token-level ratios

Scale

  • 1000s of RL steps
  • 1000s of GPUs
  • Days to weeks of training

Output: Mβ‚‚

  • Knows HOW to reason (not just format)
  • Self-verification, backtracking emerge
  • Longer thinking on harder problems
  • But: noisy policy, formatting issues
L(ΞΈ) = -𝔼[Ξ£α΅’ min(rα΅’(ΞΈ) Β· Γ‚α΅’, clip(rα΅’(ΞΈ), 1-Ξ΅, 1+Ξ΅) Β· Γ‚α΅’)]
where Γ‚α΅’ = (rα΅’ - ΞΌ_group) / Οƒ_group

Stage 3 β€” Rejection Sampling + Distillation

Cleans up the noisy RL policy by generating many solutions, keeping only the best, and fine-tuning on this curated dataset.

Process

  • Use Mβ‚‚ to generate N solutions per problem
  • Verify each solution (exact check)
  • Keep only verified-correct solutions
  • Optionally filter for quality/length

Why Distill?

  • RL policies are noisy β€” explore broadly
  • Distillation locks in good behaviors
  • Creates stable starting point for alignment
  • Clean supervised checkpoint

Variants

  • Rejection sampling: keep correct, SFT
  • On-policy distillation: student generates, teacher supervises token-by-token
  • Specialist distillation: merge multiple specialist models

Output: M₃

  • Stable, clean reasoning behavior
  • RL gains preserved in SFT-style checkpoint
  • Good base for alignment fine-tuning

Stage 4 β€” General Alignment

The final stage: fine adjustments for helpfulness, safety, and instruction following. KL penalty prevents forgetting reasoning capabilities.

Data

  • Preference pairs from humans / LLM judges
  • Helpfulness, safety, formatting
  • Multi-domain coverage

Method

  • DPO or PPO + reward model
  • KL penalty anchored to M₃
  • Low learning rate, short training
  • Replay of earlier-stage data

Why Last?

  • Reasoning is harder to learn than politeness
  • Alignment can partially undo reasoning gains
  • Fine adjustments are easier to preserve

Output: Mβ‚„ (Deployed)

  • Reasons correctly βœ“
  • Follows instructions βœ“
  • Is safe and helpful βœ“
  • Ready for production deployment
Emerging alternative: DeepSeek-V3.2 merges reasoning, agentic tasks, and alignment into a single mixed RL stage β€” avoids catastrophic forgetting between stages but requires careful reward balancing.

πŸ’‘ Why This Ordering?

What happens if we run RL before SFT? Click to see:

The DeepSeek-R1 Story
R1-Zero proved that pure RL on a base model can produce emergent reasoning. R1 showed how to make it deployable.
🧠 R1-Zero β€” Early Training (Step ~100)
RL Step ~100 Β· GRPO on base model Β· No SFT
πŸ“Š Emergent Behaviors
Key observation: In early RL training, the model produces short, unstructured responses. It hasn't yet discovered that thinking longer leads to better rewards.

R1-Zero β†’ R1: The Full Journey

πŸ§ͺ R1-Zero Discovery

  • βœ“ Emergent chain-of-thought reasoning
  • βœ“ Self-verification behavior
  • βœ“ Backtracking on errors
  • βœ“ Scales thinking with difficulty
  • βœ— Unreadable, mixed-language output
  • βœ— No instruction following
  • βœ— Format reward hacking
  • βœ— Length explosion

πŸš€ R1 Deployed

  • βœ“ All R1-Zero reasoning capabilities
  • βœ“ Clean, readable output format
  • βœ“ Instruction following
  • βœ“ Safety awareness
  • βœ“ Multi-domain helpfulness
  • βœ“ Distilled to smaller models
Cold-start SFT Reasoning RL Rejection Sampling Alignment RL
Distillation as a Core Pipeline Component
RL discovers capabilities. Distillation makes them stable, efficient, and transferable to smaller models.

πŸ”¬ RL (Exploration)

  • βœ“ Can discover entirely new capabilities
  • βœ“ Pushes the frontier of what's possible
  • βœ“ R1-Zero: emergent CoT from scratch
  • βœ— Expensive (millions of GPU-hours)
  • βœ— Noisy β€” explores broadly
  • βœ— Risk of reward hacking
  • βœ— Unstable training dynamics

🧬 Distillation (Transfer)

  • βœ“ 10–100Γ— cheaper than RL
  • βœ“ Stable, clean training
  • βœ“ Can compress to smaller models
  • βœ“ Cleans up noisy RL behaviors
  • βœ— Cannot exceed the teacher
  • βœ— Requires strong teacher model
  • βœ— Transfers, doesn't discover
The frontier recipe: RL creates the best model β†’ distillation makes it accessible/efficient. They are complementary, not competing.

Three Distillation Methods

Rejection Sampling

Standard

Generate N solutions with RL model β†’ keep verified-correct ones β†’ SFT on curated set. Simple and effective.

Specialist Distillation

DeepSeek-V3.2

Train separate specialist models per domain (math, code, agentic), then distill all into a single unified model.

On-Policy Distillation

Emerging

Student generates trajectories, teacher provides dense token-level supervision on those trajectories. 10–100Γ— cheaper than RL.

R1 Distillation Results: Distilled vs. RL-Trained

Distilled models (from R1-671B teacher) vs. RL-trained models of the same size on math benchmarks:

Result: Distillation from a strong teacher often beats RL from scratch at the same model size. The teacher's knowledge is transferred more efficiently than it can be rediscovered.
Test-Time Compute Scaling
Instead of making the model bigger, make it think longer. Trade inference compute for accuracy, scaling cost with problem difficulty.

1. Best-of-N

Parallel

Generate N independent solutions, score with verifier, return the best. Simple, embarrassingly parallel.

2. Long CoT

Sequential

Extended chain-of-thought with self-correction. Generate β†’ evaluate β†’ backtrack β†’ retry. This is what o1/R1 do.

3. Tree Search + PRM

Hybrid

At each step, generate candidates β†’ score with PRM β†’ expand promising branches β†’ prune bad ones.

4. Budget Forcing

Sequential

Explicitly control thinking tokens. Route easy→short, hard→long. Efficient compute allocation.

Best-of-N Simulator

See how generating more samples improves pass rate. Each "sample" is an independent solution attempt.

1
30%
Infrastructure: Actor–Verifier–Learner
The bottleneck in RL for LLMs is rollout generation. Modern systems decompose training into asynchronous specialized components.
Click to start
🎭

Actors

Generate rollouts using vLLM/SGLang. Inference-optimized. G=64 per prompt.

🌍

Environments

Code sandboxes, web browsers, APIs. Execute actions, return observations.

βœ…

Verifiers

Run tests, judges, constraint checks. Assemble reward signals.

πŸ“¦

Replay Buffer

Store completed trajectories with rewards. Feed batches to learner.

🧠

Learner

GRPO/PPO gradient updates. Push new checkpoint to actors.

🎯

Task Sampler

Selects problems at frontier difficulty. Mix of success/failure.

0
Rollouts Generated
0
Verified
0
Policy Updates
0%
GPU Utilization

Synchronous vs. Asynchronous RL

πŸ”„ Synchronous

  • Generate β†’ Train β†’ Generate β†’ Train
  • ~50% GPU utilization
  • Training GPUs idle during generation
  • Simpler to implement
  • No staleness issues

⚑ Asynchronous

  • Generate and train simultaneously
  • ~100% GPU utilization
  • 5Γ— faster for agentic tasks
  • 1–2 updates staleness (OK with PPO clip)
  • Complex orchestration required

Scale of Production RL (2025–2026)

DeepSeek-R1
1000s H800 GPUs
DeepSeek-V3.2
>10% pre-train cost
GLM-5
100K Ascend chips
Qwen 3.x
GRPO + curriculum
Agentic RL β€” Multi-Turn Environment Interaction
Extending reasoning RL from single-turn problem solving to multi-step interaction with real environments: web browsing, code execution, tool use.

Agent Loop Architecture

πŸ‘
Observe
Environment state
β†’
🧠
Think
Plan next action
β†’
⚑
Act
Tool call / command
β†’
🌍
Environment
Execute & return
β†’
🎯
Reward
Composite score

πŸ“‹ Agent Trajectory

Task: Find the population of Tokyo from a reliable source and convert to millions

Agentic Reward Stack

Agentic RL rarely uses a single scalar reward. It composes multiple reward components:

R(Ο„) = w_out Β· r_out + Ξ£_t w_proc Β· r_proc(h_t, a_t) βˆ’ Ξ»_cost Β· C(Ο„) βˆ’ Ξ»_safe Β· S(Ο„)
πŸ†

Outcome Reward

Did the task succeed? Binary or graded final assessment.

w = 1.0
πŸ“Š

Process Reward

Were intermediate steps useful? Valid tool syntax, correct docs retrieved.

w = 0.3
πŸ’°

Cost Penalty

Tokens, latency, tool calls, retries. Prevents runaway verbosity.

Ξ» = 0.1
πŸ›‘

Safety Penalty

Did the agent violate constraints or attempt unsafe actions?

Ξ» = 0.5

Key Differences from Reasoning RL

Episode Length

Reasoning: Single response
Agentic: 50–200+ turns with environment

Reward Timing

Reasoning: Immediate (check answer)
Agentic: Delayed, sparse (end of trajectory)

Rollout Cost

Reasoning: Just token generation
Agentic: Real environment execution

Credit Assignment

Reasoning: Group normalization (GRPO)
Agentic: Per-token GAE (PPO may be better)

Action Space

Reasoning: Tokens β†’ answer
Agentic: Tokens β†’ tool calls β†’ env responses

Token Masking

Reasoning: All tokens are agent's
Agentic: Must mask env outputs from gradients

⚠️ Failure Modes

🎭

Reward Hacking

Exploit the verifier instead of solving the task β€” modifying test cases, accessing answer keys, exploiting loopholes.

πŸ”„

Verbosity / Retry Loops

Generate excessive tokens/tool calls because cost penalty is too weak.

πŸ”“

Unsafe Tool Behavior

Agent learns high-reward but prohibited action sequences.

πŸ“‰

Distribution Fragility

Works on training environments but breaks under small task variations.