Interactive exploration of multi-stage pipelines, test-time compute, distillation, and agentic RL
The Canonical Multi-Stage Pipeline
Click each stage to see its details. The pipeline transforms a raw base model into a frontier reasoning system through progressive refinement.
Input
Mβ Base Model
Pre-trained on trillions of tokens
β
Stage 1
SFT
Cross-entropy on curated pairs
β
Stage 2
Reasoning RL
GRPO + verifiable rewards
β
Stage 3
Distillation
Rejection sampling + SFT
β
Stage 4
General Alignment
DPO / PPO + reward model
Mβ β Pre-trained Base Model
The starting point. Trained on trillions of tokens with next-token prediction. Knows language, facts, and some latent reasoning ability β but has no instruction-following capability.
Capabilities
Language fluency, factual knowledge
Latent pattern completion
Code understanding (if in corpus)
Limitations
No instruction following
No structured reasoning
No safety awareness
Cannot have a conversation
Stage 1 β Supervised Fine-Tuning (SFT)
Teaches the model the format of good behavior: how to follow instructions, use chain-of-thought, structure tool calls.
Data
10Kβ100K curated (prompt, response) pairs
Instruction following, conversation
Chain-of-thought with <think>...</think>
Tool use templates, code generation
Method
Standard cross-entropy fine-tuning
1β3 epochs, moderate learning rate
Loss only on response tokens
Output: Mβ
Knows the FORMAT of reasoning
Can follow instructions
Structures CoT, uses tool syntax
Why Before RL?
RL needs structure to optimize over
Without SFT, RL has no format to refine
Format β quality β SFT gives format, RL gives quality
Stage 2 β Reasoning RL
The core capability builder. Uses GRPO with verifiable rewards on math/code/logic problems to teach the model how to reason correctly.
The final stage: fine adjustments for helpfulness, safety, and instruction following. KL penalty prevents forgetting reasoning capabilities.
Data
Preference pairs from humans / LLM judges
Helpfulness, safety, formatting
Multi-domain coverage
Method
DPO or PPO + reward model
KL penalty anchored to Mβ
Low learning rate, short training
Replay of earlier-stage data
Why Last?
Reasoning is harder to learn than politeness
Alignment can partially undo reasoning gains
Fine adjustments are easier to preserve
Output: Mβ (Deployed)
Reasons correctly β
Follows instructions β
Is safe and helpful β
Ready for production deployment
Emerging alternative: DeepSeek-V3.2 merges reasoning, agentic tasks, and alignment into a single mixed RL stage β avoids catastrophic forgetting between stages but requires careful reward balancing.
π‘ Why This Ordering?
What happens if we run RL before SFT? Click to see:
This is exactly what R1-Zero did! Without SFT, the base model has no structured output format. RL can discover reasoning (as R1-Zero proved), but the outputs are messy β mixed languages, poor formatting, no instruction following. The model learns to reason but can't communicate its reasoning usably. SFT first gives the model the format; RL then improves the quality.
The DeepSeek-R1 Story
R1-Zero proved that pure RL on a base model can produce emergent reasoning. R1 showed how to make it deployable.
π§ R1-Zero β Early Training (Step ~100)
RL Step ~100 Β· GRPO on base model Β· No SFT
πEmergent Behaviors
Key observation: In early RL training, the model produces short, unstructured responses. It hasn't yet discovered that thinking longer leads to better rewards.
RL discovers capabilities. Distillation makes them stable, efficient, and transferable to smaller models.
π¬ RL (Exploration)
β Can discover entirely new capabilities
β Pushes the frontier of what's possible
β R1-Zero: emergent CoT from scratch
β Expensive (millions of GPU-hours)
β Noisy β explores broadly
β Risk of reward hacking
β Unstable training dynamics
𧬠Distillation (Transfer)
β 10β100Γ cheaper than RL
β Stable, clean training
β Can compress to smaller models
β Cleans up noisy RL behaviors
β Cannot exceed the teacher
β Requires strong teacher model
β Transfers, doesn't discover
The frontier recipe: RL creates the best model β distillation makes it accessible/efficient. They are complementary, not competing.
Three Distillation Methods
Rejection Sampling
Standard
Generate N solutions with RL model β keep verified-correct ones β SFT on curated set. Simple and effective.
Specialist Distillation
DeepSeek-V3.2
Train separate specialist models per domain (math, code, agentic), then distill all into a single unified model.
On-Policy Distillation
Emerging
Student generates trajectories, teacher provides dense token-level supervision on those trajectories. 10β100Γ cheaper than RL.
R1 Distillation Results: Distilled vs. RL-Trained
Distilled models (from R1-671B teacher) vs. RL-trained models of the same size on math benchmarks:
Result: Distillation from a strong teacher often beats RL from scratch at the same model size. The teacher's knowledge is transferred more efficiently than it can be rediscovered.
Test-Time Compute Scaling
Instead of making the model bigger, make it think longer. Trade inference compute for accuracy, scaling cost with problem difficulty.
1. Best-of-N
Parallel
Generate N independent solutions, score with verifier, return the best. Simple, embarrassingly parallel.
2. Long CoT
Sequential
Extended chain-of-thought with self-correction. Generate β evaluate β backtrack β retry. This is what o1/R1 do.
3. Tree Search + PRM
Hybrid
At each step, generate candidates β score with PRM β expand promising branches β prune bad ones.
4. Budget Forcing
Sequential
Explicitly control thinking tokens. Route easyβshort, hardβlong. Efficient compute allocation.
Best-of-N Simulator
See how generating more samples improves pass rate. Each "sample" is an independent solution attempt.
1
30%
Infrastructure: ActorβVerifierβLearner
The bottleneck in RL for LLMs is rollout generation. Modern systems decompose training into asynchronous specialized components.
Click to start
π
Actors
Generate rollouts using vLLM/SGLang. Inference-optimized. G=64 per prompt.
π
Environments
Code sandboxes, web browsers, APIs. Execute actions, return observations.
β
Verifiers
Run tests, judges, constraint checks. Assemble reward signals.
π¦
Replay Buffer
Store completed trajectories with rewards. Feed batches to learner.
π§
Learner
GRPO/PPO gradient updates. Push new checkpoint to actors.
π―
Task Sampler
Selects problems at frontier difficulty. Mix of success/failure.
0
Rollouts Generated
0
Verified
0
Policy Updates
0%
GPU Utilization
Synchronous vs. Asynchronous RL
π Synchronous
Generate β Train β Generate β Train
~50% GPU utilization
Training GPUs idle during generation
Simpler to implement
No staleness issues
β‘ Asynchronous
Generate and train simultaneously
~100% GPU utilization
5Γ faster for agentic tasks
1β2 updates staleness (OK with PPO clip)
Complex orchestration required
Scale of Production RL (2025β2026)
DeepSeek-R1
1000s H800 GPUs
DeepSeek-V3.2
>10% pre-train cost
GLM-5
100K Ascend chips
Qwen 3.x
GRPO + curriculum
Agentic RL β Multi-Turn Environment Interaction
Extending reasoning RL from single-turn problem solving to multi-step interaction with real environments: web browsing, code execution, tool use.
Agent Loop Architecture
π
Observe
Environment state
β
π§
Think
Plan next action
β
β‘
Act
Tool call / command
β
π
Environment
Execute & return
β
π―
Reward
Composite score
π Agent Trajectory
Task: Find the population of Tokyo from a reliable source and convert to millions
Agentic Reward Stack
Agentic RL rarely uses a single scalar reward. It composes multiple reward components: