YARN vs RoPE - Interactive Explainer

1

Why Do LLMs Need Position Information?

🤔 The Problem: Transformers Are Position-Blind

Imagine reading a sentence where all the words are thrown into a bag — you lose the meaning! Transformers process all tokens simultaneously, so without position info, "Dog bites man" = "Man bites dog".

📚

Analogy: A Library Without Shelf Numbers
Imagine a library where books have no shelf numbers. You can see every book, but you don't know which comes first. Position embeddings are like adding shelf numbers so the model knows the order of words (tokens).

Click the tokens to see what happens without position info:

Click a word arrangement above to see the difference.

2

What is RoPE? (Rotary Position Embedding)

🪢 RoPE = Encoding Position by Rotation

Instead of adding a number to each token, RoPE rotates each token's vector in a multi-dimensional space. The angle of rotation depends on the token's position.

🕰️

Analogy: A Clock with Many Hands
Think of a clock with many hands, each spinning at a different speed:

The second hand spins fast (high frequency) — it tells you fine-grained time
The hour hand spins slow (low frequency) — it tells you coarse time
Each "hand" is a dimension pair in the embedding
The angle each hand points to = the position of the token

rotation_angle(position, dimension) = position × θ_dim

where θ_dim = base^−2·dim/d
base = 10000 (typically), d = embedding dimension

🎡 Interactive: See RoPE Rotation

Watch how a token's vector gets rotated based on its position. Each pair of dimensions rotates at a different speed.

Token Position 5

Base Frequency 10000

Dimensions Shown 4

Dim pair 0 (fastest)

Dim pair 1

Dim pair 2

Dim pair 3 (slowest)

🔑 Key Insight: Low-numbered dimension pairs rotate fast (high frequency) while high-numbered pairs rotate slow (low frequency). This is like having clocks ticking at different speeds — together they uniquely encode every position!

3

The Context Length Problem

📏 What Happens When You Go Beyond Training Length?

A model trained on 4K tokens has only ever seen rotations within a certain range. If you suddenly feed it 32K tokens, the rotation angles become alien — the model has never seen them during training!

🗺️

Analogy: A Map That Doesn't Cover New Territory
Imagine you have a detailed map of your city (4K). If someone asks you to navigate to a city 100 miles away, your map is useless. You need to either:

📐 Interpolation (PI): Squish the new territory onto your existing map → everything gets blurry
🗺️ Extrapolation: Just extend the map → but you've never been there, so it's unreliable
🧶 YARN: Smartly blend both approaches based on what works for each "scale"

Original Context 4096

Target Context 16384

Trained range (safe)

Interpolated

Extrapolated (dangerous)

4

How YARN Works — The Key Innovation

🧶 YARN = Smart Per-Dimension Scaling

YARN's core insight: not all dimensions are equal! When extending context length, different frequency dimensions need different treatment.

🔴

High Freq

Don't scale these! They already work fine.

🟡

Medium Freq

Partially scale with smooth ramp.

🟢

Low Freq

Fully scale (interpolate) these.

⚖️

Attn Scale

Fix attention temperature.

👆 Click a step above to learn more!
Each colored band represents a different group of dimensions in the embedding.

🎨 Interactive: YARN's Dimension Partitioning

See how YARN divides dimensions into 3 regions and applies different scaling to each:

Scale Factor (s = target/original) 4

Alpha (boundary sharpness) 1

Beta (boundary sharpness) 32

No scaling (high freq) — keep original

Partial scaling (medium) — smooth ramp

Full scaling (low freq) — fully interpolate

For each dimension i:

wavelength_i = 2π × base^2i/d

If wavelength < 2π × α → NO scaling (high freq)
If wavelength > 2π × β → FULL scaling (low freq, divide by s)
Otherwise → Smooth interpolation between the two

📊 Frequency Scaling Comparison: RoPE vs PI vs YARN

This shows what happens to each frequency dimension when extending from 4K to 16K tokens:

Original RoPE

PI (Position Interpolation) — scales ALL

YARN — smart selective scaling

5

YARN's Attention Scaling Fix

🌡️ Why Attention Temperature Matters

When you extend context length, attention scores get distributed over more tokens. This dilutes the attention — like adding water to paint. YARN fixes this with a temperature correction.

🍕

Analogy: Splitting a Pizza
If you trained with 8 people sharing a pizza and now have 32 people, each person gets a tiny slice. YARN adjusts the pizza size (scales attention) so each person still gets a fair portion.

Number of Tokens 16

Temperature Scale 1.0

Attention Scale = 0.1 × ln(s) + 1
where s = scale factor (target_length / original_length)

6

Full Comparison: RoPE vs YARN

Feature	RoPE	YARN
Position Encoding Method	Rotation in 2D subspaces	Same rotation + smart scaling
Context Extension	❌ Poor extrapolation	✅ Smooth extension
Frequency Treatment	All dimensions same	3-band: High/Med/Low
High-Freq Dims	No special handling	Left unchanged (preserve local patterns)
Low-Freq Dims	No special handling	Fully interpolated (extend range)
Attention Temperature	Fixed	Dynamically scaled
Fine-Tuning Needed	N/A	Minimal (often < 400 steps)
Perplexity at Long Context	Explodes 📈	Stays low 📉
Short Context Quality	✅ Excellent	✅ Preserved

📈 Perplexity vs Context Length (Simulated)

See how perplexity (lower = better) changes as we go beyond the training context length:

RoPE (no extension)

PI (Position Interpolation)

YARN

Training length boundary

7

Test Your Understanding! 🧠

Question 1: What does RoPE use to encode position?

Question 2: What is YARN's key insight?

Question 3: What does YARN do to high-frequency dimensions?

Question 4: Why does YARN adjust attention temperature?

8

Full Playground: Build Your Own YARN Config

⚙️ Configure & Visualize

Adjust all parameters and see the combined effect on the frequency spectrum:

Model Dimensions (d) 64

Original Context 4096

Target Context 32768

Base 10000

Alpha 1

Beta 32

🎉 Summary

RoPE 🪢 encodes position by rotating vectors — elegant but breaks beyond training length.

YARN 🧶 extends RoPE with three clever tricks:

Split dimensions into high/medium/low frequency bands
Scale selectively — only modify what needs modification
Fix attention temperature — keep attention sharp at longer contexts

The result: extend context 4-32× with minimal fine-tuning and no quality loss!

🧶 YARN vs 🪢 RoPE