🧢 YARN vs πŸͺ’ RoPE

An interactive guide to understanding Rotary Position Embeddings and how YARN (Yet Another RoPE extensioN) extends them for longer contexts in models like GPT-OSS.

1
Why Do LLMs Need Position Information?

πŸ€” The Problem: Transformers Are Position-Blind

Imagine reading a sentence where all the words are thrown into a bag β€” you lose the meaning! Transformers process all tokens simultaneously, so without position info, "Dog bites man" = "Man bites dog".

πŸ“š
Analogy: A Library Without Shelf Numbers
Imagine a library where books have no shelf numbers. You can see every book, but you don't know which comes first. Position embeddings are like adding shelf numbers so the model knows the order of words (tokens).
Click the tokens to see what happens without position info:
Click a word arrangement above to see the difference.
2
What is RoPE? (Rotary Position Embedding)

πŸͺ’ RoPE = Encoding Position by Rotation

Instead of adding a number to each token, RoPE rotates each token's vector in a multi-dimensional space. The angle of rotation depends on the token's position.

πŸ•°οΈ
Analogy: A Clock with Many Hands
Think of a clock with many hands, each spinning at a different speed:
  • The second hand spins fast (high frequency) β€” it tells you fine-grained time
  • The hour hand spins slow (low frequency) β€” it tells you coarse time
  • Each "hand" is a dimension pair in the embedding
  • The angle each hand points to = the position of the token
rotation_angle(position, dimension) = position Γ— ΞΈdim

where ΞΈdim = baseβˆ’2Β·dim/d
base = 10000 (typically), d = embedding dimension

🎑 Interactive: See RoPE Rotation

Watch how a token's vector gets rotated based on its position. Each pair of dimensions rotates at a different speed.

Dim pair 0 (fastest)
Dim pair 1
Dim pair 2
Dim pair 3 (slowest)
πŸ”‘ Key Insight: Low-numbered dimension pairs rotate fast (high frequency) while high-numbered pairs rotate slow (low frequency). This is like having clocks ticking at different speeds β€” together they uniquely encode every position!
3
The Context Length Problem

πŸ“ What Happens When You Go Beyond Training Length?

A model trained on 4K tokens has only ever seen rotations within a certain range. If you suddenly feed it 32K tokens, the rotation angles become alien β€” the model has never seen them during training!

πŸ—ΊοΈ
Analogy: A Map That Doesn't Cover New Territory
Imagine you have a detailed map of your city (4K). If someone asks you to navigate to a city 100 miles away, your map is useless. You need to either:
  • πŸ“ Interpolation (PI): Squish the new territory onto your existing map β†’ everything gets blurry
  • πŸ—ΊοΈ Extrapolation: Just extend the map β†’ but you've never been there, so it's unreliable
  • 🧢 YARN: Smartly blend both approaches based on what works for each "scale"
Trained range (safe)
Interpolated
Extrapolated (dangerous)
4
How YARN Works β€” The Key Innovation

🧢 YARN = Smart Per-Dimension Scaling

YARN's core insight: not all dimensions are equal! When extending context length, different frequency dimensions need different treatment.

πŸ”΄
High Freq
Don't scale these! They already work fine.
🟑
Medium Freq
Partially scale with smooth ramp.
🟒
Low Freq
Fully scale (interpolate) these.
βš–οΈ
Attn Scale
Fix attention temperature.
πŸ‘† Click a step above to learn more!
Each colored band represents a different group of dimensions in the embedding.

🎨 Interactive: YARN's Dimension Partitioning

See how YARN divides dimensions into 3 regions and applies different scaling to each:

No scaling (high freq) β€” keep original
Partial scaling (medium) β€” smooth ramp
Full scaling (low freq) β€” fully interpolate
For each dimension i:

wavelengthi = 2Ο€ Γ— base2i/d

If wavelength < 2Ο€ Γ— Ξ± β†’ NO scaling (high freq)
If wavelength > 2Ο€ Γ— Ξ² β†’ FULL scaling (low freq, divide by s)
Otherwise β†’ Smooth interpolation between the two

πŸ“Š Frequency Scaling Comparison: RoPE vs PI vs YARN

This shows what happens to each frequency dimension when extending from 4K to 16K tokens:

Original RoPE
PI (Position Interpolation) β€” scales ALL
YARN β€” smart selective scaling
5
YARN's Attention Scaling Fix

🌑️ Why Attention Temperature Matters

When you extend context length, attention scores get distributed over more tokens. This dilutes the attention β€” like adding water to paint. YARN fixes this with a temperature correction.

πŸ•
Analogy: Splitting a Pizza
If you trained with 8 people sharing a pizza and now have 32 people, each person gets a tiny slice. YARN adjusts the pizza size (scales attention) so each person still gets a fair portion.
Attention Scale = 0.1 Γ— ln(s) + 1
where s = scale factor (target_length / original_length)
6
Full Comparison: RoPE vs YARN
Feature RoPE YARN
Position Encoding Method Rotation in 2D subspaces Same rotation + smart scaling
Context Extension ❌ Poor extrapolation βœ… Smooth extension
Frequency Treatment All dimensions same 3-band: High/Med/Low
High-Freq Dims No special handling Left unchanged (preserve local patterns)
Low-Freq Dims No special handling Fully interpolated (extend range)
Attention Temperature Fixed Dynamically scaled
Fine-Tuning Needed N/A Minimal (often < 400 steps)
Perplexity at Long Context Explodes πŸ“ˆ Stays low πŸ“‰
Short Context Quality βœ… Excellent βœ… Preserved

πŸ“ˆ Perplexity vs Context Length (Simulated)

See how perplexity (lower = better) changes as we go beyond the training context length:

RoPE (no extension)
PI (Position Interpolation)
YARN
Training length boundary
7
Test Your Understanding! 🧠

Question 1: What does RoPE use to encode position?

Question 2: What is YARN's key insight?

Question 3: What does YARN do to high-frequency dimensions?

Question 4: Why does YARN adjust attention temperature?

8
Full Playground: Build Your Own YARN Config

βš™οΈ Configure & Visualize

Adjust all parameters and see the combined effect on the frequency spectrum:

πŸŽ‰ Summary

RoPE πŸͺ’ encodes position by rotating vectors β€” elegant but breaks beyond training length.


YARN 🧢 extends RoPE with three clever tricks:

  1. Split dimensions into high/medium/low frequency bands
  2. Scale selectively β€” only modify what needs modification
  3. Fix attention temperature β€” keep attention sharp at longer contexts

The result: extend context 4-32Γ— with minimal fine-tuning and no quality loss!