An interactive guide to understanding Rotary Position Embeddings and how YARN (Yet Another RoPE extensioN) extends them for longer contexts in models like GPT-OSS.
Imagine reading a sentence where all the words are thrown into a bag β you lose the meaning! Transformers process all tokens simultaneously, so without position info, "Dog bites man" = "Man bites dog".
Instead of adding a number to each token, RoPE rotates each token's vector in a multi-dimensional space. The angle of rotation depends on the token's position.
Watch how a token's vector gets rotated based on its position. Each pair of dimensions rotates at a different speed.
A model trained on 4K tokens has only ever seen rotations within a certain range. If you suddenly feed it 32K tokens, the rotation angles become alien β the model has never seen them during training!
YARN's core insight: not all dimensions are equal! When extending context length, different frequency dimensions need different treatment.
See how YARN divides dimensions into 3 regions and applies different scaling to each:
This shows what happens to each frequency dimension when extending from 4K to 16K tokens:
When you extend context length, attention scores get distributed over more tokens. This dilutes the attention β like adding water to paint. YARN fixes this with a temperature correction.
| Feature | RoPE | YARN |
|---|---|---|
| Position Encoding Method | Rotation in 2D subspaces | Same rotation + smart scaling |
| Context Extension | β Poor extrapolation | β Smooth extension |
| Frequency Treatment | All dimensions same | 3-band: High/Med/Low |
| High-Freq Dims | No special handling | Left unchanged (preserve local patterns) |
| Low-Freq Dims | No special handling | Fully interpolated (extend range) |
| Attention Temperature | Fixed | Dynamically scaled |
| Fine-Tuning Needed | N/A | Minimal (often < 400 steps) |
| Perplexity at Long Context | Explodes π | Stays low π |
| Short Context Quality | β Excellent | β Preserved |
See how perplexity (lower = better) changes as we go beyond the training context length:
Adjust all parameters and see the combined effect on the frequency spectrum:
RoPE πͺ’ encodes position by rotating vectors β elegant but breaks beyond training length.
YARN π§Ά extends RoPE with three clever tricks:
The result: extend context 4-32Γ with minimal fine-tuning and no quality loss!