Rotary Position Embeddings (RoPE)

Interactive exploration β€” rotation encodes position, dot products encode relative distance

πŸ”„ How Rotation Encodes Position

RoPE rotates each 2D pair of a vector by an angle proportional to position. The same vector at different positions points in different directions.

3
0.40
Original vector
Rotated by mΒ·ΞΈ
Rotation by angle Ξ± = mΒ·ΞΈ:

x₁' = x₁·cos(Ξ±) βˆ’ xβ‚‚Β·sin(Ξ±)
xβ‚‚' = x₁·sin(Ξ±) + xβ‚‚Β·cos(Ξ±)

Equivalently: (x₁ + jxβ‚‚) Β· e^(jΞ±)
Angle mΒ·ΞΈ
1.20
Degrees
68.8Β°
β€–xβ€– preserved
βœ“
Key property: Rotation preserves vector magnitude β€” only the direction changes. This means position information is encoded without distorting the content representation. The angle mΒ·ΞΈ grows linearly with position.

πŸ“Š Frequency Schedule: ΞΈα΅’ = baseβˆ’2i/d

Each dimension pair i gets a different rotation frequency. Low indices rotate fast (local patterns), high indices rotate slow (global patterns).

Frequency ΞΈα΅’
Wavelength 2Ο€/ΞΈα΅’ (right axis)

🌊 Rotation Angles Across Positions

Heatmap showing rotation angle (mod 2Ο€) for each dimension pair at each position. High-frequency pairs cycle rapidly; low-frequency pairs change slowly.

32
0
2Ο€ Angle mod 2Ο€

🧩 All Dimension Pairs Rotating

A head_dim=8 vector split into 4 pairs. Each pair rotates at its own frequency β€” watch how position affects each pair differently.

0
Pair 0 (fastest)
Pair 1
Pair 2
Pair 3 (slowest)
Analogy to sinusoidal PE: RoPE's multi-frequency scheme is similar to a clock β€” the "seconds hand" (pair 0) rotates fast for fine-grained local position, while the "hour hand" (last pair) rotates slowly for coarse global position. Together, every position gets a unique combination of angles.

πŸ“ Relative Position from Dot Products

The magic of RoPE: when Q at position m and K at position n are both rotated, their dot product depends only on (m βˆ’ n) β€” the relative distance.

8
3
Query (pos m)
Key (pos n)
Angle between = (mβˆ’n)Β·ΞΈ
Q angle (mΒ·ΞΈ)
β€”
K angle (nΒ·ΞΈ)
β€”
Relative (mβˆ’n)
β€”
Dot product q_m Β· k_n
β€”
Try this: Set m=8, n=3 (distance 5). Then try m=12, n=7 (also distance 5). Notice the dot product is exactly the same β€” it depends only on relative distance, not absolute positions!

πŸ“‰ Attention Decay with Distance

With random Q and K vectors, RoPE causes a natural decay in expected dot product as relative distance increases β€” a soft locality bias emerges without being explicitly programmed.

Why decay happens: At large relative distances, different frequency components go in and out of phase. The high-frequency pairs oscillate wildly (canceling out on average), while low-frequency pairs still contribute coherently β€” but fewer of them align. The net effect: a gentle, natural bias toward nearby tokens.