Rotary Position Embeddings (RoPE)

Interactive exploration — rotation encodes position, dot products encode relative distance

🔄 How Rotation Encodes Position

RoPE rotates each 2D pair of a vector by an angle proportional to position. The same vector at different positions points in different directions.

Position m:

Frequency θ:

0.40

Show trail:

Original vector

Rotated by m·θ

Other positions (trail)

Rotation by angle α = m·θ:

x₁' = x₁·cos(α) − x₂·sin(α)
x₂' = x₁·sin(α) + x₂·cos(α)

Equivalently: (x₁ + jx₂) · e^(jα)

Angle m·θ

1.20

Degrees

68.8°

‖x‖ preserved

✓

Key property: Rotation preserves vector magnitude — only the direction changes. This means position information is encoded without distorting the content representation. The angle m·θ grows linearly with position.

📊 Frequency Schedule: θᵢ = base^−2i/d

Each dimension pair i gets a different rotation frequency. Low indices rotate fast (local patterns), high indices rotate slow (global patterns).

Head dim d:

Base θ:

Frequency θᵢ (log scale)

Wavelength 2π/θᵢ (right axis)

🌊 Rotation Angles Across Positions

Heatmap showing rotation angle (mod 2π) for each dimension pair at each position. High-frequency pairs cycle rapidly; low-frequency pairs change slowly.

Positions:

2π Angle mod 2π

🧩 All Dimension Pairs Rotating

A head_dim=8 vector split into 4 pairs. Each pair rotates at its own frequency — watch how position affects each pair differently.

Position m:

Pair 0 (fastest)

Pair 1

Pair 2

Pair 3 (slowest)

Analogy to a clock: RoPE's multi-frequency scheme works like clock hands — the "seconds hand" (pair 0) rotates fast for fine-grained local position, while the "hour hand" (last pair) rotates slowly for coarse global position. Together, every position gets a unique combination of angles.

📐 Relative Position from Dot Products

The magic of RoPE: when Q at position m and K at position n are both rotated, their dot product depends only on (m − n) — the relative distance.

Query pos m:

Key pos n:

Query (pos m)

Key (pos n)

Angle between = (m−n)·θ

Q angle (m·θ)

—

K angle (n·θ)

—

Relative (m−n)

—

Dot product q_m · k_n

—

Try this: Set m=8, n=3 (distance 5). Then try m=12, n=7 (also distance 5). Notice the dot product is exactly the same — it depends only on relative distance, not absolute positions!

📉 Attention Decay with Distance

When Q = K (same content), the RoPE-rotated dot product equals the average of cosines across frequency components. This creates a natural decay with distance — a soft locality bias emerges without being explicitly programmed.

Head dim d:

Base:

Max distance:

Expected dot product (normalized)

d=0 (self-attention, max score)

E[q_mᵀ · k_n] / ‖x‖² = (1 / d/2) · Σᵢ cos((m−n) · θᵢ)

where θᵢ = base^−2i/d for each dimension pair i

Why decay happens: At large relative distances, different frequency components go in and out of phase. The high-frequency pairs oscillate wildly (canceling out on average), while low-frequency pairs still contribute coherently — but fewer of them align. The net effect: a gentle, natural bias toward nearby tokens.

Base matters: A larger base (e.g. 500K or 1M) spreads frequencies further apart, creating a wider attention window — useful for long-context models like Llama 3.

🔥 Attention Score Heatmap

The full attention matrix for a sequence of tokens. Brighter = higher attention score. Notice the diagonal dominance — tokens attend most to themselves and their neighbors.

Sequence length:

Causal mask:

Low

High Attention score (after softmax)