Interactive exploration β rotation encodes position, dot products encode relative distance
RoPE rotates each 2D pair of a vector by an angle proportional to position. The same vector at different positions points in different directions.
Each dimension pair i gets a different rotation frequency. Low indices rotate fast (local patterns), high indices rotate slow (global patterns).
Heatmap showing rotation angle (mod 2Ο) for each dimension pair at each position. High-frequency pairs cycle rapidly; low-frequency pairs change slowly.
A head_dim=8 vector split into 4 pairs. Each pair rotates at its own frequency β watch how position affects each pair differently.
The magic of RoPE: when Q at position m and K at position n are both rotated, their dot product depends only on (m β n) β the relative distance.
When Q = K (same content), the RoPE-rotated dot product equals the average of cosines across frequency components. This creates a natural decay with distance β a soft locality bias emerges without being explicitly programmed.
The full attention matrix for a sequence of tokens. Brighter = higher attention score. Notice the diagonal dominance β tokens attend most to themselves and their neighbors.