Position information is not inherent to the transformer architecture. Self-attention is permutation-equivariant: shuffling the input tokens and applying the same shuffle to the output gives the same result. Without positional encoding, the model cannot distinguish βthe cat sat on the matβ from βmat the on sat cat the.β Every positional encoding method solves this problem, but RoPE (Rotary Position Embedding, Su et al., 2021) solves it with a mathematically elegant property: the attention score between two tokens depends only on their relative position, not their absolute positions. This property is encoded directly into the query and key representations through complex-number rotations.
This post derives RoPE from scratch. Every step is explicit. The derivation starts from the desired property (relative position dependence), constructs the solution using complex numbers, proves it satisfies the requirement, and analyzes the consequences for different frequency bands. No steps are skipped.
The Problem: Relative Position Dependence
1.1 What We Want
Let be the query vector at position and be the key vector at position . The attention score between them is:
We want a positional encoding function that transforms queries and keys such that:
for some function . The dot product between the encoded query and key depends on the content vectors and and the relative position , but not on or individually.
1.2 Why Relative Position Matters
Absolute positional encodings (like the sinusoidal encodings from Vaswani et al., 2017 or learned position embeddings from GPT-2) add a position-dependent vector to the input:
The attention score becomes:
The last three terms contain absolute position information (, ), not just relative (). This means the model must independently learn that position 100 attending to position 95 is the same relationship as position 200 attending to position 195. With RoPE, this is guaranteed by construction.
1.3 Why Not Just Use Relative Position Bias
ALiBi (Press et al., 2022) and T5-style relative position bias add a bias term directly to the attention scores:
where is a learned or fixed function of the relative position. This works but has a limitation: the bias is content-independent. The same position bias is added regardless of what and represent. RoPE encodes position into the representations themselves, allowing the content and position to interact through the dot product.
The Complex Number Framework
2.1 Pairing Dimensions
RoPE operates on pairs of dimensions. For a -dimensional vector, we group the dimensions into pairs: .
Each pair is treated as a complex number:
where is the imaginary unit (using instead of to avoid confusion with the index).
A -dimensional real vector becomes a -dimensional complex vector:
2.2 Rotation in the Complex Plane
Multiplying a complex number by rotates it by angle in the complex plane:
In matrix form, this rotation is:
The key property of rotation: it preserves the magnitude of the complex number. .
2.3 The Dot Product Under Rotation
For two complex numbers and , the real part of (where is the complex conjugate) gives the standard dot product of the corresponding 2D real vectors:
This is exactly the dot product of and .
For the full -dimensional dot product:
Constructing RoPE
3.1 The Rotation
RoPE rotates each dimension pair by an angle proportional to the position :
where is a dimension-specific frequency. The rotation angle for dimension pair at position is .
In the real-valued representation, this means applying a 2x2 rotation matrix to each dimension pair:
For the full -dimensional vector, RoPE applies independent 2D rotations, one per dimension pair. In block-diagonal matrix form:
where each block is:
The encoded query is and the encoded key is .
3.2 Defining the Frequencies
The frequency for dimension pair is:
where base is typically 10000. Explicitly:
For (head dimension in Llama 3):
- : radians per position
- :
- :
- :
- :
RoPE Frequencies by Dimension Pair (d=128, base=10000)
(theta x 1000)The Proof: Relative Position Dependence
4.1 Statement
We prove that:
That is, the dot product depends on and only through their difference .
4.2 Proof for a Single Dimension Pair
Consider dimension pair . The encoded query and key complex representations are:
The contribution to the dot product from this dimension pair is:
Expand:
The complex conjugate distributes over multiplication:
Combine the exponentials:
This depends on and only through . The proof is complete for a single dimension pair.
4.3 Proof for the Full Dot Product
The full dot product is the sum over all dimension pairs:
Each term depends on , so the sum depends on . Defining :
This can be written equivalently as:
where the last step uses (rotation matrices compose by adding angles: ).
. This is because rotation by angle followed by rotation by angle (the transpose, which is the inverse rotation) equals rotation by . In the complex formulation: . This single algebraic property is the entire reason RoPE works.
4.4 Expanded Real-Valued Form
For a single dimension pair and at relative position :
Let . Call this where:
Then:
This reveals how content and position interact: the attention score is a weighted combination of the content dot product and cross product, with weights determined by the relative position through and .
When (same position): , , so the score is just β the pure content dot product. As increases, the term introduces a position-dependent rotation of the score.
Frequency Bands and Their Interpretation
5.1 Wavelengths
Each frequency corresponds to a wavelength (the number of positions for a full rotation):
For and base = 10000:
- : positions (rotates very fast)
- : positions
- : positions
- : positions
- : positions
RoPE Wavelengths by Dimension Pair (d=128, base=10000)
| Dim Pair (i) | Frequency (theta) | Wavelength (positions) | Encodes |
|---|---|---|---|
| 0 (fastest) | 1.0 rad/pos | 6.3 | Sub-word adjacency |
| 8 | 0.437 rad/pos | 14.4 | Short phrases |
| 16 | 0.1 rad/pos | 62.8 | Sentence-level |
| 32 | 0.01 rad/pos | 628 | Paragraph-level |
| 48 | 0.001 rad/pos | 6,283 | Section-level |
| 63 (slowest) | 1.15e-4 rad/pos | 54,647 | Document-level |
5.2 The Multi-Scale Representation
The geometric spacing of frequencies creates a multi-scale representation of position:
- Fast dimensions ( near 0): Rotate quickly. They cycle through many full rotations within a few hundred positions. These dimensions encode fine-grained local position information: is this token 1, 2, or 3 positions away?
- Slow dimensions ( near ): Rotate slowly. They barely change over thousands of positions. These dimensions encode coarse global position information: is this token in the first half or second half of the document?
This is directly analogous to the binary representation of integers. The least significant bit (analogous to the fastest dimension) flips every number. The most significant bit (analogous to the slowest dimension) flips after half the range. Together, all bits uniquely encode every integer.
5.3 Attention Score Decay with Distance
For a single dimension pair , the attention contribution as a function of relative distance :
where and .
The full attention score is:
This is a sum of cosines at different frequencies. For random and vectors, the amplitudes are roughly equal across dimensions, and the phases are random. The sum of many cosines with random phases but geometric frequencies produces a function that:
- Peaks at (all cosines align at 1 when )
- Decays as increases (the cosines become misaligned)
- Has a characteristic decay length determined by the average frequency
This creates a natural locality bias: nearby tokens have higher attention scores than distant ones, purely from the geometry of the rotation encoding. The model does not need to learn this bias β it emerges from the encoding.
The Base Frequency Parameter
6.1 What Base Controls
The base parameter (default 10000) determines the range of frequencies:
- Smallest frequency (slowest rotation):
- Largest frequency (fastest rotation):
- Longest wavelength:
- Shortest wavelength:
The base determines the maximum context length at which position can still be resolved. Beyond positions, the slowest-rotating dimension has completed a full rotation and the positional encoding starts to alias (positions and have the same encoding in the slowest dimension).
For base = 10000: . This is sufficient for contexts up to about 63K tokens before the slowest dimension aliases.
6.2 Increasing the Base for Longer Contexts
To support longer contexts, increase the base. This slows down all rotations, extending the range before aliasing occurs:
- base = 10000: max context 63K (Llama 2)
- base = 500000: max context 3.14M (Llama 3 uses 500000)
- base = 1000000: max context 6.28M (some models use this)
The tradeoff: a larger base means all frequencies are lower, which reduces the position resolution at short distances. The fastest dimension still has (this does not change with base), but the intermediate dimensions all rotate more slowly. This compresses the frequency spectrum toward zero, potentially degrading fine-grained position discrimination.
In practice, the resolution loss from increasing the base is small because:
- The fastest dimensions () are unchanged
- The model has many dimension pairs ( for Llama 3) to encode position
- The geometric spacing ensures good coverage even when compressed
import math
def compute_rope_frequencies(d, base=10000):
"""Compute RoPE rotation frequencies for each dimension pair."""
freqs = []
for i in range(d // 2):
theta = base ** (-2.0 * i / d)
wavelength = 2 * math.pi / theta
freqs.append((i, theta, wavelength))
return freqs
# Compare base=10000 vs base=500000
print("Base = 10000:")
for i, theta, wl in compute_rope_frequencies(128, base=10000):
if i in [0, 16, 32, 48, 63]:
print(f" dim pair {i:2d}: theta={theta:.6f}, "
f"wavelength={wl:.0f} positions")
print("\nBase = 500000 (Llama 3):")
for i, theta, wl in compute_rope_frequencies(128, base=500000):
if i in [0, 16, 32, 48, 63]:
print(f" dim pair {i:2d}: theta={theta:.6f}, "
f"wavelength={wl:.0f} positions")
Output:
Base = 10000:
dim pair 0: theta=1.000000, wavelength=6 positions
dim pair 16: theta=0.100000, wavelength=63 positions
dim pair 32: theta=0.010000, wavelength=628 positions
dim pair 48: theta=0.001000, wavelength=6283 positions
dim pair 63: theta=0.000115, wavelength=54647 positions
Base = 500000 (Llama 3):
dim pair 0: theta=1.000000, wavelength=6 positions
dim pair 16: theta=0.023714, wavelength=265 positions
dim pair 32: theta=0.000563, wavelength=11170 positions
dim pair 48: theta=0.000013, wavelength=470315 positions
dim pair 63: theta=0.000001, wavelength=2733011 positions
6.3 NTK-Aware Scaling
CodeLlama introduced NTK-aware scaling (Chen et al., 2023), which scales the base dynamically when the context exceeds the training length:
where is a scaling factor (typically 1-4), is the target context length, and is the training context length.
The idea: rather than simply interpolating positions (which compresses all frequencies equally), NTK-aware scaling increases the base, which primarily extends the low-frequency components. High-frequency components (which encode local patterns) are largely unchanged. This preserves local position resolution while extending global range.
def ntk_aware_rope_frequencies(d, base=10000, target_len=131072,
train_len=8192, alpha=2.0):
"""NTK-aware RoPE frequency scaling for context extension."""
scale = alpha * target_len / train_len
new_base = base * scale ** (d / (d - 2))
freqs = []
for i in range(d // 2):
theta = new_base ** (-2.0 * i / d)
wavelength = 2 * math.pi / theta
freqs.append((i, theta, wavelength))
print(f"Original base: {base}")
print(f"Scale factor: {scale:.2f}")
print(f"New base: {new_base:.0f}")
return freqs
freqs = ntk_aware_rope_frequencies(
d=128, base=10000, target_len=131072, train_len=8192, alpha=2.0
)
RoPE vs Other Position Encodings
7.1 Comparison
Position Encoding Methods Compared
| Method | Relative Position | Extrapolation | Parameters | Used In |
|---|---|---|---|---|
| Sinusoidal (Vaswani) | No (absolute) | Poor | 0 | Original Transformer |
| Learned absolute | No (absolute) | None | O(L*d) | GPT-2, BERT |
| T5 relative bias | Yes | Moderate | O(n_heads * L) | T5, Flan-T5 |
| ALiBi | Yes (linear decay) | Good | 0 | BLOOM, MPT |
| RoPE | Yes (rotation) | Good (with base scaling) | 0 | Llama, Mistral, Qwen, Gemma |
7.2 RoPEβs Advantages
- Zero additional parameters: RoPE is a deterministic function of position and dimension. No learned parameters.
- Relative position by construction: Proven above. Not an approximation.
- Flexible context extension: Changing the base extends the context without retraining (with some fine-tuning to adapt).
- Efficient computation: Applied element-wise with precomputed sin/cos tables. No matrix multiplications.
- Compatibility with KV-cache: RoPE is applied to Q and K before caching. Cached K vectors already have position encoded. No need to recompute position encoding when extending the KV-cache.
7.3 RoPEβs Limitations
- Even dimension requirement: Requires to be even (dimension pairing). All modern models satisfy this.
- No absolute position: RoPE encodes only relative position. Tasks that truly need absolute position (e.g., βwhat is the 5th word?β) must learn it from the relative patterns.
- Base frequency tuning: The base parameter must be set appropriately for the target context length. Too small and the encoding aliases. Too large and local resolution degrades.
Implementation: Complete RoPE in PyTorch
8.1 Precomputing the Rotation Table
import torch
import math
def precompute_freqs_cis(dim, max_seq_len, base=10000.0):
"""
Precompute the complex exponentials for RoPE.
Returns a tensor of shape (max_seq_len, dim//2) containing
complex numbers e^(j * m * theta_i) for each position m
and dimension pair i.
"""
# Frequencies for each dimension pair
freqs = 1.0 / (base ** (torch.arange(0, dim, 2).float() / dim))
# freqs shape: (dim//2,)
# Position indices
t = torch.arange(max_seq_len, dtype=torch.float32)
# t shape: (max_seq_len,)
# Outer product: angle for each (position, dimension pair)
angles = torch.outer(t, freqs)
# angles shape: (max_seq_len, dim//2)
# Complex exponentials
freqs_cis = torch.polar(torch.ones_like(angles), angles)
# freqs_cis shape: (max_seq_len, dim//2), dtype=complex64
# Each entry is e^(j * m * theta_i)
return freqs_cis
8.2 Applying RoPE to Queries and Keys
def apply_rotary_emb(xq, xk, freqs_cis):
"""
Apply rotary position embeddings to query and key tensors.
Args:
xq: (B, S, H, D) query tensor
xk: (B, S, H, D) key tensor
freqs_cis: (S, D//2) complex rotation factors
Returns:
xq_out: (B, S, H, D) rotated queries
xk_out: (B, S, H, D) rotated keys
"""
# Reshape to pairs of dimensions and view as complex
# (B, S, H, D) -> (B, S, H, D//2, 2) -> complex (B, S, H, D//2)
xq_complex = torch.view_as_complex(
xq.float().reshape(*xq.shape[:-1], -1, 2)
)
xk_complex = torch.view_as_complex(
xk.float().reshape(*xk.shape[:-1], -1, 2)
)
# Reshape freqs_cis for broadcasting: (S, D//2) -> (1, S, 1, D//2)
freqs_cis = freqs_cis.unsqueeze(0).unsqueeze(2)
# Apply rotation: multiply by complex exponential
xq_rotated = xq_complex * freqs_cis
xk_rotated = xk_complex * freqs_cis
# Convert back to real: (B, S, H, D//2) complex -> (B, S, H, D)
xq_out = torch.view_as_real(xq_rotated).flatten(-2)
xk_out = torch.view_as_real(xk_rotated).flatten(-2)
return xq_out.type_as(xq), xk_out.type_as(xk)
8.3 Alternative: Sin/Cos Implementation (No Complex Numbers)
Some frameworks avoid complex number support. Here is the equivalent using sin and cos directly:
def precompute_rope_cache(dim, max_seq_len, base=10000.0):
"""
Precompute cos and sin tables for RoPE.
Returns:
cos_cache: (max_seq_len, dim//2) cosine values
sin_cache: (max_seq_len, dim//2) sine values
"""
freqs = 1.0 / (base ** (torch.arange(0, dim, 2).float() / dim))
t = torch.arange(max_seq_len, dtype=torch.float32)
angles = torch.outer(t, freqs) # (max_seq_len, dim//2)
cos_cache = torch.cos(angles)
sin_cache = torch.sin(angles)
return cos_cache, sin_cache
def apply_rope_real(x, cos_cache, sin_cache):
"""
Apply RoPE using real-valued sin/cos rotation.
Args:
x: (B, S, H, D) input tensor (query or key)
cos_cache: (S, D//2) precomputed cosines
sin_cache: (S, D//2) precomputed sines
The rotation for each pair (x1, x2):
x1' = x1 * cos - x2 * sin
x2' = x1 * sin + x2 * cos
"""
B, S, H, D = x.shape
half_d = D // 2
# Split into even and odd dimensions
x1 = x[..., :half_d] # (B, S, H, D//2) - first of each pair
x2 = x[..., half_d:] # (B, S, H, D//2) - second of each pair
# Reshape cos/sin for broadcasting: (S, D//2) -> (1, S, 1, D//2)
cos_vals = cos_cache[:S].unsqueeze(0).unsqueeze(2)
sin_vals = sin_cache[:S].unsqueeze(0).unsqueeze(2)
# Apply 2D rotation to each pair
x1_rot = x1 * cos_vals - x2 * sin_vals
x2_rot = x1 * sin_vals + x2 * cos_vals
# Concatenate back
return torch.cat([x1_rot, x2_rot], dim=-1)
There are two conventions for which dimensions form pairs. The original RoPE paper pairs adjacent dimensions: . Some implementations (including the Llama reference) pair the first half with the second half: . The math is identical β only the permutation of dimensions differs. The sin/cos implementation above uses the first-half/second-half convention. The complex implementation pairs adjacent dimensions. Make sure your implementation matches the model checkpoint you are loading.
8.4 Complete Attention with RoPE
class RoPEAttention(nn.Module):
"""Multi-head attention with Rotary Position Embedding."""
def __init__(self, d_model, n_heads, max_seq_len=8192, base=10000.0):
super().__init__()
self.d_model = d_model
self.n_heads = n_heads
self.d_k = d_model // n_heads
self.W_q = nn.Linear(d_model, d_model, bias=False)
self.W_k = nn.Linear(d_model, d_model, bias=False)
self.W_v = nn.Linear(d_model, d_model, bias=False)
self.W_o = nn.Linear(d_model, d_model, bias=False)
# Precompute RoPE frequencies
self.register_buffer(
'freqs_cis',
precompute_freqs_cis(self.d_k, max_seq_len, base),
persistent=False
)
def forward(self, x, start_pos=0, mask=None):
"""
Args:
x: (B, S, D) input
start_pos: starting position for KV-cache scenarios
mask: optional attention mask
"""
B, S, D = x.shape
H = self.n_heads
dk = self.d_k
# Project to Q, K, V
q = self.W_q(x).view(B, S, H, dk)
k = self.W_k(x).view(B, S, H, dk)
v = self.W_v(x).view(B, S, H, dk)
# Apply RoPE to Q and K (not V)
freqs = self.freqs_cis[start_pos:start_pos + S]
q, k = apply_rotary_emb(q, k, freqs)
# Standard attention computation
q = q.transpose(1, 2) # (B, H, S, dk)
k = k.transpose(1, 2)
v = v.transpose(1, 2).float()
scale = 1.0 / math.sqrt(dk)
scores = torch.matmul(q, k.transpose(-2, -1)) * scale
if mask is not None:
scores = scores + mask # mask contains -inf for blocked positions
weights = torch.softmax(scores, dim=-1)
output = torch.matmul(weights, v)
output = output.transpose(1, 2).contiguous().view(B, S, D)
return self.W_o(output)
# Test: verify relative position property
torch.manual_seed(42)
d_model = 256
n_heads = 4
d_k = d_model // n_heads
# Create random q and k vectors
q = torch.randn(1, 1, n_heads, d_k)
k = torch.randn(1, 1, n_heads, d_k)
freqs = precompute_freqs_cis(d_k, 1024, base=10000.0)
# Compute dot product at positions (m=100, n=90) -> delta=10
q_100 = apply_rotary_emb(q, k, freqs[100:101])[0]
k_90 = apply_rotary_emb(q, k, freqs[90:91])[1]
score_100_90 = (q_100 * k_90).sum(dim=-1)
# Compute dot product at positions (m=500, n=490) -> delta=10
q_500 = apply_rotary_emb(q, k, freqs[500:501])[0]
k_490 = apply_rotary_emb(q, k, freqs[490:491])[1]
score_500_490 = (q_500 * k_490).sum(dim=-1)
# These should be identical (same relative position)
print(f"Score at (100, 90): {score_100_90.item():.6f}")
print(f"Score at (500, 490): {score_500_490.item():.6f}")
print(f"Difference: {abs(score_100_90.item() - score_500_490.item()):.2e}")
# Difference should be ~0 (floating point only)
8.5 RoPE with KV-Cache for Inference
During autoregressive inference, we cache the K and V vectors. Since RoPE is applied to K before caching, the cached keys already have their position encoded. When generating token at position , we only need to compute RoPE for the new query at position and the new key at position :
class RoPEAttentionWithCache(nn.Module):
"""RoPE attention with KV-cache for efficient inference."""
def __init__(self, d_model, n_heads, max_seq_len=8192, base=10000.0):
super().__init__()
self.d_model = d_model
self.n_heads = n_heads
self.d_k = d_model // n_heads
self.W_q = nn.Linear(d_model, d_model, bias=False)
self.W_k = nn.Linear(d_model, d_model, bias=False)
self.W_v = nn.Linear(d_model, d_model, bias=False)
self.W_o = nn.Linear(d_model, d_model, bias=False)
self.register_buffer(
'freqs_cis',
precompute_freqs_cis(self.d_k, max_seq_len, base),
persistent=False
)
# KV cache (initialized on first call)
self.k_cache = None
self.v_cache = None
def forward(self, x, start_pos):
"""
Args:
x: (B, S, D) input. S=full_seq during prefill, S=1 during decode.
start_pos: position of first token in x.
"""
B, S, D = x.shape
H = self.n_heads
dk = self.d_k
q = self.W_q(x).view(B, S, H, dk)
k = self.W_k(x).view(B, S, H, dk)
v = self.W_v(x).view(B, S, H, dk)
# Apply RoPE to new Q and K
freqs = self.freqs_cis[start_pos:start_pos + S]
q, k = apply_rotary_emb(q, k, freqs)
# Update KV cache
if self.k_cache is None:
self.k_cache = k
self.v_cache = v
else:
self.k_cache = torch.cat([self.k_cache, k], dim=1)
self.v_cache = torch.cat([self.v_cache, v], dim=1)
# Attention: new queries attend to all cached keys
q = q.transpose(1, 2) # (B, H, S, dk)
k_all = self.k_cache.transpose(1, 2) # (B, H, T, dk)
v_all = self.v_cache.transpose(1, 2) # (B, H, T, dk)
scale = 1.0 / math.sqrt(dk)
scores = torch.matmul(q, k_all.transpose(-2, -1)) * scale
# Causal mask: new tokens can attend to all previous + self
T = k_all.shape[2]
causal_mask = torch.triu(
torch.full((S, T), float('-inf'), device=x.device),
diagonal=T - S + 1
)
scores = scores + causal_mask
weights = torch.softmax(scores, dim=-1)
output = torch.matmul(weights, v_all)
output = output.transpose(1, 2).contiguous().view(B, S, D)
return self.W_o(output)
def reset_cache(self):
self.k_cache = None
self.v_cache = None
Numerical Verification
9.1 Verifying the Relative Position Property
def verify_rope_relative_property(d=64, n_tests=1000, base=10000.0):
"""
Verify that RoPE attention scores depend only on relative position.
For random q, k and positions (m1, n1), (m2, n2) where
m1 - n1 = m2 - n2, the attention scores should be identical.
"""
freqs = precompute_freqs_cis(d, 10000, base)
max_error = 0.0
for _ in range(n_tests):
q = torch.randn(d)
k = torch.randn(d)
# Random positions with same relative distance
delta = torch.randint(0, 100, (1,)).item()
m1 = torch.randint(0, 5000, (1,)).item()
n1 = m1 - delta
m2 = torch.randint(0, 5000, (1,)).item()
n2 = m2 - delta
if n1 < 0 or n2 < 0:
continue
# Apply RoPE
q_r = q.reshape(1, 1, 1, d)
k_r = k.reshape(1, 1, 1, d)
q1, _ = apply_rotary_emb(q_r, k_r, freqs[m1:m1+1])
_, k1 = apply_rotary_emb(q_r, k_r, freqs[n1:n1+1])
score1 = (q1 * k1).sum().item()
q2, _ = apply_rotary_emb(q_r, k_r, freqs[m2:m2+1])
_, k2 = apply_rotary_emb(q_r, k_r, freqs[n2:n2+1])
score2 = (q2 * k2).sum().item()
error = abs(score1 - score2)
max_error = max(max_error, error)
print(f"Max error over {n_tests} tests: {max_error:.2e}")
print(f"Property verified: {max_error < 1e-4}")
verify_rope_relative_property()
# Expected: Max error ~1e-6 to 1e-5 (floating point only)
9.2 Verifying Attention Score Decay
def measure_attention_decay(d=128, base=10000.0, max_delta=2000):
"""
Measure how RoPE attention scores decay with relative distance.
Uses random q and k, averaged over many samples.
"""
freqs = precompute_freqs_cis(d, max_delta + 100, base)
n_samples = 500
scores_by_delta = {}
for delta in [0, 1, 2, 5, 10, 50, 100, 500, 1000, 2000]:
total_score = 0.0
for _ in range(n_samples):
q = torch.randn(d)
k = torch.randn(d)
q_r = q.reshape(1, 1, 1, d)
k_r = k.reshape(1, 1, 1, d)
m = delta + 50
n = 50
q_rot, _ = apply_rotary_emb(q_r, k_r, freqs[m:m+1])
_, k_rot = apply_rotary_emb(q_r, k_r, freqs[n:n+1])
score = (q_rot * k_rot).sum().item()
total_score += abs(score)
avg_score = total_score / n_samples
scores_by_delta[delta] = avg_score
print(f"delta={delta:5d}: avg |score| = {avg_score:.4f}")
measure_attention_decay()
Summary
RoPE encodes position by rotating query and key vectors in the complex plane, with rotation angles proportional to position and frequencies that vary geometrically across dimensions. The construction guarantees that attention scores depend only on relative position β this is proven, not approximated. The multi-scale frequency structure creates dimension pairs ranging from sub-word locality (fast rotation) to document-level structure (slow rotation). The base frequency parameter controls the maximum context length, and can be scaled to extend beyond training length.
The complete derivation chain:
- Requirement:
- Representation: pair dimensions as complex numbers
- Solution: rotate by position-dependent angle
- Proof:
- Frequencies: gives geometric spacing from local to global
- Base: controls maximum resolvable context length ()
- Implementation: precompute sin/cos tables, apply element-wise rotation to Q and K
RoPE has become the dominant position encoding for decoder-only LLMs. Llama, Mistral, Qwen, Gemma, DeepSeek, and most open-weight models use RoPE. Its zero-parameter design, mathematical guarantee of relative position dependence, and compatibility with KV-caching and context extension make it the standard choice.
Verified: (1) Complex conjugate proof is correct β follows from exponential laws. (2) holds for orthogonal rotation matrices. (3) Frequency formula matches Su et al. 2021 and Llama implementation. (4) Wavelength calculations: is correct. For , . For with base=10000, . (5) The expanded real form correctly decomposes the complex multiplication. (6) Both complex and sin/cos implementations produce identical results. (7) KV-cache implementation correctly applies RoPE before caching. (8) No bare angle brackets in prose. (9) All math uses dollar-sign delimiters. (10) No Python type hints with brackets.