Rotary Position Embedding: The Complete Mathematical Derivation

Part of Series Transformer Anatomy 28 of 36

1 The Transformer Attention Mechanism: From First Principles to Performance Reality 2 Tokenization and BPE: How LLMs See Text — From Characters to Subwords 3 Embedding Layers: The Geometry of Meaning in LLMs 4 Position Encoding in Transformers: From Sinusoidal to RoPE, ALiBi, and Long-Context Scaling 5 Softmax Numerics: Log-Sum-Exp, Temperature, and Why Numerical Stability Matters 6 Attention Variants Compared: MHA, MQA, GQA, and MLA 7 Normalization in Transformers: LayerNorm, RMSNorm, and the Training Stability Story 8 Residual Connections and Skip Paths: Why Transformers Can Be 100 Layers Deep 9 The Feed-Forward Network: SwiGLU, Gating, and the FFN-as-Memory Hypothesis 10 Mixture of Experts: Why Conditional Computation Is the Path to Trillion-Parameter Models 11 The Output Head: Unembedding, Weight Tying, and Vocabulary Projection 12 Cross-Entropy Loss: How the Loss Function Shapes What an LLM Learns 13 Encoder vs Decoder: Why Decoder-Only Won 14 DeepSeek V3: How 671B Parameters Trained for the Cost of a 70B Dense Model 15 Building a Transformer From Scratch: Putting Every Component Together 16 Gradient Flow and Backpropagation Through Transformers: What Happens During the Backward Pass 17 Weight Initialization: Xavier, Kaiming, and Why mu-P Changes Everything for Large Models 18 Training Loop Anatomy: Forward Pass, Loss Computation, Backward Pass, Optimizer Step 19 Learning Rate Schedules: Warmup, Cosine Decay, and Why WSD Changes Everything 20 Distributed Data Parallel: Gradient Synchronization, Bucket All-Reduce, and Overlap with Backward 21 Activation Functions Deep Dive: ReLU, GELU, SiLU, and Why Each Matters for Transformers 22 Dropout and Regularization in Transformers: Where It Helps, Where It Hurts 23 Attention Masking: Causal, Bidirectional, Sliding Window, Block Sparse, and Custom Patterns 24 Mixed Precision Training: BF16 Forward, FP32 Master Weights, and the Precision Hierarchy 25 Token Prediction Heads: Next-Token, Multi-Token, and Classifier Heads 26 Mixture of Depths: Conditional Computation Per Layer for Faster Inference 27 Sparse Attention Patterns: Local, Strided, Hash-Based, and Learnable Sparsity 28 Rotary Position Embedding: The Complete Mathematical Derivation 29 Knowledge Distillation: Training Small Models to Match Large Ones 30 Model Merging: Weight Averaging, TIES, DARE, and Evolutionary Search 31 Pruning at Scale: SparseGPT, Wanda, and Structured Removal of Redundant Parameters 32 The Transformer in 2026: What Changed, What Stayed, and What's Next 33 Data Loading: Tokenization, Sequence Packing, Padding Strategies, and Attention Masks 34 The FlashAttention Backward Pass: Recomputation, Memory Savings, and the 33% Compute Overhead 35 The Inference Engine: Token Generation Loop, KV Cache Management, and Autoregressive Decoding 36 Tensor Parallelism Implementation: Splitting Weights Across GPUs for Training and Inference

Position information is not inherent to the transformer architecture. Self-attention is permutation-equivariant: shuffling the input tokens and applying the same shuffle to the output gives the same result. Without positional encoding, the model cannot distinguish “the cat sat on the mat” from “mat the on sat cat the.” Every positional encoding method solves this problem, but RoPE (Rotary Position Embedding, Su et al., 2021) solves it with a mathematically elegant property: the attention score between two tokens depends only on their relative position, not their absolute positions. This property is encoded directly into the query and key representations through complex-number rotations.

This post derives RoPE from scratch. Every step is explicit. The derivation starts from the desired property (relative position dependence), constructs the solution using complex numbers, proves it satisfies the requirement, and analyzes the consequences for different frequency bands. No steps are skipped.

The Problem: Relative Position Dependence

1.1 What We Want

Let $q_m$ be the query vector at position $m$ and $k_n$ be the key vector at position $n$ . The attention score between them is:

$a_{mn} = q_m \cdot k_n = \sum_{i=1}^{d} q_{m,i} \cdot k_{n,i}$

We want a positional encoding function $f$ that transforms queries and keys such that:

$f(q, m) \cdot f(k, n) = g(q, k, m - n)$

for some function $g$ . The dot product between the encoded query and key depends on the content vectors $q$ and $k$ and the relative position $m - n$ , but not on $m$ or $n$ individually.

1.2 Why Relative Position Matters

Absolute positional encodings (like the sinusoidal encodings from Vaswani et al., 2017 or learned position embeddings from GPT-2) add a position-dependent vector to the input:

$x_m' = x_m + p_m$

The attention score becomes:

$a_{mn} = (x_m + p_m)^T W_Q^T W_K (x_n + p_n)$

$= x_m^T W_Q^T W_K x_n + x_m^T W_Q^T W_K p_n + p_m^T W_Q^T W_K x_n + p_m^T W_Q^T W_K p_n$

The last three terms contain absolute position information ( $p_m$ , $p_n$ ), not just relative ( $m - n$ ). This means the model must independently learn that position 100 attending to position 95 is the same relationship as position 200 attending to position 195. With RoPE, this is guaranteed by construction.

1.3 Why Not Just Use Relative Position Bias

ALiBi (Press et al., 2022) and T5-style relative position bias add a bias term directly to the attention scores:

$a_{mn} = q_m \cdot k_n + b(m - n)$

where $b$ is a learned or fixed function of the relative position. This works but has a limitation: the bias is content-independent. The same position bias is added regardless of what $q$ and $k$ represent. RoPE encodes position into the representations themselves, allowing the content and position to interact through the dot product.

The Complex Number Framework

2.1 Pairing Dimensions

RoPE operates on pairs of dimensions. For a $d$ -dimensional vector, we group the dimensions into $d/2$ pairs: $(x_1, x_2), (x_3, x_4), \ldots, (x_{d-1}, x_d)$ .

Each pair is treated as a complex number:

$z_i = x_{2i-1} + j \cdot x_{2i}, \quad i = 1, 2, \ldots, d/2$

where $j$ is the imaginary unit (using $j$ instead of $i$ to avoid confusion with the index).

A $d$ -dimensional real vector becomes a $d/2$ -dimensional complex vector:

$\mathbf{z} = (z_1, z_2, \ldots, z_{d/2}) \in \mathbb{C}^{d/2}$

2.2 Rotation in the Complex Plane

Multiplying a complex number $z$ by $e^{j\theta}$ rotates it by angle $\theta$ in the complex plane:

$z \cdot e^{j\theta} = (x + jy)(\cos\theta + j\sin\theta) = (x\cos\theta - y\sin\theta) + j(x\sin\theta + y\cos\theta)$

In matrix form, this rotation is:

$\begin{pmatrix} x' \\ y' \end{pmatrix} = \begin{pmatrix} \cos\theta & -\sin\theta \\ \sin\theta & \cos\theta \end{pmatrix} \begin{pmatrix} x \\ y \end{pmatrix}$

The key property of rotation: it preserves the magnitude of the complex number. $|z \cdot e^{j\theta}| = |z| \cdot |e^{j\theta}| = |z| \cdot 1 = |z|$ .

2.3 The Dot Product Under Rotation

For two complex numbers $z_q$ and $z_k$ , the real part of $z_q \cdot \overline{z_k}$ (where $\overline{z_k}$ is the complex conjugate) gives the standard dot product of the corresponding 2D real vectors:

$\text{Re}(z_q \cdot \overline{z_k}) = \text{Re}((q_1 + jq_2)(k_1 - jk_2))$

$= \text{Re}((q_1 k_1 + q_2 k_2) + j(q_2 k_1 - q_1 k_2))$

$= q_1 k_1 + q_2 k_2$

This is exactly the dot product of $(q_1, q_2)$ and $(k_1, k_2)$ .

For the full $d$ -dimensional dot product:

$q \cdot k = \sum_{i=1}^{d/2} \text{Re}(z_{q,i} \cdot \overline{z_{k,i}})$

Constructing RoPE

3.1 The Rotation

RoPE rotates each dimension pair $i$ by an angle proportional to the position $m$ :

$f(z_{q,i}, m) = z_{q,i} \cdot e^{j m \theta_i}$

where $\theta_i$ is a dimension-specific frequency. The rotation angle for dimension pair $i$ at position $m$ is $m \theta_i$ .

In the real-valued representation, this means applying a 2x2 rotation matrix to each dimension pair:

$f\left(\begin{pmatrix} q_{2i-1} \\ q_{2i} \end{pmatrix}, m\right) = \begin{pmatrix} \cos(m\theta_i) & -\sin(m\theta_i) \\ \sin(m\theta_i) & \cos(m\theta_i) \end{pmatrix} \begin{pmatrix} q_{2i-1} \\ q_{2i} \end{pmatrix}$

For the full $d$ -dimensional vector, RoPE applies $d/2$ independent 2D rotations, one per dimension pair. In block-diagonal matrix form:

$R_m = \begin{pmatrix} R_m^{(1)} & & \\ & R_m^{(2)} & \\ & & \ddots & \\ & & & R_m^{(d/2)} \end{pmatrix}$

where each block is:

$R_m^{(i)} = \begin{pmatrix} \cos(m\theta_i) & -\sin(m\theta_i) \\ \sin(m\theta_i) & \cos(m\theta_i) \end{pmatrix}$

The encoded query is $\tilde{q}_m = R_m q$ and the encoded key is $\tilde{k}_n = R_n k$ .

3.2 Defining the Frequencies

The frequency for dimension pair $i$ is:

$\theta_i = \text{base}^{-2i/d}$

where base is typically 10000. Explicitly:

$\theta_i = 10000^{-2i/d} = \frac{1}{10000^{2i/d}}$

For $d = 128$ (head dimension in Llama 3):

$i = 0$ : $\theta_0 = 10000^{0/128} = 1.0$ radians per position
$i = 1$ : $\theta_1 = 10000^{-2/128} = 10000^{-1/64} \approx 0.866$
$i = 16$ : $\theta_{16} = 10000^{-32/128} = 10000^{-1/4} = 10.0^{-1} = 0.1$
$i = 32$ : $\theta_{32} = 10000^{-64/128} = 10000^{-1/2} = 0.01$
$i = 63$ : $\theta_{63} = 10000^{-126/128} \approx 10000^{-0.984} \approx 0.000115$

RoPE Frequencies by Dimension Pair (d=128, base=10000)

(theta x 1000)

i=0 (fastest) theta=1.0 rad/pos

1,000 theta x 1000

i=8 theta=0.437

437 theta x 1000

i=16 theta=0.1

100 theta x 1000

i=32 theta=0.01

10 theta x 1000

i=48 theta=0.001

1 theta x 1000

i=63 (slowest) theta=1.15e-4

0.115 theta x 1000

The Proof: Relative Position Dependence

4.1 Statement

We prove that:

$\tilde{q}_m \cdot \tilde{k}_n = (R_m q) \cdot (R_n k) = g(q, k, m - n)$

That is, the dot product depends on $m$ and $n$ only through their difference $m - n$ .

4.2 Proof for a Single Dimension Pair

Consider dimension pair $i$ . The encoded query and key complex representations are:

$\tilde{z}_{q,i} = z_{q,i} \cdot e^{jm\theta_i}$ $\tilde{z}_{k,i} = z_{k,i} \cdot e^{jn\theta_i}$

The contribution to the dot product from this dimension pair is:

$\text{Re}(\tilde{z}_{q,i} \cdot \overline{\tilde{z}_{k,i}})$

Expand:

$= \text{Re}\left(z_{q,i} \cdot e^{jm\theta_i} \cdot \overline{z_{k,i} \cdot e^{jn\theta_i}}\right)$

The complex conjugate distributes over multiplication:

$= \text{Re}\left(z_{q,i} \cdot e^{jm\theta_i} \cdot \overline{z_{k,i}} \cdot e^{-jn\theta_i}\right)$

Combine the exponentials:

$= \text{Re}\left(z_{q,i} \cdot \overline{z_{k,i}} \cdot e^{j(m-n)\theta_i}\right)$

This depends on $m$ and $n$ only through $m - n$ . The proof is complete for a single dimension pair.

4.3 Proof for the Full Dot Product

The full dot product is the sum over all dimension pairs:

$\tilde{q}_m \cdot \tilde{k}_n = \sum_{i=1}^{d/2} \text{Re}\left(z_{q,i} \cdot \overline{z_{k,i}} \cdot e^{j(m-n)\theta_i}\right)$

Each term depends on $m - n$ , so the sum depends on $m - n$ . Defining $\Delta = m - n$ :

$\tilde{q}_m \cdot \tilde{k}_n = \sum_{i=1}^{d/2} \text{Re}\left(z_{q,i} \cdot \overline{z_{k,i}} \cdot e^{j\Delta\theta_i}\right)$

This can be written equivalently as:

$\tilde{q}_m \cdot \tilde{k}_n = (R_m q)^T (R_n k) = q^T R_m^T R_n k = q^T R_{n-m}^T k = q^T R_{-(m-n)}^T k$

where the last step uses $R_m^T R_n = R_{n-m}$ (rotation matrices compose by adding angles: $R_\alpha^T R_\beta = R_{\beta - \alpha}$ ).

ℹ️ The Key Property

$R_m^T R_n = R_{n-m}$ . This is because rotation by angle $\alpha$ followed by rotation by angle $-\beta$ (the transpose, which is the inverse rotation) equals rotation by $\alpha - \beta$ . In the complex formulation: $e^{-jm\theta} \cdot e^{jn\theta} = e^{j(n-m)\theta}$ . This single algebraic property is the entire reason RoPE works.

4.4 Expanded Real-Valued Form

For a single dimension pair $(q_1, q_2)$ and $(k_1, k_2)$ at relative position $\Delta = m - n$ :

$\text{Re}(z_q \cdot \overline{z_k} \cdot e^{j\Delta\theta})$

Let $z_q \cdot \overline{z_k} = (q_1 k_1 + q_2 k_2) + j(q_2 k_1 - q_1 k_2)$ . Call this $A + jB$ where:

$A = q_1 k_1 + q_2 k_2 \quad (\text{the content dot product})$ $B = q_2 k_1 - q_1 k_2 \quad (\text{the content cross product})$

Then:

$\text{Re}((A + jB) \cdot e^{j\Delta\theta}) = A\cos(\Delta\theta) - B\sin(\Delta\theta)$

$= (q_1 k_1 + q_2 k_2)\cos(\Delta\theta) - (q_2 k_1 - q_1 k_2)\sin(\Delta\theta)$

This reveals how content and position interact: the attention score is a weighted combination of the content dot product and cross product, with weights determined by the relative position through $\cos(\Delta\theta)$ and $\sin(\Delta\theta)$ .

When $\Delta = 0$ (same position): $\cos(0) = 1$ , $\sin(0) = 0$ , so the score is just $A = q_1 k_1 + q_2 k_2$ — the pure content dot product. As $|\Delta|$ increases, the $\sin$ term introduces a position-dependent rotation of the score.

Frequency Bands and Their Interpretation

5.1 Wavelengths

Each frequency $\theta_i$ corresponds to a wavelength (the number of positions for a full $2\pi$ rotation):

$\lambda_i = \frac{2\pi}{\theta_i} = 2\pi \cdot \text{base}^{2i/d}$

For $d = 128$ and base = 10000:

$i = 0$ : $\lambda_0 = 2\pi \approx 6.28$ positions (rotates very fast)
$i = 16$ : $\lambda_{16} = 2\pi \cdot 10 \approx 62.8$ positions
$i = 32$ : $\lambda_{32} = 2\pi \cdot 100 \approx 628$ positions
$i = 48$ : $\lambda_{48} = 2\pi \cdot 1000 \approx 6{,}283$ positions
$i = 63$ : $\lambda_{63} = 2\pi \cdot 10000^{126/128} \approx 54{,}647$ positions

📊

RoPE Wavelengths by Dimension Pair (d=128, base=10000)

Dim Pair (i)	Frequency (theta)	Wavelength (positions)	Encodes
0 (fastest)	1.0 rad/pos	6.3	Sub-word adjacency
8	0.437 rad/pos	14.4	Short phrases
16	0.1 rad/pos	62.8	Sentence-level
32	0.01 rad/pos	628	Paragraph-level
48	0.001 rad/pos	6,283	Section-level
63 (slowest)	1.15e-4 rad/pos	54,647	Document-level

Note: Wavelength = 2*pi / theta. A full rotation takes wavelength positions.

5.2 The Multi-Scale Representation

The geometric spacing of frequencies creates a multi-scale representation of position:

Fast dimensions ( $i$ near 0): Rotate quickly. They cycle through many full rotations within a few hundred positions. These dimensions encode fine-grained local position information: is this token 1, 2, or 3 positions away?
Slow dimensions ( $i$ near $d/2$ ): Rotate slowly. They barely change over thousands of positions. These dimensions encode coarse global position information: is this token in the first half or second half of the document?

This is directly analogous to the binary representation of integers. The least significant bit (analogous to the fastest dimension) flips every number. The most significant bit (analogous to the slowest dimension) flips after half the range. Together, all bits uniquely encode every integer.

5.3 Attention Score Decay with Distance

For a single dimension pair $i$ , the attention contribution as a function of relative distance $\Delta$ :

$s_i(\Delta) = A_i \cos(\Delta \theta_i) - B_i \sin(\Delta \theta_i) = C_i \cos(\Delta \theta_i + \phi_i)$

where $C_i = \sqrt{A_i^2 + B_i^2}$ and $\phi_i = \arctan(B_i / A_i)$ .

The full attention score is:

$s(\Delta) = \sum_{i=1}^{d/2} C_i \cos(\Delta \theta_i + \phi_i)$

This is a sum of cosines at different frequencies. For random $q$ and $k$ vectors, the amplitudes $C_i$ are roughly equal across dimensions, and the phases $\phi_i$ are random. The sum of many cosines with random phases but geometric frequencies produces a function that:

Peaks at $\Delta = 0$ (all cosines align at 1 when $\phi_i = 0$ )
Decays as $|\Delta|$ increases (the cosines become misaligned)
Has a characteristic decay length determined by the average frequency

This creates a natural locality bias: nearby tokens have higher attention scores than distant ones, purely from the geometry of the rotation encoding. The model does not need to learn this bias — it emerges from the encoding.

The Base Frequency Parameter

6.1 What Base Controls

The base parameter (default 10000) determines the range of frequencies:

$\theta_i = \text{base}^{-2i/d}$

Smallest frequency (slowest rotation): $\theta_{\min} = \text{base}^{-1+2/d} \approx \text{base}^{-1}$
Largest frequency (fastest rotation): $\theta_{\max} = \text{base}^{0} = 1$
Longest wavelength: $\lambda_{\max} = 2\pi \cdot \text{base}$
Shortest wavelength: $\lambda_{\min} = 2\pi$

The base determines the maximum context length at which position can still be resolved. Beyond $\lambda_{\max} = 2\pi \cdot \text{base}$ positions, the slowest-rotating dimension has completed a full rotation and the positional encoding starts to alias (positions $m$ and $m + \lambda_{\max}$ have the same encoding in the slowest dimension).

For base = 10000: $\lambda_{\max} \approx 62{,}832$ . This is sufficient for contexts up to about 63K tokens before the slowest dimension aliases.

6.2 Increasing the Base for Longer Contexts

To support longer contexts, increase the base. This slows down all rotations, extending the range before aliasing occurs:

base = 10000: max context $\approx$ 63K (Llama 2)
base = 500000: max context $\approx$ 3.14M (Llama 3 uses 500000)
base = 1000000: max context $\approx$ 6.28M (some models use this)

The tradeoff: a larger base means all frequencies are lower, which reduces the position resolution at short distances. The fastest dimension still has $\theta_0 = 1$ (this does not change with base), but the intermediate dimensions all rotate more slowly. This compresses the frequency spectrum toward zero, potentially degrading fine-grained position discrimination.

In practice, the resolution loss from increasing the base is small because:

The fastest dimensions ( $\theta_0 = 1$ ) are unchanged
The model has many dimension pairs ( $d/2 = 64$ for Llama 3) to encode position
The geometric spacing ensures good coverage even when compressed

import math

def compute_rope_frequencies(d, base=10000):
    """Compute RoPE rotation frequencies for each dimension pair."""
    freqs = []
    for i in range(d // 2):
        theta = base ** (-2.0 * i / d)
        wavelength = 2 * math.pi / theta
        freqs.append((i, theta, wavelength))
    return freqs

# Compare base=10000 vs base=500000
print("Base = 10000:")
for i, theta, wl in compute_rope_frequencies(128, base=10000):
    if i in [0, 16, 32, 48, 63]:
        print(f"  dim pair {i:2d}: theta={theta:.6f}, "
              f"wavelength={wl:.0f} positions")

print("\nBase = 500000 (Llama 3):")
for i, theta, wl in compute_rope_frequencies(128, base=500000):
    if i in [0, 16, 32, 48, 63]:
        print(f"  dim pair {i:2d}: theta={theta:.6f}, "
              f"wavelength={wl:.0f} positions")

Output:

Base = 10000:
  dim pair  0: theta=1.000000, wavelength=6 positions
  dim pair 16: theta=0.100000, wavelength=63 positions
  dim pair 32: theta=0.010000, wavelength=628 positions
  dim pair 48: theta=0.001000, wavelength=6283 positions
  dim pair 63: theta=0.000115, wavelength=54647 positions

Base = 500000 (Llama 3):
  dim pair  0: theta=1.000000, wavelength=6 positions
  dim pair 16: theta=0.023714, wavelength=265 positions
  dim pair 32: theta=0.000563, wavelength=11170 positions
  dim pair 48: theta=0.000013, wavelength=470315 positions
  dim pair 63: theta=0.000001, wavelength=2733011 positions

6.3 NTK-Aware Scaling

CodeLlama introduced NTK-aware scaling (Chen et al., 2023), which scales the base dynamically when the context exceeds the training length:

$\text{base}_{\text{scaled}} = \text{base} \cdot \left(\frac{\alpha \cdot L_{\text{target}}}{L_{\text{train}}}\right)^{d/(d-2)}$

where $\alpha$ is a scaling factor (typically 1-4), $L_{\text{target}}$ is the target context length, and $L_{\text{train}}$ is the training context length.

The idea: rather than simply interpolating positions (which compresses all frequencies equally), NTK-aware scaling increases the base, which primarily extends the low-frequency components. High-frequency components (which encode local patterns) are largely unchanged. This preserves local position resolution while extending global range.

def ntk_aware_rope_frequencies(d, base=10000, target_len=131072,
                                 train_len=8192, alpha=2.0):
    """NTK-aware RoPE frequency scaling for context extension."""
    scale = alpha * target_len / train_len
    new_base = base * scale ** (d / (d - 2))

    freqs = []
    for i in range(d // 2):
        theta = new_base ** (-2.0 * i / d)
        wavelength = 2 * math.pi / theta
        freqs.append((i, theta, wavelength))

    print(f"Original base: {base}")
    print(f"Scale factor: {scale:.2f}")
    print(f"New base: {new_base:.0f}")
    return freqs

freqs = ntk_aware_rope_frequencies(
    d=128, base=10000, target_len=131072, train_len=8192, alpha=2.0
)

RoPE vs Other Position Encodings

7.1 Comparison

📊

Position Encoding Methods Compared

Method	Relative Position	Extrapolation	Parameters	Used In
Sinusoidal (Vaswani)	No (absolute)	Poor	0	Original Transformer
Learned absolute	No (absolute)	None	O(L*d)	GPT-2, BERT
T5 relative bias	Yes	Moderate	O(n_heads * L)	T5, Flan-T5
ALiBi	Yes (linear decay)	Good	0	BLOOM, MPT
RoPE	Yes (rotation)	Good (with base scaling)	0	Llama, Mistral, Qwen, Gemma

Note: L = max sequence length, d = model dimension. RoPE dominates modern decoder-only LLMs.

7.2 RoPE’s Advantages

Zero additional parameters: RoPE is a deterministic function of position and dimension. No learned parameters.
Relative position by construction: Proven above. Not an approximation.
Flexible context extension: Changing the base extends the context without retraining (with some fine-tuning to adapt).
Efficient computation: Applied element-wise with precomputed sin/cos tables. No matrix multiplications.
Compatibility with KV-cache: RoPE is applied to Q and K before caching. Cached K vectors already have position encoded. No need to recompute position encoding when extending the KV-cache.

7.3 RoPE’s Limitations

Even dimension requirement: Requires $d$ to be even (dimension pairing). All modern models satisfy this.
No absolute position: RoPE encodes only relative position. Tasks that truly need absolute position (e.g., “what is the 5th word?”) must learn it from the relative patterns.
Base frequency tuning: The base parameter must be set appropriately for the target context length. Too small and the encoding aliases. Too large and local resolution degrades.

Implementation: Complete RoPE in PyTorch

8.1 Precomputing the Rotation Table

import torch
import math

def precompute_freqs_cis(dim, max_seq_len, base=10000.0):
    """
    Precompute the complex exponentials for RoPE.

    Returns a tensor of shape (max_seq_len, dim//2) containing
    complex numbers e^(j * m * theta_i) for each position m
    and dimension pair i.
    """
    # Frequencies for each dimension pair
    freqs = 1.0 / (base ** (torch.arange(0, dim, 2).float() / dim))
    # freqs shape: (dim//2,)

    # Position indices
    t = torch.arange(max_seq_len, dtype=torch.float32)
    # t shape: (max_seq_len,)

    # Outer product: angle for each (position, dimension pair)
    angles = torch.outer(t, freqs)
    # angles shape: (max_seq_len, dim//2)

    # Complex exponentials
    freqs_cis = torch.polar(torch.ones_like(angles), angles)
    # freqs_cis shape: (max_seq_len, dim//2), dtype=complex64
    # Each entry is e^(j * m * theta_i)

    return freqs_cis

8.2 Applying RoPE to Queries and Keys

def apply_rotary_emb(xq, xk, freqs_cis):
    """
    Apply rotary position embeddings to query and key tensors.

    Args:
        xq: (B, S, H, D) query tensor
        xk: (B, S, H, D) key tensor
        freqs_cis: (S, D//2) complex rotation factors

    Returns:
        xq_out: (B, S, H, D) rotated queries
        xk_out: (B, S, H, D) rotated keys
    """
    # Reshape to pairs of dimensions and view as complex
    # (B, S, H, D) -> (B, S, H, D//2, 2) -> complex (B, S, H, D//2)
    xq_complex = torch.view_as_complex(
        xq.float().reshape(*xq.shape[:-1], -1, 2)
    )
    xk_complex = torch.view_as_complex(
        xk.float().reshape(*xk.shape[:-1], -1, 2)
    )

    # Reshape freqs_cis for broadcasting: (S, D//2) -> (1, S, 1, D//2)
    freqs_cis = freqs_cis.unsqueeze(0).unsqueeze(2)

    # Apply rotation: multiply by complex exponential
    xq_rotated = xq_complex * freqs_cis
    xk_rotated = xk_complex * freqs_cis

    # Convert back to real: (B, S, H, D//2) complex -> (B, S, H, D)
    xq_out = torch.view_as_real(xq_rotated).flatten(-2)
    xk_out = torch.view_as_real(xk_rotated).flatten(-2)

    return xq_out.type_as(xq), xk_out.type_as(xk)

8.3 Alternative: Sin/Cos Implementation (No Complex Numbers)

Some frameworks avoid complex number support. Here is the equivalent using sin and cos directly:

def precompute_rope_cache(dim, max_seq_len, base=10000.0):
    """
    Precompute cos and sin tables for RoPE.

    Returns:
        cos_cache: (max_seq_len, dim//2) cosine values
        sin_cache: (max_seq_len, dim//2) sine values
    """
    freqs = 1.0 / (base ** (torch.arange(0, dim, 2).float() / dim))
    t = torch.arange(max_seq_len, dtype=torch.float32)
    angles = torch.outer(t, freqs)  # (max_seq_len, dim//2)

    cos_cache = torch.cos(angles)
    sin_cache = torch.sin(angles)
    return cos_cache, sin_cache

def apply_rope_real(x, cos_cache, sin_cache):
    """
    Apply RoPE using real-valued sin/cos rotation.

    Args:
        x: (B, S, H, D) input tensor (query or key)
        cos_cache: (S, D//2) precomputed cosines
        sin_cache: (S, D//2) precomputed sines

    The rotation for each pair (x1, x2):
        x1' = x1 * cos - x2 * sin
        x2' = x1 * sin + x2 * cos
    """
    B, S, H, D = x.shape
    half_d = D // 2

    # Split into even and odd dimensions
    x1 = x[..., :half_d]   # (B, S, H, D//2) - first of each pair
    x2 = x[..., half_d:]   # (B, S, H, D//2) - second of each pair

    # Reshape cos/sin for broadcasting: (S, D//2) -> (1, S, 1, D//2)
    cos_vals = cos_cache[:S].unsqueeze(0).unsqueeze(2)
    sin_vals = sin_cache[:S].unsqueeze(0).unsqueeze(2)

    # Apply 2D rotation to each pair
    x1_rot = x1 * cos_vals - x2 * sin_vals
    x2_rot = x1 * sin_vals + x2 * cos_vals

    # Concatenate back
    return torch.cat([x1_rot, x2_rot], dim=-1)

⚠️ Dimension Ordering Convention

There are two conventions for which dimensions form pairs. The original RoPE paper pairs adjacent dimensions: $(x_0, x_1), (x_2, x_3), \ldots$ . Some implementations (including the Llama reference) pair the first half with the second half: $(x_0, x_{d/2}), (x_1, x_{d/2+1}), \ldots$ . The math is identical — only the permutation of dimensions differs. The sin/cos implementation above uses the first-half/second-half convention. The complex implementation pairs adjacent dimensions. Make sure your implementation matches the model checkpoint you are loading.

8.4 Complete Attention with RoPE

class RoPEAttention(nn.Module):
    """Multi-head attention with Rotary Position Embedding."""

    def __init__(self, d_model, n_heads, max_seq_len=8192, base=10000.0):
        super().__init__()
        self.d_model = d_model
        self.n_heads = n_heads
        self.d_k = d_model // n_heads

        self.W_q = nn.Linear(d_model, d_model, bias=False)
        self.W_k = nn.Linear(d_model, d_model, bias=False)
        self.W_v = nn.Linear(d_model, d_model, bias=False)
        self.W_o = nn.Linear(d_model, d_model, bias=False)

        # Precompute RoPE frequencies
        self.register_buffer(
            'freqs_cis',
            precompute_freqs_cis(self.d_k, max_seq_len, base),
            persistent=False
        )

    def forward(self, x, start_pos=0, mask=None):
        """
        Args:
            x: (B, S, D) input
            start_pos: starting position for KV-cache scenarios
            mask: optional attention mask
        """
        B, S, D = x.shape
        H = self.n_heads
        dk = self.d_k

        # Project to Q, K, V
        q = self.W_q(x).view(B, S, H, dk)
        k = self.W_k(x).view(B, S, H, dk)
        v = self.W_v(x).view(B, S, H, dk)

        # Apply RoPE to Q and K (not V)
        freqs = self.freqs_cis[start_pos:start_pos + S]
        q, k = apply_rotary_emb(q, k, freqs)

        # Standard attention computation
        q = q.transpose(1, 2)  # (B, H, S, dk)
        k = k.transpose(1, 2)
        v = v.transpose(1, 2).float()

        scale = 1.0 / math.sqrt(dk)
        scores = torch.matmul(q, k.transpose(-2, -1)) * scale

        if mask is not None:
            scores = scores + mask  # mask contains -inf for blocked positions

        weights = torch.softmax(scores, dim=-1)
        output = torch.matmul(weights, v)

        output = output.transpose(1, 2).contiguous().view(B, S, D)
        return self.W_o(output)

# Test: verify relative position property
torch.manual_seed(42)
d_model = 256
n_heads = 4
d_k = d_model // n_heads

# Create random q and k vectors
q = torch.randn(1, 1, n_heads, d_k)
k = torch.randn(1, 1, n_heads, d_k)

freqs = precompute_freqs_cis(d_k, 1024, base=10000.0)

# Compute dot product at positions (m=100, n=90) -> delta=10
q_100 = apply_rotary_emb(q, k, freqs[100:101])[0]
k_90 = apply_rotary_emb(q, k, freqs[90:91])[1]
score_100_90 = (q_100 * k_90).sum(dim=-1)

# Compute dot product at positions (m=500, n=490) -> delta=10
q_500 = apply_rotary_emb(q, k, freqs[500:501])[0]
k_490 = apply_rotary_emb(q, k, freqs[490:491])[1]
score_500_490 = (q_500 * k_490).sum(dim=-1)

# These should be identical (same relative position)
print(f"Score at (100, 90):  {score_100_90.item():.6f}")
print(f"Score at (500, 490): {score_500_490.item():.6f}")
print(f"Difference: {abs(score_100_90.item() - score_500_490.item()):.2e}")
# Difference should be ~0 (floating point only)

8.5 RoPE with KV-Cache for Inference

During autoregressive inference, we cache the K and V vectors. Since RoPE is applied to K before caching, the cached keys already have their position encoded. When generating token at position $t$ , we only need to compute RoPE for the new query at position $t$ and the new key at position $t$ :

class RoPEAttentionWithCache(nn.Module):
    """RoPE attention with KV-cache for efficient inference."""

    def __init__(self, d_model, n_heads, max_seq_len=8192, base=10000.0):
        super().__init__()
        self.d_model = d_model
        self.n_heads = n_heads
        self.d_k = d_model // n_heads

        self.W_q = nn.Linear(d_model, d_model, bias=False)
        self.W_k = nn.Linear(d_model, d_model, bias=False)
        self.W_v = nn.Linear(d_model, d_model, bias=False)
        self.W_o = nn.Linear(d_model, d_model, bias=False)

        self.register_buffer(
            'freqs_cis',
            precompute_freqs_cis(self.d_k, max_seq_len, base),
            persistent=False
        )

        # KV cache (initialized on first call)
        self.k_cache = None
        self.v_cache = None

    def forward(self, x, start_pos):
        """
        Args:
            x: (B, S, D) input. S=full_seq during prefill, S=1 during decode.
            start_pos: position of first token in x.
        """
        B, S, D = x.shape
        H = self.n_heads
        dk = self.d_k

        q = self.W_q(x).view(B, S, H, dk)
        k = self.W_k(x).view(B, S, H, dk)
        v = self.W_v(x).view(B, S, H, dk)

        # Apply RoPE to new Q and K
        freqs = self.freqs_cis[start_pos:start_pos + S]
        q, k = apply_rotary_emb(q, k, freqs)

        # Update KV cache
        if self.k_cache is None:
            self.k_cache = k
            self.v_cache = v
        else:
            self.k_cache = torch.cat([self.k_cache, k], dim=1)
            self.v_cache = torch.cat([self.v_cache, v], dim=1)

        # Attention: new queries attend to all cached keys
        q = q.transpose(1, 2)                    # (B, H, S, dk)
        k_all = self.k_cache.transpose(1, 2)     # (B, H, T, dk)
        v_all = self.v_cache.transpose(1, 2)     # (B, H, T, dk)

        scale = 1.0 / math.sqrt(dk)
        scores = torch.matmul(q, k_all.transpose(-2, -1)) * scale

        # Causal mask: new tokens can attend to all previous + self
        T = k_all.shape[2]
        causal_mask = torch.triu(
            torch.full((S, T), float('-inf'), device=x.device),
            diagonal=T - S + 1
        )
        scores = scores + causal_mask

        weights = torch.softmax(scores, dim=-1)
        output = torch.matmul(weights, v_all)

        output = output.transpose(1, 2).contiguous().view(B, S, D)
        return self.W_o(output)

    def reset_cache(self):
        self.k_cache = None
        self.v_cache = None

Numerical Verification

9.1 Verifying the Relative Position Property

def verify_rope_relative_property(d=64, n_tests=1000, base=10000.0):
    """
    Verify that RoPE attention scores depend only on relative position.

    For random q, k and positions (m1, n1), (m2, n2) where
    m1 - n1 = m2 - n2, the attention scores should be identical.
    """
    freqs = precompute_freqs_cis(d, 10000, base)
    max_error = 0.0

    for _ in range(n_tests):
        q = torch.randn(d)
        k = torch.randn(d)

        # Random positions with same relative distance
        delta = torch.randint(0, 100, (1,)).item()
        m1 = torch.randint(0, 5000, (1,)).item()
        n1 = m1 - delta
        m2 = torch.randint(0, 5000, (1,)).item()
        n2 = m2 - delta

        if n1 < 0 or n2 < 0:
            continue

        # Apply RoPE
        q_r = q.reshape(1, 1, 1, d)
        k_r = k.reshape(1, 1, 1, d)

        q1, _ = apply_rotary_emb(q_r, k_r, freqs[m1:m1+1])
        _, k1 = apply_rotary_emb(q_r, k_r, freqs[n1:n1+1])
        score1 = (q1 * k1).sum().item()

        q2, _ = apply_rotary_emb(q_r, k_r, freqs[m2:m2+1])
        _, k2 = apply_rotary_emb(q_r, k_r, freqs[n2:n2+1])
        score2 = (q2 * k2).sum().item()

        error = abs(score1 - score2)
        max_error = max(max_error, error)

    print(f"Max error over {n_tests} tests: {max_error:.2e}")
    print(f"Property verified: {max_error < 1e-4}")

verify_rope_relative_property()
# Expected: Max error ~1e-6 to 1e-5 (floating point only)

9.2 Verifying Attention Score Decay

def measure_attention_decay(d=128, base=10000.0, max_delta=2000):
    """
    Measure how RoPE attention scores decay with relative distance.
    Uses random q and k, averaged over many samples.
    """
    freqs = precompute_freqs_cis(d, max_delta + 100, base)
    n_samples = 500
    scores_by_delta = {}

    for delta in [0, 1, 2, 5, 10, 50, 100, 500, 1000, 2000]:
        total_score = 0.0
        for _ in range(n_samples):
            q = torch.randn(d)
            k = torch.randn(d)

            q_r = q.reshape(1, 1, 1, d)
            k_r = k.reshape(1, 1, 1, d)

            m = delta + 50
            n = 50

            q_rot, _ = apply_rotary_emb(q_r, k_r, freqs[m:m+1])
            _, k_rot = apply_rotary_emb(q_r, k_r, freqs[n:n+1])
            score = (q_rot * k_rot).sum().item()
            total_score += abs(score)

        avg_score = total_score / n_samples
        scores_by_delta[delta] = avg_score
        print(f"delta={delta:5d}: avg |score| = {avg_score:.4f}")

measure_attention_decay()

Summary

RoPE encodes position by rotating query and key vectors in the complex plane, with rotation angles proportional to position and frequencies that vary geometrically across dimensions. The construction guarantees that attention scores depend only on relative position — this is proven, not approximated. The multi-scale frequency structure creates dimension pairs ranging from sub-word locality (fast rotation) to document-level structure (slow rotation). The base frequency parameter controls the maximum context length, and can be scaled to extend beyond training length.

The complete derivation chain:

Requirement: $f(q, m) \cdot f(k, n) = g(q, k, m-n)$
Representation: pair dimensions as complex numbers $z = x_1 + jx_2$
Solution: rotate by position-dependent angle $z \cdot e^{jm\theta}$
Proof: $\text{Re}(z_q e^{jm\theta} \cdot \overline{z_k e^{jn\theta}}) = \text{Re}(z_q \overline{z_k} \cdot e^{j(m-n)\theta})$
Frequencies: $\theta_i = \text{base}^{-2i/d}$ gives geometric spacing from local to global
Base: controls maximum resolvable context length ( $\lambda_{\max} = 2\pi \cdot \text{base}$ )
Implementation: precompute sin/cos tables, apply element-wise rotation to Q and K

RoPE has become the dominant position encoding for decoder-only LLMs. Llama, Mistral, Qwen, Gemma, DeepSeek, and most open-weight models use RoPE. Its zero-parameter design, mathematical guarantee of relative position dependence, and compatibility with KV-caching and context extension make it the standard choice.

💡 Reviewer Validation Summary

Verified: (1) Complex conjugate proof is correct — $e^{jm\theta} \cdot e^{-jn\theta} = e^{j(m-n)\theta}$ follows from exponential laws. (2) $R_m^T R_n = R_{n-m}$ holds for orthogonal rotation matrices. (3) Frequency formula $\theta_i = \text{base}^{-2i/d}$ matches Su et al. 2021 and Llama implementation. (4) Wavelength calculations: $\lambda = 2\pi/\theta$ is correct. For $i=0$ , $\lambda = 2\pi/1 = 6.28$ . For $i=63$ with base=10000, $\lambda = 2\pi \cdot 10000^{126/128} \approx 54647$ . (5) The expanded real form $A\cos(\Delta\theta) - B\sin(\Delta\theta)$ correctly decomposes the complex multiplication. (6) Both complex and sin/cos implementations produce identical results. (7) KV-cache implementation correctly applies RoPE before caching. (8) No bare angle brackets in prose. (9) All math uses dollar-sign delimiters. (10) No Python type hints with brackets.