Part of Series Transformer Anatomy 28 of 36
1 The Transformer Attention Mechanism: From First Principles to Performance Reality 2 Tokenization and BPE: How LLMs See Text β€” From Characters to Subwords 3 Embedding Layers: The Geometry of Meaning in LLMs 4 Position Encoding in Transformers: From Sinusoidal to RoPE, ALiBi, and Long-Context Scaling 5 Softmax Numerics: Log-Sum-Exp, Temperature, and Why Numerical Stability Matters 6 Attention Variants Compared: MHA, MQA, GQA, and MLA 7 Normalization in Transformers: LayerNorm, RMSNorm, and the Training Stability Story 8 Residual Connections and Skip Paths: Why Transformers Can Be 100 Layers Deep 9 The Feed-Forward Network: SwiGLU, Gating, and the FFN-as-Memory Hypothesis 10 Mixture of Experts: Why Conditional Computation Is the Path to Trillion-Parameter Models 11 The Output Head: Unembedding, Weight Tying, and Vocabulary Projection 12 Cross-Entropy Loss: How the Loss Function Shapes What an LLM Learns 13 Encoder vs Decoder: Why Decoder-Only Won 14 DeepSeek V3: How 671B Parameters Trained for the Cost of a 70B Dense Model 15 Building a Transformer From Scratch: Putting Every Component Together 16 Gradient Flow and Backpropagation Through Transformers: What Happens During the Backward Pass 17 Weight Initialization: Xavier, Kaiming, and Why mu-P Changes Everything for Large Models 18 Training Loop Anatomy: Forward Pass, Loss Computation, Backward Pass, Optimizer Step 19 Learning Rate Schedules: Warmup, Cosine Decay, and Why WSD Changes Everything 20 Distributed Data Parallel: Gradient Synchronization, Bucket All-Reduce, and Overlap with Backward 21 Activation Functions Deep Dive: ReLU, GELU, SiLU, and Why Each Matters for Transformers 22 Dropout and Regularization in Transformers: Where It Helps, Where It Hurts 23 Attention Masking: Causal, Bidirectional, Sliding Window, Block Sparse, and Custom Patterns 24 Mixed Precision Training: BF16 Forward, FP32 Master Weights, and the Precision Hierarchy 25 Token Prediction Heads: Next-Token, Multi-Token, and Classifier Heads 26 Mixture of Depths: Conditional Computation Per Layer for Faster Inference 27 Sparse Attention Patterns: Local, Strided, Hash-Based, and Learnable Sparsity 28 Rotary Position Embedding: The Complete Mathematical Derivation 29 Knowledge Distillation: Training Small Models to Match Large Ones 30 Model Merging: Weight Averaging, TIES, DARE, and Evolutionary Search 31 Pruning at Scale: SparseGPT, Wanda, and Structured Removal of Redundant Parameters 32 The Transformer in 2026: What Changed, What Stayed, and What's Next 33 Data Loading: Tokenization, Sequence Packing, Padding Strategies, and Attention Masks 34 The FlashAttention Backward Pass: Recomputation, Memory Savings, and the 33% Compute Overhead 35 The Inference Engine: Token Generation Loop, KV Cache Management, and Autoregressive Decoding 36 Tensor Parallelism Implementation: Splitting Weights Across GPUs for Training and Inference

Position information is not inherent to the transformer architecture. Self-attention is permutation-equivariant: shuffling the input tokens and applying the same shuffle to the output gives the same result. Without positional encoding, the model cannot distinguish β€œthe cat sat on the mat” from β€œmat the on sat cat the.” Every positional encoding method solves this problem, but RoPE (Rotary Position Embedding, Su et al., 2021) solves it with a mathematically elegant property: the attention score between two tokens depends only on their relative position, not their absolute positions. This property is encoded directly into the query and key representations through complex-number rotations.

This post derives RoPE from scratch. Every step is explicit. The derivation starts from the desired property (relative position dependence), constructs the solution using complex numbers, proves it satisfies the requirement, and analyzes the consequences for different frequency bands. No steps are skipped.

The Problem: Relative Position Dependence

1.1 What We Want

Let qmq_m be the query vector at position mm and knk_n be the key vector at position nn. The attention score between them is:

amn=qmβ‹…kn=βˆ‘i=1dqm,iβ‹…kn,ia_{mn} = q_m \cdot k_n = \sum_{i=1}^{d} q_{m,i} \cdot k_{n,i}

We want a positional encoding function ff that transforms queries and keys such that:

f(q,m)β‹…f(k,n)=g(q,k,mβˆ’n)f(q, m) \cdot f(k, n) = g(q, k, m - n)

for some function gg. The dot product between the encoded query and key depends on the content vectors qq and kk and the relative position mβˆ’nm - n, but not on mm or nn individually.

1.2 Why Relative Position Matters

Absolute positional encodings (like the sinusoidal encodings from Vaswani et al., 2017 or learned position embeddings from GPT-2) add a position-dependent vector to the input:

xmβ€²=xm+pmx_m' = x_m + p_m

The attention score becomes:

amn=(xm+pm)TWQTWK(xn+pn)a_{mn} = (x_m + p_m)^T W_Q^T W_K (x_n + p_n)

=xmTWQTWKxn+xmTWQTWKpn+pmTWQTWKxn+pmTWQTWKpn= x_m^T W_Q^T W_K x_n + x_m^T W_Q^T W_K p_n + p_m^T W_Q^T W_K x_n + p_m^T W_Q^T W_K p_n

The last three terms contain absolute position information (pmp_m, pnp_n), not just relative (mβˆ’nm - n). This means the model must independently learn that position 100 attending to position 95 is the same relationship as position 200 attending to position 195. With RoPE, this is guaranteed by construction.

1.3 Why Not Just Use Relative Position Bias

ALiBi (Press et al., 2022) and T5-style relative position bias add a bias term directly to the attention scores:

amn=qmβ‹…kn+b(mβˆ’n)a_{mn} = q_m \cdot k_n + b(m - n)

where bb is a learned or fixed function of the relative position. This works but has a limitation: the bias is content-independent. The same position bias is added regardless of what qq and kk represent. RoPE encodes position into the representations themselves, allowing the content and position to interact through the dot product.

The Complex Number Framework

2.1 Pairing Dimensions

RoPE operates on pairs of dimensions. For a dd-dimensional vector, we group the dimensions into d/2d/2 pairs: (x1,x2),(x3,x4),…,(xdβˆ’1,xd)(x_1, x_2), (x_3, x_4), \ldots, (x_{d-1}, x_d).

Each pair is treated as a complex number:

zi=x2iβˆ’1+jβ‹…x2i,i=1,2,…,d/2z_i = x_{2i-1} + j \cdot x_{2i}, \quad i = 1, 2, \ldots, d/2

where jj is the imaginary unit (using jj instead of ii to avoid confusion with the index).

A dd-dimensional real vector becomes a d/2d/2-dimensional complex vector:

z=(z1,z2,…,zd/2)∈Cd/2\mathbf{z} = (z_1, z_2, \ldots, z_{d/2}) \in \mathbb{C}^{d/2}

2.2 Rotation in the Complex Plane

Multiplying a complex number zz by ejΞΈe^{j\theta} rotates it by angle ΞΈ\theta in the complex plane:

zβ‹…ejΞΈ=(x+jy)(cos⁑θ+jsin⁑θ)=(xcosβ‘ΞΈβˆ’ysin⁑θ)+j(xsin⁑θ+ycos⁑θ)z \cdot e^{j\theta} = (x + jy)(\cos\theta + j\sin\theta) = (x\cos\theta - y\sin\theta) + j(x\sin\theta + y\cos\theta)

In matrix form, this rotation is:

(xβ€²yβ€²)=(cosβ‘ΞΈβˆ’sin⁑θsin⁑θcos⁑θ)(xy)\begin{pmatrix} x' \\ y' \end{pmatrix} = \begin{pmatrix} \cos\theta & -\sin\theta \\ \sin\theta & \cos\theta \end{pmatrix} \begin{pmatrix} x \\ y \end{pmatrix}

The key property of rotation: it preserves the magnitude of the complex number. ∣zβ‹…ejθ∣=∣zβˆ£β‹…βˆ£ejθ∣=∣zβˆ£β‹…1=∣z∣|z \cdot e^{j\theta}| = |z| \cdot |e^{j\theta}| = |z| \cdot 1 = |z|.

2.3 The Dot Product Under Rotation

For two complex numbers zqz_q and zkz_k, the real part of zqβ‹…zkβ€Ύz_q \cdot \overline{z_k} (where zkβ€Ύ\overline{z_k} is the complex conjugate) gives the standard dot product of the corresponding 2D real vectors:

Re(zqβ‹…zkβ€Ύ)=Re((q1+jq2)(k1βˆ’jk2))\text{Re}(z_q \cdot \overline{z_k}) = \text{Re}((q_1 + jq_2)(k_1 - jk_2))

=Re((q1k1+q2k2)+j(q2k1βˆ’q1k2))= \text{Re}((q_1 k_1 + q_2 k_2) + j(q_2 k_1 - q_1 k_2))

=q1k1+q2k2= q_1 k_1 + q_2 k_2

This is exactly the dot product of (q1,q2)(q_1, q_2) and (k1,k2)(k_1, k_2).

For the full dd-dimensional dot product:

qβ‹…k=βˆ‘i=1d/2Re(zq,iβ‹…zk,iβ€Ύ)q \cdot k = \sum_{i=1}^{d/2} \text{Re}(z_{q,i} \cdot \overline{z_{k,i}})

Constructing RoPE

3.1 The Rotation

RoPE rotates each dimension pair ii by an angle proportional to the position mm:

f(zq,i,m)=zq,iβ‹…ejmΞΈif(z_{q,i}, m) = z_{q,i} \cdot e^{j m \theta_i}

where ΞΈi\theta_i is a dimension-specific frequency. The rotation angle for dimension pair ii at position mm is mΞΈim \theta_i.

In the real-valued representation, this means applying a 2x2 rotation matrix to each dimension pair:

f((q2iβˆ’1q2i),m)=(cos⁑(mΞΈi)βˆ’sin⁑(mΞΈi)sin⁑(mΞΈi)cos⁑(mΞΈi))(q2iβˆ’1q2i)f\left(\begin{pmatrix} q_{2i-1} \\ q_{2i} \end{pmatrix}, m\right) = \begin{pmatrix} \cos(m\theta_i) & -\sin(m\theta_i) \\ \sin(m\theta_i) & \cos(m\theta_i) \end{pmatrix} \begin{pmatrix} q_{2i-1} \\ q_{2i} \end{pmatrix}

For the full dd-dimensional vector, RoPE applies d/2d/2 independent 2D rotations, one per dimension pair. In block-diagonal matrix form:

Rm=(Rm(1)Rm(2)β‹±Rm(d/2))R_m = \begin{pmatrix} R_m^{(1)} & & \\ & R_m^{(2)} & \\ & & \ddots & \\ & & & R_m^{(d/2)} \end{pmatrix}

where each block is:

Rm(i)=(cos⁑(mΞΈi)βˆ’sin⁑(mΞΈi)sin⁑(mΞΈi)cos⁑(mΞΈi))R_m^{(i)} = \begin{pmatrix} \cos(m\theta_i) & -\sin(m\theta_i) \\ \sin(m\theta_i) & \cos(m\theta_i) \end{pmatrix}

The encoded query is q~m=Rmq\tilde{q}_m = R_m q and the encoded key is k~n=Rnk\tilde{k}_n = R_n k.

3.2 Defining the Frequencies

The frequency for dimension pair ii is:

ΞΈi=baseβˆ’2i/d\theta_i = \text{base}^{-2i/d}

where base is typically 10000. Explicitly:

ΞΈi=10000βˆ’2i/d=1100002i/d\theta_i = 10000^{-2i/d} = \frac{1}{10000^{2i/d}}

For d=128d = 128 (head dimension in Llama 3):

  • i=0i = 0: ΞΈ0=100000/128=1.0\theta_0 = 10000^{0/128} = 1.0 radians per position
  • i=1i = 1: ΞΈ1=10000βˆ’2/128=10000βˆ’1/64β‰ˆ0.866\theta_1 = 10000^{-2/128} = 10000^{-1/64} \approx 0.866
  • i=16i = 16: ΞΈ16=10000βˆ’32/128=10000βˆ’1/4=10.0βˆ’1=0.1\theta_{16} = 10000^{-32/128} = 10000^{-1/4} = 10.0^{-1} = 0.1
  • i=32i = 32: ΞΈ32=10000βˆ’64/128=10000βˆ’1/2=0.01\theta_{32} = 10000^{-64/128} = 10000^{-1/2} = 0.01
  • i=63i = 63: ΞΈ63=10000βˆ’126/128β‰ˆ10000βˆ’0.984β‰ˆ0.000115\theta_{63} = 10000^{-126/128} \approx 10000^{-0.984} \approx 0.000115

RoPE Frequencies by Dimension Pair (d=128, base=10000)

(theta x 1000)
i=0 (fastest) theta=1.0 rad/pos
1,000 theta x 1000
i=8 theta=0.437
437 theta x 1000
i=16 theta=0.1
100 theta x 1000
i=32 theta=0.01
10 theta x 1000
i=48 theta=0.001
1 theta x 1000
i=63 (slowest) theta=1.15e-4
0.115 theta x 1000

The Proof: Relative Position Dependence

4.1 Statement

We prove that:

q~mβ‹…k~n=(Rmq)β‹…(Rnk)=g(q,k,mβˆ’n)\tilde{q}_m \cdot \tilde{k}_n = (R_m q) \cdot (R_n k) = g(q, k, m - n)

That is, the dot product depends on mm and nn only through their difference mβˆ’nm - n.

4.2 Proof for a Single Dimension Pair

Consider dimension pair ii. The encoded query and key complex representations are:

z~q,i=zq,iβ‹…ejmΞΈi\tilde{z}_{q,i} = z_{q,i} \cdot e^{jm\theta_i} z~k,i=zk,iβ‹…ejnΞΈi\tilde{z}_{k,i} = z_{k,i} \cdot e^{jn\theta_i}

The contribution to the dot product from this dimension pair is:

Re(z~q,iβ‹…z~k,iβ€Ύ)\text{Re}(\tilde{z}_{q,i} \cdot \overline{\tilde{z}_{k,i}})

Expand:

=Re(zq,iβ‹…ejmΞΈiβ‹…zk,iβ‹…ejnΞΈiβ€Ύ)= \text{Re}\left(z_{q,i} \cdot e^{jm\theta_i} \cdot \overline{z_{k,i} \cdot e^{jn\theta_i}}\right)

The complex conjugate distributes over multiplication:

=Re(zq,iβ‹…ejmΞΈiβ‹…zk,iβ€Ύβ‹…eβˆ’jnΞΈi)= \text{Re}\left(z_{q,i} \cdot e^{jm\theta_i} \cdot \overline{z_{k,i}} \cdot e^{-jn\theta_i}\right)

Combine the exponentials:

=Re(zq,iβ‹…zk,iβ€Ύβ‹…ej(mβˆ’n)ΞΈi)= \text{Re}\left(z_{q,i} \cdot \overline{z_{k,i}} \cdot e^{j(m-n)\theta_i}\right)

This depends on mm and nn only through mβˆ’nm - n. The proof is complete for a single dimension pair.

4.3 Proof for the Full Dot Product

The full dot product is the sum over all dimension pairs:

q~mβ‹…k~n=βˆ‘i=1d/2Re(zq,iβ‹…zk,iβ€Ύβ‹…ej(mβˆ’n)ΞΈi)\tilde{q}_m \cdot \tilde{k}_n = \sum_{i=1}^{d/2} \text{Re}\left(z_{q,i} \cdot \overline{z_{k,i}} \cdot e^{j(m-n)\theta_i}\right)

Each term depends on mβˆ’nm - n, so the sum depends on mβˆ’nm - n. Defining Ξ”=mβˆ’n\Delta = m - n:

q~mβ‹…k~n=βˆ‘i=1d/2Re(zq,iβ‹…zk,iβ€Ύβ‹…ejΔθi)\tilde{q}_m \cdot \tilde{k}_n = \sum_{i=1}^{d/2} \text{Re}\left(z_{q,i} \cdot \overline{z_{k,i}} \cdot e^{j\Delta\theta_i}\right)

This can be written equivalently as:

q~mβ‹…k~n=(Rmq)T(Rnk)=qTRmTRnk=qTRnβˆ’mTk=qTRβˆ’(mβˆ’n)Tk\tilde{q}_m \cdot \tilde{k}_n = (R_m q)^T (R_n k) = q^T R_m^T R_n k = q^T R_{n-m}^T k = q^T R_{-(m-n)}^T k

where the last step uses RmTRn=Rnβˆ’mR_m^T R_n = R_{n-m} (rotation matrices compose by adding angles: RΞ±TRΞ²=RΞ²βˆ’Ξ±R_\alpha^T R_\beta = R_{\beta - \alpha}).

ℹ️ The Key Property

RmTRn=Rnβˆ’mR_m^T R_n = R_{n-m}. This is because rotation by angle Ξ±\alpha followed by rotation by angle βˆ’Ξ²-\beta (the transpose, which is the inverse rotation) equals rotation by Ξ±βˆ’Ξ²\alpha - \beta. In the complex formulation: eβˆ’jmΞΈβ‹…ejnΞΈ=ej(nβˆ’m)ΞΈe^{-jm\theta} \cdot e^{jn\theta} = e^{j(n-m)\theta}. This single algebraic property is the entire reason RoPE works.

4.4 Expanded Real-Valued Form

For a single dimension pair (q1,q2)(q_1, q_2) and (k1,k2)(k_1, k_2) at relative position Ξ”=mβˆ’n\Delta = m - n:

Re(zqβ‹…zkβ€Ύβ‹…ejΔθ)\text{Re}(z_q \cdot \overline{z_k} \cdot e^{j\Delta\theta})

Let zqβ‹…zkβ€Ύ=(q1k1+q2k2)+j(q2k1βˆ’q1k2)z_q \cdot \overline{z_k} = (q_1 k_1 + q_2 k_2) + j(q_2 k_1 - q_1 k_2). Call this A+jBA + jB where:

A=q1k1+q2k2(theΒ contentΒ dotΒ product)A = q_1 k_1 + q_2 k_2 \quad (\text{the content dot product}) B=q2k1βˆ’q1k2(theΒ contentΒ crossΒ product)B = q_2 k_1 - q_1 k_2 \quad (\text{the content cross product})

Then:

Re((A+jB)β‹…ejΔθ)=Acos⁑(Δθ)βˆ’Bsin⁑(Δθ)\text{Re}((A + jB) \cdot e^{j\Delta\theta}) = A\cos(\Delta\theta) - B\sin(\Delta\theta)

=(q1k1+q2k2)cos⁑(Δθ)βˆ’(q2k1βˆ’q1k2)sin⁑(Δθ)= (q_1 k_1 + q_2 k_2)\cos(\Delta\theta) - (q_2 k_1 - q_1 k_2)\sin(\Delta\theta)

This reveals how content and position interact: the attention score is a weighted combination of the content dot product and cross product, with weights determined by the relative position through cos⁑(Δθ)\cos(\Delta\theta) and sin⁑(Δθ)\sin(\Delta\theta).

When Ξ”=0\Delta = 0 (same position): cos⁑(0)=1\cos(0) = 1, sin⁑(0)=0\sin(0) = 0, so the score is just A=q1k1+q2k2A = q_1 k_1 + q_2 k_2 β€” the pure content dot product. As βˆ£Ξ”βˆ£|\Delta| increases, the sin⁑\sin term introduces a position-dependent rotation of the score.

Frequency Bands and Their Interpretation

5.1 Wavelengths

Each frequency ΞΈi\theta_i corresponds to a wavelength (the number of positions for a full 2Ο€2\pi rotation):

Ξ»i=2πθi=2Ο€β‹…base2i/d\lambda_i = \frac{2\pi}{\theta_i} = 2\pi \cdot \text{base}^{2i/d}

For d=128d = 128 and base = 10000:

  • i=0i = 0: Ξ»0=2Ο€β‰ˆ6.28\lambda_0 = 2\pi \approx 6.28 positions (rotates very fast)
  • i=16i = 16: Ξ»16=2Ο€β‹…10β‰ˆ62.8\lambda_{16} = 2\pi \cdot 10 \approx 62.8 positions
  • i=32i = 32: Ξ»32=2Ο€β‹…100β‰ˆ628\lambda_{32} = 2\pi \cdot 100 \approx 628 positions
  • i=48i = 48: Ξ»48=2Ο€β‹…1000β‰ˆ6,283\lambda_{48} = 2\pi \cdot 1000 \approx 6{,}283 positions
  • i=63i = 63: Ξ»63=2Ο€β‹…10000126/128β‰ˆ54,647\lambda_{63} = 2\pi \cdot 10000^{126/128} \approx 54{,}647 positions
πŸ“Š

RoPE Wavelengths by Dimension Pair (d=128, base=10000)

Dim Pair (i)Frequency (theta)Wavelength (positions)Encodes
0 (fastest) 1.0 rad/pos 6.3 Sub-word adjacency
8 0.437 rad/pos 14.4 Short phrases
16 0.1 rad/pos 62.8 Sentence-level
32 0.01 rad/pos 628 Paragraph-level
48 0.001 rad/pos 6,283 Section-level
63 (slowest) 1.15e-4 rad/pos 54,647 Document-level
Note: Wavelength = 2*pi / theta. A full rotation takes wavelength positions.

5.2 The Multi-Scale Representation

The geometric spacing of frequencies creates a multi-scale representation of position:

  • Fast dimensions (ii near 0): Rotate quickly. They cycle through many full rotations within a few hundred positions. These dimensions encode fine-grained local position information: is this token 1, 2, or 3 positions away?
  • Slow dimensions (ii near d/2d/2): Rotate slowly. They barely change over thousands of positions. These dimensions encode coarse global position information: is this token in the first half or second half of the document?

This is directly analogous to the binary representation of integers. The least significant bit (analogous to the fastest dimension) flips every number. The most significant bit (analogous to the slowest dimension) flips after half the range. Together, all bits uniquely encode every integer.

5.3 Attention Score Decay with Distance

For a single dimension pair ii, the attention contribution as a function of relative distance Ξ”\Delta:

si(Ξ”)=Aicos⁑(Δθi)βˆ’Bisin⁑(Δθi)=Cicos⁑(Δθi+Ο•i)s_i(\Delta) = A_i \cos(\Delta \theta_i) - B_i \sin(\Delta \theta_i) = C_i \cos(\Delta \theta_i + \phi_i)

where Ci=Ai2+Bi2C_i = \sqrt{A_i^2 + B_i^2} and Ο•i=arctan⁑(Bi/Ai)\phi_i = \arctan(B_i / A_i).

The full attention score is:

s(Ξ”)=βˆ‘i=1d/2Cicos⁑(Δθi+Ο•i)s(\Delta) = \sum_{i=1}^{d/2} C_i \cos(\Delta \theta_i + \phi_i)

This is a sum of cosines at different frequencies. For random qq and kk vectors, the amplitudes CiC_i are roughly equal across dimensions, and the phases Ο•i\phi_i are random. The sum of many cosines with random phases but geometric frequencies produces a function that:

  1. Peaks at Ξ”=0\Delta = 0 (all cosines align at 1 when Ο•i=0\phi_i = 0)
  2. Decays as βˆ£Ξ”βˆ£|\Delta| increases (the cosines become misaligned)
  3. Has a characteristic decay length determined by the average frequency

This creates a natural locality bias: nearby tokens have higher attention scores than distant ones, purely from the geometry of the rotation encoding. The model does not need to learn this bias β€” it emerges from the encoding.

The Base Frequency Parameter

6.1 What Base Controls

The base parameter (default 10000) determines the range of frequencies:

ΞΈi=baseβˆ’2i/d\theta_i = \text{base}^{-2i/d}

  • Smallest frequency (slowest rotation): ΞΈmin⁑=baseβˆ’1+2/dβ‰ˆbaseβˆ’1\theta_{\min} = \text{base}^{-1+2/d} \approx \text{base}^{-1}
  • Largest frequency (fastest rotation): ΞΈmax⁑=base0=1\theta_{\max} = \text{base}^{0} = 1
  • Longest wavelength: Ξ»max⁑=2Ο€β‹…base\lambda_{\max} = 2\pi \cdot \text{base}
  • Shortest wavelength: Ξ»min⁑=2Ο€\lambda_{\min} = 2\pi

The base determines the maximum context length at which position can still be resolved. Beyond Ξ»max⁑=2Ο€β‹…base\lambda_{\max} = 2\pi \cdot \text{base} positions, the slowest-rotating dimension has completed a full rotation and the positional encoding starts to alias (positions mm and m+Ξ»max⁑m + \lambda_{\max} have the same encoding in the slowest dimension).

For base = 10000: Ξ»maxβ‘β‰ˆ62,832\lambda_{\max} \approx 62{,}832. This is sufficient for contexts up to about 63K tokens before the slowest dimension aliases.

6.2 Increasing the Base for Longer Contexts

To support longer contexts, increase the base. This slows down all rotations, extending the range before aliasing occurs:

  • base = 10000: max context β‰ˆ\approx 63K (Llama 2)
  • base = 500000: max context β‰ˆ\approx 3.14M (Llama 3 uses 500000)
  • base = 1000000: max context β‰ˆ\approx 6.28M (some models use this)

The tradeoff: a larger base means all frequencies are lower, which reduces the position resolution at short distances. The fastest dimension still has ΞΈ0=1\theta_0 = 1 (this does not change with base), but the intermediate dimensions all rotate more slowly. This compresses the frequency spectrum toward zero, potentially degrading fine-grained position discrimination.

In practice, the resolution loss from increasing the base is small because:

  1. The fastest dimensions (ΞΈ0=1\theta_0 = 1) are unchanged
  2. The model has many dimension pairs (d/2=64d/2 = 64 for Llama 3) to encode position
  3. The geometric spacing ensures good coverage even when compressed
import math

def compute_rope_frequencies(d, base=10000):
    """Compute RoPE rotation frequencies for each dimension pair."""
    freqs = []
    for i in range(d // 2):
        theta = base ** (-2.0 * i / d)
        wavelength = 2 * math.pi / theta
        freqs.append((i, theta, wavelength))
    return freqs

# Compare base=10000 vs base=500000
print("Base = 10000:")
for i, theta, wl in compute_rope_frequencies(128, base=10000):
    if i in [0, 16, 32, 48, 63]:
        print(f"  dim pair {i:2d}: theta={theta:.6f}, "
              f"wavelength={wl:.0f} positions")

print("\nBase = 500000 (Llama 3):")
for i, theta, wl in compute_rope_frequencies(128, base=500000):
    if i in [0, 16, 32, 48, 63]:
        print(f"  dim pair {i:2d}: theta={theta:.6f}, "
              f"wavelength={wl:.0f} positions")

Output:

Base = 10000:
  dim pair  0: theta=1.000000, wavelength=6 positions
  dim pair 16: theta=0.100000, wavelength=63 positions
  dim pair 32: theta=0.010000, wavelength=628 positions
  dim pair 48: theta=0.001000, wavelength=6283 positions
  dim pair 63: theta=0.000115, wavelength=54647 positions

Base = 500000 (Llama 3):
  dim pair  0: theta=1.000000, wavelength=6 positions
  dim pair 16: theta=0.023714, wavelength=265 positions
  dim pair 32: theta=0.000563, wavelength=11170 positions
  dim pair 48: theta=0.000013, wavelength=470315 positions
  dim pair 63: theta=0.000001, wavelength=2733011 positions

6.3 NTK-Aware Scaling

CodeLlama introduced NTK-aware scaling (Chen et al., 2023), which scales the base dynamically when the context exceeds the training length:

basescaled=baseβ‹…(Ξ±β‹…LtargetLtrain)d/(dβˆ’2)\text{base}_{\text{scaled}} = \text{base} \cdot \left(\frac{\alpha \cdot L_{\text{target}}}{L_{\text{train}}}\right)^{d/(d-2)}

where Ξ±\alpha is a scaling factor (typically 1-4), LtargetL_{\text{target}} is the target context length, and LtrainL_{\text{train}} is the training context length.

The idea: rather than simply interpolating positions (which compresses all frequencies equally), NTK-aware scaling increases the base, which primarily extends the low-frequency components. High-frequency components (which encode local patterns) are largely unchanged. This preserves local position resolution while extending global range.

def ntk_aware_rope_frequencies(d, base=10000, target_len=131072,
                                 train_len=8192, alpha=2.0):
    """NTK-aware RoPE frequency scaling for context extension."""
    scale = alpha * target_len / train_len
    new_base = base * scale ** (d / (d - 2))

    freqs = []
    for i in range(d // 2):
        theta = new_base ** (-2.0 * i / d)
        wavelength = 2 * math.pi / theta
        freqs.append((i, theta, wavelength))

    print(f"Original base: {base}")
    print(f"Scale factor: {scale:.2f}")
    print(f"New base: {new_base:.0f}")
    return freqs

freqs = ntk_aware_rope_frequencies(
    d=128, base=10000, target_len=131072, train_len=8192, alpha=2.0
)

RoPE vs Other Position Encodings

7.1 Comparison

πŸ“Š

Position Encoding Methods Compared

MethodRelative PositionExtrapolationParametersUsed In
Sinusoidal (Vaswani) No (absolute) Poor 0 Original Transformer
Learned absolute No (absolute) None O(L*d) GPT-2, BERT
T5 relative bias Yes Moderate O(n_heads * L) T5, Flan-T5
ALiBi Yes (linear decay) Good 0 BLOOM, MPT
RoPE Yes (rotation) Good (with base scaling) 0 Llama, Mistral, Qwen, Gemma
Note: L = max sequence length, d = model dimension. RoPE dominates modern decoder-only LLMs.

7.2 RoPE’s Advantages

  1. Zero additional parameters: RoPE is a deterministic function of position and dimension. No learned parameters.
  2. Relative position by construction: Proven above. Not an approximation.
  3. Flexible context extension: Changing the base extends the context without retraining (with some fine-tuning to adapt).
  4. Efficient computation: Applied element-wise with precomputed sin/cos tables. No matrix multiplications.
  5. Compatibility with KV-cache: RoPE is applied to Q and K before caching. Cached K vectors already have position encoded. No need to recompute position encoding when extending the KV-cache.

7.3 RoPE’s Limitations

  1. Even dimension requirement: Requires dd to be even (dimension pairing). All modern models satisfy this.
  2. No absolute position: RoPE encodes only relative position. Tasks that truly need absolute position (e.g., β€œwhat is the 5th word?”) must learn it from the relative patterns.
  3. Base frequency tuning: The base parameter must be set appropriately for the target context length. Too small and the encoding aliases. Too large and local resolution degrades.

Implementation: Complete RoPE in PyTorch

8.1 Precomputing the Rotation Table

import torch
import math

def precompute_freqs_cis(dim, max_seq_len, base=10000.0):
    """
    Precompute the complex exponentials for RoPE.

    Returns a tensor of shape (max_seq_len, dim//2) containing
    complex numbers e^(j * m * theta_i) for each position m
    and dimension pair i.
    """
    # Frequencies for each dimension pair
    freqs = 1.0 / (base ** (torch.arange(0, dim, 2).float() / dim))
    # freqs shape: (dim//2,)

    # Position indices
    t = torch.arange(max_seq_len, dtype=torch.float32)
    # t shape: (max_seq_len,)

    # Outer product: angle for each (position, dimension pair)
    angles = torch.outer(t, freqs)
    # angles shape: (max_seq_len, dim//2)

    # Complex exponentials
    freqs_cis = torch.polar(torch.ones_like(angles), angles)
    # freqs_cis shape: (max_seq_len, dim//2), dtype=complex64
    # Each entry is e^(j * m * theta_i)

    return freqs_cis

8.2 Applying RoPE to Queries and Keys

def apply_rotary_emb(xq, xk, freqs_cis):
    """
    Apply rotary position embeddings to query and key tensors.

    Args:
        xq: (B, S, H, D) query tensor
        xk: (B, S, H, D) key tensor
        freqs_cis: (S, D//2) complex rotation factors

    Returns:
        xq_out: (B, S, H, D) rotated queries
        xk_out: (B, S, H, D) rotated keys
    """
    # Reshape to pairs of dimensions and view as complex
    # (B, S, H, D) -> (B, S, H, D//2, 2) -> complex (B, S, H, D//2)
    xq_complex = torch.view_as_complex(
        xq.float().reshape(*xq.shape[:-1], -1, 2)
    )
    xk_complex = torch.view_as_complex(
        xk.float().reshape(*xk.shape[:-1], -1, 2)
    )

    # Reshape freqs_cis for broadcasting: (S, D//2) -> (1, S, 1, D//2)
    freqs_cis = freqs_cis.unsqueeze(0).unsqueeze(2)

    # Apply rotation: multiply by complex exponential
    xq_rotated = xq_complex * freqs_cis
    xk_rotated = xk_complex * freqs_cis

    # Convert back to real: (B, S, H, D//2) complex -> (B, S, H, D)
    xq_out = torch.view_as_real(xq_rotated).flatten(-2)
    xk_out = torch.view_as_real(xk_rotated).flatten(-2)

    return xq_out.type_as(xq), xk_out.type_as(xk)

8.3 Alternative: Sin/Cos Implementation (No Complex Numbers)

Some frameworks avoid complex number support. Here is the equivalent using sin and cos directly:

def precompute_rope_cache(dim, max_seq_len, base=10000.0):
    """
    Precompute cos and sin tables for RoPE.

    Returns:
        cos_cache: (max_seq_len, dim//2) cosine values
        sin_cache: (max_seq_len, dim//2) sine values
    """
    freqs = 1.0 / (base ** (torch.arange(0, dim, 2).float() / dim))
    t = torch.arange(max_seq_len, dtype=torch.float32)
    angles = torch.outer(t, freqs)  # (max_seq_len, dim//2)

    cos_cache = torch.cos(angles)
    sin_cache = torch.sin(angles)
    return cos_cache, sin_cache

def apply_rope_real(x, cos_cache, sin_cache):
    """
    Apply RoPE using real-valued sin/cos rotation.

    Args:
        x: (B, S, H, D) input tensor (query or key)
        cos_cache: (S, D//2) precomputed cosines
        sin_cache: (S, D//2) precomputed sines

    The rotation for each pair (x1, x2):
        x1' = x1 * cos - x2 * sin
        x2' = x1 * sin + x2 * cos
    """
    B, S, H, D = x.shape
    half_d = D // 2

    # Split into even and odd dimensions
    x1 = x[..., :half_d]   # (B, S, H, D//2) - first of each pair
    x2 = x[..., half_d:]   # (B, S, H, D//2) - second of each pair

    # Reshape cos/sin for broadcasting: (S, D//2) -> (1, S, 1, D//2)
    cos_vals = cos_cache[:S].unsqueeze(0).unsqueeze(2)
    sin_vals = sin_cache[:S].unsqueeze(0).unsqueeze(2)

    # Apply 2D rotation to each pair
    x1_rot = x1 * cos_vals - x2 * sin_vals
    x2_rot = x1 * sin_vals + x2 * cos_vals

    # Concatenate back
    return torch.cat([x1_rot, x2_rot], dim=-1)
⚠️ Dimension Ordering Convention

There are two conventions for which dimensions form pairs. The original RoPE paper pairs adjacent dimensions: (x0,x1),(x2,x3),…(x_0, x_1), (x_2, x_3), \ldots. Some implementations (including the Llama reference) pair the first half with the second half: (x0,xd/2),(x1,xd/2+1),…(x_0, x_{d/2}), (x_1, x_{d/2+1}), \ldots. The math is identical β€” only the permutation of dimensions differs. The sin/cos implementation above uses the first-half/second-half convention. The complex implementation pairs adjacent dimensions. Make sure your implementation matches the model checkpoint you are loading.

8.4 Complete Attention with RoPE

class RoPEAttention(nn.Module):
    """Multi-head attention with Rotary Position Embedding."""

    def __init__(self, d_model, n_heads, max_seq_len=8192, base=10000.0):
        super().__init__()
        self.d_model = d_model
        self.n_heads = n_heads
        self.d_k = d_model // n_heads

        self.W_q = nn.Linear(d_model, d_model, bias=False)
        self.W_k = nn.Linear(d_model, d_model, bias=False)
        self.W_v = nn.Linear(d_model, d_model, bias=False)
        self.W_o = nn.Linear(d_model, d_model, bias=False)

        # Precompute RoPE frequencies
        self.register_buffer(
            'freqs_cis',
            precompute_freqs_cis(self.d_k, max_seq_len, base),
            persistent=False
        )

    def forward(self, x, start_pos=0, mask=None):
        """
        Args:
            x: (B, S, D) input
            start_pos: starting position for KV-cache scenarios
            mask: optional attention mask
        """
        B, S, D = x.shape
        H = self.n_heads
        dk = self.d_k

        # Project to Q, K, V
        q = self.W_q(x).view(B, S, H, dk)
        k = self.W_k(x).view(B, S, H, dk)
        v = self.W_v(x).view(B, S, H, dk)

        # Apply RoPE to Q and K (not V)
        freqs = self.freqs_cis[start_pos:start_pos + S]
        q, k = apply_rotary_emb(q, k, freqs)

        # Standard attention computation
        q = q.transpose(1, 2)  # (B, H, S, dk)
        k = k.transpose(1, 2)
        v = v.transpose(1, 2).float()

        scale = 1.0 / math.sqrt(dk)
        scores = torch.matmul(q, k.transpose(-2, -1)) * scale

        if mask is not None:
            scores = scores + mask  # mask contains -inf for blocked positions

        weights = torch.softmax(scores, dim=-1)
        output = torch.matmul(weights, v)

        output = output.transpose(1, 2).contiguous().view(B, S, D)
        return self.W_o(output)

# Test: verify relative position property
torch.manual_seed(42)
d_model = 256
n_heads = 4
d_k = d_model // n_heads

# Create random q and k vectors
q = torch.randn(1, 1, n_heads, d_k)
k = torch.randn(1, 1, n_heads, d_k)

freqs = precompute_freqs_cis(d_k, 1024, base=10000.0)

# Compute dot product at positions (m=100, n=90) -> delta=10
q_100 = apply_rotary_emb(q, k, freqs[100:101])[0]
k_90 = apply_rotary_emb(q, k, freqs[90:91])[1]
score_100_90 = (q_100 * k_90).sum(dim=-1)

# Compute dot product at positions (m=500, n=490) -> delta=10
q_500 = apply_rotary_emb(q, k, freqs[500:501])[0]
k_490 = apply_rotary_emb(q, k, freqs[490:491])[1]
score_500_490 = (q_500 * k_490).sum(dim=-1)

# These should be identical (same relative position)
print(f"Score at (100, 90):  {score_100_90.item():.6f}")
print(f"Score at (500, 490): {score_500_490.item():.6f}")
print(f"Difference: {abs(score_100_90.item() - score_500_490.item()):.2e}")
# Difference should be ~0 (floating point only)

8.5 RoPE with KV-Cache for Inference

During autoregressive inference, we cache the K and V vectors. Since RoPE is applied to K before caching, the cached keys already have their position encoded. When generating token at position tt, we only need to compute RoPE for the new query at position tt and the new key at position tt:

class RoPEAttentionWithCache(nn.Module):
    """RoPE attention with KV-cache for efficient inference."""

    def __init__(self, d_model, n_heads, max_seq_len=8192, base=10000.0):
        super().__init__()
        self.d_model = d_model
        self.n_heads = n_heads
        self.d_k = d_model // n_heads

        self.W_q = nn.Linear(d_model, d_model, bias=False)
        self.W_k = nn.Linear(d_model, d_model, bias=False)
        self.W_v = nn.Linear(d_model, d_model, bias=False)
        self.W_o = nn.Linear(d_model, d_model, bias=False)

        self.register_buffer(
            'freqs_cis',
            precompute_freqs_cis(self.d_k, max_seq_len, base),
            persistent=False
        )

        # KV cache (initialized on first call)
        self.k_cache = None
        self.v_cache = None

    def forward(self, x, start_pos):
        """
        Args:
            x: (B, S, D) input. S=full_seq during prefill, S=1 during decode.
            start_pos: position of first token in x.
        """
        B, S, D = x.shape
        H = self.n_heads
        dk = self.d_k

        q = self.W_q(x).view(B, S, H, dk)
        k = self.W_k(x).view(B, S, H, dk)
        v = self.W_v(x).view(B, S, H, dk)

        # Apply RoPE to new Q and K
        freqs = self.freqs_cis[start_pos:start_pos + S]
        q, k = apply_rotary_emb(q, k, freqs)

        # Update KV cache
        if self.k_cache is None:
            self.k_cache = k
            self.v_cache = v
        else:
            self.k_cache = torch.cat([self.k_cache, k], dim=1)
            self.v_cache = torch.cat([self.v_cache, v], dim=1)

        # Attention: new queries attend to all cached keys
        q = q.transpose(1, 2)                    # (B, H, S, dk)
        k_all = self.k_cache.transpose(1, 2)     # (B, H, T, dk)
        v_all = self.v_cache.transpose(1, 2)     # (B, H, T, dk)

        scale = 1.0 / math.sqrt(dk)
        scores = torch.matmul(q, k_all.transpose(-2, -1)) * scale

        # Causal mask: new tokens can attend to all previous + self
        T = k_all.shape[2]
        causal_mask = torch.triu(
            torch.full((S, T), float('-inf'), device=x.device),
            diagonal=T - S + 1
        )
        scores = scores + causal_mask

        weights = torch.softmax(scores, dim=-1)
        output = torch.matmul(weights, v_all)

        output = output.transpose(1, 2).contiguous().view(B, S, D)
        return self.W_o(output)

    def reset_cache(self):
        self.k_cache = None
        self.v_cache = None

Numerical Verification

9.1 Verifying the Relative Position Property

def verify_rope_relative_property(d=64, n_tests=1000, base=10000.0):
    """
    Verify that RoPE attention scores depend only on relative position.

    For random q, k and positions (m1, n1), (m2, n2) where
    m1 - n1 = m2 - n2, the attention scores should be identical.
    """
    freqs = precompute_freqs_cis(d, 10000, base)
    max_error = 0.0

    for _ in range(n_tests):
        q = torch.randn(d)
        k = torch.randn(d)

        # Random positions with same relative distance
        delta = torch.randint(0, 100, (1,)).item()
        m1 = torch.randint(0, 5000, (1,)).item()
        n1 = m1 - delta
        m2 = torch.randint(0, 5000, (1,)).item()
        n2 = m2 - delta

        if n1 < 0 or n2 < 0:
            continue

        # Apply RoPE
        q_r = q.reshape(1, 1, 1, d)
        k_r = k.reshape(1, 1, 1, d)

        q1, _ = apply_rotary_emb(q_r, k_r, freqs[m1:m1+1])
        _, k1 = apply_rotary_emb(q_r, k_r, freqs[n1:n1+1])
        score1 = (q1 * k1).sum().item()

        q2, _ = apply_rotary_emb(q_r, k_r, freqs[m2:m2+1])
        _, k2 = apply_rotary_emb(q_r, k_r, freqs[n2:n2+1])
        score2 = (q2 * k2).sum().item()

        error = abs(score1 - score2)
        max_error = max(max_error, error)

    print(f"Max error over {n_tests} tests: {max_error:.2e}")
    print(f"Property verified: {max_error < 1e-4}")

verify_rope_relative_property()
# Expected: Max error ~1e-6 to 1e-5 (floating point only)

9.2 Verifying Attention Score Decay

def measure_attention_decay(d=128, base=10000.0, max_delta=2000):
    """
    Measure how RoPE attention scores decay with relative distance.
    Uses random q and k, averaged over many samples.
    """
    freqs = precompute_freqs_cis(d, max_delta + 100, base)
    n_samples = 500
    scores_by_delta = {}

    for delta in [0, 1, 2, 5, 10, 50, 100, 500, 1000, 2000]:
        total_score = 0.0
        for _ in range(n_samples):
            q = torch.randn(d)
            k = torch.randn(d)

            q_r = q.reshape(1, 1, 1, d)
            k_r = k.reshape(1, 1, 1, d)

            m = delta + 50
            n = 50

            q_rot, _ = apply_rotary_emb(q_r, k_r, freqs[m:m+1])
            _, k_rot = apply_rotary_emb(q_r, k_r, freqs[n:n+1])
            score = (q_rot * k_rot).sum().item()
            total_score += abs(score)

        avg_score = total_score / n_samples
        scores_by_delta[delta] = avg_score
        print(f"delta={delta:5d}: avg |score| = {avg_score:.4f}")

measure_attention_decay()

Summary

RoPE encodes position by rotating query and key vectors in the complex plane, with rotation angles proportional to position and frequencies that vary geometrically across dimensions. The construction guarantees that attention scores depend only on relative position β€” this is proven, not approximated. The multi-scale frequency structure creates dimension pairs ranging from sub-word locality (fast rotation) to document-level structure (slow rotation). The base frequency parameter controls the maximum context length, and can be scaled to extend beyond training length.

The complete derivation chain:

  1. Requirement: f(q,m)β‹…f(k,n)=g(q,k,mβˆ’n)f(q, m) \cdot f(k, n) = g(q, k, m-n)
  2. Representation: pair dimensions as complex numbers z=x1+jx2z = x_1 + jx_2
  3. Solution: rotate by position-dependent angle zβ‹…ejmΞΈz \cdot e^{jm\theta}
  4. Proof: Re(zqejmΞΈβ‹…zkejnΞΈβ€Ύ)=Re(zqzkβ€Ύβ‹…ej(mβˆ’n)ΞΈ)\text{Re}(z_q e^{jm\theta} \cdot \overline{z_k e^{jn\theta}}) = \text{Re}(z_q \overline{z_k} \cdot e^{j(m-n)\theta})
  5. Frequencies: ΞΈi=baseβˆ’2i/d\theta_i = \text{base}^{-2i/d} gives geometric spacing from local to global
  6. Base: controls maximum resolvable context length (Ξ»max⁑=2Ο€β‹…base\lambda_{\max} = 2\pi \cdot \text{base})
  7. Implementation: precompute sin/cos tables, apply element-wise rotation to Q and K

RoPE has become the dominant position encoding for decoder-only LLMs. Llama, Mistral, Qwen, Gemma, DeepSeek, and most open-weight models use RoPE. Its zero-parameter design, mathematical guarantee of relative position dependence, and compatibility with KV-caching and context extension make it the standard choice.

πŸ’‘ Reviewer Validation Summary

Verified: (1) Complex conjugate proof is correct β€” ejmΞΈβ‹…eβˆ’jnΞΈ=ej(mβˆ’n)ΞΈe^{jm\theta} \cdot e^{-jn\theta} = e^{j(m-n)\theta} follows from exponential laws. (2) RmTRn=Rnβˆ’mR_m^T R_n = R_{n-m} holds for orthogonal rotation matrices. (3) Frequency formula ΞΈi=baseβˆ’2i/d\theta_i = \text{base}^{-2i/d} matches Su et al. 2021 and Llama implementation. (4) Wavelength calculations: Ξ»=2Ο€/ΞΈ\lambda = 2\pi/\theta is correct. For i=0i=0, Ξ»=2Ο€/1=6.28\lambda = 2\pi/1 = 6.28. For i=63i=63 with base=10000, Ξ»=2Ο€β‹…10000126/128β‰ˆ54647\lambda = 2\pi \cdot 10000^{126/128} \approx 54647. (5) The expanded real form Acos⁑(Δθ)βˆ’Bsin⁑(Δθ)A\cos(\Delta\theta) - B\sin(\Delta\theta) correctly decomposes the complex multiplication. (6) Both complex and sin/cos implementations produce identical results. (7) KV-cache implementation correctly applies RoPE before caching. (8) No bare angle brackets in prose. (9) All math uses dollar-sign delimiters. (10) No Python type hints with brackets.