Part of Series Transformer Anatomy 8 of 23
1 The Transformer Attention Mechanism: From First Principles to Performance Reality 2 Tokenization and BPE: How LLMs See Text — From Characters to Subwords 3 Embedding Layers: The Geometry of Meaning in LLMs 4 Position Encoding in Transformers: From Sinusoidal to RoPE, ALiBi, and Long-Context Scaling 5 Softmax Numerics: Log-Sum-Exp, Temperature, and Why Numerical Stability Matters 6 Attention Variants Compared: MHA, MQA, GQA, and MLA 7 Normalization in Transformers: LayerNorm, RMSNorm, and the Training Stability Story 8 Residual Connections and Skip Paths: Why Transformers Can Be 100 Layers Deep 9 The Feed-Forward Network: SwiGLU, Gating, and the FFN-as-Memory Hypothesis 10 Mixture of Experts: Why Conditional Computation Is the Path to Trillion-Parameter Models 11 The Output Head: Unembedding, Weight Tying, and Vocabulary Projection 12 Cross-Entropy Loss: How the Loss Function Shapes What an LLM Learns 13 Encoder vs Decoder: Why Decoder-Only Won 14 DeepSeek V3: How 671B Parameters Trained for the Cost of a 70B Dense Model 15 Building a Transformer From Scratch: Putting Every Component Together 16 Gradient Flow and Backpropagation Through Transformers: What Happens During the Backward Pass 17 Weight Initialization: Xavier, Kaiming, and Why mu-P Changes Everything for Large Models 18 Training Loop Anatomy: Forward Pass, Loss Computation, Backward Pass, Optimizer Step 19 Learning Rate Schedules: Warmup, Cosine Decay, and Why WSD Changes Everything 20 Activation Functions Deep Dive: ReLU, GELU, SiLU, and Why Each Matters for Transformers 21 Attention Masking: Causal, Bidirectional, Sliding Window, Block Sparse, and Custom Patterns 22 Knowledge Distillation: Training Small Models to Match Large Ones 23 Model Merging: Weight Averaging, TIES, DARE, and Evolutionary Search

By this point in the series you know every sublayer in a transformer block: attention computes token interactions, feed-forward networks apply pointwise nonlinear transformations, and normalization keeps activations well-behaved. Each piece is well-understood in isolation. But understanding the pieces does not explain how you can stack 80, 96, or even 128 of them on top of each other and still get meaningful gradients all the way back to the embedding layer.

The answer is the residual connection — the simple addition x+f(x)x + f(x) that wraps every sublayer. It is the single most important structural decision in the transformer architecture. Without it, deep transformers are untrainable. With it, we can build models with over a hundred layers that converge reliably. This post explains why.

We will start with the fundamental problem that depth creates, trace the insight from ResNet through its application in transformers, derive the gradient flow mathematics rigorously, and then explore the modern “residual stream” perspective that has reshaped how researchers at Anthropic and elsewhere think about what transformers are actually doing.


1. The Depth Problem

Why Deep Networks Are Hard to Train

The promise of deep networks is compositional representation: each layer builds increasingly abstract features on top of the previous layer’s output. Layer 1 detects edges, layer 5 detects textures, layer 20 detects objects — or in the language domain, layer 1 captures token identity, layer 20 captures syntactic roles, layer 60 captures complex reasoning patterns. Depth is what gives neural networks their extraordinary representational power.

But depth comes at a brutal cost during training. The fundamental issue is that backpropagation computes gradients by applying the chain rule through every layer in sequence. For a network with LL layers, each computing xl+1=fl(xl)x_{l+1} = f_l(x_l), the gradient of the loss with respect to the input is:

Lx0=LxLl=1Lflxl1\frac{\partial \mathcal{L}}{\partial x_0} = \frac{\partial \mathcal{L}}{\partial x_L} \cdot \prod_{l=1}^{L} \frac{\partial f_l}{\partial x_{l-1}}

This product of LL Jacobian matrices is where everything goes wrong.

Vanishing Gradients

If each layer’s Jacobian has spectral norm slightly less than 1 — say 0.95 — the gradient magnitude after LL layers is roughly 0.95L0.95^L. For L=20L = 20, that is 0.95200.360.95^{20} \approx 0.36 — already losing two-thirds of the signal. For L=50L = 50, it is 0.95500.0770.95^{50} \approx 0.077 — the gradient reaching the early layers is less than 8% of the gradient at the output. For L=100L = 100, it is 0.951000.0060.95^{100} \approx 0.006. The early layers receive essentially zero gradient and stop learning.

This is the vanishing gradient problem, and it is exponential in depth. It does not matter how good your optimizer is or how large your learning rate is — if the gradient signal has been multiplied by 10310^{-3} by the time it reaches layer 1, the early layers are frozen.

Exploding Gradients

The opposite problem is equally destructive. If each Jacobian has spectral norm slightly greater than 1 — say 1.05 — the product grows as 1.05L1.05^L. At L=100L = 100, that is 1.051001311.05^{100} \approx 131. Gradients grow by two orders of magnitude, causing weight updates that are far too large, which destabilizes training and can push the loss to infinity.

🚨 The Narrow Stability Window

For a deep network without residual connections to train stably, each layer’s Jacobian must have spectral norm very close to exactly 1. At L=100L = 100 layers, even a 2% deviation per layer compounds to a factor of 7x growth or 0.13x shrinkage. Maintaining this balance across all layers, all training steps, and all data points is practically impossible with standard initialization and optimization.

The Shattering Gradient Problem

Balduzzi et al. (2017) identified a subtler issue beyond simple magnitude: gradient directions become increasingly chaotic in deep networks without skip connections. They showed that the gradient of a deep feed-forward network with respect to its input resembles white noise as depth increases. Specifically, the cosine similarity between the gradient at one input and the gradient at a nearby input decays exponentially with depth.

This means that even if you manage to control the gradient magnitude (through careful initialization, gradient clipping, or other tricks), the gradient direction at the early layers is essentially random — it carries no useful information about how to update the weights. The gradients have “shattered.”

The Empirical Wall

These three problems — vanishing magnitude, exploding magnitude, and shattering direction — create a hard practical limit. Before residual connections, networks deeper than approximately 20 layers were extremely difficult to train. The landmark VGG network (2014) used 19 layers and was considered very deep for its time. Attempts to go deeper produced networks that converged to worse solutions than their shallower counterparts, not because they lacked capacity, but because optimization completely failed.

Training Loss After 90 Epochs (CIFAR-10, No Residuals)

(loss)
10 layers
0.12 loss
20 layers
0.18 loss
30 layers
0.42 loss
56 layers
0.85 loss
110 layers
1.45 loss

The trend is clear: adding depth hurts without residual connections. A 56-layer plain network performs worse than its 20-layer counterpart. This is not overfitting — it is optimization failure. The network cannot learn.


2. ResNet’s Insight: Learning Modifications, Not Transformations

The Key Idea

In December 2015, Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun published “Deep Residual Learning for Image Recognition,” a paper that fundamentally changed deep learning. The idea was disarmingly simple.

Instead of learning a mapping H(x)=f(x)H(x) = f(x) directly, restructure the layer to learn the residual:

H(x)=x+g(x)H(x) = x + g(x)

where g(x)g(x) is what the layer actually computes (the “residual function”), and xx is passed through unchanged via a skip connection. The output is the input plus a learned modification.

This single change made it possible to train networks with 152 layers — and eventually over 1,000 layers — where plain networks failed completely at 56.

Why This Works: The Identity Highway

The critical insight is about what happens when g(x)0g(x) \approx 0. If a layer’s contribution is small — either because the layer has not learned anything useful yet, or because the task does not require that layer’s transformation — the identity mapping xxx \mapsto x passes the input through unchanged. The layer is effectively “transparent.”

In a plain network, learning the identity function f(x)=xf(x) = x is actually difficult. The weights must be carefully coordinated to produce an identity mapping, which is a non-trivial optimization target in a nonlinear network. But in a residual network, the “do nothing” behavior is the default. The layer only needs to learn g(x)=0g(x) = 0, which is trivially achievable with zero-initialized or small weights.

Σ Theorem: Residual Learning Principle

If the optimal mapping for a given layer is close to the identity function, it is easier to learn the residual g(x)=H(x)x0g(x) = H(x) - x \approx 0 than to learn H(x)xH(x) \approx x directly. Residual connections reframe each layer’s task from “compute the full output” to “compute a small correction to the input.”

The Mental Model: A Stack of Refinements

Think of a residual network not as a pipeline of complete transformations, but as a process of iterative refinement. The input enters with some initial representation. Each layer examines that representation and says: “Here is a small adjustment.” The representation accumulates these adjustments as it flows through the network.

This is a fundamentally different computational paradigm. In a plain network, layer 50’s output may bear little resemblance to layer 1’s output — the representation is replaced at every step. In a residual network, layer 50’s output is layer 1’s output plus 50 accumulated corrections. The original information is never destroyed; it persists through the entire network unless a layer actively subtracts it.

This persistence of information is what makes deep training possible. Even if layers 10 through 40 produce negligible corrections (because their gradients are too small for meaningful learning), the information from layers 1 through 9 passes through to layers 41 and beyond. The network can still function as a 10-layer network while gradually incorporating the intermediate layers as training progresses.

The Transformer Application

The transformer architecture applies residual connections at every sublayer. Each transformer block computes:

x=x+Attention(x)x = x + \text{Attention}(x) x=x+FFN(x)x = x + \text{FFN}(x)

(With normalization in the appropriate places, which we will address shortly.) A 96-layer transformer has 192 residual connections — one for each attention sublayer and one for each FFN sublayer. Every single sublayer is wrapped in a skip path.

This is not optional. Every major transformer architecture — GPT, Llama, PaLM, Gemini, Claude — uses residual connections at every sublayer. No one has successfully trained a deep transformer without them.


3. Gradient Flow Analysis: The Mathematics of Why Residuals Work

The Fundamental Equation

Let us derive the gradient flow through a residual network rigorously. Consider LL residual layers:

xl+1=xl+gl(xl)for l=0,1,,L1x_{l+1} = x_l + g_l(x_l) \quad \text{for } l = 0, 1, \ldots, L-1

We want the gradient of the loss L\mathcal{L} with respect to x0x_0. By the chain rule:

Lx0=LxLl=0L1xl+1xl\frac{\partial \mathcal{L}}{\partial x_0} = \frac{\partial \mathcal{L}}{\partial x_L} \cdot \prod_{l=0}^{L-1} \frac{\partial x_{l+1}}{\partial x_l}

Now compute each factor:

xl+1xl=xl(xl+gl(xl))=I+glxl\frac{\partial x_{l+1}}{\partial x_l} = \frac{\partial}{\partial x_l}(x_l + g_l(x_l)) = I + \frac{\partial g_l}{\partial x_l}

where II is the identity matrix. So the full gradient is:

Lx0=LxLl=0L1(I+glxl)\frac{\partial \mathcal{L}}{\partial x_0} = \frac{\partial \mathcal{L}}{\partial x_L} \cdot \prod_{l=0}^{L-1}\left(I + \frac{\partial g_l}{\partial x_l}\right)

This is the key equation. Compare it to the plain network version lflxl\prod_{l} \frac{\partial f_l}{\partial x_l} — the only difference is the I+I + inside each factor.

Expanding the Product: 2L2^L Implicit Paths

When you expand the product l=0L1(I+Jl)\prod_{l=0}^{L-1}(I + J_l) where Jl=glxlJ_l = \frac{\partial g_l}{\partial x_l}, you get a sum over all 2L2^L subsets of layers:

l=0L1(I+Jl)=I+lJl+l1<l2Jl1Jl2++l=0L1Jl\prod_{l=0}^{L-1}(I + J_l) = I + \sum_{l} J_l + \sum_{l_1 < l_2} J_{l_1} J_{l_2} + \cdots + \prod_{l=0}^{L-1} J_l
Σ Theorem: Exponential Path Decomposition

A residual network with LL layers creates 2L2^L implicit gradient paths, one for each subset S{0,1,,L1}S \subseteq \{0, 1, \ldots, L-1\}. The gradient through subset SS is lSJl\prod_{l \in S} J_l, while all layers not in SS contribute only the identity (pass-through). The total gradient is the sum of gradients through all 2L2^L paths.

This is the profound insight. In a plain network, there is exactly one path from the output to the input, and it passes through every layer. If any single layer has a small Jacobian, the entire gradient vanishes. In a residual network, there are 2L2^L paths, and even if most of them carry negligible gradient, the short paths (those that skip many layers) remain viable.

The Direct Path

The most important single path is the one corresponding to S=S = \emptyset — the path that skips all layers and passes only through the identity connections:

Lx0S==LxLI=LxL\frac{\partial \mathcal{L}}{\partial x_0}\bigg|_{S = \emptyset} = \frac{\partial \mathcal{L}}{\partial x_L} \cdot I = \frac{\partial \mathcal{L}}{\partial x_L}

This is the “gradient highway.” Regardless of what happens in the glg_l functions — regardless of how small their Jacobians are, regardless of whether they exhibit vanishing or exploding behavior — this direct path carries the output gradient back to the input unchanged. It cannot vanish. It cannot explode (unless the loss gradient itself does). It is an unconditional guarantee that every layer in the network receives at least the raw output gradient.

Effective Path Length Distribution

Veit, Wilber, and Belongie (2016) studied the effective path lengths empirically and found a striking result: during training, the gradient signal is dominated by paths of moderate length. In a 110-layer ResNet, the paths that contribute meaningful gradient are those of length 10—30. The very short paths (length 0—5) contribute gradient but are too simple to carry useful information. The very long paths (length 80+) have vanishing contributions because they pass through many Jacobians. The bulk of the learning happens through paths of intermediate length.

This means that a 110-layer ResNet effectively behaves like an ensemble of shallower networks of various depths. It is not a single 110-layer computation; it is an exponential collection of computations of different depths, all sharing parameters, all contributing to the final output.

ℹ️ The Ensemble Interpretation

A residual network with LL layers behaves as an implicit ensemble of 2L2^L networks of different depths. During training, the gradient naturally concentrates on paths of moderate length. Deleting a single layer from a trained ResNet causes only a small performance drop — unlike a plain network, where removing any layer is catastrophic.

Quantitative Gradient Health

Let us put concrete numbers on this. Suppose each layer’s Jacobian JlJ_l has spectral norm 0.8 (a significant vanishing gradient factor). In a plain network with L=80L = 80 layers:

Jplain0.8801.5×108\|J_{\text{plain}}\| \leq 0.8^{80} \approx 1.5 \times 10^{-8}

The gradient is effectively zero. Now consider the residual network. The direct path alone contributes gradient of norm LxL\|\frac{\partial \mathcal{L}}{\partial x_L}\|. The length-1 paths contribute lJl\sum_l J_l — there are 80 of them, each with norm around 0.8. The length-2 paths contribute l1<l2Jl1Jl2\sum_{l_1 < l_2} J_{l_1} J_{l_2} — there are (802)=3,160\binom{80}{2} = 3{,}160 of them, each with norm around 0.640.64. Even though individual long paths have vanishing gradient, the number of paths at each length grows combinatorially, which partially compensates for the per-path decay.

Gradient Norm at Layer 0 (80-Layer Network, Jacobian Norm = 0.8)

(relative)
Plain network 0.8^80 = ~1.5e-8
0 relative
Residual (direct path only) Identity passthrough
1 relative
Residual (all paths) Sum of 2^80 paths
1.8 relative

The residual network maintains healthy gradient magnitude even when the per-layer Jacobians are strongly contractive. This is the mathematical reason deep transformers work.


4. The Transformer Residual Stream

A New Mental Model

In 2021, Nelson Elhage, Neel Nanda, Catherine Olsson, and other researchers at Anthropic published “A Mathematical Framework for Transformer Circuits,” which introduced a powerful reframing of how to think about residual connections in transformers. Rather than viewing them as a gradient-flow trick, they proposed thinking of the residual connection as the primary object — the residual stream.

The idea is this: think of the dmodeld_{\text{model}}-dimensional vector at each token position as a communication channel. This vector persists through the entire network. It enters as the token embedding, flows through every layer, and exits at the unembedding layer to produce logits. The residual stream is the transformer’s state.

Each sublayer — every attention head, every FFN — is not a “stage” that the data passes through. Instead, each sublayer is an independent operator that reads from and writes to the residual stream. The stream itself is the persistent state; the sublayers are operations that modify it.

Reading and Writing

Under this mental model, each attention head performs three operations:

  1. Read: The head projects the residual stream through its WQW_Q, WKW_K, and WVW_V matrices to extract queries, keys, and values. This is a read operation — the head selects what information from the stream it wants to process.

  2. Compute: The head performs the attention computation (dot products, softmax, weighted sum) on its extracted information. This computation is entirely internal to the head.

  3. Write: The head projects its output through WOW_O and adds the result back to the residual stream. This is a write operation — the head deposits its computed information into the stream for subsequent layers to use.

The FFN operates similarly: it reads from the stream via its input projection, computes through its hidden layer and nonlinearity, and writes back via its output projection.

💡 The Stream Is the Model

In the residual stream view, the “true” state of the transformer is the dmodeld_{\text{model}}-dimensional vector at each position. Attention heads and FFN layers are peripheral devices that read from and write to this shared bus. The residual connection is not a trick to help with gradient flow — it is the architecture. Everything else is a reader/writer attached to it.

Composition Through the Stream

This perspective makes layer interactions much clearer. Consider how an “induction head” works (a circuit that performs in-context learning). It requires two attention heads:

  1. Head A (in an earlier layer) detects that token BB follows token AA somewhere in the context. It writes information about this ABA \to B pattern into the residual stream.

  2. Head B (in a later layer) reads the pattern information that Head A wrote. When it sees token AA appear again, it uses the stored pattern to predict that BB should follow.

These two heads never communicate directly. Head A writes to the residual stream; Head B reads from the residual stream. The stream is the shared memory that enables composition across layers. Without the residual connection, this information would need to survive transformation by every intermediate layer — with residuals, it is simply added to the stream and persists until it is read.

The Linear Algebraic View

Mathematically, the output of a transformer with LL blocks (and 2 sublayers per block) can be written as:

xfinal=x0+l=12Lhl(x0,x1,,xl1)x_{\text{final}} = x_0 + \sum_{l=1}^{2L} h_l(x_0, x_1, \ldots, x_{l-1})

where x0x_0 is the token embedding and each hlh_l is the output of a sublayer (either an attention head or an FFN). The final representation is the embedding plus the sum of all sublayer contributions.

This additive structure has a crucial property: it is linear in the sublayer outputs. Even though each hlh_l is a nonlinear function of previous states, their contributions to the final output combine linearly. This means:

  • The contribution of head 3 in layer 5 and the contribution of the FFN in layer 12 do not interact multiplicatively — they simply add.
  • You can analyze each sublayer’s contribution independently by examining what vector it writes to the stream.
  • The residual stream at any point is a sum of all previous contributions, making it amenable to linear-algebraic analysis (projections, decompositions, etc.).

This linearity is what makes mechanistic interpretability possible. If sublayer outputs were composed multiplicatively (as in a plain network), understanding any individual component would require understanding its interaction with every other component.

The Stream as a Bandwidth-Limited Bus

The residual stream has exactly dmodeld_{\text{model}} dimensions. Every attention head and every FFN in the entire network must communicate through this fixed-width channel. In GPT-3, dmodel=12,288d_{\text{model}} = 12{,}288. There are 96 layers with 96 attention heads each (9,216 total heads) plus 96 FFN layers — over 9,300 sublayers sharing 12,288 dimensions.

This creates a severe bandwidth constraint. If each head needed exclusive use of some dimensions, we would need far more dimensions than dmodeld_{\text{model}} provides. The network must learn to share dimensions efficiently, which leads directly to the superposition phenomenon we discuss in Section 7.


5. Scaling Factors: Keeping the Residual Stream Bounded

The Variance Growth Problem

Each sublayer adds its output to the residual stream. If sublayer outputs have nonzero variance, the variance of the stream grows with depth:

Var(xL)=Var(x0)+l=1LVar(hl)\text{Var}(x_L) = \text{Var}(x_0) + \sum_{l=1}^{L} \text{Var}(h_l)

(assuming the contributions are approximately uncorrelated, which is a reasonable first-order approximation early in training). If each sublayer contributes variance σ2\sigma^2, the stream variance after LL sublayers is:

Var(xL)Var(x0)+Lσ2\text{Var}(x_L) \approx \text{Var}(x_0) + L \cdot \sigma^2

For a 96-layer transformer with 192 sublayers, if each sublayer contributes variance 0.01, the total accumulated variance is 192×0.01=1.92192 \times 0.01 = 1.92. If the initial embedding variance is 1.0, the final variance is roughly 2.92 — about 3x the initial value. The activation magnitudes have grown by 31.7\sqrt{3} \approx 1.7x.

This might seem manageable, but it compounds with the actual sublayer output magnitudes and can cause numerical issues, especially in FP16/BF16 where the dynamic range is limited.

GPT-2’s Scaling Approach

GPT-2 introduced a simple fix: scale the output projection of each sublayer by 12N\frac{1}{\sqrt{2 N}}, where NN is the number of layers. Specifically, the weights of the final linear projection in each attention sublayer and each FFN sublayer are initialized with standard deviation scaled down by this factor.

The factor 2N2N accounts for the fact that there are 2N2N sublayers (attention + FFN per layer) contributing to the residual stream. The square root comes from the relationship between weight scale and output variance. By making each contribution 12N\frac{1}{\sqrt{2N}} smaller, the accumulated variance after all sublayers returns to approximately the same magnitude as the initial embedding variance.

Residual Stream Norm Growth (96-Layer Model)

(relative norm)
No scaling Unbounded growth
14.2 relative norm
1/sqrt(N) scaling Moderate growth
3.1 relative norm
1/sqrt(2N) (GPT-2) Well-controlled
1.8 relative norm
DeepSeek alpha Near-constant
1.2 relative norm

DeepSeek’s Alpha Scaling

DeepSeek-V2 and subsequent DeepSeek models use a more aggressive scaling scheme. They introduce a per-layer scaling factor α\alpha that is computed as a function of the layer depth:

xl+1=xl+αlhl(xl)x_{l+1} = x_l + \alpha_l \cdot h_l(x_l)

where αl\alpha_l decreases with depth. The deeper layers — which are further from the output and thus contribute to more accumulated variance — are scaled down more aggressively. This achieves tighter control over the stream norm than a uniform scaling factor.

muP: Maximal Update Parameterization

Yang and Hu (2021) developed a principled theoretical framework called maximal update parameterization (muP) that derives the correct scaling factors from first principles. The core idea is to parameterize the network such that the optimal hyperparameters (learning rate, initialization scale, etc.) are independent of model width.

In the context of residual connections, muP prescribes specific scaling factors for the output projections of attention and FFN sublayers that depend on the model width dmodeld_{\text{model}} and the number of layers. The key result is that with muP, you can tune hyperparameters on a small model and directly transfer them to a much larger model — the residual stream dynamics are preserved across scales.

Why Scaling Matters for Training Stability

Without proper scaling, training a 96-layer transformer in BF16 can produce activations that overflow the representable range (max value 3.39e38 in BF16). Even before overflow, large activation magnitudes reduce the effective precision of the representation, because the floating-point grid becomes coarser at larger magnitudes. Scaling the residual contributions keeps activations in a numerically healthy range throughout training.

The Interaction with Normalization

Scaling factors interact closely with layer normalization. In Post-Norm (the original transformer), the normalization is applied after the residual addition:

xl+1=LayerNorm(xl+hl(xl))x_{l+1} = \text{LayerNorm}(x_l + h_l(x_l))

This normalizes the stream after each sublayer, which controls variance growth but also disrupts the clean residual path. The gradient must flow through the LayerNorm Jacobian, which can amplify or attenuate it.

In Pre-Norm (used by GPT-2, Llama, and most modern LLMs), normalization is applied inside the sublayer before the residual addition:

xl+1=xl+hl(LayerNorm(xl))x_{l+1} = x_l + h_l(\text{LayerNorm}(x_l))

This preserves the clean identity path in the residual connection. The gradient through the skip path is exactly II — no normalization Jacobian to worry about. The sublayer still operates on normalized inputs, so its internal computation is well-conditioned. But the stream itself is not normalized, which is why the scaling factors described above become necessary.


6. ReZero Initialization: Starting from Identity

The Problem with Random Initialization

At the start of training, each sublayer’s weights are randomly initialized. The attention weights produce random attention patterns; the FFN weights produce random transformations. The output of each sublayer is, essentially, random noise.

In a standard residual network, this means the residual stream starts as the token embedding plus LL noise terms. After 96 layers with 192 sublayers, the token embedding is buried under a mountain of random contributions. The network must simultaneously learn to suppress this noise and learn useful representations — a difficult optimization problem.

The ReZero Solution

Bachlechner et al. (2020) proposed a simple modification: initialize each sublayer’s contribution to exactly zero:

xl+1=xl+αlgl(xl)x_{l+1} = x_l + \alpha_l \cdot g_l(x_l)

where αl\alpha_l is a learnable scalar initialized to 0. At the start of training, every αl=0\alpha_l = 0, so the network computes the identity function regardless of depth:

xL=x0+l0gl(xl)=x0x_L = x_0 + \sum_{l} 0 \cdot g_l(x_l) = x_0

The gradient with respect to αl\alpha_l is:

Lαl=LxLgl(xl)\frac{\partial \mathcal{L}}{\partial \alpha_l} = \frac{\partial \mathcal{L}}{\partial x_L} \cdot g_l(x_l)

This gradient tells each αl\alpha_l how much the sublayer’s output would help reduce the loss. Layers that produce useful contributions develop positive αl\alpha_l values; layers whose outputs are harmful maintain αl0\alpha_l \approx 0.

Σ Theorem: ReZero Convergence Property

With ReZero initialization, the network starts as an exact identity function. During the early phase of training, layers gradually “turn on” as their scaling factors αl\alpha_l grow from zero. This creates a natural curriculum: the network first learns as a shallow model (few active layers), then progressively deepens as more layers activate. The optimization landscape at initialization is smooth and well-conditioned, because the Jacobian of the entire network is exactly II.

Practical Adoption

While ReZero demonstrated the principle clearly, most production LLMs use a related but less extreme approach. Rather than initializing αl=0\alpha_l = 0 explicitly, they initialize the output projection weights of each sublayer to be very small (near zero) so that each sublayer starts with near-zero contribution. The GPT-2 scaling of 12N\frac{1}{\sqrt{2N}} achieves a similar effect: for a 96-layer model, the scaling factor is 11920.072\frac{1}{\sqrt{192}} \approx 0.072, so each sublayer’s initial contribution is roughly 7% of what it would be without scaling.

The core idea — start close to identity, let the network gradually deepen itself — is now embedded in standard practice, even when the ReZero formulation is not used explicitly.


7. The Superposition Perspective

More Features Than Dimensions

In 2022, Elhage, Trenton Bricken, and other Anthropic researchers published “Toy Models of Superposition,” which revealed a startling property of the residual stream: it encodes far more features than it has dimensions.

The classical assumption was that each feature in a neural network corresponds to one neuron (or one dimension of the residual stream). Under this assumption, a dmodel=4,096d_{\text{model}} = 4{,}096 stream could represent at most 4,096 independent features. But this turns out to be wildly wrong.

Features as Directions

The key finding is that features are not axis-aligned. Instead, each feature is represented as a direction in the high-dimensional residual stream space. A feature is “active” when the residual stream vector has a large component in that feature’s direction.

In a dd-dimensional space, there are far more “almost orthogonal” directions than there are dimensions. Specifically, in R4096\mathbb{R}^{4096}, you can pack millions of unit vectors such that the cosine similarity between any two is less than 0.01. Each of these near-orthogonal directions can encode a distinct feature, at the cost of small interference between features.

The Superposition Tradeoff

The network faces a tradeoff between two objectives:

  1. Feature capacity: Represent as many features as possible to capture the complexity of language.
  2. Feature interference: Minimize cross-talk between features, which degrades the network’s ability to read out individual features cleanly.

Anthropic’s toy models showed that networks naturally learn to pack features in superposition — representing more features than dimensions allow — when features are sparse (most features are inactive for any given input). If a feature is only active 1% of the time, the interference it causes when active is a small price to pay for having it available at all.

ℹ️ Sparsity Enables Superposition

Superposition works because language features are sparse. The feature “this token is part of a French sentence” is irrelevant for 95%+ of English text. The feature “this token represents a chemical formula” is irrelevant for most inputs. Because most features are inactive at any given time, the network can pack thousands of features into hundreds of dimensions with minimal practical interference.

Implications for the Residual Stream

This perspective transforms how we understand the residual stream. It is not a set of 4,096 independent channels carrying 4,096 features. It is a compressed, overcomplete representation carrying potentially millions of features encoded as directions in a high-dimensional space.

Each attention head and each FFN reads and writes in this overcomplete space. When an attention head projects the residual stream through its WQW_Q matrix, it is not selecting specific dimensions — it is selecting specific directions, potentially activating features stored in superposition. When it writes back through WOW_O, it deposits its results as a vector in the stream, which other components must decompose to extract useful information.

The Residual Connection’s Role in Superposition

The residual connection is essential to superposition. Because sublayer outputs are added to the stream (rather than replacing it), features written by early layers persist in the stream for later layers to read. If the architecture used xl+1=fl(xl)x_{l+1} = f_l(x_l) instead of xl+1=xl+gl(xl)x_{l+1} = x_l + g_l(x_l), each layer’s nonlinear transformation could destroy the delicate geometric structure of superposed features.

The additive structure preserves the angular relationships between feature directions. If layer 5 writes a feature in direction vv and layer 20 wants to read it, the feature’s direction is still vv in the residual stream at layer 20 — it has not been rotated or distorted by the intervening layers. The intervening layers have added their own contributions, but they have not transformed the existing content.

This is why mechanistic interpretability works at all. The linearity of the residual connection means that features maintain their identity as they flow through the network, even as new features are added on top of them.

From Theory to Practice: Sparse Autoencoders

The practical consequence of superposition is that individual neurons (dimensions) of the residual stream are not interpretable — each dimension is a mixture of many superposed features. To extract individual features, Anthropic and other groups have developed sparse autoencoders (SAEs) that decompose residual stream activations into a much larger set of sparse, interpretable features.

An SAE trained on GPT-4-scale residual stream activations might extract 100,000+ features from a 4,096-dimensional stream — a 25x overcomplete dictionary. Each feature activates sparsely and corresponds to an interpretable concept: “the current topic is sports,” “this sentence contains a negation,” “the model is uncertain about the next token,” and so on.

This line of research is only possible because of the residual stream’s additive structure. The clean linear superposition of features, maintained by residual connections, is what makes decomposition tractable.


8. Practical Implications for LLM Engineering

You Cannot Remove Residual Connections

This may seem obvious, but it is worth stating explicitly: if you remove the residual connection from even a single sublayer in a trained transformer, the model’s performance collapses catastrophically. The model has learned to produce small refinements that accumulate through the stream. Without the skip path, a sublayer’s output must carry the entire representation, which it was never trained to do.

Experiments on Llama-family models show that removing a single residual connection increases perplexity by 10—100x, depending on the layer. The model effectively produces incoherent text.

You Cannot Easily Modify Them

Because the residual stream is the backbone of the entire architecture, modifications to the residual path have outsized effects. Changing the skip connection from addition to, say, a gated combination xl+1=αxl+(1α)gl(xl)x_{l+1} = \alpha x_l + (1 - \alpha) g_l(x_l) alters every gradient path in the network simultaneously. Even small changes to the residual structure require careful analysis of their effect on gradient flow, activation scales, and feature superposition.

This is one reason why the transformer architecture has been remarkably stable. Researchers have modified attention patterns (MHA, MQA, GQA, MLA), positional encodings (sinusoidal, RoPE, ALiBi), normalization schemes (Post-Norm, Pre-Norm, RMSNorm), and activation functions (ReLU, GELU, SwiGLU) — but the residual connection has remained untouched since 2017. It is the one component that nobody can improve upon because it is already optimal in a precise mathematical sense: identity is the unique linear map that preserves gradient magnitude exactly.

Σ Theorem: Optimality of the Identity Skip Connection

Among all linear skip connections xl+1=Axl+gl(xl)x_{l+1} = Ax_l + g_l(x_l), the identity A=IA = I is the unique choice that simultaneously (1) preserves gradient magnitude through the skip path, (2) preserves the direction of the gradient, and (3) introduces no additional parameters. Any other choice of AA either attenuates gradients, amplifies them, rotates them, or requires learning — all of which degrade training stability at depth.

They Determine Maximum Effective Depth

Even with residual connections, there is a practical limit to depth. As the network gets deeper, the accumulated sublayer contributions can overwhelm the original embedding signal. The ratio of “useful signal” to “accumulated noise” in the residual stream decreases with depth, even though individual sublayer contributions are scaled down.

Empirically, the relationship between model quality and depth follows a curve of diminishing returns. Going from 24 to 48 layers typically provides a meaningful improvement. Going from 48 to 96 layers provides a smaller improvement. Going from 96 to 192 layers provides minimal additional benefit for most tasks, while doubling the computational cost. This is one reason why modern LLMs have largely settled in the 80—128 layer range: beyond that, widening the model (increasing dmodeld_{\text{model}}) gives better returns than adding more layers.

Benchmark Score vs. Model Depth (Matched Parameter Count)

(score)
24 layers (wide)
72 score
48 layers
78 score
80 layers
82 score
96 layers
83 score
128 layers Diminishing returns
83.5 score
192 layers (narrow) Depth hurts at fixed params
82 score

The Pre-Norm Advantage

The choice of normalization placement has a direct impact on residual path cleanliness. In Post-Norm:

xl+1=LayerNorm(xl+hl(xl))x_{l+1} = \text{LayerNorm}(x_l + h_l(x_l))

The gradient through the skip path is LayerNormxI\frac{\partial \text{LayerNorm}}{\partial x} \cdot I — it must pass through the LayerNorm Jacobian. This Jacobian can have eigenvalues that deviate significantly from 1, especially when the input distribution is skewed or has outlier dimensions.

In Pre-Norm:

xl+1=xl+hl(LayerNorm(xl))x_{l+1} = x_l + h_l(\text{LayerNorm}(x_l))

The gradient through the skip path is simply II. The normalization only affects the gradient through the sublayer branch, not the identity branch. This is why Pre-Norm enables stable training at greater depths: the gradient highway is completely unobstructed.

⚠️ Post-Norm Is Not Dead

Despite Pre-Norm’s gradient flow advantages, some recent architectures (including some configurations of PaLM and Gemini) experiment with modified Post-Norm schemes that add additional scaling or normalization tricks to recover training stability. Post-Norm can produce better final quality in some settings because it normalizes the accumulated stream, preventing the representation drift that Pre-Norm allows. The tradeoff is real, and the optimal choice depends on the specific depth, width, and training recipe.

Residual Connections and Model Surgery

Fine-tuning, pruning, and model merging all interact with the residual stream. LoRA (Low-Rank Adaptation) works by adding a small learned update to the sublayer weights, which changes what each sublayer writes to the stream. Because the stream is additive, LoRA’s modifications compose cleanly: the stream is the original contributions plus the LoRA deltas.

Layer pruning — removing entire transformer blocks — is feasible precisely because of the residual connection’s ensemble property. Removing a layer from a residual network removes one set of contributions from the stream but leaves all other contributions intact. The network degrades gracefully rather than catastrophically (though for a pruned model to recover, it typically needs a brief period of fine-tuning to adjust the remaining layers’ contributions).

Model merging (averaging the weights of two fine-tuned models) works because the residual stream structure ensures that the merged model produces outputs that are roughly the average of the two source models’ contributions. Without residual connections, weight averaging would not produce meaningful behavior because the multiplicative interactions between layers would create unpredictable interference.


Conclusion: The Architecture’s Backbone

The residual connection is the transformer’s most important structural element. It solves the gradient flow problem that makes deep networks trainable. It creates the residual stream that serves as the architecture’s shared communication bus. It enables the superposition of features that gives transformers their extraordinary representational capacity. And it provides the additive structure that makes interpretability, fine-tuning, and model surgery possible.

Every other component in the transformer — attention, feed-forward networks, normalization, positional encoding — has been modified, replaced, or rearchitected multiple times since 2017. The residual connection remains exactly as it was: x+f(x)x + f(x). It has survived unchanged because it is, in a precise mathematical sense, the optimal solution to the problem it solves. There is no simpler structure that enables deep training. There is no more complex structure that improves upon it.

When you look at a 100-layer transformer, you are not looking at a 100-stage pipeline. You are looking at a dmodeld_{\text{model}}-dimensional communication channel with 200 devices attached to it, each reading and writing small updates. The channel is the model. Everything else is peripheral.