Part of Series Transformer Anatomy 9 of 23
1 The Transformer Attention Mechanism: From First Principles to Performance Reality 2 Tokenization and BPE: How LLMs See Text — From Characters to Subwords 3 Embedding Layers: The Geometry of Meaning in LLMs 4 Position Encoding in Transformers: From Sinusoidal to RoPE, ALiBi, and Long-Context Scaling 5 Softmax Numerics: Log-Sum-Exp, Temperature, and Why Numerical Stability Matters 6 Attention Variants Compared: MHA, MQA, GQA, and MLA 7 Normalization in Transformers: LayerNorm, RMSNorm, and the Training Stability Story 8 Residual Connections and Skip Paths: Why Transformers Can Be 100 Layers Deep 9 The Feed-Forward Network: SwiGLU, Gating, and the FFN-as-Memory Hypothesis 10 Mixture of Experts: Why Conditional Computation Is the Path to Trillion-Parameter Models 11 The Output Head: Unembedding, Weight Tying, and Vocabulary Projection 12 Cross-Entropy Loss: How the Loss Function Shapes What an LLM Learns 13 Encoder vs Decoder: Why Decoder-Only Won 14 DeepSeek V3: How 671B Parameters Trained for the Cost of a 70B Dense Model 15 Building a Transformer From Scratch: Putting Every Component Together 16 Gradient Flow and Backpropagation Through Transformers: What Happens During the Backward Pass 17 Weight Initialization: Xavier, Kaiming, and Why mu-P Changes Everything for Large Models 18 Training Loop Anatomy: Forward Pass, Loss Computation, Backward Pass, Optimizer Step 19 Learning Rate Schedules: Warmup, Cosine Decay, and Why WSD Changes Everything 20 Activation Functions Deep Dive: ReLU, GELU, SiLU, and Why Each Matters for Transformers 21 Attention Masking: Causal, Bidirectional, Sliding Window, Block Sparse, and Custom Patterns 22 Knowledge Distillation: Training Small Models to Match Large Ones 23 Model Merging: Weight Averaging, TIES, DARE, and Evolutionary Search

The previous eight posts in this series have focused almost exclusively on the attention mechanism — how queries meet keys, how positional information is encoded, how the KV cache is compressed. Attention gets the spotlight because it is the architectural novelty that defines the transformer. But attention accounts for only about one-third of a transformer’s parameters. The other two-thirds live in the feed-forward network (FFN), also called the MLP block, which sits after every attention layer in the residual stream.

This imbalance between attention’s fame and the FFN’s parameter mass is not a coincidence. The attention mechanism routes information between token positions: it decides which tokens should talk to which other tokens. The FFN, by contrast, processes each token position independently. It reads a single vector from the residual stream, transforms it through a nonlinear function, and writes the result back. No cross-token communication occurs. If attention is the postal service that delivers mail between addresses, the FFN is the factory at each address that processes what arrives.

This post covers everything you need to understand about the FFN block: its architecture and tensor shapes, the expansion ratio and why it dominates parameter count, the evolution of activation functions from ReLU to GELU to SwiGLU, the gating mechanism and why multiplicative interactions matter, the remarkable FFN-as-key-value-memory hypothesis, knowledge neurons and model editing, the connection to Mixture of Experts, and finally a detailed performance analysis showing where the FFN sits on the roofline.


1. The FFN Block Architecture

Two Linear Layers with a Nonlinearity

The standard feed-forward network in a transformer is a simple two-layer MLP applied independently to each token position. Given an input vector xRdmodelx \in \mathbb{R}^{d_\text{model}} from the residual stream, the FFN computes:

FFN(x)=W2σ(W1x+b1)+b2\text{FFN}(x) = W_2 \cdot \sigma(W_1 \cdot x + b_1) + b_2

where W1Rdff×dmodelW_1 \in \mathbb{R}^{d_\text{ff} \times d_\text{model}} is the up-projection, W2Rdmodel×dffW_2 \in \mathbb{R}^{d_\text{model} \times d_\text{ff}} is the down-projection, σ\sigma is a nonlinear activation function, and b1b_1, b2b_2 are optional bias terms (most modern architectures omit them).

The tensor shapes through the FFN, for a batched input, are:

  1. Input: [B,S,dmodel][B, S, d_\text{model}] — batch size BB, sequence length SS, model dimension dmodeld_\text{model}.
  2. After up-projection: [B,S,dff][B, S, d_\text{ff}] — the hidden dimension dffd_\text{ff} is typically much larger than dmodeld_\text{model}.
  3. After activation: [B,S,dff][B, S, d_\text{ff}] — same shape, but values have been nonlinearly transformed.
  4. After down-projection: [B,S,dmodel][B, S, d_\text{model}] — back to the original dimension, ready to be added to the residual stream.
class FeedForwardNetwork(nn.Module):
    """
    Standard two-layer FFN as in the original Transformer.
    Applied independently to each token position.
    """
    def __init__(self, d_model: int, d_ff: int):
        super().__init__()
        self.w1 = nn.Linear(d_model, d_ff, bias=False)   # Up-projection
        self.w2 = nn.Linear(d_ff, d_model, bias=False)    # Down-projection
        self.act = nn.ReLU()  # Original Transformer used ReLU

    def forward(self, x: torch.Tensor) -> torch.Tensor:
        # x: [batch, seq_len, d_model]
        hidden = self.act(self.w1(x))  # [batch, seq_len, d_ff]
        output = self.w2(hidden)        # [batch, seq_len, d_model]
        return output

Why Two Layers?

The first linear layer projects UP from dmodeld_\text{model} to dffd_\text{ff}, expanding the representation into a higher-dimensional space. The nonlinearity acts in this higher-dimensional space, creating complex nonlinear feature combinations that a single linear layer could not express. The second linear layer projects DOWN back to dmodeld_\text{model}, compressing the result into a form that fits back into the residual stream.

This expand-then-compress pattern is the fundamental trick. A single linear transformation WRdmodel×dmodelW \in \mathbb{R}^{d_\text{model} \times d_\text{model}} can only compute linear functions of the input. By expanding to a higher dimension, applying a pointwise nonlinearity, and then compressing back down, the FFN can approximate a much richer class of functions. The wider the intermediate dimension dffd_\text{ff}, the more expressive the FFN becomes — at the cost of more parameters and more compute.

ℹ️ Per-Token Independence

The FFN processes each token position completely independently. Token at position 0 and token at position 512 go through exactly the same weights, but there is no information flow between them inside the FFN. All cross-position communication happens in the attention layers. This separation of concerns — attention for routing, FFN for processing — is a core architectural principle of the transformer.

The mental model to carry forward: the FFN is a per-token MLP that reads from the residual stream, transforms the representation through a bottleneck of nonlinear features, and writes back. It is where the transformer “thinks” about each token individually, using the context that attention has already mixed into that token’s representation.


2. The Expansion Ratio

The Standard 4x Rule

The original transformer (Vaswani et al., 2017) set dff=4×dmodeld_\text{ff} = 4 \times d_\text{model}. For a model with dmodel=512d_\text{model} = 512, the FFN hidden dimension was dff=2048d_\text{ff} = 2048. This 4x expansion ratio became one of the most widely adopted hyperparameters in deep learning, carried forward through GPT-2, GPT-3, and many other architectures.

Why 4x? The original paper does not provide a deep theoretical justification. Empirically, 4x provides a good balance between expressiveness and parameter efficiency. Smaller ratios (2x, 3x) reduce capacity and hurt quality. Larger ratios (8x, 16x) provide diminishing returns while linearly increasing parameters and compute. The 4x ratio has proven robust across a wide range of model sizes, from 100M to 175B parameters.

SwiGLU Changes the Math

When gated activations like SwiGLU replaced ReLU (which we will cover in detail in Sections 3 and 4), the expansion ratio changed. A gated FFN uses THREE weight matrices instead of two: W1W_1 (gate projection), W3W_3 (up-projection), and W2W_2 (down-projection). To keep the total parameter count equal to a standard FFN with dff=4×dmodeld_\text{ff} = 4 \times d_\text{model}, the hidden dimension must be reduced.

A standard FFN has 2×dmodel×dff2 \times d_\text{model} \times d_\text{ff} parameters (two matrices). A gated FFN has 3×dmodel×dff3 \times d_\text{model} \times d_\text{ff}^{\prime} parameters (three matrices). Setting these equal:

2×dmodel×4dmodel=3×dmodel×dff2 \times d_\text{model} \times 4 d_\text{model} = 3 \times d_\text{model} \times d_\text{ff}^{\prime} dff=83×dmodel2.67×dmodeld_\text{ff}^{\prime} = \frac{8}{3} \times d_\text{model} \approx 2.67 \times d_\text{model}

In practice, dffd_\text{ff} is rounded to a multiple of 256 (for GPU efficiency). For Llama 2 70B with dmodel=8192d_\text{model} = 8192, the theoretical value is 83×8192=21845\frac{8}{3} \times 8192 = 21845, but the actual value used is dff=28672d_\text{ff} = 28672 — larger than the parameter-matched value, because the Llama designers chose to allocate extra capacity to the FFN.

Parameter Count Dominance

Let us work through the numbers for Llama 2 70B to see why the FFN dominates the parameter count.

Per-layer FFN parameters (SwiGLU with dmodel=8192d_\text{model} = 8192, dff=28672d_\text{ff} = 28672):

3×dmodel×dff=3×8192×28672=704,643,072704M3 \times d_\text{model} \times d_\text{ff} = 3 \times 8192 \times 28672 = 704{,}643{,}072 \approx 704\text{M}

Per-layer attention parameters (nheads=64n_\text{heads} = 64, nkv_heads=8n_\text{kv\_heads} = 8, dhead=128d_\text{head} = 128):

WQ+WK+WV+WO=8192×8192+8192×1024+8192×1024+8192×8192=151,027,712151MW_Q + W_K + W_V + W_O = 8192 \times 8192 + 8192 \times 1024 + 8192 \times 1024 + 8192 \times 8192 = 151{,}027{,}712 \approx 151\text{M}

The FFN has roughly 4.7x more parameters than the attention layer. Across 80 layers:

FFN total=704M×80=56.4B\text{FFN total} = 704\text{M} \times 80 = 56.4\text{B} Attention total=151M×80=12.1B\text{Attention total} = 151\text{M} \times 80 = 12.1\text{B}

The FFN accounts for approximately 82% of the model’s layer parameters, with the remaining 18% in attention. Adding the embedding and output layers does not significantly change this ratio.

📊

Parameter Breakdown: Llama 2 70B

ComponentPer LayerTotal (80 layers)Share
FFN (SwiGLU) 704M 56.4B 82%
Attention (GQA-8) 151M 12.1B 18%
Embeddings + Head -- ~1.1B --
Total 855M ~69.6B 100%
Note: d_model=8192, d_ff=28672, n_heads=64, n_kv_heads=8, d_head=128

This is the core insight: when you are looking at a 70B parameter model, roughly 56 billion of those parameters live in the FFN blocks. Understanding what these parameters do — and how knowledge is stored in them — is essential.


3. Activation Functions: The Evolution

The activation function σ\sigma in the FFN determines how the network introduces nonlinearity. The choice of activation has evolved significantly over the transformer era, driven by empirical findings about training dynamics and gradient flow.

ReLU: The Original Choice

The original transformer used ReLU (Rectified Linear Unit):

ReLU(x)=max(0,x)\text{ReLU}(x) = \max(0, x)

ReLU is elegant in its simplicity. For positive inputs, it is the identity function. For negative inputs, it outputs zero. This creates sparse activations: at any given input, roughly half the neurons in the hidden layer have zero output. Sparsity is computationally attractive because multiplying by zero is free, and it has been argued to encourage more interpretable representations.

However, ReLU has a fundamental problem for deep networks: the dying ReLU phenomenon. For any neuron where W_1 x + b_1 < 0, the gradient through ReLU is exactly zero. If a neuron drifts into a regime where its pre-activation is consistently negative across the training data, it receives zero gradient and can never recover. The neuron is “dead” — it contributes nothing to the network’s output and wastes parameters. In deep transformer stacks with many layers of FFNs, this problem compounds.

GELU: The Smooth Alternative

Hendrycks and Gimpel (2016) proposed the Gaussian Error Linear Unit:

GELU(x)=xΦ(x)\text{GELU}(x) = x \cdot \Phi(x)

where Φ(x)\Phi(x) is the cumulative distribution function of the standard Gaussian. In practice, this is approximated as:

GELU(x)0.5x(1+tanh[2/π(x+0.044715x3)])\text{GELU}(x) \approx 0.5 \cdot x \cdot \left(1 + \tanh\left[\sqrt{2/\pi}\left(x + 0.044715 x^3\right)\right]\right)

GELU behaves like a smooth version of ReLU. For large positive values, Φ(x)1\Phi(x) \approx 1 and GELU x\approx x (like ReLU). For large negative values, Φ(x)0\Phi(x) \approx 0 and GELU 0\approx 0 (like ReLU). But the transition is smooth: near zero, GELU has non-zero gradients on both sides. This means neurons never fully “die” — they always receive at least a small gradient signal, allowing them to recover from unfavorable weight configurations.

GELU became the default activation in BERT and GPT-2, and its success was one of the early signals that smoother activations help transformer optimization. The improvement is not dramatic on any single benchmark — typically 0.1—0.5% — but it is consistent across tasks and scales, and it makes training more stable.

SiLU/Swish: The Simplification

The SiLU (Sigmoid Linear Unit), also known as Swish, was proposed by Ramachandran et al. (2017):

SiLU(x)=xσ(x)=x1+ex\text{SiLU}(x) = x \cdot \sigma(x) = \frac{x}{1 + e^{-x}}

where σ\sigma is the logistic sigmoid function. SiLU is conceptually simpler than GELU (sigmoid is easier to reason about than the Gaussian CDF) and behaves almost identically in practice. For large positive xx, sigmoid approaches 1 and SiLU x\approx x. For large negative xx, sigmoid approaches 0 and SiLU 0\approx 0. Near zero, SiLU is smooth with a slight non-monotonicity: it dips slightly below zero around x1.28x \approx -1.28.

The quality difference between GELU and SiLU is negligible in practice. SiLU became the activation of choice for the Llama family of models, primarily because it integrates cleanly into the gated architecture described next.

💡 Activation Function Selection in Practice

For standalone FFNs (without gating), GELU and SiLU/Swish perform nearly identically. The choice is largely one of convention: BERT and GPT-2 use GELU, Llama uses SiLU. If you are implementing a gated FFN (SwiGLU), use SiLU — that is the “Swi” in SwiGLU. If you are using a standard two-matrix FFN, GELU is the safe default.

The Common Theme

All three smooth activations (GELU, SiLU, Swish) share a key property that ReLU lacks: they are smooth functions with non-zero gradients almost everywhere. This matters for optimization in deep networks. When you stack 80 or 128 layers, each with an FFN, the gradient must flow backward through all those activations during training. Zero gradients in ReLU create dead zones that block gradient flow. Smooth activations keep the gradient signal alive, enabling more stable training at scale.

Activation Function Comparison (Approximate Behavior)

(relative quality)
ReLU Sparse, dying neurons
94 relative quality
GELU Smooth, BERT/GPT default
99 relative quality
SiLU/Swish Smooth, Llama default
99 relative quality
SwiGLU (gated) Gated, state of the art
100 relative quality

4. Gated FFNs: SwiGLU, GeGLU, ReGLU

The Gating Mechanism

The most significant architectural change to the FFN since the original transformer is the introduction of gating. Instead of applying the activation function to a single up-projection, a gated FFN computes two parallel projections and multiplies them together element-wise:

SwiGLU(x)=(SiLU(W1x)(W3x))W2\text{SwiGLU}(x) = \left(\text{SiLU}(W_1 x) \odot (W_3 x)\right) \cdot W_2

Here, W1Rdff×dmodelW_1 \in \mathbb{R}^{d_\text{ff} \times d_\text{model}} is the gate projection, W3Rdff×dmodelW_3 \in \mathbb{R}^{d_\text{ff} \times d_\text{model}} is the up-projection, and W2Rdmodel×dffW_2 \in \mathbb{R}^{d_\text{model} \times d_\text{ff}} is the down-projection. The \odot symbol denotes element-wise (Hadamard) multiplication.

The name decodes as follows: Swi (Swish/SiLU activation) + GLU (Gated Linear Unit). Replace Swish with GELU and you get GeGLU. Replace it with ReLU and you get ReGLU.

class SwiGLU_FFN(nn.Module):
    """
    Gated FFN with SwiGLU activation.
    Used by Llama 2, Llama 3, Mistral, and most modern LLMs.
    """
    def __init__(self, d_model: int, d_ff: int):
        super().__init__()
        self.w1 = nn.Linear(d_model, d_ff, bias=False)  # Gate projection
        self.w3 = nn.Linear(d_model, d_ff, bias=False)  # Up-projection
        self.w2 = nn.Linear(d_ff, d_model, bias=False)  # Down-projection

    def forward(self, x: torch.Tensor) -> torch.Tensor:
        # x: [batch, seq_len, d_model]
        gate = F.silu(self.w1(x))    # [batch, seq_len, d_ff]
        up = self.w3(x)               # [batch, seq_len, d_ff]
        hidden = gate * up             # Element-wise gating
        output = self.w2(hidden)       # [batch, seq_len, d_model]
        return output

Why Gating Helps

The standard FFN computes W2σ(W1x)W_2 \cdot \sigma(W_1 x): the activation function acts as a fixed filter on the up-projected features. The network must learn W1W_1 such that the features worth keeping have positive pre-activations and the features worth suppressing have negative pre-activations. This is a single, somewhat rigid mechanism.

The gated FFN introduces a multiplicative interaction between two different projections of the input. The term SiLU(W1x)\text{SiLU}(W_1 x) computes a soft binary mask: which features should be active (values near 1) and which should be suppressed (values near 0). The term W3xW_3 x computes the actual feature values. The element-wise product SiLU(W1x)W3x\text{SiLU}(W_1 x) \odot W_3 x means the network can independently learn what to compute (via W3W_3) and whether to compute it (via W1W_1).

This separation is more powerful than it may appear. In a standard FFN, the same matrix W1W_1 must simultaneously determine both the feature value and the gating decision. In a gated FFN, these two roles are decoupled into separate learnable transformations. The gate can learn to detect arbitrary patterns in the input — specific token types, specific positions in a sentence, specific semantic features — and selectively enable or disable the corresponding features from the up-projection.

Multiplicative interactions also create higher-order feature combinations. When SiLU(W1x)W3x\text{SiLU}(W_1 x) \odot W_3 x is computed, the effective output is a product of two linear functions of the input, which is a second-order polynomial in xx (modulo the SiLU nonlinearity). This gives the network access to interaction terms that a standard FFN can only approximate through the nonlinearity.

The Shazeer (2020) Paper

Noam Shazeer’s “GLU Variants Improve Transformer” (2020) systematically tested all combinations of gating with different activation functions. The paper is remarkably concise — essentially a large ablation study. The key findings:

  1. All gated variants outperform all non-gated variants at equal parameter count. This is the single most important result: gating itself, regardless of the specific activation function, consistently helps.
  2. SwiGLU and GeGLU tie for best performance, with SwiGLU having a slight edge on most benchmarks.
  3. ReGLU (gated ReLU) outperforms standard GELU without gating, demonstrating that gating is more important than the choice of activation function.
  4. The improvements are consistent across model sizes (from 100M to 1B parameters tested) and across tasks (language modeling perplexity, downstream benchmarks).
📊

GLU Variants Comparison (Shazeer 2020)

FFN VariantActivationGated?Perplexity (lower = better)
Standard FFN ReLU No 3.89
Standard FFN GELU No 3.80
ReGLU ReLU Yes 3.76
GeGLU GELU Yes 3.72
SwiGLU SiLU/Swish Yes 3.71
Note: Perplexity on C4 validation set, parameter-matched at ~200M. Lower is better.

The Parameter Tradeoff

The price of gating is an extra weight matrix. A standard FFN with hidden dimension dffd_\text{ff} has 2×dmodel×dff2 \times d_\text{model} \times d_\text{ff} parameters. A gated FFN has 3×dmodel×dff3 \times d_\text{model} \times d_\text{ff}^{\prime} parameters. To maintain the same total parameter count, the hidden dimension must shrink from dffd_\text{ff} to dff=23dffd_\text{ff}^{\prime} = \frac{2}{3} d_\text{ff}.

For the standard 4x expansion ratio, dff=4dmodeld_\text{ff} = 4 d_\text{model}, so dff=83dmodel2.67dmodeld_\text{ff}^{\prime} = \frac{8}{3} d_\text{model} \approx 2.67 d_\text{model}. The hidden dimension is narrower, but the gate’s multiplicative interaction more than compensates. In Shazeer’s experiments, the gated FFN with the narrower hidden dimension consistently outperformed the standard FFN with the wider hidden dimension at exactly the same parameter count.

SwiGLU Adoption

As of 2025, SwiGLU has become the near-universal default for transformer FFNs. Llama 2/3, Mistral, Mixtral, Gemma, Qwen, DeepSeek, and most other frontier models use SwiGLU. If you are building a new transformer from scratch, SwiGLU is the obvious choice — the quality improvement is free once you account for the parameter rebalancing.


5. The FFN-as-Key-Value-Memory Hypothesis

The Core Insight

In 2021, Geva, Schuster, Berant, and Levy published “Transformer Feed-Forward Layers Are Key-Value Memories,” a paper that fundamentally reframed how we think about the FFN. Their insight is both simple and profound: the FFN’s two weight matrices behave like the key and value stores of a soft associative memory.

Consider the standard FFN (without gating, for clarity):

FFN(x)=W2σ(W1x)\text{FFN}(x) = W_2 \cdot \sigma(W_1 \cdot x)

Write W1W_1 row-by-row. The ii-th row of W1W_1, call it kiRdmodelk_i \in \mathbb{R}^{d_\text{model}}, computes a dot product with the input xx:

ai=σ(kix)a_i = \sigma(k_i \cdot x)

This dot product measures how well the input xx matches the pattern encoded in kik_i. If the match is strong (high dot product), the activation aia_i is large. If the match is weak, aia_i is small or zero (for ReLU).

Now write W2W_2 column-by-column. The ii-th column of W2W_2, call it viRdmodelv_i \in \mathbb{R}^{d_\text{model}}, is the “value” associated with the ii-th key. The output of the FFN is:

FFN(x)=i=1dffaivi\text{FFN}(x) = \sum_{i=1}^{d_\text{ff}} a_i \cdot v_i

This is a weighted sum over value vectors, where the weights are determined by how well the input matches each key vector. This is precisely the mechanism of a soft key-value lookup:

  1. Keys (W1W_1 rows): patterns the FFN has learned to detect in the input.
  2. Values (W2W_2 columns): information the FFN retrieves when a key matches.
  3. Activation (σ\sigma): the soft matching function that determines retrieval strength.

The FFN is essentially a learned associative memory with dffd_\text{ff} entries. Each entry consists of a key pattern (what to look for) and a value vector (what to retrieve). When the input matches a key, the corresponding value is retrieved and added to the residual stream.

ℹ️ FFN as Soft Lookup Table

Think of the FFN as a lookup table with dffd_\text{ff} entries. Each entry has a key (a row of W1W_1) and a value (a column of W2W_2). The input xx is the query. The FFN computes a soft match against all keys simultaneously, then returns a weighted combination of the corresponding values. For a model with dff=28672d_\text{ff} = 28672, that is 28,672 entries in the lookup table — per layer.

Evidence from Probing Experiments

Geva et al. provided compelling evidence for this interpretation through a series of probing experiments on GPT-2 and other models:

Key pattern analysis. They examined what input patterns cause specific rows of W1W_1 to activate strongly. They found that individual keys correspond to highly interpretable features:

  • Certain keys activate specifically for tokens that represent countries.
  • Other keys activate for tokens in the context of temporal expressions (“in 1997”, “last year”).
  • Some keys activate for tokens following particular syntactic patterns (e.g., the subject of a passive construction).

Value vector analysis. They then examined the corresponding columns of W2W_2 — the values retrieved when those keys match. When a “country” key fires, the corresponding value promotes the token for that country’s capital in the output vocabulary distribution. When a “temporal” key fires, the corresponding value promotes time-related tokens.

The retrieval is compositional. Because the output is a weighted sum of many values, the FFN does not just retrieve a single fact — it composes multiple partial retrievals. If the input simultaneously matches a “European country” key and a “capital city” key, the values from both contribute to the output, collectively pushing the distribution toward the correct capital.

Implications

This reframing has several profound implications:

  1. Knowledge storage is explicit. Factual knowledge (“Paris is the capital of France”) is not stored as some diffuse, inscrutable pattern across the network. It is stored as specific key-value pairs in specific FFN layers. The key encodes the input context that triggers the retrieval, and the value encodes the output to produce.

  2. The FFN is a memory bank. Each FFN layer contains dffd_\text{ff} memory slots. Across 80 layers with dff=28672d_\text{ff} = 28672, the model has roughly 2.3 million memory slots. This is the model’s “knowledge capacity” — the number of distinct facts or patterns it can store.

  3. Different layers store different types of knowledge. Early layers tend to store syntactic patterns. Middle layers store semantic associations and factual knowledge. Late layers store output-formatting patterns. This layerwise specialization mirrors findings from probing classifiers applied to attention.

FFN as Key-Value Memory (Conceptual)

Each FFN layer is a soft associative memory with d_ff entries

Key 1 (W1 row 1): Detects 'France' in context Activation: 0.92 Matched
Value 1 (W2 col 1): Promotes 'Paris' in output Weight: 0.92 Retrieved
Key 2 (W1 row 2): Detects temporal expressions Activation: 0.03 Not matched
Value 2 (W2 col 2): Promotes year tokens Weight: 0.03 Weakly retrieved
Key 3 (W1 row 3): Detects 'capital of' pattern Activation: 0.87 Matched
Value 3 (W2 col 3): Promotes city names Weight: 0.87 Retrieved

6. Knowledge Neurons

Localizing Facts in the Network

If the FFN stores knowledge as key-value pairs, can we identify the specific neurons that store a specific fact? Dai et al. (2022) addressed this question in “Knowledge Neurons in Pretrained Transformers,” introducing a method to locate and manipulate the neurons responsible for individual factual associations.

Their approach works as follows. Given a factual query like “The capital of France is ___”, they:

  1. Run the query through the model and record the activation of every neuron in every FFN layer (all dffd_\text{ff} activations at every layer).
  2. For each neuron, suppress its activation (set it to zero) and measure how much the model’s probability of the correct answer (“Paris”) decreases.
  3. Neurons whose suppression causes a large decrease in the correct answer’s probability are identified as “knowledge neurons” for that fact.

Key Findings

Dai et al. found several striking results:

Knowledge is relatively localized. For a given fact, typically 5—20 neurons (out of millions) are responsible for the majority of the model’s ability to produce the correct answer. Suppressing these neurons can reduce the probability of the correct answer by 50—90%.

Knowledge neurons cluster in middle layers. For GPT-2 and BERT-based models, the knowledge neurons for factual associations are concentrated in the middle third of the network (roughly layers 8—16 in a 24-layer model). Early layers handle syntactic processing and late layers handle output formatting, so the middle layers are where “facts” live.

Different facts use different neurons. The neurons storing “the capital of France is Paris” are almost entirely disjoint from those storing “the capital of Japan is Tokyo.” The FFN has specialized different memory slots for different facts, exactly as the key-value memory hypothesis predicts.

Suppressing knowledge neurons erases facts. If you zero out the 10—20 knowledge neurons for “the capital of France is Paris,” the model can no longer answer that question correctly. But it can still answer “the capital of Japan is Tokyo” — the suppression is targeted. This is a form of surgical model editing: modifying specific facts without affecting the rest of the model’s knowledge.

Implications for Model Editing

The knowledge neuron framework opens the door to targeted model editing. If you want a model to “forget” a specific fact — perhaps for privacy reasons, or to correct outdated information — you can:

  1. Identify the knowledge neurons for that fact.
  2. Suppress or modify them (either by zeroing weights or by fine-tuning them to produce a different output).

This is far more efficient than retraining the entire model. Several subsequent papers (ROME, MEMIT, and others) have built on this insight to develop increasingly sophisticated model editing techniques.

⚠️ Knowledge Is Distributed, Not Perfectly Localized

While the knowledge neuron framework is compelling, it is important not to oversimplify. Knowledge is distributed across many neurons and multiple layers. Suppressing the top 20 neurons for a fact reduces the correct answer’s probability significantly, but rarely eliminates it entirely. The model has redundant storage — the same fact may be partially encoded in multiple locations. Editing one set of neurons can leave residual knowledge elsewhere that manifests in unexpected ways. Practical model editing remains an active research area.

The ROME and MEMIT Methods

Meng et al. (2022) extended the knowledge neuron work with ROME (Rank-One Model Editing), which modifies a single FFN layer to insert, delete, or update a specific fact. The key insight is that modifying a single row of W2W_2 (a value vector) can change the output associated with a specific key pattern without affecting other key-value pairs in the same layer.

MEMIT (Mass-Editing Memory in a Transformer) scaled this to edit thousands of facts simultaneously by distributing modifications across multiple FFN layers. These methods formalize the FFN-as-memory hypothesis into a practical engineering tool: the FFN is not just metaphorically a memory — it can be read and written like one.


7. The Mixture of Experts Connection

From One Memory Bank to Many

If a single FFN layer is a key-value memory with dffd_\text{ff} entries, a natural question arises: what if we want more entries without proportionally increasing the compute cost? The answer is Mixture of Experts (MoE), which replaces the single FFN with NN parallel “expert” FFNs and a learned router that decides which expert(s) to use for each token.

MoE-FFN(x)=i=1Ngi(x)FFNi(x)\text{MoE-FFN}(x) = \sum_{i=1}^{N} g_i(x) \cdot \text{FFN}_i(x)

where gi(x)g_i(x) is the routing weight for expert ii, computed by a small router network. In practice, the router selects only the top-kk experts (typically k=1k = 1 or k=2k = 2), so most experts are not activated for any given token. This is what makes MoE efficient: the total parameter count is NN times larger than a dense FFN, but the compute cost is only k/Nk / N times as expensive.

Specialization Through the Lens of Memory

Under the FFN-as-memory hypothesis, MoE has a clean interpretation: instead of one memory bank with dffd_\text{ff} entries, you have NN specialized memory banks, each with dffd_\text{ff} entries, for a total of N×dffN \times d_\text{ff} entries. The router learns which memory bank is relevant for each input.

Empirically, experts do specialize. In trained MoE models:

  • Some experts specialize in punctuation and formatting.
  • Others specialize in named entities and factual knowledge.
  • Others specialize in code or mathematical expressions.
  • Some handle common syntactic patterns, while others handle rare constructions.

This specialization emerges naturally from the routing mechanism. The router learns to send each token to the expert whose key patterns best match the input, which means each expert can devote its full dffd_\text{ff} capacity to a subset of the input distribution. This is far more parameter-efficient than a single dense FFN that must allocate memory slots across the entire input distribution.

Scale Examples

Mixtral 8x7B has 8 expert FFNs per layer, with 2 activated per token. Total parameters: ~47B. Active parameters per token: ~13B. This gives the model the knowledge capacity of a 47B-parameter model at the compute cost of a 13B-parameter model.

DeepSeek-V2 has 160 experts per layer with 6 activated per token. Total parameters: 236B. Active parameters per token: ~21B. The knowledge capacity is enormous — over 160 times the memory entries of a single FFN — but the per-token cost is manageable.

💡 Part 10 Preview: MoE in Detail

The next post in this series covers Mixture of Experts in full depth: the routing algorithms (top-k, expert choice, soft routing), the load balancing problem and auxiliary losses, the capacity factor, the interaction between MoE and tensor parallelism, and the performance characteristics of sparse computation on modern GPUs.

The progression from dense FFN to MoE is a direct consequence of the FFN-as-memory framing. Once you view the FFN as a memory bank, the natural scaling strategy is to add more banks (experts) rather than making one bank wider (dffd_\text{ff} larger). More banks with a router is more parameter-efficient than a wider bank, because the router avoids wasting compute on irrelevant memory entries.


8. FFN Performance Analysis

Compute Cost

Each linear layer in the FFN is a matrix multiplication. For a standard (non-gated) FFN with input dimension dmodeld_\text{model} and hidden dimension dffd_\text{ff}, the compute cost per token is:

FLOPsFFN=2×dmodel×dff×2=4×dmodel×dff\text{FLOPs}_\text{FFN} = 2 \times d_\text{model} \times d_\text{ff} \times 2 = 4 \times d_\text{model} \times d_\text{ff}

The factor of 2 per matmul accounts for the multiply and accumulate operations (each element of the output requires dd multiplies and dd adds). The final factor of 2 counts both the up-projection and down-projection.

For a gated FFN (SwiGLU), there are three matrices instead of two:

FLOPsSwiGLU=2×dmodel×dff×3=6×dmodel×dff\text{FLOPs}_\text{SwiGLU} = 2 \times d_\text{model} \times d_\text{ff} \times 3 = 6 \times d_\text{model} \times d_\text{ff}

For Llama 2 70B (dmodel=8192d_\text{model} = 8192, dff=28672d_\text{ff} = 28672):

FLOPs per token per layer=6×8192×28672=1,409,286,1441.41 TFLOP\text{FLOPs per token per layer} = 6 \times 8192 \times 28672 = 1{,}409{,}286{,}144 \approx 1.41 \text{ TFLOP}

For a batch of BB tokens and SS sequence positions:

FLOPs per layer=6×B×S×dmodel×dff\text{FLOPs per layer} = 6 \times B \times S \times d_\text{model} \times d_\text{ff}

Across all 80 layers, a single forward pass through the FFN blocks alone costs 80×1.41=112.880 \times 1.41 = 112.8 TFLOPs per token. This dwarfs the attention computation at short sequence lengths.

📊

FFN Compute Cost Per Token Per Layer

Modeld_modeld_ffFFN TypeTFLOPs/token/layer
GPT-2 (1.5B) 1600 6400 Standard 0.041
Llama 2 7B 4096 11008 SwiGLU 0.271
Llama 2 13B 5120 13824 SwiGLU 0.425
Llama 2 70B 8192 28672 SwiGLU 1.41
Llama 3 405B 16384 53248 SwiGLU 5.24
Note: FLOPs = 6 * d_model * d_ff for SwiGLU, 4 * d_model * d_ff for standard FFN.

Memory Traffic

During inference, the dominant cost of the FFN is not compute but memory bandwidth. The three weight matrices (W1W_1, W2W_2, W3W_3 for SwiGLU) must be loaded from HBM for every batch of tokens. The total weight data per FFN layer is:

BytesFFN=3×dmodel×dff×dtype\text{Bytes}_\text{FFN} = 3 \times d_\text{model} \times d_\text{ff} \times d_\text{type}

For Llama 2 70B in BF16 (dtype=2d_\text{type} = 2):

BytesFFN=3×8192×28672×2=1,409,286,144 bytes1.31 GB\text{Bytes}_\text{FFN} = 3 \times 8192 \times 28672 \times 2 = 1{,}409{,}286{,}144 \text{ bytes} \approx 1.31 \text{ GB}

On an H100 with 3.35 TB/s of HBM bandwidth, loading a single FFN layer takes:

tload=1.31 GB3.35 TB/s=0.39 mst_\text{load} = \frac{1.31 \text{ GB}}{3.35 \text{ TB/s}} = 0.39 \text{ ms}

At batch size 1 (single-token decode), this is the entire cost. The compute (1.41 TFLOPs) takes roughly 0.0014 ms on an H100 at 990 TFLOPS — over 200x faster than loading the weights. The FFN is massively memory-bandwidth-bound at small batch sizes, exactly like the attention mechanism during decode.

Arithmetic Intensity and the Roofline

The arithmetic intensity (AI) of the FFN is the ratio of compute to memory traffic:

AI=FLOPsBytes=6×B×dmodel×dff3×dmodel×dff×dtype=2Bdtype=B (for BF16)\text{AI} = \frac{\text{FLOPs}}{\text{Bytes}} = \frac{6 \times B \times d_\text{model} \times d_\text{ff}}{3 \times d_\text{model} \times d_\text{ff} \times d_\text{type}} = \frac{2B}{d_\text{type}} = B \text{ (for BF16)}

This is a remarkably clean result: the arithmetic intensity of the FFN is simply the batch size (in BF16). This means:

  • At batch=1: AI = 1 FLOP/byte. Deeply memory-bound. The H100 has a compute-to-bandwidth ratio of approximately 295 FLOPs/byte (990 TFLOPS / 3.35 TB/s), so the GPU is at roughly 0.3% compute utilization.
  • At batch=32: AI = 32. Still memory-bound, but utilization improves to about 11%.
  • At batch=295: AI = 295. The FFN reaches the roofline — the balance point where compute and memory bandwidth are equally saturated.
  • At batch=1024: AI = 1024. Compute-bound. Now the GPU is fully utilized and memory bandwidth is not the bottleneck.

FFN Compute Utilization vs. Batch Size (H100)

(%)
Batch=1 Memory-bound
0.3 %
Batch=8 Memory-bound
2.7 %
Batch=32 Memory-bound
10.8 %
Batch=128 Approaching balance
43.4 %
Batch=295 Roofline (balanced)
100 %
Batch=1024 Compute-bound
100 %

The Decode Bottleneck

During autoregressive decoding (generating one token at a time), the effective batch size for the FFN is the number of sequences being generated in parallel. At batch=1, the H100 spends 0.39 ms loading 1.31 GB of FFN weights to perform a computation that takes 0.0014 ms. Over 99.6% of the time is spent waiting for memory.

This is identical to the attention decode bottleneck described in Part 7 (speculative decoding). In fact, during decode, the FFN and the attention mechanism are both memory-bandwidth-bound, and their costs are additive. For Llama 2 70B at batch=1:

  • FFN weight loading per layer: 1.31 GB, 0.39 ms
  • Attention weight loading per layer: ~0.28 GB, 0.08 ms
  • KV cache reading per layer: varies with context length

The FFN accounts for roughly 80% of the weight-loading cost per layer, matching its 82% share of the parameters. This is not a coincidence — when you are memory-bandwidth-bound, the cost is proportional to the number of bytes loaded, which is proportional to the number of parameters.

Per-Layer Weight Loading: Llama 2 70B (BF16)

Decode phase, batch=1. All components are memory-bandwidth-bound.

0x5400 0x0000
0xA800 0x5400
0xFC00 0xA800
0x10000 0xFC00
FFN W1 (gate) 448 MB
FFN W3 (up) 448 MB
FFN W2 (down) 448 MB
Attention (Wq,Wk,Wv,Wo) 286 MB
d_model x d_ff = 8192 x 28672 x 2 bytes
Same dimensions as W1
d_ff x d_model = 28672 x 8192 x 2 bytes
GQA-8: smaller K,V projections
FFN W1 (gate) 448 MB
FFN W3 (up) 448 MB
FFN W2 (down) 448 MB
Attention (Wq,Wk,Wv,Wo) 286 MB

Prefill vs. Decode: A Different Story

During the prefill phase (processing the input prompt), all tokens are processed simultaneously. The effective batch size is B×SB \times S (batch size times prompt length). For a batch of 8 requests with 2048-token prompts, the effective batch size is 16,384 — far above the roofline threshold. In this regime, the FFN is compute-bound, and the GPU achieves near-peak utilization.

This is why prefill throughput scales almost linearly with GPU FLOPs, while decode throughput scales with memory bandwidth. Optimization strategies differ accordingly:

  • Prefill: maximize FLOPs (use larger GPUs, optimize matmul kernels, use FP8 for higher throughput).
  • Decode: maximize memory bandwidth (use more GPUs to parallelize weight loading, quantize weights to reduce bytes, increase batch size to amortize loading cost).
📊

FFN Performance Regime by Phase

PhaseEffective BatchArithmetic IntensityBottleneckOptimization
Decode (batch=1) 1 1 FLOP/byte Memory BW Weight quantization, batching
Decode (batch=32) 32 32 FLOP/byte Memory BW Batching, TP parallelism
Decode (batch=256) 256 256 FLOP/byte Near balanced Balanced optimization
Prefill (B=8, S=2048) 16,384 16,384 FLOP/byte Compute FP8, kernel optimization
Note: H100 roofline crossover at ~295 FLOP/byte (990 TFLOPS / 3.35 TB/s). BF16 arithmetic.

9. Putting It All Together

The FFN in the Residual Stream

To understand the FFN’s role in the full transformer, recall the residual stream architecture. Each transformer layer computes:

x=x+Attention(LN(x))x' = x + \text{Attention}(\text{LN}(x)) x=x+FFN(LN(x))x'' = x' + \text{FFN}(\text{LN}(x'))

The attention layer reads from the residual stream, computes cross-token interactions, and writes back. Then the FFN reads from the updated residual stream, processes each token independently through its key-value memory, and writes back. The residual connections ensure that information from earlier layers is preserved — the FFN adds to the existing representation rather than replacing it.

This means the FFN at layer \ell operates on a representation that already incorporates the outputs of all previous attention and FFN layers. By the middle layers, each token’s representation in the residual stream contains rich contextual information: it knows not just what word it is, but what words surround it, what syntactic role it plays, what the topic of the passage is. This context-enriched representation is what the FFN’s keys match against, which is why the FFN can detect high-level patterns like “this token appears in the context of a question about European geography” and retrieve the appropriate factual information.

Design Decisions for Modern FFNs

If you are designing or fine-tuning a transformer today, the FFN decisions are largely settled:

  1. Activation function: SwiGLU. There is no compelling reason to use anything else for a new model.
  2. Expansion ratio: dff83dmodeld_\text{ff} \approx \frac{8}{3} d_\text{model}, rounded to a multiple of 256 for GPU efficiency. Many practitioners use a slightly larger ratio for additional capacity.
  3. Biases: No biases (b1=b2=0b_1 = b_2 = 0). Removing biases saves a small number of parameters and does not measurably affect quality. All Llama models omit biases.
  4. Normalization: RMSNorm (not LayerNorm) applied before the FFN, as part of the pre-norm residual architecture. Pre-norm is more stable than post-norm for deep networks.
  5. Parallelism: For large models, the FFN weights are partitioned across GPUs using tensor parallelism. W1W_1 and W3W_3 are column-partitioned, W2W_2 is row-partitioned. Each GPU computes a shard of the hidden dimension, and an all-reduce synchronizes the output.

The Capacity Question

How many facts can a transformer’s FFN blocks store? This is an open and fascinating question. A rough upper bound comes from the key-value memory interpretation: each FFN layer has dffd_\text{ff} memory slots, and there are LL layers, giving L×dffL \times d_\text{ff} total slots.

For Llama 2 70B: 80×28672=2,293,76080 \times 28672 = 2{,}293{,}760 slots. But this is a loose upper bound. Each “fact” may require multiple neurons (Dai et al. found 5—20 neurons per fact). Knowledge is redundantly stored across layers. And the same neurons may participate in multiple facts through superposition — a topic of active research in mechanistic interpretability.

A practical estimate: large language models appear to store on the order of millions to tens of millions of distinct factual associations, which is consistent with their impressive but imperfect recall of world knowledge. The FFN is the primary substrate for this knowledge, and understanding its structure is the key to understanding what the model knows and how to modify it.


Conclusion

The feed-forward network is the workhorse of the transformer. It accounts for two-thirds of the parameters, the majority of the compute during prefill, and the majority of the memory bandwidth consumption during decode. It is where factual knowledge is stored, organized as a soft key-value memory with tens of thousands of entries per layer.

The evolution from ReLU to GELU to SwiGLU reflects a broader trend in deep learning: smooth, gated nonlinearities outperform simple thresholding functions, especially in deep networks where gradient flow is critical. SwiGLU’s gating mechanism is not just a minor improvement — it fundamentally changes the FFN’s computational structure by allowing the network to independently control what to compute and whether to compute it.

The FFN-as-memory hypothesis, supported by the knowledge neuron literature and practical model editing techniques like ROME and MEMIT, gives us a concrete framework for understanding what lives inside those 56 billion parameters. Knowledge is not a diffuse, inscrutable pattern spread across the model — it is stored as identifiable key-value pairs in specific FFN layers, concentrated in the middle of the network.

And the path from dense FFNs to Mixture of Experts is a natural one: if the FFN is a memory bank, scale it by adding more banks. MoE replaces a single FFN with many specialized experts, each handling a subset of the input distribution. This is the subject of the next post in this series.

For the systems engineer, the key takeaway is that the FFN’s performance characteristics are simple and predictable. Arithmetic intensity scales linearly with batch size. At batch=1 decode, the FFN is memory-bandwidth-bound with under 1% compute utilization. At prefill-scale batch sizes, it is compute-bound with near-peak utilization. Every optimization strategy — quantization, batching, tensor parallelism, speculative decoding — must account for which regime the FFN is operating in.