The previous eight posts in this series have focused almost exclusively on the attention mechanism — how queries meet keys, how positional information is encoded, how the KV cache is compressed. Attention gets the spotlight because it is the architectural novelty that defines the transformer. But attention accounts for only about one-third of a transformer’s parameters. The other two-thirds live in the feed-forward network (FFN), also called the MLP block, which sits after every attention layer in the residual stream.
This imbalance between attention’s fame and the FFN’s parameter mass is not a coincidence. The attention mechanism routes information between token positions: it decides which tokens should talk to which other tokens. The FFN, by contrast, processes each token position independently. It reads a single vector from the residual stream, transforms it through a nonlinear function, and writes the result back. No cross-token communication occurs. If attention is the postal service that delivers mail between addresses, the FFN is the factory at each address that processes what arrives.
This post covers everything you need to understand about the FFN block: its architecture and tensor shapes, the expansion ratio and why it dominates parameter count, the evolution of activation functions from ReLU to GELU to SwiGLU, the gating mechanism and why multiplicative interactions matter, the remarkable FFN-as-key-value-memory hypothesis, knowledge neurons and model editing, the connection to Mixture of Experts, and finally a detailed performance analysis showing where the FFN sits on the roofline.
1. The FFN Block Architecture
Two Linear Layers with a Nonlinearity
The standard feed-forward network in a transformer is a simple two-layer MLP applied independently to each token position. Given an input vector from the residual stream, the FFN computes:
where is the up-projection, is the down-projection, is a nonlinear activation function, and , are optional bias terms (most modern architectures omit them).
The tensor shapes through the FFN, for a batched input, are:
- Input: — batch size , sequence length , model dimension .
- After up-projection: — the hidden dimension is typically much larger than .
- After activation: — same shape, but values have been nonlinearly transformed.
- After down-projection: — back to the original dimension, ready to be added to the residual stream.
class FeedForwardNetwork(nn.Module):
"""
Standard two-layer FFN as in the original Transformer.
Applied independently to each token position.
"""
def __init__(self, d_model: int, d_ff: int):
super().__init__()
self.w1 = nn.Linear(d_model, d_ff, bias=False) # Up-projection
self.w2 = nn.Linear(d_ff, d_model, bias=False) # Down-projection
self.act = nn.ReLU() # Original Transformer used ReLU
def forward(self, x: torch.Tensor) -> torch.Tensor:
# x: [batch, seq_len, d_model]
hidden = self.act(self.w1(x)) # [batch, seq_len, d_ff]
output = self.w2(hidden) # [batch, seq_len, d_model]
return output
Why Two Layers?
The first linear layer projects UP from to , expanding the representation into a higher-dimensional space. The nonlinearity acts in this higher-dimensional space, creating complex nonlinear feature combinations that a single linear layer could not express. The second linear layer projects DOWN back to , compressing the result into a form that fits back into the residual stream.
This expand-then-compress pattern is the fundamental trick. A single linear transformation can only compute linear functions of the input. By expanding to a higher dimension, applying a pointwise nonlinearity, and then compressing back down, the FFN can approximate a much richer class of functions. The wider the intermediate dimension , the more expressive the FFN becomes — at the cost of more parameters and more compute.
The FFN processes each token position completely independently. Token at position 0 and token at position 512 go through exactly the same weights, but there is no information flow between them inside the FFN. All cross-position communication happens in the attention layers. This separation of concerns — attention for routing, FFN for processing — is a core architectural principle of the transformer.
The mental model to carry forward: the FFN is a per-token MLP that reads from the residual stream, transforms the representation through a bottleneck of nonlinear features, and writes back. It is where the transformer “thinks” about each token individually, using the context that attention has already mixed into that token’s representation.
2. The Expansion Ratio
The Standard 4x Rule
The original transformer (Vaswani et al., 2017) set . For a model with , the FFN hidden dimension was . This 4x expansion ratio became one of the most widely adopted hyperparameters in deep learning, carried forward through GPT-2, GPT-3, and many other architectures.
Why 4x? The original paper does not provide a deep theoretical justification. Empirically, 4x provides a good balance between expressiveness and parameter efficiency. Smaller ratios (2x, 3x) reduce capacity and hurt quality. Larger ratios (8x, 16x) provide diminishing returns while linearly increasing parameters and compute. The 4x ratio has proven robust across a wide range of model sizes, from 100M to 175B parameters.
SwiGLU Changes the Math
When gated activations like SwiGLU replaced ReLU (which we will cover in detail in Sections 3 and 4), the expansion ratio changed. A gated FFN uses THREE weight matrices instead of two: (gate projection), (up-projection), and (down-projection). To keep the total parameter count equal to a standard FFN with , the hidden dimension must be reduced.
A standard FFN has parameters (two matrices). A gated FFN has parameters (three matrices). Setting these equal:
In practice, is rounded to a multiple of 256 (for GPU efficiency). For Llama 2 70B with , the theoretical value is , but the actual value used is — larger than the parameter-matched value, because the Llama designers chose to allocate extra capacity to the FFN.
Parameter Count Dominance
Let us work through the numbers for Llama 2 70B to see why the FFN dominates the parameter count.
Per-layer FFN parameters (SwiGLU with , ):
Per-layer attention parameters (, , ):
The FFN has roughly 4.7x more parameters than the attention layer. Across 80 layers:
The FFN accounts for approximately 82% of the model’s layer parameters, with the remaining 18% in attention. Adding the embedding and output layers does not significantly change this ratio.
Parameter Breakdown: Llama 2 70B
| Component | Per Layer | Total (80 layers) | Share |
|---|---|---|---|
| FFN (SwiGLU) | 704M | 56.4B | 82% |
| Attention (GQA-8) | 151M | 12.1B | 18% |
| Embeddings + Head | -- | ~1.1B | -- |
| Total | 855M | ~69.6B | 100% |
This is the core insight: when you are looking at a 70B parameter model, roughly 56 billion of those parameters live in the FFN blocks. Understanding what these parameters do — and how knowledge is stored in them — is essential.
3. Activation Functions: The Evolution
The activation function in the FFN determines how the network introduces nonlinearity. The choice of activation has evolved significantly over the transformer era, driven by empirical findings about training dynamics and gradient flow.
ReLU: The Original Choice
The original transformer used ReLU (Rectified Linear Unit):
ReLU is elegant in its simplicity. For positive inputs, it is the identity function. For negative inputs, it outputs zero. This creates sparse activations: at any given input, roughly half the neurons in the hidden layer have zero output. Sparsity is computationally attractive because multiplying by zero is free, and it has been argued to encourage more interpretable representations.
However, ReLU has a fundamental problem for deep networks: the dying ReLU phenomenon. For any neuron where W_1 x + b_1 < 0, the gradient through ReLU is exactly zero. If a neuron drifts into a regime where its pre-activation is consistently negative across the training data, it receives zero gradient and can never recover. The neuron is “dead” — it contributes nothing to the network’s output and wastes parameters. In deep transformer stacks with many layers of FFNs, this problem compounds.
GELU: The Smooth Alternative
Hendrycks and Gimpel (2016) proposed the Gaussian Error Linear Unit:
where is the cumulative distribution function of the standard Gaussian. In practice, this is approximated as:
GELU behaves like a smooth version of ReLU. For large positive values, and GELU (like ReLU). For large negative values, and GELU (like ReLU). But the transition is smooth: near zero, GELU has non-zero gradients on both sides. This means neurons never fully “die” — they always receive at least a small gradient signal, allowing them to recover from unfavorable weight configurations.
GELU became the default activation in BERT and GPT-2, and its success was one of the early signals that smoother activations help transformer optimization. The improvement is not dramatic on any single benchmark — typically 0.1—0.5% — but it is consistent across tasks and scales, and it makes training more stable.
SiLU/Swish: The Simplification
The SiLU (Sigmoid Linear Unit), also known as Swish, was proposed by Ramachandran et al. (2017):
where is the logistic sigmoid function. SiLU is conceptually simpler than GELU (sigmoid is easier to reason about than the Gaussian CDF) and behaves almost identically in practice. For large positive , sigmoid approaches 1 and SiLU . For large negative , sigmoid approaches 0 and SiLU . Near zero, SiLU is smooth with a slight non-monotonicity: it dips slightly below zero around .
The quality difference between GELU and SiLU is negligible in practice. SiLU became the activation of choice for the Llama family of models, primarily because it integrates cleanly into the gated architecture described next.
For standalone FFNs (without gating), GELU and SiLU/Swish perform nearly identically. The choice is largely one of convention: BERT and GPT-2 use GELU, Llama uses SiLU. If you are implementing a gated FFN (SwiGLU), use SiLU — that is the “Swi” in SwiGLU. If you are using a standard two-matrix FFN, GELU is the safe default.
The Common Theme
All three smooth activations (GELU, SiLU, Swish) share a key property that ReLU lacks: they are smooth functions with non-zero gradients almost everywhere. This matters for optimization in deep networks. When you stack 80 or 128 layers, each with an FFN, the gradient must flow backward through all those activations during training. Zero gradients in ReLU create dead zones that block gradient flow. Smooth activations keep the gradient signal alive, enabling more stable training at scale.
Activation Function Comparison (Approximate Behavior)
(relative quality)4. Gated FFNs: SwiGLU, GeGLU, ReGLU
The Gating Mechanism
The most significant architectural change to the FFN since the original transformer is the introduction of gating. Instead of applying the activation function to a single up-projection, a gated FFN computes two parallel projections and multiplies them together element-wise:
Here, is the gate projection, is the up-projection, and is the down-projection. The symbol denotes element-wise (Hadamard) multiplication.
The name decodes as follows: Swi (Swish/SiLU activation) + GLU (Gated Linear Unit). Replace Swish with GELU and you get GeGLU. Replace it with ReLU and you get ReGLU.
class SwiGLU_FFN(nn.Module):
"""
Gated FFN with SwiGLU activation.
Used by Llama 2, Llama 3, Mistral, and most modern LLMs.
"""
def __init__(self, d_model: int, d_ff: int):
super().__init__()
self.w1 = nn.Linear(d_model, d_ff, bias=False) # Gate projection
self.w3 = nn.Linear(d_model, d_ff, bias=False) # Up-projection
self.w2 = nn.Linear(d_ff, d_model, bias=False) # Down-projection
def forward(self, x: torch.Tensor) -> torch.Tensor:
# x: [batch, seq_len, d_model]
gate = F.silu(self.w1(x)) # [batch, seq_len, d_ff]
up = self.w3(x) # [batch, seq_len, d_ff]
hidden = gate * up # Element-wise gating
output = self.w2(hidden) # [batch, seq_len, d_model]
return output
Why Gating Helps
The standard FFN computes : the activation function acts as a fixed filter on the up-projected features. The network must learn such that the features worth keeping have positive pre-activations and the features worth suppressing have negative pre-activations. This is a single, somewhat rigid mechanism.
The gated FFN introduces a multiplicative interaction between two different projections of the input. The term computes a soft binary mask: which features should be active (values near 1) and which should be suppressed (values near 0). The term computes the actual feature values. The element-wise product means the network can independently learn what to compute (via ) and whether to compute it (via ).
This separation is more powerful than it may appear. In a standard FFN, the same matrix must simultaneously determine both the feature value and the gating decision. In a gated FFN, these two roles are decoupled into separate learnable transformations. The gate can learn to detect arbitrary patterns in the input — specific token types, specific positions in a sentence, specific semantic features — and selectively enable or disable the corresponding features from the up-projection.
Multiplicative interactions also create higher-order feature combinations. When is computed, the effective output is a product of two linear functions of the input, which is a second-order polynomial in (modulo the SiLU nonlinearity). This gives the network access to interaction terms that a standard FFN can only approximate through the nonlinearity.
The Shazeer (2020) Paper
Noam Shazeer’s “GLU Variants Improve Transformer” (2020) systematically tested all combinations of gating with different activation functions. The paper is remarkably concise — essentially a large ablation study. The key findings:
- All gated variants outperform all non-gated variants at equal parameter count. This is the single most important result: gating itself, regardless of the specific activation function, consistently helps.
- SwiGLU and GeGLU tie for best performance, with SwiGLU having a slight edge on most benchmarks.
- ReGLU (gated ReLU) outperforms standard GELU without gating, demonstrating that gating is more important than the choice of activation function.
- The improvements are consistent across model sizes (from 100M to 1B parameters tested) and across tasks (language modeling perplexity, downstream benchmarks).
GLU Variants Comparison (Shazeer 2020)
| FFN Variant | Activation | Gated? | Perplexity (lower = better) |
|---|---|---|---|
| Standard FFN | ReLU | No | 3.89 |
| Standard FFN | GELU | No | 3.80 |
| ReGLU | ReLU | Yes | 3.76 |
| GeGLU | GELU | Yes | 3.72 |
| SwiGLU | SiLU/Swish | Yes | 3.71 |
The Parameter Tradeoff
The price of gating is an extra weight matrix. A standard FFN with hidden dimension has parameters. A gated FFN has parameters. To maintain the same total parameter count, the hidden dimension must shrink from to .
For the standard 4x expansion ratio, , so . The hidden dimension is narrower, but the gate’s multiplicative interaction more than compensates. In Shazeer’s experiments, the gated FFN with the narrower hidden dimension consistently outperformed the standard FFN with the wider hidden dimension at exactly the same parameter count.
As of 2025, SwiGLU has become the near-universal default for transformer FFNs. Llama 2/3, Mistral, Mixtral, Gemma, Qwen, DeepSeek, and most other frontier models use SwiGLU. If you are building a new transformer from scratch, SwiGLU is the obvious choice — the quality improvement is free once you account for the parameter rebalancing.
5. The FFN-as-Key-Value-Memory Hypothesis
The Core Insight
In 2021, Geva, Schuster, Berant, and Levy published “Transformer Feed-Forward Layers Are Key-Value Memories,” a paper that fundamentally reframed how we think about the FFN. Their insight is both simple and profound: the FFN’s two weight matrices behave like the key and value stores of a soft associative memory.
Consider the standard FFN (without gating, for clarity):
Write row-by-row. The -th row of , call it , computes a dot product with the input :
This dot product measures how well the input matches the pattern encoded in . If the match is strong (high dot product), the activation is large. If the match is weak, is small or zero (for ReLU).
Now write column-by-column. The -th column of , call it , is the “value” associated with the -th key. The output of the FFN is:
This is a weighted sum over value vectors, where the weights are determined by how well the input matches each key vector. This is precisely the mechanism of a soft key-value lookup:
- Keys ( rows): patterns the FFN has learned to detect in the input.
- Values ( columns): information the FFN retrieves when a key matches.
- Activation (): the soft matching function that determines retrieval strength.
The FFN is essentially a learned associative memory with entries. Each entry consists of a key pattern (what to look for) and a value vector (what to retrieve). When the input matches a key, the corresponding value is retrieved and added to the residual stream.
Think of the FFN as a lookup table with entries. Each entry has a key (a row of ) and a value (a column of ). The input is the query. The FFN computes a soft match against all keys simultaneously, then returns a weighted combination of the corresponding values. For a model with , that is 28,672 entries in the lookup table — per layer.
Evidence from Probing Experiments
Geva et al. provided compelling evidence for this interpretation through a series of probing experiments on GPT-2 and other models:
Key pattern analysis. They examined what input patterns cause specific rows of to activate strongly. They found that individual keys correspond to highly interpretable features:
- Certain keys activate specifically for tokens that represent countries.
- Other keys activate for tokens in the context of temporal expressions (“in 1997”, “last year”).
- Some keys activate for tokens following particular syntactic patterns (e.g., the subject of a passive construction).
Value vector analysis. They then examined the corresponding columns of — the values retrieved when those keys match. When a “country” key fires, the corresponding value promotes the token for that country’s capital in the output vocabulary distribution. When a “temporal” key fires, the corresponding value promotes time-related tokens.
The retrieval is compositional. Because the output is a weighted sum of many values, the FFN does not just retrieve a single fact — it composes multiple partial retrievals. If the input simultaneously matches a “European country” key and a “capital city” key, the values from both contribute to the output, collectively pushing the distribution toward the correct capital.
Implications
This reframing has several profound implications:
-
Knowledge storage is explicit. Factual knowledge (“Paris is the capital of France”) is not stored as some diffuse, inscrutable pattern across the network. It is stored as specific key-value pairs in specific FFN layers. The key encodes the input context that triggers the retrieval, and the value encodes the output to produce.
-
The FFN is a memory bank. Each FFN layer contains memory slots. Across 80 layers with , the model has roughly 2.3 million memory slots. This is the model’s “knowledge capacity” — the number of distinct facts or patterns it can store.
-
Different layers store different types of knowledge. Early layers tend to store syntactic patterns. Middle layers store semantic associations and factual knowledge. Late layers store output-formatting patterns. This layerwise specialization mirrors findings from probing classifiers applied to attention.
FFN as Key-Value Memory (Conceptual)
Each FFN layer is a soft associative memory with d_ff entries
6. Knowledge Neurons
Localizing Facts in the Network
If the FFN stores knowledge as key-value pairs, can we identify the specific neurons that store a specific fact? Dai et al. (2022) addressed this question in “Knowledge Neurons in Pretrained Transformers,” introducing a method to locate and manipulate the neurons responsible for individual factual associations.
Their approach works as follows. Given a factual query like “The capital of France is ___”, they:
- Run the query through the model and record the activation of every neuron in every FFN layer (all activations at every layer).
- For each neuron, suppress its activation (set it to zero) and measure how much the model’s probability of the correct answer (“Paris”) decreases.
- Neurons whose suppression causes a large decrease in the correct answer’s probability are identified as “knowledge neurons” for that fact.
Key Findings
Dai et al. found several striking results:
Knowledge is relatively localized. For a given fact, typically 5—20 neurons (out of millions) are responsible for the majority of the model’s ability to produce the correct answer. Suppressing these neurons can reduce the probability of the correct answer by 50—90%.
Knowledge neurons cluster in middle layers. For GPT-2 and BERT-based models, the knowledge neurons for factual associations are concentrated in the middle third of the network (roughly layers 8—16 in a 24-layer model). Early layers handle syntactic processing and late layers handle output formatting, so the middle layers are where “facts” live.
Different facts use different neurons. The neurons storing “the capital of France is Paris” are almost entirely disjoint from those storing “the capital of Japan is Tokyo.” The FFN has specialized different memory slots for different facts, exactly as the key-value memory hypothesis predicts.
Suppressing knowledge neurons erases facts. If you zero out the 10—20 knowledge neurons for “the capital of France is Paris,” the model can no longer answer that question correctly. But it can still answer “the capital of Japan is Tokyo” — the suppression is targeted. This is a form of surgical model editing: modifying specific facts without affecting the rest of the model’s knowledge.
Implications for Model Editing
The knowledge neuron framework opens the door to targeted model editing. If you want a model to “forget” a specific fact — perhaps for privacy reasons, or to correct outdated information — you can:
- Identify the knowledge neurons for that fact.
- Suppress or modify them (either by zeroing weights or by fine-tuning them to produce a different output).
This is far more efficient than retraining the entire model. Several subsequent papers (ROME, MEMIT, and others) have built on this insight to develop increasingly sophisticated model editing techniques.
While the knowledge neuron framework is compelling, it is important not to oversimplify. Knowledge is distributed across many neurons and multiple layers. Suppressing the top 20 neurons for a fact reduces the correct answer’s probability significantly, but rarely eliminates it entirely. The model has redundant storage — the same fact may be partially encoded in multiple locations. Editing one set of neurons can leave residual knowledge elsewhere that manifests in unexpected ways. Practical model editing remains an active research area.
The ROME and MEMIT Methods
Meng et al. (2022) extended the knowledge neuron work with ROME (Rank-One Model Editing), which modifies a single FFN layer to insert, delete, or update a specific fact. The key insight is that modifying a single row of (a value vector) can change the output associated with a specific key pattern without affecting other key-value pairs in the same layer.
MEMIT (Mass-Editing Memory in a Transformer) scaled this to edit thousands of facts simultaneously by distributing modifications across multiple FFN layers. These methods formalize the FFN-as-memory hypothesis into a practical engineering tool: the FFN is not just metaphorically a memory — it can be read and written like one.
7. The Mixture of Experts Connection
From One Memory Bank to Many
If a single FFN layer is a key-value memory with entries, a natural question arises: what if we want more entries without proportionally increasing the compute cost? The answer is Mixture of Experts (MoE), which replaces the single FFN with parallel “expert” FFNs and a learned router that decides which expert(s) to use for each token.
where is the routing weight for expert , computed by a small router network. In practice, the router selects only the top- experts (typically or ), so most experts are not activated for any given token. This is what makes MoE efficient: the total parameter count is times larger than a dense FFN, but the compute cost is only times as expensive.
Specialization Through the Lens of Memory
Under the FFN-as-memory hypothesis, MoE has a clean interpretation: instead of one memory bank with entries, you have specialized memory banks, each with entries, for a total of entries. The router learns which memory bank is relevant for each input.
Empirically, experts do specialize. In trained MoE models:
- Some experts specialize in punctuation and formatting.
- Others specialize in named entities and factual knowledge.
- Others specialize in code or mathematical expressions.
- Some handle common syntactic patterns, while others handle rare constructions.
This specialization emerges naturally from the routing mechanism. The router learns to send each token to the expert whose key patterns best match the input, which means each expert can devote its full capacity to a subset of the input distribution. This is far more parameter-efficient than a single dense FFN that must allocate memory slots across the entire input distribution.
Scale Examples
Mixtral 8x7B has 8 expert FFNs per layer, with 2 activated per token. Total parameters: ~47B. Active parameters per token: ~13B. This gives the model the knowledge capacity of a 47B-parameter model at the compute cost of a 13B-parameter model.
DeepSeek-V2 has 160 experts per layer with 6 activated per token. Total parameters: 236B. Active parameters per token: ~21B. The knowledge capacity is enormous — over 160 times the memory entries of a single FFN — but the per-token cost is manageable.
The next post in this series covers Mixture of Experts in full depth: the routing algorithms (top-k, expert choice, soft routing), the load balancing problem and auxiliary losses, the capacity factor, the interaction between MoE and tensor parallelism, and the performance characteristics of sparse computation on modern GPUs.
The progression from dense FFN to MoE is a direct consequence of the FFN-as-memory framing. Once you view the FFN as a memory bank, the natural scaling strategy is to add more banks (experts) rather than making one bank wider ( larger). More banks with a router is more parameter-efficient than a wider bank, because the router avoids wasting compute on irrelevant memory entries.
8. FFN Performance Analysis
Compute Cost
Each linear layer in the FFN is a matrix multiplication. For a standard (non-gated) FFN with input dimension and hidden dimension , the compute cost per token is:
The factor of 2 per matmul accounts for the multiply and accumulate operations (each element of the output requires multiplies and adds). The final factor of 2 counts both the up-projection and down-projection.
For a gated FFN (SwiGLU), there are three matrices instead of two:
For Llama 2 70B (, ):
For a batch of tokens and sequence positions:
Across all 80 layers, a single forward pass through the FFN blocks alone costs TFLOPs per token. This dwarfs the attention computation at short sequence lengths.
FFN Compute Cost Per Token Per Layer
| Model | d_model | d_ff | FFN Type | TFLOPs/token/layer |
|---|---|---|---|---|
| GPT-2 (1.5B) | 1600 | 6400 | Standard | 0.041 |
| Llama 2 7B | 4096 | 11008 | SwiGLU | 0.271 |
| Llama 2 13B | 5120 | 13824 | SwiGLU | 0.425 |
| Llama 2 70B | 8192 | 28672 | SwiGLU | 1.41 |
| Llama 3 405B | 16384 | 53248 | SwiGLU | 5.24 |
Memory Traffic
During inference, the dominant cost of the FFN is not compute but memory bandwidth. The three weight matrices (, , for SwiGLU) must be loaded from HBM for every batch of tokens. The total weight data per FFN layer is:
For Llama 2 70B in BF16 ():
On an H100 with 3.35 TB/s of HBM bandwidth, loading a single FFN layer takes:
At batch size 1 (single-token decode), this is the entire cost. The compute (1.41 TFLOPs) takes roughly 0.0014 ms on an H100 at 990 TFLOPS — over 200x faster than loading the weights. The FFN is massively memory-bandwidth-bound at small batch sizes, exactly like the attention mechanism during decode.
Arithmetic Intensity and the Roofline
The arithmetic intensity (AI) of the FFN is the ratio of compute to memory traffic:
This is a remarkably clean result: the arithmetic intensity of the FFN is simply the batch size (in BF16). This means:
- At batch=1: AI = 1 FLOP/byte. Deeply memory-bound. The H100 has a compute-to-bandwidth ratio of approximately 295 FLOPs/byte (990 TFLOPS / 3.35 TB/s), so the GPU is at roughly 0.3% compute utilization.
- At batch=32: AI = 32. Still memory-bound, but utilization improves to about 11%.
- At batch=295: AI = 295. The FFN reaches the roofline — the balance point where compute and memory bandwidth are equally saturated.
- At batch=1024: AI = 1024. Compute-bound. Now the GPU is fully utilized and memory bandwidth is not the bottleneck.
FFN Compute Utilization vs. Batch Size (H100)
(%)The Decode Bottleneck
During autoregressive decoding (generating one token at a time), the effective batch size for the FFN is the number of sequences being generated in parallel. At batch=1, the H100 spends 0.39 ms loading 1.31 GB of FFN weights to perform a computation that takes 0.0014 ms. Over 99.6% of the time is spent waiting for memory.
This is identical to the attention decode bottleneck described in Part 7 (speculative decoding). In fact, during decode, the FFN and the attention mechanism are both memory-bandwidth-bound, and their costs are additive. For Llama 2 70B at batch=1:
- FFN weight loading per layer: 1.31 GB, 0.39 ms
- Attention weight loading per layer: ~0.28 GB, 0.08 ms
- KV cache reading per layer: varies with context length
The FFN accounts for roughly 80% of the weight-loading cost per layer, matching its 82% share of the parameters. This is not a coincidence — when you are memory-bandwidth-bound, the cost is proportional to the number of bytes loaded, which is proportional to the number of parameters.
Per-Layer Weight Loading: Llama 2 70B (BF16)
Decode phase, batch=1. All components are memory-bandwidth-bound.
0x5400 0x0000 0xA800 0x5400 0xFC00 0xA800 0x10000 0xFC00 448 MB 448 MB 448 MB 286 MB Prefill vs. Decode: A Different Story
During the prefill phase (processing the input prompt), all tokens are processed simultaneously. The effective batch size is (batch size times prompt length). For a batch of 8 requests with 2048-token prompts, the effective batch size is 16,384 — far above the roofline threshold. In this regime, the FFN is compute-bound, and the GPU achieves near-peak utilization.
This is why prefill throughput scales almost linearly with GPU FLOPs, while decode throughput scales with memory bandwidth. Optimization strategies differ accordingly:
- Prefill: maximize FLOPs (use larger GPUs, optimize matmul kernels, use FP8 for higher throughput).
- Decode: maximize memory bandwidth (use more GPUs to parallelize weight loading, quantize weights to reduce bytes, increase batch size to amortize loading cost).
FFN Performance Regime by Phase
| Phase | Effective Batch | Arithmetic Intensity | Bottleneck | Optimization |
|---|---|---|---|---|
| Decode (batch=1) | 1 | 1 FLOP/byte | Memory BW | Weight quantization, batching |
| Decode (batch=32) | 32 | 32 FLOP/byte | Memory BW | Batching, TP parallelism |
| Decode (batch=256) | 256 | 256 FLOP/byte | Near balanced | Balanced optimization |
| Prefill (B=8, S=2048) | 16,384 | 16,384 FLOP/byte | Compute | FP8, kernel optimization |
9. Putting It All Together
The FFN in the Residual Stream
To understand the FFN’s role in the full transformer, recall the residual stream architecture. Each transformer layer computes:
The attention layer reads from the residual stream, computes cross-token interactions, and writes back. Then the FFN reads from the updated residual stream, processes each token independently through its key-value memory, and writes back. The residual connections ensure that information from earlier layers is preserved — the FFN adds to the existing representation rather than replacing it.
This means the FFN at layer operates on a representation that already incorporates the outputs of all previous attention and FFN layers. By the middle layers, each token’s representation in the residual stream contains rich contextual information: it knows not just what word it is, but what words surround it, what syntactic role it plays, what the topic of the passage is. This context-enriched representation is what the FFN’s keys match against, which is why the FFN can detect high-level patterns like “this token appears in the context of a question about European geography” and retrieve the appropriate factual information.
Design Decisions for Modern FFNs
If you are designing or fine-tuning a transformer today, the FFN decisions are largely settled:
- Activation function: SwiGLU. There is no compelling reason to use anything else for a new model.
- Expansion ratio: , rounded to a multiple of 256 for GPU efficiency. Many practitioners use a slightly larger ratio for additional capacity.
- Biases: No biases (). Removing biases saves a small number of parameters and does not measurably affect quality. All Llama models omit biases.
- Normalization: RMSNorm (not LayerNorm) applied before the FFN, as part of the pre-norm residual architecture. Pre-norm is more stable than post-norm for deep networks.
- Parallelism: For large models, the FFN weights are partitioned across GPUs using tensor parallelism. and are column-partitioned, is row-partitioned. Each GPU computes a shard of the hidden dimension, and an all-reduce synchronizes the output.
The Capacity Question
How many facts can a transformer’s FFN blocks store? This is an open and fascinating question. A rough upper bound comes from the key-value memory interpretation: each FFN layer has memory slots, and there are layers, giving total slots.
For Llama 2 70B: slots. But this is a loose upper bound. Each “fact” may require multiple neurons (Dai et al. found 5—20 neurons per fact). Knowledge is redundantly stored across layers. And the same neurons may participate in multiple facts through superposition — a topic of active research in mechanistic interpretability.
A practical estimate: large language models appear to store on the order of millions to tens of millions of distinct factual associations, which is consistent with their impressive but imperfect recall of world knowledge. The FFN is the primary substrate for this knowledge, and understanding its structure is the key to understanding what the model knows and how to modify it.
Conclusion
The feed-forward network is the workhorse of the transformer. It accounts for two-thirds of the parameters, the majority of the compute during prefill, and the majority of the memory bandwidth consumption during decode. It is where factual knowledge is stored, organized as a soft key-value memory with tens of thousands of entries per layer.
The evolution from ReLU to GELU to SwiGLU reflects a broader trend in deep learning: smooth, gated nonlinearities outperform simple thresholding functions, especially in deep networks where gradient flow is critical. SwiGLU’s gating mechanism is not just a minor improvement — it fundamentally changes the FFN’s computational structure by allowing the network to independently control what to compute and whether to compute it.
The FFN-as-memory hypothesis, supported by the knowledge neuron literature and practical model editing techniques like ROME and MEMIT, gives us a concrete framework for understanding what lives inside those 56 billion parameters. Knowledge is not a diffuse, inscrutable pattern spread across the model — it is stored as identifiable key-value pairs in specific FFN layers, concentrated in the middle of the network.
And the path from dense FFNs to Mixture of Experts is a natural one: if the FFN is a memory bank, scale it by adding more banks. MoE replaces a single FFN with many specialized experts, each handling a subset of the input distribution. This is the subject of the next post in this series.
For the systems engineer, the key takeaway is that the FFN’s performance characteristics are simple and predictable. Arithmetic intensity scales linearly with batch size. At batch=1 decode, the FFN is memory-bandwidth-bound with under 1% compute utilization. At prefill-scale batch sizes, it is compute-bound with near-peak utilization. Every optimization strategy — quantization, batching, tensor parallelism, speculative decoding — must account for which regime the FFN is operating in.