You have now read 14 posts covering every component of a modern transformer in isolation. This capstone post does what none of the individual posts could: it assembles every piece into a single, annotated forward pass through a production-scale LLM. By the end, you will understand exactly what happens to a string of text as it enters a Llama-3-class model and emerges as a probability distribution over the next token — every matrix multiply, every normalization, every memory access.
We will use concrete numbers throughout: a Llama 3 70B-class model with , , (GQA), (SwiGLU), , and layers.
The Complete Forward Pass
Step 1: Tokenization (Series Part 2)
Input: a raw string like "The quick brown fox".
The tokenizer (BPE, Part 2) converts this to a sequence of integer IDs:
token_ids = tokenizer.encode("The quick brown fox")
# Result: [791, 4996, 14198, 39935] (4 tokens)
# Shape: [seq_len] = [4]
For a batch of sequences, we pad to the longest and get shape . The tokenizer is CPU-bound and typically takes microseconds — negligible compared to the model forward pass.
Step 2: Token Embedding (Series Part 3)
The embedding table maps each token ID to a dense vector:
# E shape: [128256, 8192] (1.05B parameters, 2.1 GB in FP16)
# Input: [B, S] of integer IDs
# Output: [B, S, 8192]
h = embedding_table[token_ids] # Simple lookup, no computation
In Llama 3, the same matrix is reused as the output projection (unembedding). This saves 2.1 GB of parameters. The embedding lookup at the start and the linear projection at the end share the same weights.
Step 3: The Transformer Block Loop (80 layers)
Each of the layers applies the same pattern: Norm → Attention → Residual → Norm → FFN → Residual.
for layer in range(80):
# === Attention sublayer ===
h_norm = rms_norm(h) # [B, S, 8192] -> [B, S, 8192]
attn_out = gqa_attention(h_norm, layer) # [B, S, 8192] -> [B, S, 8192]
h = h + attn_out # Residual connection
# === FFN sublayer ===
h_norm = rms_norm(h) # [B, S, 8192] -> [B, S, 8192]
ffn_out = swiglu_ffn(h_norm, layer) # [B, S, 8192] -> [B, S, 8192]
h = h + ffn_out # Residual connection
Let’s trace each operation in detail.
3a: RMSNorm (Series Part 7)
# Input: [B, S, 8192]
# Params: gamma [8192] (8192 parameters per norm, 2 norms per layer)
# Output: [B, S, 8192]
# Cost: O(B * S * d) -- negligible compared to attention/FFN
No mean subtraction (unlike LayerNorm), no bias term. Just RMS scaling with a learned . Each layer has two RMSNorm instances (one before attention, one before FFN) = 16,384 parameters per layer. Across 80 layers: 1.3M parameters total — negligible.
3b: GQA Attention (Series Parts 1, 4, 5, 6)
This is where most of the complexity lives. Llama 3 uses Grouped-Query Attention with 64 query heads and 8 KV heads (each query group of 8 heads shares one KV head):
# Projections
Q = h_norm @ W_Q # [B,S,8192] @ [8192, 8192] -> [B,S,8192] = [B,S,64,128]
K = h_norm @ W_K # [B,S,8192] @ [8192, 1024] -> [B,S,1024] = [B,S,8,128]
V = h_norm @ W_V # [B,S,8192] @ [8192, 1024] -> [B,S,1024] = [B,S,8,128]
# Apply RoPE to Q and K (Series Part 4)
Q, K = apply_rope(Q, K, positions)
# Attention with GQA: each of 64 Q heads maps to one of 8 KV heads
# Effective: 8 groups of 8 query heads sharing 1 KV head
# Score: Q_head @ K_group.T / sqrt(128), with causal mask
# Output: softmax(scores) @ V_group
O = h_norm @ W_O # [B,S,8192] @ [8192, 8192] -> [B,S,8192]
Attention parameters per layer:
- : M
- : M (reduced by GQA 8x)
- : M (reduced by GQA 8x)
- : M
- Total attention per layer: 151M parameters
RoPE (Part 4) rotates Q and K by position-dependent angles. This encodes relative position without any learned parameters — just trigonometric functions applied to pairs of dimensions.
Online softmax (Part 5) computes the attention weights without materializing the full matrix when using FlashAttention. The log-sum-exp trick prevents numerical overflow.
KV Cache (covered in Inference Timeline): During autoregressive generation, K and V for previous tokens are cached. Per token per layer: bytes (FP16). Across 80 layers: 327 KB per token.
3c: Residual Connection (Series Part 8)
h = h + attn_out # Element-wise addition, [B, S, 8192]
Simple addition. This is the “residual stream” — the persistent vector that carries information through the network. Without it, gradients would vanish through 80 layers.
3d: SwiGLU FFN (Series Part 9)
# Three weight matrices per FFN layer:
gate = silu(h_norm @ W_1) # [B,S,8192] @ [8192, 28672] -> [B,S,28672]
up = h_norm @ W_3 # [B,S,8192] @ [8192, 28672] -> [B,S,28672]
hidden = gate * up # Element-wise multiply [B,S,28672]
out = hidden @ W_2 # [B,S,28672] @ [28672, 8192] -> [B,S,8192]
FFN parameters per layer:
- : M
- : M
- : M
- Total FFN per layer: 704.6M parameters
The FFN is 82% of each layer’s parameters. This is where factual knowledge is stored (Part 9: the FFN-as-key-value-memory hypothesis).
Step 4: Final Normalization
h = rms_norm(h) # [B, S, 8192] -> [B, S, 8192]
One final RMSNorm before the output projection.
Step 5: Output Projection / Unembedding (Series Part 11)
# With weight tying: reuse the embedding matrix
logits = h @ E.T # [B, S, 8192] @ [8192, 128256] -> [B, S, 128256]
This produces raw logits over the entire 128K vocabulary for each position. For autoregressive generation, we only need the logits at the last position: logits[:, -1, :] of shape .
Step 6: Sampling (Inference Timeline Series)
# Temperature scaling + top-p sampling
probs = softmax(logits[:, -1, :] / temperature) # [B, 128256]
next_token = top_p_sample(probs, p=0.95) # [B]
The softmax (Part 5) converts logits to probabilities. Top-p sampling (covered in the Inference Timeline) selects from the smallest set of tokens covering 95% of probability mass.
Parameter Count Audit
Llama 3 70B Parameter Breakdown
| Component | Params per Layer | Total (80 layers) | % of Model |
|---|---|---|---|
| Token Embedding (E) | N/A (shared) | 1.05B | 1.5% |
| RMSNorm (2 per layer) | 16.4K | 1.3M | 0.002% |
| Attention (Q+K+V+O) | 151M | 12.1B | 17.3% |
| FFN (W1+W2+W3 SwiGLU) | 704.6M | 56.4B | 80.8% |
| Output Head (tied with E) | 0 (tied) | 0 | 0% |
| TOTAL | 856M/layer | ~69.8B | 100% |
The FFN dominates at 80.8% of all parameters. Attention is only 17.3%. This is why MoE (Part 10) replaces the FFN with multiple expert FFNs — it’s the component with the most parameters to distribute.
Where Parameters Live in Llama 3 70B
(% of parameters)Memory Audit
Memory Requirements for Llama 3 70B
| Component | FP16 | INT8 | INT4 |
|---|---|---|---|
| Model Weights | 140 GB | 70 GB | 35 GB |
| KV Cache (B=32, S=4096) | 40.8 GB | 20.4 GB | 10.2 GB |
| Activations (per layer peak) | ~2 GB | ~2 GB | ~2 GB |
| Total Inference Memory | ~183 GB | ~92 GB | ~47 GB |
| Fits on 1x H100 80GB? | No | Yes (tight) | Yes (comfortable) |
At batch=32 and sequence length 4096, the KV cache is 40.8 GB — almost as much as the INT4 model weights. At 8192 tokens, it doubles to 81.6 GB. KV cache, not model weights, is the binding memory constraint for long-context serving. This is why KV cache quantization (Inference Timeline Part 2), PagedAttention (Part 5), and prefix caching (Part 8) are critical.
The Design Space: What Llama 3 Architects Chose (and Why)
Every component in the model represents a design decision. Here’s what the Llama 3 team chose at each point, and why:
Llama 3 70B Architecture Decisions
| Decision | Choice | Alternatives | Why This Choice |
|---|---|---|---|
| Attention | GQA-8 | MHA, MQA, MLA | GQA-8 balances quality and KV cache size. MQA too aggressive, MHA too expensive. |
| FFN | SwiGLU (8/3 ratio) | ReLU, GELU MLP | SwiGLU gives ~1% perplexity improvement. Three matrices at 8/3 ratio matches parameter count. |
| Normalization | RMSNorm (Pre-Norm) | LayerNorm, Post-Norm | RMSNorm is 10-15% faster, Pre-Norm stabilizes deep training. |
| Position | RoPE (base 500K) | Learned, ALiBi | RoPE enables context extension. High base for 128K context training. |
| Vocab Size | 128,256 | 32K, 50K, 100K | 128K improves multilingual compression 15-20%. Embedding cost is 1.5% of model. |
| Architecture | Dense | MoE | Meta can afford the training compute. Dense is simpler to serve. |
| Weight Tying | Yes | Separate head | Saves 1.05B params (2.1 GB). No quality cost for large models. |
| Embedding Dim | 8192 | 4096, 12288 | Scaling law optimal for 70B parameters. |
Architecture design is not about finding the “best” component — it’s about finding the right tradeoff for your constraints. DeepSeek V3 chose MoE + MLA because they optimized for training efficiency. Meta chose dense + GQA because they optimized for serving simplicity. Both are correct for their context.
Where Innovation Happens
After 14 posts of analysis, we can map the transformer design space into settled components and active frontiers:
Settled (unlikely to change fundamentally):
- Residual connections (Part 8) — the identity skip path is non-negotiable
- Pre-Norm placement (Part 7) — Post-Norm is dead for new architectures
- Weight tying (Part 11) — free memory savings, no quality cost
- BPE tokenization (Part 2) — may evolve to byte-level but BPE principle stays
- Cross-entropy loss (Part 12) — pre-training objective is stable
Active frontiers (changing rapidly):
- Attention mechanism (Parts 1, 6) — MLA may replace GQA, linear attention and SSMs (Mamba) challenge the attention paradigm entirely
- Position encoding (Part 4) — NoPE, dynamic encodings, context extension beyond 1M tokens
- Expert routing (Part 10) — loss-free balancing, fine-grained experts, expert parallelism optimization
- FFN variants (Part 9) — MoE replaces dense FFN at scale, knowledge neuron editing
- Training objectives (Part 12) — multi-token prediction, DPO/RLHF post-training, inference-time compute scaling
Implementation Checklist
If you’re implementing a transformer from this series, here are the most common bugs and their fixes:
Common Implementation Bugs
| Bug | Symptom | Fix | Series Reference |
|---|---|---|---|
| Missing causal mask | Model sees future tokens, perfect loss on training | Apply causal mask before softmax: set future positions to -inf | Part 1 |
| Wrong RoPE frequencies | Model fails at long sequences | Use theta = 10000^(-2i/d) with correct dim pairing | Part 4 |
| Softmax overflow | NaN in attention weights | Subtract max before exp (log-sum-exp trick) | Part 5 |
| Post-Norm instead of Pre-Norm | Training diverges at 60+ layers | Normalize BEFORE sublayer, not after | Part 7 |
| Missing residual scaling | Activation magnitude explodes in deep models | Scale output projection by 1/sqrt(2L) | Part 8 |
| SwiGLU wrong expansion | Parameter count mismatch | Use d_ff = 8/3 * d_model (not 4x) to compensate for 3 matrices | Part 9 |
| FP16 norm computation | Gradient instability in mixed precision | Keep RMSNorm in FP32 even when forward pass is FP16/FP8 | Part 7 |
What You Now Know
If you’ve read all 15 parts of this series, you understand:
- How text becomes tokens and why subword tokenization exists (Part 2)
- How tokens become vectors through embedding lookup (Part 3)
- How position is encoded via RoPE rotation (Part 4)
- How attention works — the QKV mechanism, multi-head decomposition, and O(n^2) cost (Part 1)
- How softmax creates a distribution and why numerical stability matters (Part 5)
- Why GQA/MLA reduce memory and how to choose between attention variants (Part 6)
- Why normalization enables depth and why RMSNorm replaced LayerNorm (Part 7)
- Why residual connections are essential and the residual stream mental model (Part 8)
- How the FFN stores knowledge and why SwiGLU’s gating mechanism helps (Part 9)
- How MoE scales parameters without proportional compute cost (Part 10)
- How the output head produces probabilities and why weight tying works (Part 11)
- How cross-entropy shapes learning and why models hallucinate (Part 12)
- Why decoder-only won over encoder and encoder-decoder architectures (Part 13)
- How DeepSeek V3 pushes the frontier with MLA, fine-grained MoE, and FP8 training (Part 14)
You have the foundation to read any LLM paper, understand any serving system, and identify where the next breakthrough might come from. The transformer is not magic — it’s engineering. Every component has a reason, every design choice has a tradeoff, and every tradeoff creates an opportunity for innovation.
This series covered the architecture. The companion Inference Optimization Timeline series covers what happens when you try to make these models fast and cheap in production — from KV cache management to speculative decoding to disaggregated serving. The two series together give you the complete picture: how transformers work and how to deploy them at scale.