A single Llama-7B forward pass spends 32ms total across 32 transformer layers, but the time is not evenly distributed: 46% goes to the FFN, 28% to attention, 18% to projections, and under 5% to normalization and residuals. For long contexts (seq > 2048), attention’s O(N²) complexity makes it the bottleneck. For short contexts (seq < 512), the FFN dominates because its 3 weight matrices in SwiGLU perform more FLOPs than attention. Understanding where time goes — and how it shifts with workload — is essential for optimization.

Component-Level Performance Breakdown

📊

Time per Transformer Layer by Component (Llama-7B, A100, FP16, batch=1, seq=512)

ComponentFLOPsTime (us)ShareBound
QKV Projection 3 x 2 x BxL x d^2 180 18% Compute
Attention (QK^T + Softmax + AV) 2 x 2 x BxL^2 x d_h x H 220 22% Compute/Memory
Output Projection 2 x BxL x d^2 65 6% Compute
FFN Gate + Up (SwiGLU) 2 x 2 x BxL x d x d_ff 310 31% Compute
FFN Down 2 x BxL x d_ff x d 155 15% Compute
RMSNorm (x2) BxL x d x 2 25 2.5% Memory BW
Residual Add (x2) BxL x d x 2 15 1.5% Memory BW
Other (rotary emb, etc.) -- 30 3% --
Note: d=4096, d_ff=11008, H=32, d_h=128. Total: ~1000 us per layer, 32 layers = ~32 ms per forward pass.

The FFN dominates (46% of compute) because LLaMA uses SwiGLU, which has 3 weight matrices in the FFN instead of the original transformer’s 2. Attention is second at 28%. Normalization and residual connections are negligible (under 5%).

Compute Distribution Within a Transformer Layer

(% of layer time)
FFN (up + gate + down) Largest component
46 % of layer time
Attention (QKV + score + output)
46 % of layer time
Normalization + residual
5 % of layer time
Other (RoPE, etc.)
3 % of layer time

How the Bottleneck Shifts

The dominant bottleneck changes with operating conditions:

📊

Bottleneck by Operating Regime

RegimeDominant CostBoundOptimization
Prefill (large batch, long seq) Attention (O(N^2)) Compute FlashAttention, tensor cores
Decode (batch=1) Weight loading for all projections Memory BW Quantization, batching
Decode (batch=32) FFN matmul + weight loading Mixed Balance batch size with latency target
Very long context (32K+) KV cache loading Memory BW + capacity GQA, KV quantization, sliding window
The N^2 Crossover

At short sequences (L under 1024), FFN dominates because its cost is O(L x d x d_ff) while attention is O(L^2 x d). At long sequences (L over 4096), attention’s quadratic scaling overtakes FFN. The crossover point depends on the d_ff/d ratio — for LLaMA (d_ff = 2.7xd), it’s around L ~ 2.7 x d_h = ~346 tokens per head. In practice, with multi-head parallelism, attention dominates above ~2K tokens.

Memory Breakdown

📊

Memory Usage by Component (Llama-7B, FP16)

ComponentParametersMemoryShare of Model
Embedding (input + output) 2 x 32000 x 4096 0.5 GB 3.6%
QKV projections (per layer) 3 x 4096^2 96 MB x 32 = 3.0 GB 21.4%
Output projection (per layer) 4096^2 32 MB x 32 = 1.0 GB 7.1%
FFN (gate+up+down, per layer) (2+1) x 4096 x 11008 258 MB x 32 = 8.1 GB 57.9%
RMSNorm (per layer) 2 x 4096 0.5 MB x 32 = 16 MB 0.1%
Total model 6.7B parameters ~14 GB 100%
Note: FFN dominates parameter count (58%) because of the 2.7x expansion ratio in SwiGLU.

FFN weights are 58% of the model. This is why weight quantization has such a large impact — quantizing FFN weights from FP16 to INT4 saves ~6 GB out of 14 GB total.

Scaling Analysis: How Components Scale with Model Size

📊

Component Scaling Across Model Sizes

ModelLayersd_modelAttention ShareFFN ShareNorm Share
Llama-1B 22 2048 28% 68% 4%
Llama-7B 32 4096 28% 67% 5%
Llama-13B 40 5120 29% 66% 5%
Llama-70B 80 8192 30% 65% 5%
Note: Attention share increases slightly with model size because head count grows faster than d_ff/d ratio.

The ratios are remarkably stable across model sizes. FFN consistently dominates at ~65-68%, attention at ~28-30%, and normalization is always negligible. This means optimization strategies that work for 7B generally transfer to 70B.

Optimization Impact by Component

Optimization Impact by Transformer Component

(x speedup potential)
FFN quantization (FP16->INT4) Biggest win (58% of weights)
3 x speedup potential
FlashAttention (prefill) Eliminates attention memory traffic
2.5 x speedup potential
GQA (reduce KV heads) 4-8x KV cache reduction
2 x speedup potential
Fuse norm+residual Small but free
1.15 x speedup potential
Fuse QKV projection 3 GEMMs -> 1
1.1 x speedup potential

Conclusion

The transformer’s performance profile is dominated by two components: FFN (46-68% of compute/parameters) and attention (28-30%). The FFN is the primary target for weight quantization (58% of model memory). Attention is the primary target for algorithmic optimization (FlashAttention for prefill, GQA for KV cache). Normalization and residual connections are under 5% of both compute and memory — optimize them last (fuse if convenient, but don’t obsess). These ratios are stable across model sizes, so optimization strategies transfer well from small to large models.