A single Llama-7B forward pass spends 32ms total across 32 transformer layers, but the time is not evenly distributed: 46% goes to the FFN, 28% to attention, 18% to projections, and under 5% to normalization and residuals. For long contexts (seq > 2048), attention’s O(N²) complexity makes it the bottleneck. For short contexts (seq < 512), the FFN dominates because its 3 weight matrices in SwiGLU perform more FLOPs than attention. Understanding where time goes — and how it shifts with workload — is essential for optimization.
Component-Level Performance Breakdown
Time per Transformer Layer by Component (Llama-7B, A100, FP16, batch=1, seq=512)
| Component | FLOPs | Time (us) | Share | Bound |
|---|---|---|---|---|
| QKV Projection | 3 x 2 x BxL x d^2 | 180 | 18% | Compute |
| Attention (QK^T + Softmax + AV) | 2 x 2 x BxL^2 x d_h x H | 220 | 22% | Compute/Memory |
| Output Projection | 2 x BxL x d^2 | 65 | 6% | Compute |
| FFN Gate + Up (SwiGLU) | 2 x 2 x BxL x d x d_ff | 310 | 31% | Compute |
| FFN Down | 2 x BxL x d_ff x d | 155 | 15% | Compute |
| RMSNorm (x2) | BxL x d x 2 | 25 | 2.5% | Memory BW |
| Residual Add (x2) | BxL x d x 2 | 15 | 1.5% | Memory BW |
| Other (rotary emb, etc.) | -- | 30 | 3% | -- |
The FFN dominates (46% of compute) because LLaMA uses SwiGLU, which has 3 weight matrices in the FFN instead of the original transformer’s 2. Attention is second at 28%. Normalization and residual connections are negligible (under 5%).
Compute Distribution Within a Transformer Layer
(% of layer time)How the Bottleneck Shifts
The dominant bottleneck changes with operating conditions:
Bottleneck by Operating Regime
| Regime | Dominant Cost | Bound | Optimization |
|---|---|---|---|
| Prefill (large batch, long seq) | Attention (O(N^2)) | Compute | FlashAttention, tensor cores |
| Decode (batch=1) | Weight loading for all projections | Memory BW | Quantization, batching |
| Decode (batch=32) | FFN matmul + weight loading | Mixed | Balance batch size with latency target |
| Very long context (32K+) | KV cache loading | Memory BW + capacity | GQA, KV quantization, sliding window |
At short sequences (L under 1024), FFN dominates because its cost is O(L x d x d_ff) while attention is O(L^2 x d). At long sequences (L over 4096), attention’s quadratic scaling overtakes FFN. The crossover point depends on the d_ff/d ratio — for LLaMA (d_ff = 2.7xd), it’s around L ~ 2.7 x d_h = ~346 tokens per head. In practice, with multi-head parallelism, attention dominates above ~2K tokens.
Memory Breakdown
Memory Usage by Component (Llama-7B, FP16)
| Component | Parameters | Memory | Share of Model |
|---|---|---|---|
| Embedding (input + output) | 2 x 32000 x 4096 | 0.5 GB | 3.6% |
| QKV projections (per layer) | 3 x 4096^2 | 96 MB x 32 = 3.0 GB | 21.4% |
| Output projection (per layer) | 4096^2 | 32 MB x 32 = 1.0 GB | 7.1% |
| FFN (gate+up+down, per layer) | (2+1) x 4096 x 11008 | 258 MB x 32 = 8.1 GB | 57.9% |
| RMSNorm (per layer) | 2 x 4096 | 0.5 MB x 32 = 16 MB | 0.1% |
| Total model | 6.7B parameters | ~14 GB | 100% |
FFN weights are 58% of the model. This is why weight quantization has such a large impact — quantizing FFN weights from FP16 to INT4 saves ~6 GB out of 14 GB total.
Scaling Analysis: How Components Scale with Model Size
Component Scaling Across Model Sizes
| Model | Layers | d_model | Attention Share | FFN Share | Norm Share |
|---|---|---|---|---|---|
| Llama-1B | 22 | 2048 | 28% | 68% | 4% |
| Llama-7B | 32 | 4096 | 28% | 67% | 5% |
| Llama-13B | 40 | 5120 | 29% | 66% | 5% |
| Llama-70B | 80 | 8192 | 30% | 65% | 5% |
The ratios are remarkably stable across model sizes. FFN consistently dominates at ~65-68%, attention at ~28-30%, and normalization is always negligible. This means optimization strategies that work for 7B generally transfer to 70B.
Optimization Impact by Component
Optimization Impact by Transformer Component
(x speedup potential)Conclusion
The transformer’s performance profile is dominated by two components: FFN (46-68% of compute/parameters) and attention (28-30%). The FFN is the primary target for weight quantization (58% of model memory). Attention is the primary target for algorithmic optimization (FlashAttention for prefill, GQA for KV cache). Normalization and residual connections are under 5% of both compute and memory — optimize them last (fuse if convenient, but don’t obsess). These ratios are stable across model sizes, so optimization strategies transfer well from small to large models.