Transformer Architecture: A Performance Analysis of Every Component

A single Llama-7B forward pass spends 32ms total across 32 transformer layers, but the time is not evenly distributed: 46% goes to the FFN, 28% to attention, 18% to projections, and under 5% to normalization and residuals. For long contexts (seq > 2048), attention’s O(N²) complexity makes it the bottleneck. For short contexts (seq < 512), the FFN dominates because its 3 weight matrices in SwiGLU perform more FLOPs than attention. Understanding where time goes — and how it shifts with workload — is essential for optimization.

Component-Level Performance Breakdown

📊

Time per Transformer Layer by Component (Llama-7B, A100, FP16, batch=1, seq=512)

Component	FLOPs	Time (us)	Share	Bound
QKV Projection	3 x 2 x BxL x d^2	180	18%	Compute
Attention (QK^T + Softmax + AV)	2 x 2 x BxL^2 x d_h x H	220	22%	Compute/Memory
Output Projection	2 x BxL x d^2	65	6%	Compute
FFN Gate + Up (SwiGLU)	2 x 2 x BxL x d x d_ff	310	31%	Compute
FFN Down	2 x BxL x d_ff x d	155	15%	Compute
RMSNorm (x2)	BxL x d x 2	25	2.5%	Memory BW
Residual Add (x2)	BxL x d x 2	15	1.5%	Memory BW
Other (rotary emb, etc.)	--	30	3%	--

Note: d=4096, d_ff=11008, H=32, d_h=128. Total: ~1000 us per layer, 32 layers = ~32 ms per forward pass.

The FFN dominates (46% of compute) because LLaMA uses SwiGLU, which has 3 weight matrices in the FFN instead of the original transformer’s 2. Attention is second at 28%. Normalization and residual connections are negligible (under 5%).

Compute Distribution Within a Transformer Layer

(% of layer time)

FFN (up + gate + down) Largest component

46 % of layer time

Attention (QKV + score + output)

46 % of layer time

Normalization + residual

5 % of layer time

Other (RoPE, etc.)

3 % of layer time

How the Bottleneck Shifts

The dominant bottleneck changes with operating conditions:

📊

Bottleneck by Operating Regime

Regime	Dominant Cost	Bound	Optimization
Prefill (large batch, long seq)	Attention (O(N^2))	Compute	FlashAttention, tensor cores
Decode (batch=1)	Weight loading for all projections	Memory BW	Quantization, batching
Decode (batch=32)	FFN matmul + weight loading	Mixed	Balance batch size with latency target
Very long context (32K+)	KV cache loading	Memory BW + capacity	GQA, KV quantization, sliding window

⚡ The N^2 Crossover

At short sequences (L under 1024), FFN dominates because its cost is O(L x d x d_ff) while attention is O(L^2 x d). At long sequences (L over 4096), attention’s quadratic scaling overtakes FFN. The crossover point depends on the d_ff/d ratio — for LLaMA (d_ff = 2.7xd), it’s around L ~ 2.7 x d_h = ~346 tokens per head. In practice, with multi-head parallelism, attention dominates above ~2K tokens.

Memory Breakdown

📊

Memory Usage by Component (Llama-7B, FP16)

Component	Parameters	Memory	Share of Model
Embedding (input + output)	2 x 32000 x 4096	0.5 GB	3.6%
QKV projections (per layer)	3 x 4096^2	96 MB x 32 = 3.0 GB	21.4%
Output projection (per layer)	4096^2	32 MB x 32 = 1.0 GB	7.1%
FFN (gate+up+down, per layer)	(2+1) x 4096 x 11008	258 MB x 32 = 8.1 GB	57.9%
RMSNorm (per layer)	2 x 4096	0.5 MB x 32 = 16 MB	0.1%
Total model	6.7B parameters	~14 GB	100%

Note: FFN dominates parameter count (58%) because of the 2.7x expansion ratio in SwiGLU.

FFN weights are 58% of the model. This is why weight quantization has such a large impact — quantizing FFN weights from FP16 to INT4 saves ~6 GB out of 14 GB total.

Scaling Analysis: How Components Scale with Model Size

📊

Component Scaling Across Model Sizes

Model	Layers	d_model	Attention Share	FFN Share	Norm Share
Llama-1B	22	2048	28%	68%	4%
Llama-7B	32	4096	28%	67%	5%
Llama-13B	40	5120	29%	66%	5%
Llama-70B	80	8192	30%	65%	5%

Note: Attention share increases slightly with model size because head count grows faster than d_ff/d ratio.

The ratios are remarkably stable across model sizes. FFN consistently dominates at ~65-68%, attention at ~28-30%, and normalization is always negligible. This means optimization strategies that work for 7B generally transfer to 70B.

Optimization Impact by Component

Optimization Impact by Transformer Component

(x speedup potential)

FFN quantization (FP16->INT4) Biggest win (58% of weights)

3 x speedup potential

FlashAttention (prefill) Eliminates attention memory traffic

2.5 x speedup potential

GQA (reduce KV heads) 4-8x KV cache reduction

2 x speedup potential

Fuse norm+residual Small but free

1.15 x speedup potential

Fuse QKV projection 3 GEMMs -> 1

1.1 x speedup potential

Conclusion

The transformer’s performance profile is dominated by two components: FFN (46-68% of compute/parameters) and attention (28-30%). The FFN is the primary target for weight quantization (58% of model memory). Attention is the primary target for algorithmic optimization (FlashAttention for prefill, GQA for KV cache). Normalization and residual connections are under 5% of both compute and memory — optimize them last (fuse if convenient, but don’t obsess). These ratios are stable across model sizes, so optimization strategies transfer well from small to large models.