Attention Mechanism Performance Analysis: Where the Cycles Go

At sequence length 4096, attention consumes 82% of transformer inference time on a V100. Double the sequence to 8192 and attention time quadruples while everything else stays flat — the O(N²) complexity is not theoretical, it dominates production workloads. Before FlashAttention cut attention time by 2-4x, we had to profile standard attention to understand where the cycles actually go: is it the QK^T matmul, the softmax, or the memory bandwidth between them?

The Attention Computation Pipeline

Standard scaled dot-product attention involves five sequential operations. Profiling each individually reveals a clear picture:

📊

Attention Operation Breakdown (seq_len=1024, d_model=768, single head)

Operation	Time (ms)	Share	Complexity	Bound
QK^T matmul	8.2	45%	O(N^2 d)	Compute
Softmax	6.8	37%	O(N^2)	Memory bandwidth
Attention x V	2.4	13%	O(N^2 d)	Compute
Scaling + masking	0.8	5%	O(N^2)	Memory bandwidth

Note: Profiled on V100 with PyTorch 1.12, FP32. Total: 18.2 ms per head.

Two operations dominate: QK^T (45%) and softmax (37%). The QK^T matmul is compute-bound — it is a large matrix multiplication that tensor cores handle well. But softmax is memory-bandwidth-bound: it requires reading the entire NxN score matrix, computing element-wise exponentials, normalizing, and writing back. This read-compute-write pattern is the fundamental bottleneck.

Quadratic Scaling in Practice

The O(N^2) scaling is not just theoretical — it hits hard in practice. Doubling sequence length does not double attention time; it quadruples it.

Attention Latency vs Sequence Length (Single Layer, 32 Heads)

(ms)

N=128 Negligible

2 ms

N=512

18 ms

N=1024

72 ms

N=2048

290 ms

N=4096 4x from 2048

1,160 ms

At sequence length 4096, a single attention layer takes over a second on V100. For a 32-layer model, that is 37 seconds just for attention — clearly untenable for interactive inference.

Memory Access Analysis

The memory access pattern reveals why attention is bandwidth-bound at scale. For each attention head with sequence length N and head dimension d:

📊

Memory Traffic Per Attention Head (N=4096, d=128, FP16)

Step	Reads	Writes	Traffic
Load Q, K	Nxd x 2 = 2.1 MB	n/a	2.1 MB
Store QK^T scores	n/a	N^2 x 2 = 33.5 MB	33.5 MB
Load scores for softmax	N^2 x 2 = 33.5 MB	n/a	33.5 MB
Store softmax output	n/a	N^2 x 2 = 33.5 MB	33.5 MB
Load weights + V	N^2 x 2 + Nxd x 2 = 34.0 MB	n/a	34.0 MB
Store output	n/a	Nxd x 2 = 1.0 MB	1.0 MB

Note: Total: ~138 MB per head. The NxN intermediate matrix (S) accounts for 97% of traffic.

The NxN score matrix is the culprit. It is read and written three times (store QK^T, load for softmax, store softmax result) before being used in the final matmul. For 32 heads at N=4096, total HBM traffic exceeds 4 GB per layer.

⚡ The Key Insight

The NxN attention score matrix is a massive intermediate that exists only to be immediately consumed. It does not need to live in HBM at all — if we can compute and consume it in tiles that fit in SRAM, we eliminate 97% of memory traffic. This is exactly what FlashAttention does.

Arithmetic Intensity

Arithmetic intensity (FLOPs per byte of memory traffic) determines whether an operation is compute-bound or memory-bound on a given GPU:

📊

Arithmetic Intensity of Attention Operations

Operation	FLOPs	Memory Traffic	AI (FLOP/byte)	Bound on V100
QK^T (standard)	2 N^2 d	4Nd + 2 N^2	~2.8	Memory
Softmax	5 N^2	4 N^2	1.25	Memory
Attention x V	2 N^2 d	2 N^2 + 4Nd	~2.8	Memory
Fused attention (Flash)	4 N^2 d + 5 N^2	4Nd + 2Nd	~89	Compute

Note: V100 has ~125 FLOP/byte crossover point (900 GB/s HBM, 112 TFLOPS FP16).

Standard attention operations have arithmetic intensity of 1-3 FLOP/byte — deep in memory-bound territory on modern GPUs. Fusing everything into a single kernel with tiled SRAM access pushes arithmetic intensity to ~89 FLOP/byte, crossing into compute-bound territory.

Optimization Approaches

The analysis points to three complementary optimization strategies:

Tiled/Fused Attention (FlashAttention): Eliminate HBM traffic for the NxN intermediate by processing in SRAM-sized tiles. This is the most impactful optimization — it does not approximate, it computes the exact same result with 33x less memory traffic.

Sparse Attention Patterns: Reduce the effective N^2 computation by only attending to a subset of tokens. Local attention (window size W) reduces complexity from O(N^2) to O(NW). The trade-off is model quality — sparse patterns lose long-range dependencies.

Low-Rank Approximation: If attention weights are approximately low-rank (rank r much less than N), you can compute Q_low K_low^T in O(N^2 r) instead of O(N^2 d). This trades approximation error for speed, and works best when attention patterns are smooth.

Optimization Strategy Comparison (seq_len=1024, d=768)

(ms)

Standard attention

582 ms

FlashAttention (exact) 7.5x faster

78 ms

Local attention (W=128) 13x but approximate

45 ms

Low-rank (r=64) 9.4x but approximate

62 ms

Conclusion

Profiling attention reveals a clear hierarchy of bottlenecks. The QK^T and softmax operations consume 82% of time, and the root cause is the NxN intermediate matrix that requires 97% of HBM traffic. The optimization path is clear: eliminate that intermediate (FlashAttention), reduce its size (sparse attention), or compress it (low-rank). Profile first, optimize the actual bottleneck, measure the improvement.