At sequence length 4096, attention consumes 82% of transformer inference time on a V100. Double the sequence to 8192 and attention time quadruples while everything else stays flat โ the O(Nยฒ) complexity is not theoretical, it dominates production workloads. Before FlashAttention cut attention time by 2-4x, we had to profile standard attention to understand where the cycles actually go: is it the QK^T matmul, the softmax, or the memory bandwidth between them?
The Attention Computation Pipeline
Standard scaled dot-product attention involves five sequential operations. Profiling each individually reveals a clear picture:
Attention Operation Breakdown (seq_len=1024, d_model=768, single head)
| Operation | Time (ms) | Share | Complexity | Bound |
|---|---|---|---|---|
| QK^T matmul | 8.2 | 45% | O(N^2 d) | Compute |
| Softmax | 6.8 | 37% | O(N^2) | Memory bandwidth |
| Attention x V | 2.4 | 13% | O(N^2 d) | Compute |
| Scaling + masking | 0.8 | 5% | O(N^2) | Memory bandwidth |
Two operations dominate: QK^T (45%) and softmax (37%). The QK^T matmul is compute-bound โ it is a large matrix multiplication that tensor cores handle well. But softmax is memory-bandwidth-bound: it requires reading the entire NxN score matrix, computing element-wise exponentials, normalizing, and writing back. This read-compute-write pattern is the fundamental bottleneck.
Quadratic Scaling in Practice
The O(N^2) scaling is not just theoretical โ it hits hard in practice. Doubling sequence length does not double attention time; it quadruples it.
Attention Latency vs Sequence Length (Single Layer, 32 Heads)
(ms)At sequence length 4096, a single attention layer takes over a second on V100. For a 32-layer model, that is 37 seconds just for attention โ clearly untenable for interactive inference.
Memory Access Analysis
The memory access pattern reveals why attention is bandwidth-bound at scale. For each attention head with sequence length N and head dimension d:
Memory Traffic Per Attention Head (N=4096, d=128, FP16)
| Step | Reads | Writes | Traffic |
|---|---|---|---|
| Load Q, K | Nxd x 2 = 2.1 MB | n/a | 2.1 MB |
| Store QK^T scores | n/a | N^2 x 2 = 33.5 MB | 33.5 MB |
| Load scores for softmax | N^2 x 2 = 33.5 MB | n/a | 33.5 MB |
| Store softmax output | n/a | N^2 x 2 = 33.5 MB | 33.5 MB |
| Load weights + V | N^2 x 2 + Nxd x 2 = 34.0 MB | n/a | 34.0 MB |
| Store output | n/a | Nxd x 2 = 1.0 MB | 1.0 MB |
The NxN score matrix is the culprit. It is read and written three times (store QK^T, load for softmax, store softmax result) before being used in the final matmul. For 32 heads at N=4096, total HBM traffic exceeds 4 GB per layer.
The NxN attention score matrix is a massive intermediate that exists only to be immediately consumed. It does not need to live in HBM at all โ if we can compute and consume it in tiles that fit in SRAM, we eliminate 97% of memory traffic. This is exactly what FlashAttention does.
Arithmetic Intensity
Arithmetic intensity (FLOPs per byte of memory traffic) determines whether an operation is compute-bound or memory-bound on a given GPU:
Arithmetic Intensity of Attention Operations
| Operation | FLOPs | Memory Traffic | AI (FLOP/byte) | Bound on V100 |
|---|---|---|---|---|
| QK^T (standard) | 2 N^2 d | 4Nd + 2 N^2 | ~2.8 | Memory |
| Softmax | 5 N^2 | 4 N^2 | 1.25 | Memory |
| Attention x V | 2 N^2 d | 2 N^2 + 4Nd | ~2.8 | Memory |
| Fused attention (Flash) | 4 N^2 d + 5 N^2 | 4Nd + 2Nd | ~89 | Compute |
Standard attention operations have arithmetic intensity of 1-3 FLOP/byte โ deep in memory-bound territory on modern GPUs. Fusing everything into a single kernel with tiled SRAM access pushes arithmetic intensity to ~89 FLOP/byte, crossing into compute-bound territory.
Optimization Approaches
The analysis points to three complementary optimization strategies:
Tiled/Fused Attention (FlashAttention): Eliminate HBM traffic for the NxN intermediate by processing in SRAM-sized tiles. This is the most impactful optimization โ it does not approximate, it computes the exact same result with 33x less memory traffic.
Sparse Attention Patterns: Reduce the effective N^2 computation by only attending to a subset of tokens. Local attention (window size W) reduces complexity from O(N^2) to O(NW). The trade-off is model quality โ sparse patterns lose long-range dependencies.
Low-Rank Approximation: If attention weights are approximately low-rank (rank r much less than N), you can compute Q_low K_low^T in O(N^2 r) instead of O(N^2 d). This trades approximation error for speed, and works best when attention patterns are smooth.
Optimization Strategy Comparison (seq_len=1024, d=768)
(ms)Conclusion
Profiling attention reveals a clear hierarchy of bottlenecks. The QK^T and softmax operations consume 82% of time, and the root cause is the NxN intermediate matrix that requires 97% of HBM traffic. The optimization path is clear: eliminate that intermediate (FlashAttention), reduce its size (sparse attention), or compress it (low-rank). Profile first, optimize the actual bottleneck, measure the improvement.