At sequence length 4096, attention consumes 82% of transformer inference time on a V100. Double the sequence to 8192 and attention time quadruples while everything else stays flat โ€” the O(Nยฒ) complexity is not theoretical, it dominates production workloads. Before FlashAttention cut attention time by 2-4x, we had to profile standard attention to understand where the cycles actually go: is it the QK^T matmul, the softmax, or the memory bandwidth between them?

The Attention Computation Pipeline

Standard scaled dot-product attention involves five sequential operations. Profiling each individually reveals a clear picture:

๐Ÿ“Š

Attention Operation Breakdown (seq_len=1024, d_model=768, single head)

OperationTime (ms)ShareComplexityBound
QK^T matmul 8.2 45% O(N^2 d) Compute
Softmax 6.8 37% O(N^2) Memory bandwidth
Attention x V 2.4 13% O(N^2 d) Compute
Scaling + masking 0.8 5% O(N^2) Memory bandwidth
Note: Profiled on V100 with PyTorch 1.12, FP32. Total: 18.2 ms per head.

Two operations dominate: QK^T (45%) and softmax (37%). The QK^T matmul is compute-bound โ€” it is a large matrix multiplication that tensor cores handle well. But softmax is memory-bandwidth-bound: it requires reading the entire NxN score matrix, computing element-wise exponentials, normalizing, and writing back. This read-compute-write pattern is the fundamental bottleneck.

Quadratic Scaling in Practice

The O(N^2) scaling is not just theoretical โ€” it hits hard in practice. Doubling sequence length does not double attention time; it quadruples it.

Attention Latency vs Sequence Length (Single Layer, 32 Heads)

(ms)
N=128 Negligible
2 ms
N=512
18 ms
N=1024
72 ms
N=2048
290 ms
N=4096 4x from 2048
1,160 ms

At sequence length 4096, a single attention layer takes over a second on V100. For a 32-layer model, that is 37 seconds just for attention โ€” clearly untenable for interactive inference.

Memory Access Analysis

The memory access pattern reveals why attention is bandwidth-bound at scale. For each attention head with sequence length N and head dimension d:

๐Ÿ“Š

Memory Traffic Per Attention Head (N=4096, d=128, FP16)

StepReadsWritesTraffic
Load Q, K Nxd x 2 = 2.1 MB n/a 2.1 MB
Store QK^T scores n/a N^2 x 2 = 33.5 MB 33.5 MB
Load scores for softmax N^2 x 2 = 33.5 MB n/a 33.5 MB
Store softmax output n/a N^2 x 2 = 33.5 MB 33.5 MB
Load weights + V N^2 x 2 + Nxd x 2 = 34.0 MB n/a 34.0 MB
Store output n/a Nxd x 2 = 1.0 MB 1.0 MB
Note: Total: ~138 MB per head. The NxN intermediate matrix (S) accounts for 97% of traffic.

The NxN score matrix is the culprit. It is read and written three times (store QK^T, load for softmax, store softmax result) before being used in the final matmul. For 32 heads at N=4096, total HBM traffic exceeds 4 GB per layer.

โšก The Key Insight

The NxN attention score matrix is a massive intermediate that exists only to be immediately consumed. It does not need to live in HBM at all โ€” if we can compute and consume it in tiles that fit in SRAM, we eliminate 97% of memory traffic. This is exactly what FlashAttention does.

Arithmetic Intensity

Arithmetic intensity (FLOPs per byte of memory traffic) determines whether an operation is compute-bound or memory-bound on a given GPU:

๐Ÿ“Š

Arithmetic Intensity of Attention Operations

OperationFLOPsMemory TrafficAI (FLOP/byte)Bound on V100
QK^T (standard) 2 N^2 d 4Nd + 2 N^2 ~2.8 Memory
Softmax 5 N^2 4 N^2 1.25 Memory
Attention x V 2 N^2 d 2 N^2 + 4Nd ~2.8 Memory
Fused attention (Flash) 4 N^2 d + 5 N^2 4Nd + 2Nd ~89 Compute
Note: V100 has ~125 FLOP/byte crossover point (900 GB/s HBM, 112 TFLOPS FP16).

Standard attention operations have arithmetic intensity of 1-3 FLOP/byte โ€” deep in memory-bound territory on modern GPUs. Fusing everything into a single kernel with tiled SRAM access pushes arithmetic intensity to ~89 FLOP/byte, crossing into compute-bound territory.

Optimization Approaches

The analysis points to three complementary optimization strategies:

Tiled/Fused Attention (FlashAttention): Eliminate HBM traffic for the NxN intermediate by processing in SRAM-sized tiles. This is the most impactful optimization โ€” it does not approximate, it computes the exact same result with 33x less memory traffic.

Sparse Attention Patterns: Reduce the effective N^2 computation by only attending to a subset of tokens. Local attention (window size W) reduces complexity from O(N^2) to O(NW). The trade-off is model quality โ€” sparse patterns lose long-range dependencies.

Low-Rank Approximation: If attention weights are approximately low-rank (rank r much less than N), you can compute Q_low K_low^T in O(N^2 r) instead of O(N^2 d). This trades approximation error for speed, and works best when attention patterns are smooth.

Optimization Strategy Comparison (seq_len=1024, d=768)

(ms)
Standard attention
582 ms
FlashAttention (exact) 7.5x faster
78 ms
Local attention (W=128) 13x but approximate
45 ms
Low-rank (r=64) 9.4x but approximate
62 ms

Conclusion

Profiling attention reveals a clear hierarchy of bottlenecks. The QK^T and softmax operations consume 82% of time, and the root cause is the NxN intermediate matrix that requires 97% of HBM traffic. The optimization path is clear: eliminate that intermediate (FlashAttention), reduce its size (sparse attention), or compress it (low-rank). Profile first, optimize the actual bottleneck, measure the improvement.