An Intel Xeon 8280 delivers 1.2 TFLOPS of peak compute but only 210 GB/s of memory bandwidth. That is a 5,714 FLOP/byte ratio. Any operation with arithmetic intensity below that — element-wise operations, most activations, normalization layers — is memory-bound and achieves under 4% of peak compute. An A100 GPU shifts that ratio to 153 FLOP/byte (312 TFLOPS, 2,039 GB/s), making more operations compute-bound. But even on A100, 70% of typical transformer operations remain memory-bound. Understanding this compute-memory tradeoff is the first step to optimization: you cannot fix what you do not measure.
The Bandwidth Landscape
Memory Bandwidth Across Platforms (2019-2024)
| Platform | Memory Type | Theoretical BW | Achieved BW | Efficiency |
|---|---|---|---|---|
| Intel Xeon 8280 (2 socket) | DDR4-2933 | 281 GB/s | ~210 GB/s | 75% |
| AMD EPYC 7742 (2 socket) | DDR4-3200 | 410 GB/s | ~340 GB/s | 83% |
| NVIDIA V100 | HBM2 | 900 GB/s | ~820 GB/s | 91% |
| NVIDIA A100 | HBM2e | 2,039 GB/s | ~1,800 GB/s | 88% |
| NVIDIA H100 (SXM) | HBM3 | 3,350 GB/s | ~2,900 GB/s | 87% |
| Intel Gaudi2 | HBM2e | 2,450 GB/s | ~2,200 GB/s | 90% |
| Apple M2 Ultra | LPDDR5 | 800 GB/s | ~680 GB/s | 85% |
The GPU advantage is clear: HBM delivers 5-15x more bandwidth than DDR. This is why GPUs dominate AI workloads — not because of raw compute (which CPUs have narrowed the gap on), but because of memory bandwidth.
The Roofline Model
The roofline model tells you whether an operation is compute-bound or memory-bound based on its arithmetic intensity (AI): the ratio of FLOPs to bytes transferred.
AI = FLOPs / Bytes
The crossover point where compute and memory bandwidth are balanced:
AI_crossover = Peak_FLOPS / Peak_BW
Roofline Crossover Points
| Platform | Peak FP16 FLOPS | Peak BW | Crossover AI | Interpretation |
|---|---|---|---|---|
| V100 | 125 TFLOPS | 900 GB/s | 139 FLOP/byte | Most ops are memory-bound |
| A100 | 312 TFLOPS | 2,039 GB/s | 153 FLOP/byte | Even higher AI needed |
| H100 | 990 TFLOPS | 3,350 GB/s | 295 FLOP/byte | Almost everything is memory-bound |
Each GPU generation increases compute faster than bandwidth. V100 needs AI > 139 to be compute-bound; H100 needs AI over 295. This means more and more operations are memory-bound on newer hardware — making bandwidth optimization increasingly important.
Measuring Actual Bandwidth
STREAM Benchmark
The gold standard for measuring achievable memory bandwidth:
# Build and run STREAM (adjust array size for your cache)
gcc -O3 -fopenmp -DSTREAM_ARRAY_SIZE=100000000 stream.c -o stream
OMP_NUM_THREADS=48 ./stream
The four STREAM operations:
- Copy:
a[i] = b[i]— pure read + write bandwidth - Scale:
a[i] = q * b[i]— adds one multiply - Add:
a[i] = b[i] + c[i]— two reads, one write - Triad:
a[i] = b[i] + q * c[i]— the standard benchmark (AI = 0.33 FLOP/byte)
GPU Bandwidth
# NVIDIA's bandwidth test
/usr/local/cuda/samples/1_Utilities/bandwidthTest/bandwidthTest
# Or use custom kernel with nvprof/ncu
ncu --metrics dram__bytes_read.sum,dram__bytes_write.sum ./my_kernel
Where AI Workloads Fall on the Roofline
Arithmetic Intensity of Common AI Operations
(FLOP/byte)Most LLM operations are memory-bound, especially during the decode phase. Only large batched prefill and big GEMM operations cross into compute-bound territory. This is the fundamental reason why LLM serving focuses so heavily on memory optimization.
Bandwidth Optimization Techniques
Bandwidth Optimization Techniques
| Technique | Applicable When | Bandwidth Improvement | Mechanism |
|---|---|---|---|
| Quantization (FP16->INT4) | Weight loading | 2-4x | Fewer bytes to read |
| Batching | Decode phase | Up to 64x | Amortize weight reads |
| Operator fusion | Multi-op sequences | 1.5-3x | Eliminate intermediate writes |
| Data layout (SoA) | Strided access | 2-10x | Better coalescing/prefetch |
| Tiling (SMEM) | Reusable data | 3-15x | Serve from fast SRAM instead |
| Compression (sparse) | Sparse models | 1.5-4x | Skip zero values |
Before optimizing any kernel, calculate its bandwidth requirement: bytes_read + bytes_written. Divide by your GPU’s achieved bandwidth (~90% of theoretical for HBM). If the result exceeds your kernel’s runtime, the kernel is not yet bandwidth-limited — look for other bottlenecks. If it’s close, you’re hitting the hardware limit and need algorithmic changes (quantization, fusion, tiling).
Multi-Level Bandwidth Optimization
The most effective optimizations exploit the bandwidth gap between memory levels:
Bandwidth Reduction Through Memory Hierarchy Exploitation (A100)
(GB/s needed from HBM)Each level of optimization reduces the demand on HBM. The combination of caching, tiling, and fusion can reduce HBM traffic by 10x — often the difference between a kernel being bandwidth-limited and performing at acceptable throughput.
Conclusion
Memory bandwidth is the primary bottleneck for most AI workloads on modern hardware. GPU HBM delivers 5-15x more bandwidth than CPU DDR, which is why GPUs dominate AI — but even HBM bandwidth is insufficient for operations like LLM decode at batch=1. The roofline model provides the analytical framework: calculate your operation’s arithmetic intensity, compare to the platform’s crossover point, and optimize accordingly. For memory-bound operations (most of LLM serving), the optimization hierarchy is: batch more, quantize weights, fuse operators, tile into SRAM.