An Intel Xeon 8280 delivers 1.2 TFLOPS of peak compute but only 210 GB/s of memory bandwidth. That is a 5,714 FLOP/byte ratio. Any operation with arithmetic intensity below that — element-wise operations, most activations, normalization layers — is memory-bound and achieves under 4% of peak compute. An A100 GPU shifts that ratio to 153 FLOP/byte (312 TFLOPS, 2,039 GB/s), making more operations compute-bound. But even on A100, 70% of typical transformer operations remain memory-bound. Understanding this compute-memory tradeoff is the first step to optimization: you cannot fix what you do not measure.

The Bandwidth Landscape

📊

Memory Bandwidth Across Platforms (2019-2024)

PlatformMemory TypeTheoretical BWAchieved BWEfficiency
Intel Xeon 8280 (2 socket) DDR4-2933 281 GB/s ~210 GB/s 75%
AMD EPYC 7742 (2 socket) DDR4-3200 410 GB/s ~340 GB/s 83%
NVIDIA V100 HBM2 900 GB/s ~820 GB/s 91%
NVIDIA A100 HBM2e 2,039 GB/s ~1,800 GB/s 88%
NVIDIA H100 (SXM) HBM3 3,350 GB/s ~2,900 GB/s 87%
Intel Gaudi2 HBM2e 2,450 GB/s ~2,200 GB/s 90%
Apple M2 Ultra LPDDR5 800 GB/s ~680 GB/s 85%
Note: Achieved BW measured with STREAM triad benchmark (or equivalent). HBM consistently achieves 85-91% of theoretical.

The GPU advantage is clear: HBM delivers 5-15x more bandwidth than DDR. This is why GPUs dominate AI workloads — not because of raw compute (which CPUs have narrowed the gap on), but because of memory bandwidth.

The Roofline Model

The roofline model tells you whether an operation is compute-bound or memory-bound based on its arithmetic intensity (AI): the ratio of FLOPs to bytes transferred.

AI = FLOPs / Bytes

The crossover point where compute and memory bandwidth are balanced:

AI_crossover = Peak_FLOPS / Peak_BW

📊

Roofline Crossover Points

PlatformPeak FP16 FLOPSPeak BWCrossover AIInterpretation
V100 125 TFLOPS 900 GB/s 139 FLOP/byte Most ops are memory-bound
A100 312 TFLOPS 2,039 GB/s 153 FLOP/byte Even higher AI needed
H100 990 TFLOPS 3,350 GB/s 295 FLOP/byte Almost everything is memory-bound
Note: As compute grows faster than bandwidth (each generation), the crossover AI increases -- more operations become memory-bound.
The Trend Is Clear

Each GPU generation increases compute faster than bandwidth. V100 needs AI > 139 to be compute-bound; H100 needs AI over 295. This means more and more operations are memory-bound on newer hardware — making bandwidth optimization increasingly important.

Measuring Actual Bandwidth

STREAM Benchmark

The gold standard for measuring achievable memory bandwidth:

# Build and run STREAM (adjust array size for your cache)
gcc -O3 -fopenmp -DSTREAM_ARRAY_SIZE=100000000 stream.c -o stream
OMP_NUM_THREADS=48 ./stream

The four STREAM operations:

  • Copy: a[i] = b[i] — pure read + write bandwidth
  • Scale: a[i] = q * b[i] — adds one multiply
  • Add: a[i] = b[i] + c[i] — two reads, one write
  • Triad: a[i] = b[i] + q * c[i] — the standard benchmark (AI = 0.33 FLOP/byte)

GPU Bandwidth

# NVIDIA's bandwidth test
/usr/local/cuda/samples/1_Utilities/bandwidthTest/bandwidthTest
# Or use custom kernel with nvprof/ncu
ncu --metrics dram__bytes_read.sum,dram__bytes_write.sum ./my_kernel

Where AI Workloads Fall on the Roofline

Arithmetic Intensity of Common AI Operations

(FLOP/byte)
LLM decode (batch=1) Deep in memory-bound
0.5 FLOP/byte
LLM decode (batch=32) Still memory-bound
16 FLOP/byte
Standard attention (seq=4K)
3 FLOP/byte
FlashAttention (seq=4K) Near crossover
89 FLOP/byte
LLM prefill (batch=32) Compute-bound
250 FLOP/byte
GEMM (4096x4096) Compute-bound
340 FLOP/byte

Most LLM operations are memory-bound, especially during the decode phase. Only large batched prefill and big GEMM operations cross into compute-bound territory. This is the fundamental reason why LLM serving focuses so heavily on memory optimization.

Bandwidth Optimization Techniques

📊

Bandwidth Optimization Techniques

TechniqueApplicable WhenBandwidth ImprovementMechanism
Quantization (FP16->INT4) Weight loading 2-4x Fewer bytes to read
Batching Decode phase Up to 64x Amortize weight reads
Operator fusion Multi-op sequences 1.5-3x Eliminate intermediate writes
Data layout (SoA) Strided access 2-10x Better coalescing/prefetch
Tiling (SMEM) Reusable data 3-15x Serve from fast SRAM instead
Compression (sparse) Sparse models 1.5-4x Skip zero values
💡 The Bandwidth Budget

Before optimizing any kernel, calculate its bandwidth requirement: bytes_read + bytes_written. Divide by your GPU’s achieved bandwidth (~90% of theoretical for HBM). If the result exceeds your kernel’s runtime, the kernel is not yet bandwidth-limited — look for other bottlenecks. If it’s close, you’re hitting the hardware limit and need algorithmic changes (quantization, fusion, tiling).

Multi-Level Bandwidth Optimization

The most effective optimizations exploit the bandwidth gap between memory levels:

Bandwidth Reduction Through Memory Hierarchy Exploitation (A100)

(GB/s needed from HBM)
Naive (all from HBM) Exceeds HBM BW
2,000 GB/s needed from HBM
+ L2 cache hits (40%)
1,200 GB/s needed from HBM
+ SMEM tiling
400 GB/s needed from HBM
+ Operator fusion Within HBM budget
200 GB/s needed from HBM

Each level of optimization reduces the demand on HBM. The combination of caching, tiling, and fusion can reduce HBM traffic by 10x — often the difference between a kernel being bandwidth-limited and performing at acceptable throughput.

Conclusion

Memory bandwidth is the primary bottleneck for most AI workloads on modern hardware. GPU HBM delivers 5-15x more bandwidth than CPU DDR, which is why GPUs dominate AI — but even HBM bandwidth is insufficient for operations like LLM decode at batch=1. The roofline model provides the analytical framework: calculate your operation’s arithmetic intensity, compare to the platform’s crossover point, and optimize accordingly. For memory-bound operations (most of LLM serving), the optimization hierarchy is: batch more, quantize weights, fuse operators, tile into SRAM.