Memory Bandwidth Analysis: The Real Bottleneck in Modern Computing

An Intel Xeon 8280 delivers 1.2 TFLOPS of peak compute but only 210 GB/s of memory bandwidth. That is a 5,714 FLOP/byte ratio. Any operation with arithmetic intensity below that — element-wise operations, most activations, normalization layers — is memory-bound and achieves under 4% of peak compute. An A100 GPU shifts that ratio to 153 FLOP/byte (312 TFLOPS, 2,039 GB/s), making more operations compute-bound. But even on A100, 70% of typical transformer operations remain memory-bound. Understanding this compute-memory tradeoff is the first step to optimization: you cannot fix what you do not measure.

The Bandwidth Landscape

📊

Memory Bandwidth Across Platforms (2019-2024)

Platform	Memory Type	Theoretical BW	Achieved BW	Efficiency
Intel Xeon 8280 (2 socket)	DDR4-2933	281 GB/s	~210 GB/s	75%
AMD EPYC 7742 (2 socket)	DDR4-3200	410 GB/s	~340 GB/s	83%
NVIDIA V100	HBM2	900 GB/s	~820 GB/s	91%
NVIDIA A100	HBM2e	2,039 GB/s	~1,800 GB/s	88%
NVIDIA H100 (SXM)	HBM3	3,350 GB/s	~2,900 GB/s	87%
Intel Gaudi2	HBM2e	2,450 GB/s	~2,200 GB/s	90%
Apple M2 Ultra	LPDDR5	800 GB/s	~680 GB/s	85%

Note: Achieved BW measured with STREAM triad benchmark (or equivalent). HBM consistently achieves 85-91% of theoretical.

The GPU advantage is clear: HBM delivers 5-15x more bandwidth than DDR. This is why GPUs dominate AI workloads — not because of raw compute (which CPUs have narrowed the gap on), but because of memory bandwidth.

The Roofline Model

The roofline model tells you whether an operation is compute-bound or memory-bound based on its arithmetic intensity (AI): the ratio of FLOPs to bytes transferred.

AI = FLOPs / Bytes

The crossover point where compute and memory bandwidth are balanced:

AI_crossover = Peak_FLOPS / Peak_BW

📊

Roofline Crossover Points

Platform	Peak FP16 FLOPS	Peak BW	Crossover AI	Interpretation
V100	125 TFLOPS	900 GB/s	139 FLOP/byte	Most ops are memory-bound
A100	312 TFLOPS	2,039 GB/s	153 FLOP/byte	Even higher AI needed
H100	990 TFLOPS	3,350 GB/s	295 FLOP/byte	Almost everything is memory-bound

Note: As compute grows faster than bandwidth (each generation), the crossover AI increases -- more operations become memory-bound.

⚡ The Trend Is Clear

Each GPU generation increases compute faster than bandwidth. V100 needs AI > 139 to be compute-bound; H100 needs AI over 295. This means more and more operations are memory-bound on newer hardware — making bandwidth optimization increasingly important.

Measuring Actual Bandwidth

STREAM Benchmark

The gold standard for measuring achievable memory bandwidth:

# Build and run STREAM (adjust array size for your cache)
gcc -O3 -fopenmp -DSTREAM_ARRAY_SIZE=100000000 stream.c -o stream
OMP_NUM_THREADS=48 ./stream

The four STREAM operations:

Copy: a[i] = b[i] — pure read + write bandwidth
Scale: a[i] = q * b[i] — adds one multiply
Add: a[i] = b[i] + c[i] — two reads, one write
Triad: a[i] = b[i] + q * c[i] — the standard benchmark (AI = 0.33 FLOP/byte)

GPU Bandwidth

# NVIDIA's bandwidth test
/usr/local/cuda/samples/1_Utilities/bandwidthTest/bandwidthTest
# Or use custom kernel with nvprof/ncu
ncu --metrics dram__bytes_read.sum,dram__bytes_write.sum ./my_kernel

Where AI Workloads Fall on the Roofline

Arithmetic Intensity of Common AI Operations

(FLOP/byte)

LLM decode (batch=1) Deep in memory-bound

0.5 FLOP/byte

LLM decode (batch=32) Still memory-bound

16 FLOP/byte

Standard attention (seq=4K)

3 FLOP/byte

FlashAttention (seq=4K) Near crossover

89 FLOP/byte

LLM prefill (batch=32) Compute-bound

250 FLOP/byte

GEMM (4096x4096) Compute-bound

340 FLOP/byte

Most LLM operations are memory-bound, especially during the decode phase. Only large batched prefill and big GEMM operations cross into compute-bound territory. This is the fundamental reason why LLM serving focuses so heavily on memory optimization.

Bandwidth Optimization Techniques

📊

Bandwidth Optimization Techniques

Technique	Applicable When	Bandwidth Improvement	Mechanism
Quantization (FP16->INT4)	Weight loading	2-4x	Fewer bytes to read
Batching	Decode phase	Up to 64x	Amortize weight reads
Operator fusion	Multi-op sequences	1.5-3x	Eliminate intermediate writes
Data layout (SoA)	Strided access	2-10x	Better coalescing/prefetch
Tiling (SMEM)	Reusable data	3-15x	Serve from fast SRAM instead
Compression (sparse)	Sparse models	1.5-4x	Skip zero values

💡 The Bandwidth Budget

Before optimizing any kernel, calculate its bandwidth requirement: bytes_read + bytes_written. Divide by your GPU’s achieved bandwidth (~90% of theoretical for HBM). If the result exceeds your kernel’s runtime, the kernel is not yet bandwidth-limited — look for other bottlenecks. If it’s close, you’re hitting the hardware limit and need algorithmic changes (quantization, fusion, tiling).

Multi-Level Bandwidth Optimization

The most effective optimizations exploit the bandwidth gap between memory levels:

Bandwidth Reduction Through Memory Hierarchy Exploitation (A100)

(GB/s needed from HBM)

Naive (all from HBM) Exceeds HBM BW

2,000 GB/s needed from HBM

+ L2 cache hits (40%)

1,200 GB/s needed from HBM

+ SMEM tiling

400 GB/s needed from HBM

+ Operator fusion Within HBM budget

200 GB/s needed from HBM

Each level of optimization reduces the demand on HBM. The combination of caching, tiling, and fusion can reduce HBM traffic by 10x — often the difference between a kernel being bandwidth-limited and performing at acceptable throughput.

Conclusion

Memory bandwidth is the primary bottleneck for most AI workloads on modern hardware. GPU HBM delivers 5-15x more bandwidth than CPU DDR, which is why GPUs dominate AI — but even HBM bandwidth is insufficient for operations like LLM decode at batch=1. The roofline model provides the analytical framework: calculate your operation’s arithmetic intensity, compare to the platform’s crossover point, and optimize accordingly. For memory-bound operations (most of LLM serving), the optimization hierarchy is: batch more, quantize weights, fuse operators, tile into SRAM.

The Bandwidth Landscape

Memory Bandwidth Across Platforms (2019-2024)

The Roofline Model

Roofline Crossover Points

Measuring Actual Bandwidth

STREAM Benchmark

GPU Bandwidth

Where AI Workloads Fall on the Roofline

Arithmetic Intensity of Common AI Operations

Bandwidth Optimization Techniques

Bandwidth Optimization Techniques

Multi-Level Bandwidth Optimization

Bandwidth Reduction Through Memory Hierarchy Exploitation (A100)

Conclusion

Stanley Phoong

Related Posts

CPU vs GPU Memory: Why GPUs Need Different Optimization

GPU Memory Bandwidth Optimization: From Theory to Practice

Vector Processing: From ARM NEON to AVX-512 to GPU SIMT