LLM inference throughput during the decode phase is determined by a single number: memory bandwidth. Not compute FLOPS, not cache size, not interconnect speed. Memory bandwidth. A 70B parameter model at FP16 requires reading 140 GB of weights from memory for every single output token. On an H100 with 3,350 GB/s of HBM3 bandwidth, that takes approximately 42 milliseconds — yielding about 24 tokens per second regardless of the H100’s 990 TFLOPS of tensor compute sitting largely idle. Double the bandwidth and you double the token rate.
This is why High Bandwidth Memory (HBM) is the defining technology of modern AI accelerators. This post covers the physical technology behind HBM, the four generations deployed in NVIDIA data center GPUs, the engineering tradeoffs in capacity vs. bandwidth vs. power vs. cost, and how to measure actual bandwidth utilization in practice.
HBM Architecture Fundamentals
Why Not GDDR?
Consumer GPUs use GDDR (Graphics Double Data Rate) memory, which places DRAM chips around the GPU die on the PCB surface. An RTX 4090 has 24 GB of GDDR6X across twelve chips, connected via 384 traces on the PCB. Each trace runs at 21 Gbps, yielding:
This works for consumer GPUs, but scaling further hits physical limits: more traces require a larger PCB, longer traces increase signal degradation, and higher frequencies increase power and heat. GDDR7 pushes to 36 Gbps per pin but the PCB trace count is the fundamental bottleneck.
The HBM Stacking Approach
HBM solves this by going vertical. Instead of spreading DRAM chips across a wide PCB, HBM stacks multiple DRAM dies on top of each other, connected by through-silicon vias (TSVs) — microscopic copper pillars drilled through the silicon itself. Each HBM stack sits on a silicon interposer right next to the GPU die, connected via micro-bumps at a pitch of 40-55 micrometers.
Key physical parameters of an HBM stack:
- TSV diameter: 5-10 micrometers
- TSV pitch: 40-55 micrometers
- TSVs per stack: Thousands (each stack has a 1024-bit interface)
- Stack height: 4, 8, or 12 DRAM dies
- Interposer: 2.5D silicon interposer (CoWoS packaging from TSMC)
The 1024-bit interface per stack is the key differentiator. A single HBM stack has the same bus width as the entire GDDR6X interface of an RTX 4090 (384 pins). With multiple stacks, total bus width reaches 4096-6144 bits — 10-16x wider than GDDR.
HBM uses 2.5D packaging: the DRAM stacks are placed next to the GPU die on a shared silicon interposer, not directly on top of the GPU. The interposer contains the wiring between GPU and HBM stacks. True 3D stacking (placing DRAM directly on the compute die) is being researched but faces thermal challenges — the DRAM would be heated by the GPU below it.
HBM Generations: Specifications
HBM Generation Specifications
| Spec | HBM2 | HBM2e | HBM3 | HBM3e |
|---|---|---|---|---|
| JEDEC Standard | JESD235B | JESD235B (extended) | JESD238 | JESD238A |
| Pin data rate | 2.0 Gbps | 3.6 Gbps | 6.4 Gbps | 9.6 Gbps |
| Interface width per stack | 1024-bit | 1024-bit | 1024-bit | 1024-bit |
| BW per stack | 256 GB/s | 461 GB/s | 819 GB/s | 1,229 GB/s |
| Max die stack height | 8-hi | 8-hi | 8-hi (12-hi avail.) | 12-hi |
| Capacity per stack (8-hi) | 8 GB | 16 GB | 16 GB | 24 GB |
| Capacity per stack (12-hi) | N/A | N/A | 24 GB | 36 GB |
| Stacks on V100/A100/H100/B200 | 4 | 5 | 5 (6 for H200) | 8 |
| Total GPU BW | 900 GB/s (V100) | 2,039 GB/s (A100) | 3,350 GB/s (H100) | 8,000 GB/s (B200) |
| Total GPU capacity | 32 GB | 80 GB | 80 GB (H100) | 192 GB (B200) |
| Power per stack | ~4.5W | ~5.5W | ~7.5W | ~10W |
| Year deployed (NVIDIA) | 2017 | 2020 | 2022 | 2024 |
HBM2 (V100: 900 GB/s)
The V100 uses 4 stacks of HBM2 at 1.75 Gbps per pin (below the 2.0 Gbps JEDEC maximum), yielding:
NVIDIA rounds this to 900 GB/s. Total capacity is 4 stacks x 8 GB = 32 GB (the 16 GB variant uses 4-hi stacks).
At 900 GB/s and 125 TFLOPS FP16, the V100 has a compute-to-bandwidth ratio of approximately 139 FLOP/byte. This was reasonable for the training workloads of 2017-2019 (BERT, GPT-2), where batch sizes were large enough to keep tensor cores busy. But it established the pattern: memory bandwidth is the constraint that determines what models you can serve.
HBM2e (A100: 2,039 GB/s)
The A100 uses 5 stacks of HBM2e at 3.2 Gbps per pin:
NVIDIA reports 2,039 GB/s (likely accounting for the ECC bandwidth overhead on the 80 GB variant). The 40 GB A100 variant uses the same 5 stacks but with 8 GB per stack instead of 16 GB.
The jump from 900 to 2,039 GB/s (2.26x) tracked the FP16 compute increase (125 to 312 TFLOPS, 2.5x). HBM2e achieved this by increasing the per-pin data rate from 1.75 to 3.2 Gbps — a clock speed improvement requiring better signaling and tighter manufacturing tolerances on the TSVs.
The 80 GB capacity was the real game-changer for inference. For the first time, a 40B parameter model (80 GB at FP16) could fit on a single GPU. Previously, multi-GPU setups were required for anything above 16B parameters.
HBM3 (H100: 3,350 GB/s)
HBM3 introduced a major architectural change: independent channels. Previous HBM generations had a monolithic 1024-bit interface per stack. HBM3 splits this into two independent 512-bit channels per stack, allowing each channel to operate at different addresses simultaneously. This improves effective bandwidth for workloads with irregular access patterns (like KV cache reads during attention).
The H100 uses 5 stacks of HBM3 at 5.23 Gbps per pin:
Capacity remains at 80 GB (5 x 16 GB stacks). The pin rate increase from 3.2 to 5.23 Gbps (1.63x) delivered a 1.64x bandwidth improvement.
HBM Bandwidth per Stack (GB/s)
(GB/s per stack)HBM3e (H200: 4,800 GB/s, B200: 8,000 GB/s)
HBM3e pushes pin rates to 8.0-9.6 Gbps and introduces 12-hi stacking (12 DRAM dies per stack). This is where capacity and bandwidth both jump significantly.
H200: 6 stacks of HBM3e at 8.0 Gbps per pin, 12-hi stacking:
NVIDIA reports 4,800 GB/s — the difference reflects that the H200 uses the same memory controller as the H100 (designed for 5 stacks of HBM3) and cannot fully utilize all 6 stacks at maximum rate. Capacity: 6 x 24 GB = 141 GB (NVIDIA reports 141 GB, consistent with the first stack being slightly smaller or ECC overhead).
B200: 8 stacks of HBM3e (4 per die) at approximately 9.2 Gbps per pin:
NVIDIA reports 8,000 GB/s. The dual-die design gives Blackwell twice the memory controller count of Hopper, allowing full utilization of 8 stacks. Capacity: 8 x 24 GB = 192 GB.
Why Bandwidth Determines Decode Throughput
The Decode Bottleneck
During autoregressive decoding, each token generation requires:
- Read all model weights (for a single GEMV per layer)
- Read the KV cache for attention
- Compute attention scores, FFN output, logits
- Write one new KV cache entry per layer
- Sample the next token
Step 1 dominates. For a 70B parameter model at FP16 (2 bytes per param), the weight read per token is:
The compute is (one multiply, one add per parameter). On an H100 with 990 TFLOPS FP16:
The GPU spends 0.14 ms computing and 41.8 ms reading weights. Compute utilization: 0.3%. This is why HBM bandwidth, not TFLOPS, determines decode throughput.
The arithmetic intensity of a single-batch decode step is where is bytes per parameter. At FP16 (), intensity is 0.5 FLOP/byte. The H100’s ridge point is 296 FLOP/byte (FP16). The decode operation is 592x below the ridge point. No amount of compute optimization helps — only bandwidth or reduced weight size (quantization) matters.
Batch Size: Trading Latency for Throughput
At batch size , the same weight read serves tokens:
To become compute-bound on H100 FP16: , so . In practice, KV cache memory limits batch size well before this point. For a 70B model with 4K context, the KV cache per sequence is approximately 20 GB at FP16 — filling the 80 GB HBM after just 3 sequences (model weights already consume 140 GB, so this model needs multi-GPU anyway).
Decode Throughput vs Batch Size (70B FP8, Single H100)
| Batch Size | Weight Read Time | Compute Time | Bottleneck | Tokens/s | Per-user Latency |
|---|---|---|---|---|---|
| 1 | 20.9 ms | 0.07 ms | Memory | 48 | 20.9 ms/tok |
| 8 | 20.9 ms | 0.56 ms | Memory | 383 | 20.9 ms/tok |
| 32 | 20.9 ms | 2.24 ms | Memory | 1,531 | 20.9 ms/tok |
| 128 | 20.9 ms | 8.96 ms | Memory | 6,124 | 20.9 ms/tok |
| 512 | 20.9 ms | 35.8 ms | Compute | 14,300 | 35.8 ms/tok |
The Capacity Cliff
Bandwidth only matters if the model fits in memory. If weights must be split across GPUs, NVLink bandwidth replaces HBM bandwidth as the bottleneck for the inter-GPU portion of each layer.
A 405B parameter model (Llama 3) at FP16 requires 810 GB. On H100s (80 GB each), this needs a minimum of 11 GPUs. In practice, tensor parallelism across 8 GPUs in a single node is standard, with pipeline parallelism across nodes. Each tensor-parallel step requires an all-reduce over NVLink (900 GB/s bidirectional per GPU), which is 3.7x slower than local HBM bandwidth.
On B200s (192 GB each), the same model needs only 5 GPUs at FP16, or fits on 2 GPUs at FP4 (101 GB). Fewer GPUs means less communication overhead.
Minimum GPUs for 405B Model (various precisions)
(GPUs required)Power Consumption of HBM
HBM power is not negligible. Each HBM stack consumes 4.5-10W depending on generation and data rate. On a B200 with 8 stacks:
That is 8% of the B200’s 1,000W TDP dedicated to memory alone. The power breakdown matters because HBM power scales with bandwidth utilization — when the memory is being read at full bandwidth (during decode), it draws peak power.
HBM Power Budget per GPU
| GPU | Stacks | Per-stack Power | Total HBM Power | GPU TDP | HBM % of TDP |
|---|---|---|---|---|---|
| V100 | 4 | ~4.5W | ~18W | 300W | 6% |
| A100 | 5 | ~5.5W | ~28W | 400W | 7% |
| H100 | 5 | ~7.5W | ~38W | 700W | 5% |
| H200 | 6 | ~9W | ~54W | 700W | 8% |
| B200 | 8 | ~10W | ~80W | 1,000W | 8% |
Energy per Byte
A useful metric for comparing HBM generations is energy per byte transferred:
HBM Energy Efficiency (pJ/bit)
| HBM Gen | Stack Power | Stack BW | pJ/bit | Improvement vs Prior |
|---|---|---|---|---|
| HBM2 | 4.5W | 1,792 Gbps | 2.51 | baseline |
| HBM2e | 5.5W | 3,277 Gbps | 1.68 | 1.49x better |
| HBM3 | 7.5W | 5,356 Gbps | 1.40 | 1.20x better |
| HBM3e | 10W | 8,192 Gbps | 1.22 | 1.15x better |
Energy efficiency improvements are slowing down — each generation delivers roughly 15-50% better pJ/bit. This means power envelopes constrain future bandwidth scaling. A hypothetical HBM4 at 12-16 Gbps per pin with 16-hi stacking would push per-stack power above 15W, and total HBM power above 120W for an 8-stack configuration.
Cost of HBM
HBM is the single most expensive component in a data center GPU. As of 2024-2025, approximate per-GB costs are:
Approximate HBM Cost (2024-2025)
| HBM Gen | Cost per GB | Total Cost per GPU | GPU ASP | HBM % of GPU Cost |
|---|---|---|---|---|
| HBM2 (V100 era) | ~$15-20 | ~$480-640 (32 GB) | ~$8,000 | ~6-8% |
| HBM2e (A100 era) | ~$15-20 | ~$1,200-1,600 (80 GB) | ~$12,000 | ~10-13% |
| HBM3 (H100 era) | ~$25-35 | ~$2,000-2,800 (80 GB) | ~$30,000 | ~7-9% |
| HBM3e (B200 era) | ~$30-40 | ~$5,760-7,680 (192 GB) | ~$40,000 | ~14-19% |
HBM cost is driven by two factors: manufacturing yield (stacking 12 dies requires all 12 to be functional) and supply constraints (SK Hynix, Samsung, and Micron cannot build new HBM fabs fast enough to meet AI demand). The TSV process adds approximately $2-3 per GB to the base DRAM cost, and 12-hi stacking has lower yield than 8-hi due to the cumulative probability of defects.
As of early 2025, HBM3e supply is entirely allocated to NVIDIA and AMD through 2026. SK Hynix produces approximately 60-70% of HBM supply, with Samsung and Micron splitting the remainder. Capacity expansion takes 18-24 months. This supply constraint, more than any technical factor, determines GPU pricing and availability.
Measuring Actual vs Theoretical Bandwidth
CUDA Bandwidth Test
The simplest bandwidth measurement uses cudaMemcpy:
#include <cuda_runtime.h>
#include <cstdio>
#include <chrono>
void measure_bandwidth(size_t size_mb) {
size_t size = size_mb * 1024 * 1024;
void *d_src, *d_dst;
cudaMalloc(&d_src, size);
cudaMalloc(&d_dst, size);
cudaMemset(d_src, 1, size);
// Warm up
cudaMemcpy(d_dst, d_src, size, cudaMemcpyDeviceToDevice);
cudaDeviceSynchronize();
// Measure
cudaEvent_t start, stop;
cudaEventCreate(&start);
cudaEventCreate(&stop);
int iterations = 100;
cudaEventRecord(start);
for (int i = 0; i < iterations; i++) {
cudaMemcpy(d_dst, d_src, size, cudaMemcpyDeviceToDevice);
}
cudaEventRecord(stop);
cudaDeviceSynchronize();
float ms;
cudaEventElapsedTime(&ms, start, stop);
double bandwidth = (2.0 * size * iterations) / (ms / 1000.0) / 1e9;
// Factor of 2: one read + one write
printf("Size: %zu MB, Time: %.2f ms, Bandwidth: %.1f GB/s\n",
size_mb, ms / iterations, bandwidth);
cudaFree(d_src);
cudaFree(d_dst);
cudaEventDestroy(start);
cudaEventDestroy(stop);
}
int main() {
printf("HBM Bandwidth Measurement (device-to-device copy)\n");
printf("================================================\n");
measure_bandwidth(1);
measure_bandwidth(16);
measure_bandwidth(256);
measure_bandwidth(1024);
measure_bandwidth(4096);
return 0;
}
Kernel-Level Bandwidth Measurement
A more accurate measurement uses a custom kernel that forces specific access patterns:
__global__ void read_bandwidth_kernel(const float4 *input, float4 *output, size_t n) {
size_t idx = blockIdx.x * blockDim.x + threadIdx.x;
size_t stride = blockDim.x * gridDim.x;
float4 sum = make_float4(0, 0, 0, 0);
for (size_t i = idx; i < n; i += stride) {
float4 val = input[i];
sum.x += val.x;
sum.y += val.y;
sum.z += val.z;
sum.w += val.w;
}
if (idx == 0) output[0] = sum; // Prevent optimization
}
__global__ void write_bandwidth_kernel(float4 *output, size_t n) {
size_t idx = blockIdx.x * blockDim.x + threadIdx.x;
size_t stride = blockDim.x * gridDim.x;
float4 val = make_float4(1.0f, 2.0f, 3.0f, 4.0f);
for (size_t i = idx; i < n; i += stride) {
output[i] = val;
}
}
void measure_read_write_separately(size_t size_mb) {
size_t n_float4 = (size_mb * 1024 * 1024) / sizeof(float4);
size_t size = n_float4 * sizeof(float4);
float4 *d_buf, *d_out;
cudaMalloc(&d_buf, size);
cudaMalloc(&d_out, sizeof(float4));
cudaMemset(d_buf, 0, size);
int blocks = 256;
int threads = 256;
// Measure read bandwidth
cudaEvent_t start, stop;
cudaEventCreate(&start);
cudaEventCreate(&stop);
cudaEventRecord(start);
for (int i = 0; i < 20; i++)
read_bandwidth_kernel<<<blocks, threads>>>(d_buf, d_out, n_float4);
cudaEventRecord(stop);
cudaDeviceSynchronize();
float ms;
cudaEventElapsedTime(&ms, start, stop);
double read_bw = (double)size * 20 / (ms / 1000.0) / 1e9;
// Measure write bandwidth
cudaEventRecord(start);
for (int i = 0; i < 20; i++)
write_bandwidth_kernel<<<blocks, threads>>>(d_buf, n_float4);
cudaEventRecord(stop);
cudaDeviceSynchronize();
cudaEventElapsedTime(&ms, start, stop);
double write_bw = (double)size * 20 / (ms / 1000.0) / 1e9;
printf("Size: %zu MB | Read: %.1f GB/s | Write: %.1f GB/s\n",
size_mb, read_bw, write_bw);
cudaFree(d_buf);
cudaFree(d_out);
cudaEventDestroy(start);
cudaEventDestroy(stop);
}
Expected Results
Typical Measured vs Theoretical HBM Bandwidth
| GPU | Theoretical BW | Measured (large copy) | Utilization % | Measured (kernel read) |
|---|---|---|---|---|
| V100 | 900 GB/s | 830-850 GB/s | 92-94% | 780-820 GB/s |
| A100 | 2,039 GB/s | 1,900-1,950 GB/s | 93-96% | 1,800-1,880 GB/s |
| H100 | 3,350 GB/s | 3,100-3,200 GB/s | 93-96% | 2,900-3,100 GB/s |
| H200 | 4,800 GB/s | 4,400-4,600 GB/s | 92-96% | 4,100-4,400 GB/s |
The gap between theoretical and measured bandwidth comes from:
- ECC overhead: The A100 80GB and H100 enable ECC by default, consuming approximately 6% of raw bandwidth for error correction codes
- Memory controller overhead: Address translation, refresh cycles, and bank conflicts consume 2-4% of bandwidth
- Non-ideal access patterns: Any deviation from perfectly coalesced, sequential, 128-byte-aligned access degrades throughput
Using nvidia-smi for Runtime Monitoring
# Real-time HBM bandwidth utilization
nvidia-smi dmon -d 1 -s m
# Output columns: gpu mem_used mem_total mem_util fb_used fb_total
# mem_util shows HBM bandwidth utilization as a percentage
# For more detail:
nvidia-smi --query-gpu=memory.used,memory.total,memory.free,utilization.memory \
--format=csv -l 1
Using Nsight Compute for Per-Kernel Analysis
# Measure HBM throughput for a specific kernel
ncu --metrics \
dram__bytes_read.sum,\
dram__bytes_write.sum,\
dram__throughput.avg.pct_of_peak_sustained_elapsed,\
dram__sectors_read.sum,\
dram__sectors_write.sum \
./my_binary
# dram__throughput shows what percentage of peak HBM bandwidth the kernel achieves
# Target: above 80% for memory-bound kernels
# Below 60% indicates access pattern problems (non-coalesced, bank conflicts)
A well-optimized memory-bound kernel should achieve 80-90% of theoretical HBM bandwidth. If you measure below 70%, check for: (1) non-coalesced global memory access, (2) small transfer sizes that do not saturate the memory bus, (3) excessive L2 cache thrashing from random access patterns, (4) memory-mapped regions that cross page boundaries.
HBM vs Capacity: The Tradeoff
NVIDIA’s product lineup reveals the capacity-vs-bandwidth tradeoff explicitly:
- H100: 80 GB at 3,350 GB/s — optimized for compute-heavy workloads
- H200: 141 GB at 4,800 GB/s — same die, more and faster memory
- B200: 192 GB at 8,000 GB/s — dual die with maximum memory
For inference serving, the choice between these GPUs depends on model size:
The communication overhead term makes single-GPU serving dramatically more efficient. An H200 serving a 70B model at FP8 (70 GB, fits in 141 GB) will outperform two H100s serving the same model (70 GB split across 2x 80 GB) despite lower total bandwidth (4,800 vs 6,700 GB/s) because there is zero NVLink communication overhead.
Future: HBM4 and Beyond
HBM4 (expected 2025-2026) will introduce several changes:
- Logic base die: The bottom die of the stack will be a logic die manufactured on an advanced process node (potentially 5nm or better), containing ECC engines, built-in PHYs, and possibly compute-near-memory logic
- Pin rates: 12-16 Gbps per pin
- 2048-bit interface per stack: Doubling the bus width per stack
- Projected per-stack bandwidth: 1.5-2 TB/s
- 16-hi stacking: 16 DRAM dies per stack (if yield allows)
With 8 stacks of HBM4 at 2 TB/s each, a single GPU could reach 16 TB/s of memory bandwidth. At that rate, a 70B FP4 model (35 GB) would decode at:
That approaches human reading speed for a 70B model — a meaningful perceptual threshold for interactive AI.
The HBM bandwidth trajectory shows no signs of slowing. Each generation delivers 1.5-2.4x more bandwidth per stack, and package-level innovations (more stacks, wider interfaces) compound the improvement. For LLM inference, this is the single most impactful hardware trend.