Part of Series GPU Hardware & AI Accelerators 2 of 30
1 NVIDIA GPU Architecture Evolution: Volta, Ampere, Hopper, Blackwell — What Changed and Why 2 HBM Memory: HBM2, HBM2e, HBM3, HBM3e — Bandwidth, Capacity, and Why It Determines AI Performance 3 NVLink, NVSwitch, and GPU Interconnect: From Peer-to-Peer to NVL72 Rack-Scale Fabric 4 The Streaming Multiprocessor: Warp Schedulers, Register File, and the Execution Pipeline 5 AMD MI300X and ROCm: 192GB HBM3, 5.3 TB/s Bandwidth, and the CUDA Software Moat 6 Tensor Core Evolution: From Volta HMMA to Hopper WGMMA — What Changed at Each Generation 7 GPU Memory Hierarchy: L1, L2, Shared Memory, and Cache Behavior Under Different Access Patterns 8 PCIe Gen5 and the CPU-GPU Bandwidth Bottleneck: When PCIe Limits Your Inference 9 MIG and GPU Virtualization: Partitioning a Single GPU for Multi-Tenant Inference 10 Warp Schedulers and Instruction Issue: How GPUs Hide Latency with Thousands of Threads 11 The Register File: 256KB per SM, Register Pressure, and Why More Registers Mean Fewer Threads 12 L2 Cache Behavior: Residency Control, Working Set Effects, and Cache-Aware Kernel Design 13 ECC Memory and GPU Reliability: Silent Data Corruption, Error Detection, and the Cost of ECC 14 NVSwitch Fabric Topology: How 72 GPUs Share a Single Memory Address Space in NVL72 15 Grace Hopper Superchip: Unified CPU-GPU Memory via NVLink-C2C and What It Changes 16 Blackwell B200 Deep Dive: Dual-Die Design, FP4 Tensor Cores, and 8 TB/s HBM3e 17 Google TPU Architecture: MXU, ICI Interconnect, XLA Compilation, and When TPUs Win 18 Intel Gaudi and Habana: Graph Compiler Model, TPC Architecture, and the ROI Calculation 19 GPU Power Efficiency: Performance per Watt, Dynamic Voltage Scaling, and Datacenter Power Budgets 20 GPU Programming Models: CUDA vs ROCm vs Metal vs Vulkan Compute — Portability and Performance 21 Datacenter vs Consumer GPUs: H100 vs RTX 4090 — What You Actually Get for 10x the Price 22 GPU Cooling: Air, Liquid, and Immersion — Thermal Solutions for AI Datacenters 23 GPU Hardware Scheduling: How the GigaThread Engine Distributes Work Across SMs 24 CPU vs GPU Memory: Why GPUs Need Different Optimization 25 Non-NVIDIA AI Accelerators: Gaudi, MI300X, TPU, and the Software Challenge 26 The Definitive Guide to GPU Memory: Registers, Shared Memory, Caches, and HBM 27 GPU Tensor Core Programming: From Volta WMMA to Hopper WGMMA 28 Vector Processing: From ARM NEON to AVX-512 to GPU SIMT 29 Turing vs Volta Architecture for AI Workloads (Jan 2020) 30 Habana Gaudi vs NVIDIA V100: AI Training Performance (Jul 2020)

LLM inference throughput during the decode phase is determined by a single number: memory bandwidth. Not compute FLOPS, not cache size, not interconnect speed. Memory bandwidth. A 70B parameter model at FP16 requires reading 140 GB of weights from memory for every single output token. On an H100 with 3,350 GB/s of HBM3 bandwidth, that takes approximately 42 milliseconds — yielding about 24 tokens per second regardless of the H100’s 990 TFLOPS of tensor compute sitting largely idle. Double the bandwidth and you double the token rate.

This is why High Bandwidth Memory (HBM) is the defining technology of modern AI accelerators. This post covers the physical technology behind HBM, the four generations deployed in NVIDIA data center GPUs, the engineering tradeoffs in capacity vs. bandwidth vs. power vs. cost, and how to measure actual bandwidth utilization in practice.

HBM Architecture Fundamentals

Why Not GDDR?

Consumer GPUs use GDDR (Graphics Double Data Rate) memory, which places DRAM chips around the GPU die on the PCB surface. An RTX 4090 has 24 GB of GDDR6X across twelve chips, connected via 384 traces on the PCB. Each trace runs at 21 Gbps, yielding:

384×21 Gbps/8=1,008 GB/s384 \times 21 \text{ Gbps} / 8 = 1{,}008 \text{ GB/s}

This works for consumer GPUs, but scaling further hits physical limits: more traces require a larger PCB, longer traces increase signal degradation, and higher frequencies increase power and heat. GDDR7 pushes to 36 Gbps per pin but the PCB trace count is the fundamental bottleneck.

The HBM Stacking Approach

HBM solves this by going vertical. Instead of spreading DRAM chips across a wide PCB, HBM stacks multiple DRAM dies on top of each other, connected by through-silicon vias (TSVs) — microscopic copper pillars drilled through the silicon itself. Each HBM stack sits on a silicon interposer right next to the GPU die, connected via micro-bumps at a pitch of 40-55 micrometers.

Key physical parameters of an HBM stack:

  • TSV diameter: 5-10 micrometers
  • TSV pitch: 40-55 micrometers
  • TSVs per stack: Thousands (each stack has a 1024-bit interface)
  • Stack height: 4, 8, or 12 DRAM dies
  • Interposer: 2.5D silicon interposer (CoWoS packaging from TSMC)

The 1024-bit interface per stack is the key differentiator. A single HBM stack has the same bus width as the entire GDDR6X interface of an RTX 4090 (384 pins). With multiple stacks, total bus width reaches 4096-6144 bits — 10-16x wider than GDDR.

ℹ️ 2.5D vs 3D Packaging

HBM uses 2.5D packaging: the DRAM stacks are placed next to the GPU die on a shared silicon interposer, not directly on top of the GPU. The interposer contains the wiring between GPU and HBM stacks. True 3D stacking (placing DRAM directly on the compute die) is being researched but faces thermal challenges — the DRAM would be heated by the GPU below it.

HBM Generations: Specifications

📊

HBM Generation Specifications

SpecHBM2HBM2eHBM3HBM3e
JEDEC Standard JESD235B JESD235B (extended) JESD238 JESD238A
Pin data rate 2.0 Gbps 3.6 Gbps 6.4 Gbps 9.6 Gbps
Interface width per stack 1024-bit 1024-bit 1024-bit 1024-bit
BW per stack 256 GB/s 461 GB/s 819 GB/s 1,229 GB/s
Max die stack height 8-hi 8-hi 8-hi (12-hi avail.) 12-hi
Capacity per stack (8-hi) 8 GB 16 GB 16 GB 24 GB
Capacity per stack (12-hi) N/A N/A 24 GB 36 GB
Stacks on V100/A100/H100/B200 4 5 5 (6 for H200) 8
Total GPU BW 900 GB/s (V100) 2,039 GB/s (A100) 3,350 GB/s (H100) 8,000 GB/s (B200)
Total GPU capacity 32 GB 80 GB 80 GB (H100) 192 GB (B200)
Power per stack ~4.5W ~5.5W ~7.5W ~10W
Year deployed (NVIDIA) 2017 2020 2022 2024
Note: Pin data rates are for peak spec. Actual deployed rates may vary. B200 uses dual-die with 4 stacks per die. H200 uses 6x HBM3e stacks for 141 GB and 4,800 GB/s.

HBM2 (V100: 900 GB/s)

The V100 uses 4 stacks of HBM2 at 1.75 Gbps per pin (below the 2.0 Gbps JEDEC maximum), yielding:

4 stacks×1024 bits×1.75 Gbps/8=896 GB/s4 \text{ stacks} \times 1024 \text{ bits} \times 1.75 \text{ Gbps} / 8 = 896 \text{ GB/s}

NVIDIA rounds this to 900 GB/s. Total capacity is 4 stacks x 8 GB = 32 GB (the 16 GB variant uses 4-hi stacks).

At 900 GB/s and 125 TFLOPS FP16, the V100 has a compute-to-bandwidth ratio of approximately 139 FLOP/byte. This was reasonable for the training workloads of 2017-2019 (BERT, GPT-2), where batch sizes were large enough to keep tensor cores busy. But it established the pattern: memory bandwidth is the constraint that determines what models you can serve.

HBM2e (A100: 2,039 GB/s)

The A100 uses 5 stacks of HBM2e at 3.2 Gbps per pin:

5×1024×3.2/8=2,048 GB/s5 \times 1024 \times 3.2 / 8 = 2{,}048 \text{ GB/s}

NVIDIA reports 2,039 GB/s (likely accounting for the ECC bandwidth overhead on the 80 GB variant). The 40 GB A100 variant uses the same 5 stacks but with 8 GB per stack instead of 16 GB.

The jump from 900 to 2,039 GB/s (2.26x) tracked the FP16 compute increase (125 to 312 TFLOPS, 2.5x). HBM2e achieved this by increasing the per-pin data rate from 1.75 to 3.2 Gbps — a clock speed improvement requiring better signaling and tighter manufacturing tolerances on the TSVs.

The 80 GB capacity was the real game-changer for inference. For the first time, a 40B parameter model (80 GB at FP16) could fit on a single GPU. Previously, multi-GPU setups were required for anything above 16B parameters.

HBM3 (H100: 3,350 GB/s)

HBM3 introduced a major architectural change: independent channels. Previous HBM generations had a monolithic 1024-bit interface per stack. HBM3 splits this into two independent 512-bit channels per stack, allowing each channel to operate at different addresses simultaneously. This improves effective bandwidth for workloads with irregular access patterns (like KV cache reads during attention).

The H100 uses 5 stacks of HBM3 at 5.23 Gbps per pin:

5×1024×5.23/8=3,350 GB/s5 \times 1024 \times 5.23 / 8 = 3{,}350 \text{ GB/s}

Capacity remains at 80 GB (5 x 16 GB stacks). The pin rate increase from 3.2 to 5.23 Gbps (1.63x) delivered a 1.64x bandwidth improvement.

HBM Bandwidth per Stack (GB/s)

(GB/s per stack)
HBM2 (V100) 1.75 Gbps/pin
225 GB/s per stack
HBM2e (A100) 3.2 Gbps/pin
408 GB/s per stack
+81.3%
HBM3 (H100) 5.23 Gbps/pin
670 GB/s per stack
+197.8%
HBM3e (H200) 8.0 Gbps/pin
800 GB/s per stack
+255.6%
HBM3e (B200) 9.2 Gbps/pin
1,000 GB/s per stack
+344.4%

HBM3e (H200: 4,800 GB/s, B200: 8,000 GB/s)

HBM3e pushes pin rates to 8.0-9.6 Gbps and introduces 12-hi stacking (12 DRAM dies per stack). This is where capacity and bandwidth both jump significantly.

H200: 6 stacks of HBM3e at 8.0 Gbps per pin, 12-hi stacking:

6×1024×8.0/8=6,144 GB/s (theoretical)6 \times 1024 \times 8.0 / 8 = 6{,}144 \text{ GB/s (theoretical)}

NVIDIA reports 4,800 GB/s — the difference reflects that the H200 uses the same memory controller as the H100 (designed for 5 stacks of HBM3) and cannot fully utilize all 6 stacks at maximum rate. Capacity: 6 x 24 GB = 141 GB (NVIDIA reports 141 GB, consistent with the first stack being slightly smaller or ECC overhead).

B200: 8 stacks of HBM3e (4 per die) at approximately 9.2 Gbps per pin:

8×1024×9.2/8=9,421 GB/s (theoretical)8 \times 1024 \times 9.2 / 8 = 9{,421} \text{ GB/s (theoretical)}

NVIDIA reports 8,000 GB/s. The dual-die design gives Blackwell twice the memory controller count of Hopper, allowing full utilization of 8 stacks. Capacity: 8 x 24 GB = 192 GB.

Why Bandwidth Determines Decode Throughput

The Decode Bottleneck

During autoregressive decoding, each token generation requires:

  1. Read all model weights (for a single GEMV per layer)
  2. Read the KV cache for attention
  3. Compute attention scores, FFN output, logits
  4. Write one new KV cache entry per layer
  5. Sample the next token

Step 1 dominates. For a 70B parameter model at FP16 (2 bytes per param), the weight read per token is:

70×109×2=140 GB per token70 \times 10^9 \times 2 = 140 \text{ GB per token}

The compute is 2×70×109=140 GFLOP per token2 \times 70 \times 10^9 = 140 \text{ GFLOP per token} (one multiply, one add per parameter). On an H100 with 990 TFLOPS FP16:

Compute time=140×109990×1012=0.14 ms\text{Compute time} = \frac{140 \times 10^9}{990 \times 10^{12}} = 0.14 \text{ ms}

Memory time=140×1093350×109=41.8 ms\text{Memory time} = \frac{140 \times 10^9}{3350 \times 10^9} = 41.8 \text{ ms}

The GPU spends 0.14 ms computing and 41.8 ms reading weights. Compute utilization: 0.3%. This is why HBM bandwidth, not TFLOPS, determines decode throughput.

The Arithmetic Intensity of Decode

The arithmetic intensity of a single-batch decode step is 2 FLOP/(2B)=1/B FLOP/byte2 \text{ FLOP} / (2B) = 1/B \text{ FLOP/byte} where BB is bytes per parameter. At FP16 (B=2B=2), intensity is 0.5 FLOP/byte. The H100’s ridge point is 296 FLOP/byte (FP16). The decode operation is 592x below the ridge point. No amount of compute optimization helps — only bandwidth or reduced weight size (quantization) matters.

Batch Size: Trading Latency for Throughput

At batch size bb, the same weight read serves bb tokens:

Arithmetic intensity=2b2B=bB FLOP/byte\text{Arithmetic intensity} = \frac{2b}{2B} = \frac{b}{B} \text{ FLOP/byte}

To become compute-bound on H100 FP16: b/2296b / 2 \geq 296, so b592b \geq 592. In practice, KV cache memory limits batch size well before this point. For a 70B model with 4K context, the KV cache per sequence is approximately 20 GB at FP16 — filling the 80 GB HBM after just 3 sequences (model weights already consume 140 GB, so this model needs multi-GPU anyway).

📊

Decode Throughput vs Batch Size (70B FP8, Single H100)

Batch SizeWeight Read TimeCompute TimeBottleneckTokens/sPer-user Latency
1 20.9 ms 0.07 ms Memory 48 20.9 ms/tok
8 20.9 ms 0.56 ms Memory 383 20.9 ms/tok
32 20.9 ms 2.24 ms Memory 1,531 20.9 ms/tok
128 20.9 ms 8.96 ms Memory 6,124 20.9 ms/tok
512 20.9 ms 35.8 ms Compute 14,300 35.8 ms/tok
Note: FP8 halves weight size to 70 GB. Weight read time is 70 GB / 3,350 GB/s = 20.9 ms regardless of batch size. Compute time scales linearly with batch. At batch 512 compute exceeds memory time.

The Capacity Cliff

Bandwidth only matters if the model fits in memory. If weights must be split across GPUs, NVLink bandwidth replaces HBM bandwidth as the bottleneck for the inter-GPU portion of each layer.

A 405B parameter model (Llama 3) at FP16 requires 810 GB. On H100s (80 GB each), this needs a minimum of 11 GPUs. In practice, tensor parallelism across 8 GPUs in a single node is standard, with pipeline parallelism across nodes. Each tensor-parallel step requires an all-reduce over NVLink (900 GB/s bidirectional per GPU), which is 3.7x slower than local HBM bandwidth.

On B200s (192 GB each), the same model needs only 5 GPUs at FP16, or fits on 2 GPUs at FP4 (101 GB). Fewer GPUs means less communication overhead.

Minimum GPUs for 405B Model (various precisions)

(GPUs required)
H100 FP16 810 GB / 80 GB
11 GPUs required
H100 FP8 405 GB / 80 GB
6 GPUs required
H200 FP8 405 GB / 141 GB
3 GPUs required
B200 FP8 405 GB / 192 GB
3 GPUs required
B200 FP4 203 GB / 192 GB
2 GPUs required

Power Consumption of HBM

HBM power is not negligible. Each HBM stack consumes 4.5-10W depending on generation and data rate. On a B200 with 8 stacks:

8×10W=80W8 \times 10\text{W} = 80\text{W}

That is 8% of the B200’s 1,000W TDP dedicated to memory alone. The power breakdown matters because HBM power scales with bandwidth utilization — when the memory is being read at full bandwidth (during decode), it draws peak power.

📊

HBM Power Budget per GPU

GPUStacksPer-stack PowerTotal HBM PowerGPU TDPHBM % of TDP
V100 4 ~4.5W ~18W 300W 6%
A100 5 ~5.5W ~28W 400W 7%
H100 5 ~7.5W ~38W 700W 5%
H200 6 ~9W ~54W 700W 8%
B200 8 ~10W ~80W 1,000W 8%
Note: Power figures are approximate and vary with utilization. TDP is the maximum board power limit.

Energy per Byte

A useful metric for comparing HBM generations is energy per byte transferred:

pJ/bit=Stack power (mW)Stack bandwidth (Gbps)\text{pJ/bit} = \frac{\text{Stack power (mW)}}{\text{Stack bandwidth (Gbps)}}

📊

HBM Energy Efficiency (pJ/bit)

HBM GenStack PowerStack BWpJ/bitImprovement vs Prior
HBM2 4.5W 1,792 Gbps 2.51 baseline
HBM2e 5.5W 3,277 Gbps 1.68 1.49x better
HBM3 7.5W 5,356 Gbps 1.40 1.20x better
HBM3e 10W 8,192 Gbps 1.22 1.15x better
Note: Energy efficiency improves each generation but at a decreasing rate. Process shrinks and circuit-level optimizations drive these gains.

Energy efficiency improvements are slowing down — each generation delivers roughly 15-50% better pJ/bit. This means power envelopes constrain future bandwidth scaling. A hypothetical HBM4 at 12-16 Gbps per pin with 16-hi stacking would push per-stack power above 15W, and total HBM power above 120W for an 8-stack configuration.

Cost of HBM

HBM is the single most expensive component in a data center GPU. As of 2024-2025, approximate per-GB costs are:

📊

Approximate HBM Cost (2024-2025)

HBM GenCost per GBTotal Cost per GPUGPU ASPHBM % of GPU Cost
HBM2 (V100 era) ~$15-20 ~$480-640 (32 GB) ~$8,000 ~6-8%
HBM2e (A100 era) ~$15-20 ~$1,200-1,600 (80 GB) ~$12,000 ~10-13%
HBM3 (H100 era) ~$25-35 ~$2,000-2,800 (80 GB) ~$30,000 ~7-9%
HBM3e (B200 era) ~$30-40 ~$5,760-7,680 (192 GB) ~$40,000 ~14-19%
Note: Costs are approximate industry estimates. GPU ASP includes the full module, not just the die. Supply constraints in 2024-2025 have pushed HBM3e prices to the high end.

HBM cost is driven by two factors: manufacturing yield (stacking 12 dies requires all 12 to be functional) and supply constraints (SK Hynix, Samsung, and Micron cannot build new HBM fabs fast enough to meet AI demand). The TSV process adds approximately $2-3 per GB to the base DRAM cost, and 12-hi stacking has lower yield than 8-hi due to the cumulative probability of defects.

⚠️ Supply Economics

As of early 2025, HBM3e supply is entirely allocated to NVIDIA and AMD through 2026. SK Hynix produces approximately 60-70% of HBM supply, with Samsung and Micron splitting the remainder. Capacity expansion takes 18-24 months. This supply constraint, more than any technical factor, determines GPU pricing and availability.

Measuring Actual vs Theoretical Bandwidth

CUDA Bandwidth Test

The simplest bandwidth measurement uses cudaMemcpy:

#include <cuda_runtime.h>
#include <cstdio>
#include <chrono>

void measure_bandwidth(size_t size_mb) {
    size_t size = size_mb * 1024 * 1024;
    void *d_src, *d_dst;
    cudaMalloc(&d_src, size);
    cudaMalloc(&d_dst, size);
    cudaMemset(d_src, 1, size);

    // Warm up
    cudaMemcpy(d_dst, d_src, size, cudaMemcpyDeviceToDevice);
    cudaDeviceSynchronize();

    // Measure
    cudaEvent_t start, stop;
    cudaEventCreate(&start);
    cudaEventCreate(&stop);

    int iterations = 100;
    cudaEventRecord(start);
    for (int i = 0; i < iterations; i++) {
        cudaMemcpy(d_dst, d_src, size, cudaMemcpyDeviceToDevice);
    }
    cudaEventRecord(stop);
    cudaDeviceSynchronize();

    float ms;
    cudaEventElapsedTime(&ms, start, stop);
    double bandwidth = (2.0 * size * iterations) / (ms / 1000.0) / 1e9;
    // Factor of 2: one read + one write

    printf("Size: %zu MB, Time: %.2f ms, Bandwidth: %.1f GB/s\n",
           size_mb, ms / iterations, bandwidth);

    cudaFree(d_src);
    cudaFree(d_dst);
    cudaEventDestroy(start);
    cudaEventDestroy(stop);
}

int main() {
    printf("HBM Bandwidth Measurement (device-to-device copy)\n");
    printf("================================================\n");
    measure_bandwidth(1);
    measure_bandwidth(16);
    measure_bandwidth(256);
    measure_bandwidth(1024);
    measure_bandwidth(4096);
    return 0;
}

Kernel-Level Bandwidth Measurement

A more accurate measurement uses a custom kernel that forces specific access patterns:

__global__ void read_bandwidth_kernel(const float4 *input, float4 *output, size_t n) {
    size_t idx = blockIdx.x * blockDim.x + threadIdx.x;
    size_t stride = blockDim.x * gridDim.x;

    float4 sum = make_float4(0, 0, 0, 0);
    for (size_t i = idx; i < n; i += stride) {
        float4 val = input[i];
        sum.x += val.x;
        sum.y += val.y;
        sum.z += val.z;
        sum.w += val.w;
    }
    if (idx == 0) output[0] = sum;  // Prevent optimization
}

__global__ void write_bandwidth_kernel(float4 *output, size_t n) {
    size_t idx = blockIdx.x * blockDim.x + threadIdx.x;
    size_t stride = blockDim.x * gridDim.x;

    float4 val = make_float4(1.0f, 2.0f, 3.0f, 4.0f);
    for (size_t i = idx; i < n; i += stride) {
        output[i] = val;
    }
}

void measure_read_write_separately(size_t size_mb) {
    size_t n_float4 = (size_mb * 1024 * 1024) / sizeof(float4);
    size_t size = n_float4 * sizeof(float4);

    float4 *d_buf, *d_out;
    cudaMalloc(&d_buf, size);
    cudaMalloc(&d_out, sizeof(float4));
    cudaMemset(d_buf, 0, size);

    int blocks = 256;
    int threads = 256;

    // Measure read bandwidth
    cudaEvent_t start, stop;
    cudaEventCreate(&start);
    cudaEventCreate(&stop);

    cudaEventRecord(start);
    for (int i = 0; i < 20; i++)
        read_bandwidth_kernel<<<blocks, threads>>>(d_buf, d_out, n_float4);
    cudaEventRecord(stop);
    cudaDeviceSynchronize();

    float ms;
    cudaEventElapsedTime(&ms, start, stop);
    double read_bw = (double)size * 20 / (ms / 1000.0) / 1e9;

    // Measure write bandwidth
    cudaEventRecord(start);
    for (int i = 0; i < 20; i++)
        write_bandwidth_kernel<<<blocks, threads>>>(d_buf, n_float4);
    cudaEventRecord(stop);
    cudaDeviceSynchronize();

    cudaEventElapsedTime(&ms, start, stop);
    double write_bw = (double)size * 20 / (ms / 1000.0) / 1e9;

    printf("Size: %zu MB | Read: %.1f GB/s | Write: %.1f GB/s\n",
           size_mb, read_bw, write_bw);

    cudaFree(d_buf);
    cudaFree(d_out);
    cudaEventDestroy(start);
    cudaEventDestroy(stop);
}

Expected Results

📊

Typical Measured vs Theoretical HBM Bandwidth

GPUTheoretical BWMeasured (large copy)Utilization %Measured (kernel read)
V100 900 GB/s 830-850 GB/s 92-94% 780-820 GB/s
A100 2,039 GB/s 1,900-1,950 GB/s 93-96% 1,800-1,880 GB/s
H100 3,350 GB/s 3,100-3,200 GB/s 93-96% 2,900-3,100 GB/s
H200 4,800 GB/s 4,400-4,600 GB/s 92-96% 4,100-4,400 GB/s
Note: Large sequential copies achieve 93-96% utilization. Kernel-level reads achieve 88-95% depending on access pattern quality. Non-coalesced access can drop to 30-50%.

The gap between theoretical and measured bandwidth comes from:

  1. ECC overhead: The A100 80GB and H100 enable ECC by default, consuming approximately 6% of raw bandwidth for error correction codes
  2. Memory controller overhead: Address translation, refresh cycles, and bank conflicts consume 2-4% of bandwidth
  3. Non-ideal access patterns: Any deviation from perfectly coalesced, sequential, 128-byte-aligned access degrades throughput

Using nvidia-smi for Runtime Monitoring

# Real-time HBM bandwidth utilization
nvidia-smi dmon -d 1 -s m

# Output columns: gpu mem_used mem_total mem_util fb_used fb_total
# mem_util shows HBM bandwidth utilization as a percentage

# For more detail:
nvidia-smi --query-gpu=memory.used,memory.total,memory.free,utilization.memory \
  --format=csv -l 1

Using Nsight Compute for Per-Kernel Analysis

# Measure HBM throughput for a specific kernel
ncu --metrics \
  dram__bytes_read.sum,\
  dram__bytes_write.sum,\
  dram__throughput.avg.pct_of_peak_sustained_elapsed,\
  dram__sectors_read.sum,\
  dram__sectors_write.sum \
  ./my_binary

# dram__throughput shows what percentage of peak HBM bandwidth the kernel achieves
# Target: above 80% for memory-bound kernels
# Below 60% indicates access pattern problems (non-coalesced, bank conflicts)
Bandwidth Utilization Rule of Thumb

A well-optimized memory-bound kernel should achieve 80-90% of theoretical HBM bandwidth. If you measure below 70%, check for: (1) non-coalesced global memory access, (2) small transfer sizes that do not saturate the memory bus, (3) excessive L2 cache thrashing from random access patterns, (4) memory-mapped regions that cross page boundaries.

HBM vs Capacity: The Tradeoff

NVIDIA’s product lineup reveals the capacity-vs-bandwidth tradeoff explicitly:

  • H100: 80 GB at 3,350 GB/s — optimized for compute-heavy workloads
  • H200: 141 GB at 4,800 GB/s — same die, more and faster memory
  • B200: 192 GB at 8,000 GB/s — dual die with maximum memory

For inference serving, the choice between these GPUs depends on model size:

If model fits on 1 GPU:throughputbandwidth\text{If model fits on 1 GPU:}\quad \text{throughput} \propto \text{bandwidth}

If model spans N GPUs:throughputbandwidth1+comm_overheadN\text{If model spans N GPUs:}\quad \text{throughput} \propto \frac{\text{bandwidth}}{1 + \frac{\text{comm\_overhead}}{N}}

The communication overhead term makes single-GPU serving dramatically more efficient. An H200 serving a 70B model at FP8 (70 GB, fits in 141 GB) will outperform two H100s serving the same model (70 GB split across 2x 80 GB) despite lower total bandwidth (4,800 vs 6,700 GB/s) because there is zero NVLink communication overhead.

Future: HBM4 and Beyond

HBM4 (expected 2025-2026) will introduce several changes:

  • Logic base die: The bottom die of the stack will be a logic die manufactured on an advanced process node (potentially 5nm or better), containing ECC engines, built-in PHYs, and possibly compute-near-memory logic
  • Pin rates: 12-16 Gbps per pin
  • 2048-bit interface per stack: Doubling the bus width per stack
  • Projected per-stack bandwidth: 1.5-2 TB/s
  • 16-hi stacking: 16 DRAM dies per stack (if yield allows)

With 8 stacks of HBM4 at 2 TB/s each, a single GPU could reach 16 TB/s of memory bandwidth. At that rate, a 70B FP4 model (35 GB) would decode at:

16,0002×35229 tokens/s (batch=1)\frac{16{,}000}{2 \times 35} \approx 229 \text{ tokens/s (batch=1)}

That approaches human reading speed for a 70B model — a meaningful perceptual threshold for interactive AI.

The HBM bandwidth trajectory shows no signs of slowing. Each generation delivers 1.5-2.4x more bandwidth per stack, and package-level innovations (more stacks, wider interfaces) compound the improvement. For LLM inference, this is the single most impactful hardware trend.