Part of Series GPU Hardware & AI Accelerators 9 of 30
1 NVIDIA GPU Architecture Evolution: Volta, Ampere, Hopper, Blackwell โ€” What Changed and Why 2 HBM Memory: HBM2, HBM2e, HBM3, HBM3e โ€” Bandwidth, Capacity, and Why It Determines AI Performance 3 NVLink, NVSwitch, and GPU Interconnect: From Peer-to-Peer to NVL72 Rack-Scale Fabric 4 The Streaming Multiprocessor: Warp Schedulers, Register File, and the Execution Pipeline 5 AMD MI300X and ROCm: 192GB HBM3, 5.3 TB/s Bandwidth, and the CUDA Software Moat 6 Tensor Core Evolution: From Volta HMMA to Hopper WGMMA โ€” What Changed at Each Generation 7 GPU Memory Hierarchy: L1, L2, Shared Memory, and Cache Behavior Under Different Access Patterns 8 PCIe Gen5 and the CPU-GPU Bandwidth Bottleneck: When PCIe Limits Your Inference 9 MIG and GPU Virtualization: Partitioning a Single GPU for Multi-Tenant Inference 10 Warp Schedulers and Instruction Issue: How GPUs Hide Latency with Thousands of Threads 11 The Register File: 256KB per SM, Register Pressure, and Why More Registers Mean Fewer Threads 12 L2 Cache Behavior: Residency Control, Working Set Effects, and Cache-Aware Kernel Design 13 ECC Memory and GPU Reliability: Silent Data Corruption, Error Detection, and the Cost of ECC 14 NVSwitch Fabric Topology: How 72 GPUs Share a Single Memory Address Space in NVL72 15 Grace Hopper Superchip: Unified CPU-GPU Memory via NVLink-C2C and What It Changes 16 Blackwell B200 Deep Dive: Dual-Die Design, FP4 Tensor Cores, and 8 TB/s HBM3e 17 Google TPU Architecture: MXU, ICI Interconnect, XLA Compilation, and When TPUs Win 18 Intel Gaudi and Habana: Graph Compiler Model, TPC Architecture, and the ROI Calculation 19 GPU Power Efficiency: Performance per Watt, Dynamic Voltage Scaling, and Datacenter Power Budgets 20 GPU Programming Models: CUDA vs ROCm vs Metal vs Vulkan Compute โ€” Portability and Performance 21 Datacenter vs Consumer GPUs: H100 vs RTX 4090 โ€” What You Actually Get for 10x the Price 22 GPU Cooling: Air, Liquid, and Immersion โ€” Thermal Solutions for AI Datacenters 23 GPU Hardware Scheduling: How the GigaThread Engine Distributes Work Across SMs 24 CPU vs GPU Memory: Why GPUs Need Different Optimization 25 Non-NVIDIA AI Accelerators: Gaudi, MI300X, TPU, and the Software Challenge 26 The Definitive Guide to GPU Memory: Registers, Shared Memory, Caches, and HBM 27 GPU Tensor Core Programming: From Volta WMMA to Hopper WGMMA 28 Vector Processing: From ARM NEON to AVX-512 to GPU SIMT 29 Turing vs Volta Architecture for AI Workloads (Jan 2020) 30 Habana Gaudi vs NVIDIA V100: AI Training Performance (Jul 2020)

PCIe Gen5 x16 delivers 64 GB/s of bidirectional bandwidth (32 GB/s in each direction). HBM3 on an H100 delivers 3350 GB/s. The ratio is 104:1. This means that any data that must cross the PCIe bus between CPU and GPU is transferred at 1% of the speed the GPU can process it. For compute-bound workloads where data stays on the GPU, PCIe is irrelevant. For workloads that transfer data across PCIe โ€” model loading, KV cache offloading, multi-GPU communication without NVLink, or prefill-decode disaggregation โ€” PCIe is the bottleneck.

This post documents the PCIe bandwidth pipeline, measures real-world transfer rates (which are 20-40% below theoretical peak), identifies the scenarios where PCIe limits inference throughput, and implements bandwidth optimization techniques.

PCIe Bandwidth Fundamentals

Theoretical vs Achieved Bandwidth

๐Ÿ“Š

PCIe Bandwidth by Generation (x16 lanes)

GenerationPer-Lane Ratex16 Raw BWEncoding OverheadEffective BW (one dir)
Gen3 8 GT/s 16 GB/s 128b/130b 15.75 GB/s
Gen4 16 GT/s 32 GB/s 128b/130b 31.5 GB/s
Gen5 32 GT/s 64 GB/s 128b/130b 63 GB/s
Gen6 (2025) 64 GT/s 128 GB/s 242b/256b 121 GB/s
Note: PCIe Gen5 x16 theoretical one-direction bandwidth is 63 GB/s after encoding overhead. Achieved bandwidth is typically 25-28 GB/s for H2D and 25-28 GB/s for D2H due to protocol overhead and CPU-side bottlenecks.

Why Achieved Bandwidth Is Below Theoretical

def pcie_overhead_analysis():
    """Analyze PCIe bandwidth overhead sources."""
    theoretical_gen5_x16 = 63.0  # GB/s one direction

    overheads = {
        "128b/130b encoding": 2 / 130,  # 1.5%
        "Transaction Layer Packet (TLP) headers": 0.04,  # ~4%
        "Data Link Layer (DLL) overhead": 0.02,  # ~2%
        "Flow control credits": 0.01,  # ~1%
        "CPU-side DMA engine limitations": 0.05,  # ~5%
        "IOMMU translation overhead": 0.03,  # ~3%
        "Interrupt/polling latency": 0.02,  # ~2%
    }

    total_overhead = sum(overheads.values())
    achieved = theoretical_gen5_x16 * (1 - total_overhead)

    print(f"Theoretical PCIe Gen5 x16: {theoretical_gen5_x16:.1f} GB/s")
    print(f"Overhead breakdown:")
    for name, pct in overheads.items():
        print(f"  {name}: {pct*100:.1f}%")
    print(f"Total overhead: {total_overhead*100:.1f}%")
    print(f"Estimated achieved: {achieved:.1f} GB/s")
    print(f"Typical measured: 25-28 GB/s")

pcie_overhead_analysis()

Measuring PCIe Bandwidth

Pinned vs Pageable Memory

The single most important optimization for PCIe transfers is using pinned (page-locked) memory instead of pageable memory. Pinned memory enables Direct Memory Access (DMA) โ€” the GPU DMA engine transfers data directly without CPU involvement. Pageable memory requires the CPU to copy data from the page cache to a pinned staging buffer first, then DMA to GPU.

import torch
import time

def measure_pcie_bandwidth(sizes_mb=None, num_iters=20):
    """Measure PCIe bandwidth with pinned and pageable memory."""
    if sizes_mb is None:
        sizes_mb = [1, 10, 100, 500, 1000, 2000, 4000]

    device = 'cuda'

    for size_mb in sizes_mb:
        num_elements = int(size_mb * 1024 * 1024 / 4)  # FP32

        # --- Pageable memory (default) ---
        h_pageable = torch.randn(num_elements, dtype=torch.float32)
        d_tensor = torch.empty(num_elements, device=device,
                                dtype=torch.float32)

        # Warmup
        d_tensor.copy_(h_pageable)
        torch.cuda.synchronize()

        start = time.perf_counter()
        for _ in range(num_iters):
            d_tensor.copy_(h_pageable)
            torch.cuda.synchronize()
        pageable_time = (time.perf_counter() - start) / num_iters
        pageable_bw = (size_mb / 1024) / pageable_time

        # --- Pinned memory ---
        h_pinned = torch.randn(num_elements, dtype=torch.float32).pin_memory()

        # Warmup
        d_tensor.copy_(h_pinned)
        torch.cuda.synchronize()

        start = time.perf_counter()
        for _ in range(num_iters):
            d_tensor.copy_(h_pinned)
            torch.cuda.synchronize()
        pinned_time = (time.perf_counter() - start) / num_iters
        pinned_bw = (size_mb / 1024) / pinned_time

        speedup = pinned_bw / pageable_bw
        print(f"Size: {size_mb:>5d} MB | "
              f"Pageable: {pageable_bw:.1f} GB/s | "
              f"Pinned: {pinned_bw:.1f} GB/s | "
              f"Speedup: {speedup:.2f}x")

measure_pcie_bandwidth()
๐Ÿ“Š

PCIe H2D Transfer Bandwidth: Pinned vs Pageable (PCIe Gen5 x16)

Transfer SizePageable (GB/s)Pinned (GB/s)Speedup
1 MB 4.2 12.5 3.0x
10 MB 8.1 22.4 2.8x
100 MB 10.3 26.8 2.6x
500 MB 10.8 27.2 2.5x
1 GB 10.9 27.5 2.5x
4 GB 11.0 27.8 2.5x
Note: Pinned memory provides 2.5-3x higher bandwidth for large transfers. Small transfers (1 MB) are dominated by latency overhead in both cases.

Async Transfers and Overlap

def async_transfer_demonstration():
    """Demonstrate overlapping PCIe transfers with GPU compute."""
    device = 'cuda'
    size_mb = 500
    num_elements = int(size_mb * 1024 * 1024 / 4)

    # Create two streams
    transfer_stream = torch.cuda.Stream()
    compute_stream = torch.cuda.Stream()

    # Pinned host memory (required for async transfers)
    h_data = torch.randn(num_elements, dtype=torch.float32).pin_memory()
    d_data = torch.empty(num_elements, device=device, dtype=torch.float32)
    d_compute = torch.randn(4096, 4096, device=device, dtype=torch.float32)

    # --- Sequential: transfer then compute ---
    torch.cuda.synchronize()
    start = time.perf_counter()

    d_data.copy_(h_data)  # Transfer on default stream
    torch.cuda.synchronize()
    result = torch.matmul(d_compute, d_compute)  # Compute on default
    torch.cuda.synchronize()

    sequential_time = time.perf_counter() - start

    # --- Overlapped: transfer and compute on separate streams ---
    torch.cuda.synchronize()
    start = time.perf_counter()

    with torch.cuda.stream(transfer_stream):
        d_data.copy_(h_data, non_blocking=True)

    with torch.cuda.stream(compute_stream):
        result = torch.matmul(d_compute, d_compute)

    torch.cuda.synchronize()
    overlapped_time = time.perf_counter() - start

    print(f"Sequential:  {sequential_time*1000:.1f} ms")
    print(f"Overlapped:  {overlapped_time*1000:.1f} ms")
    print(f"Speedup: {sequential_time/overlapped_time:.2f}x")

async_transfer_demonstration()
โš ๏ธ non_blocking=True Requires Pinned Memory

tensor.copy_(source, non_blocking=True) only performs an asynchronous transfer if the source is in pinned memory. With pageable memory, non_blocking=True is silently ignored and the transfer is synchronous (the CPU blocks until the transfer completes). Always verify that your host tensors are pinned when using async transfers.

When PCIe Is the Bottleneck

Scenario 1: Model Loading

def model_loading_time(model_params_B, dtype_bytes=2,
                        pcie_bw_GBs=27):
    """Time to load a model from CPU to GPU over PCIe."""
    model_bytes = model_params_B * 1e9 * dtype_bytes
    model_gb = model_bytes / 1e9

    load_time = model_gb / pcie_bw_GBs

    print(f"Model: {model_params_B}B params ({dtype_bytes} bytes/param)")
    print(f"Size: {model_gb:.1f} GB")
    print(f"PCIe Gen5 transfer: {load_time:.1f} seconds")
    print(f"PCIe Gen4 transfer: {model_gb / 13.5:.1f} seconds")

    # Compare to loading from NVMe SSD
    nvme_bw = 7.0  # GB/s (Gen4 NVMe)
    print(f"NVMe SSD read: {model_gb / nvme_bw:.1f} seconds "
          f"(SSD is the bottleneck, not PCIe)")

model_loading_time(7, 2)    # 7B FP16
print()
model_loading_time(70, 2)   # 70B FP16
print()
model_loading_time(70, 1)   # 70B FP8
print()
model_loading_time(405, 1)  # 405B FP8

Scenario 2: KV Cache Offloading

When GPU memory is full, some systems offload KV cache to CPU memory and bring it back when needed.

def kv_cache_offload_analysis(model_params_B=70, num_layers=80,
                                kv_heads=8, head_dim=128,
                                seq_len=4096, batch_size=1):
    """Analyze KV cache offload latency."""
    # KV cache per layer: 2 * batch * heads * seq * dim * 2 bytes (FP16)
    kv_per_layer = 2 * batch_size * kv_heads * seq_len * head_dim * 2
    kv_total = kv_per_layer * num_layers

    print(f"KV cache per layer: {kv_per_layer / 1e6:.1f} MB")
    print(f"KV cache total: {kv_total / 1e6:.1f} MB")
    print()

    # Time to swap one layer's KV cache over PCIe
    pcie_bw = 27e9  # 27 GB/s
    swap_time_per_layer = kv_per_layer / pcie_bw * 1000  # ms

    # Time for GPU to process one layer (decode, batch=1)
    # Approximate: weight_bytes / HBM_BW
    weight_per_layer = model_params_B * 1e9 * 2 / num_layers
    hbm_bw = 3350e9
    compute_time_per_layer = weight_per_layer / hbm_bw * 1000  # ms

    print(f"KV swap time per layer: {swap_time_per_layer:.2f} ms "
          f"(PCIe: 27 GB/s)")
    print(f"GPU compute per layer:  {compute_time_per_layer:.2f} ms "
          f"(HBM: 3350 GB/s)")
    print(f"Swap overhead: {swap_time_per_layer/compute_time_per_layer*100:.0f}%")
    print()

    # Can we overlap swap with compute?
    if swap_time_per_layer < compute_time_per_layer:
        print("Swap can be fully hidden behind compute (prefetch next layer)")
    else:
        overhead = swap_time_per_layer - compute_time_per_layer
        print(f"Swap CANNOT be hidden: {overhead:.2f} ms exposed per layer")
        print(f"Total overhead for {num_layers} layers: "
              f"{overhead * num_layers:.0f} ms")

kv_cache_offload_analysis()
print()
kv_cache_offload_analysis(batch_size=32)  # Larger batch = larger KV
def multi_gpu_communication_analysis():
    """Compare PCIe vs NVLink for multi-GPU inference."""
    # Tensor parallelism requires all-reduce after each layer
    # All-reduce data volume: 2 * hidden_dim * batch * seq * dtype
    # Factor of 2: allreduce = reduce-scatter + all-gather

    configs = [
        ("7B, TP=2", 4096, 1, 1, 2),
        ("70B, TP=8", 8192, 1, 1, 8),
        ("70B, TP=8, batch=32", 8192, 32, 1, 8),
    ]

    for name, hidden, batch, seq, tp_degree in configs:
        allreduce_bytes = 2 * hidden * batch * seq * 2  # FP16

        pcie_bw = 27  # GB/s (per direction, PCIe Gen5)
        nvlink_bw = 450  # GB/s (per direction, H100 NVLink)

        # All-reduce time (ring algorithm): 2 * (N-1)/N * data / BW
        factor = 2 * (tp_degree - 1) / tp_degree

        pcie_time = factor * allreduce_bytes / (pcie_bw * 1e9) * 1e6  # us
        nvlink_time = factor * allreduce_bytes / (nvlink_bw * 1e9) * 1e6  # us

        print(f"{name}:")
        print(f"  All-reduce size: {allreduce_bytes/1024:.1f} KB")
        print(f"  PCIe Gen5:  {pcie_time:.1f} us")
        print(f"  NVLink:     {nvlink_time:.1f} us")
        print(f"  NVLink speedup: {pcie_time/nvlink_time:.1f}x")
        print()

multi_gpu_communication_analysis()

All-Reduce Latency: PCIe vs NVLink (70B, TP=8, batch=1)

(us)
PCIe Gen4
2.1 us
PCIe Gen5
1.1 us
NVLink (H100) 16x faster
0.07 us
โšก PCIe Multi-GPU Is Viable for Decode

For single-token decode (batch=1), the all-reduce data is small (16 KB for hidden=8192 FP16). Even over PCIe, this takes ~1 us โ€” negligible compared to the layer compute time (~50 us). PCIe-connected GPUs work well for decode. For prefill (large batch/sequence), the all-reduce volume scales linearly and PCIe becomes a significant bottleneck.

GPUDirect RDMA: Bypassing the CPU

Architecture

GPUDirect RDMA enables network adapters (InfiniBand, RoCE) to read/write GPU memory directly without going through CPU memory. The data path changes from:

Without GPUDirect: Network adapter -> CPU memory -> PCIe -> GPU memory With GPUDirect: Network adapter -> PCIe -> GPU memory (direct)

def gpudirect_analysis():
    """Compare data paths with and without GPUDirect RDMA."""
    payload_gb = 1.0  # 1 GB transfer

    # Without GPUDirect: two copies
    # 1. Network -> CPU memory (network BW limited)
    # 2. CPU memory -> GPU memory (PCIe BW limited)
    network_bw = 50  # 400 Gbps = 50 GB/s InfiniBand
    pcie_bw = 27     # GB/s

    time_no_gd = payload_gb / network_bw + payload_gb / pcie_bw
    # Plus CPU overhead for the copy

    # With GPUDirect: one copy
    # Network -> GPU memory (limited by min of network and PCIe BW)
    gd_bw = min(network_bw, pcie_bw)
    time_gd = payload_gb / gd_bw

    print(f"Without GPUDirect: {time_no_gd*1000:.1f} ms "
          f"(network + PCIe copy)")
    print(f"With GPUDirect:    {time_gd*1000:.1f} ms "
          f"(direct, limited by PCIe)")
    print(f"Speedup: {time_no_gd/time_gd:.2f}x")
    print()
    print("GPUDirect eliminates:")
    print("  - One memory copy (CPU-GPU)")
    print("  - CPU overhead for managing the copy")
    print("  - CPU memory bandwidth consumption")

gpudirect_analysis()

GPUDirect Storage

GPUDirect Storage extends the same concept to NVMe SSDs: load model weights directly from SSD to GPU memory without staging in CPU memory.

def gpudirect_storage_analysis():
    """Analyze model loading with GPUDirect Storage."""
    model_gb = 140  # 70B FP16

    # Traditional: SSD -> CPU -> GPU
    nvme_bw = 7.0   # GB/s (one NVMe Gen4)
    pcie_bw = 27.0   # GB/s

    traditional_time = model_gb / nvme_bw + model_gb / pcie_bw
    # Sequential: SSD read, then PCIe transfer
    # (Can overlap with multiple SSDs but typically SSD is bottleneck)

    # GPUDirect Storage: SSD -> GPU (bypass CPU)
    # Limited by SSD read speed, not PCIe (for single SSD)
    # With RAID/multiple SSDs: can approach PCIe limit
    gds_bw_single_ssd = 6.5  # Slightly below raw NVMe due to protocol
    gds_bw_4_ssds = 24.0     # 4 NVMe SSDs in parallel

    gds_time_1 = model_gb / gds_bw_single_ssd
    gds_time_4 = model_gb / gds_bw_4_ssds

    print(f"Model size: {model_gb} GB")
    print(f"Traditional (SSD -> CPU -> GPU): {traditional_time:.1f} s")
    print(f"GDS 1 SSD:  {gds_time_1:.1f} s")
    print(f"GDS 4 SSDs: {gds_time_4:.1f} s")
    print(f"Speedup (4 SSDs): {traditional_time/gds_time_4:.2f}x")

gpudirect_storage_analysis()

PCIe Bandwidth Optimization Techniques

Technique 1: Transfer Size Optimization

def transfer_size_optimization():
    """Show how transfer size affects achieved PCIe bandwidth."""
    # Small transfers are dominated by latency overhead
    # PCIe transaction latency: ~1 us (initiation + acknowledgment)
    # At 27 GB/s, 1 us transfers 27 KB

    sizes_kb = [1, 4, 16, 64, 256, 1024, 4096, 16384, 65536]
    pcie_bw_peak = 27.0  # GB/s
    pcie_latency_us = 1.5  # Round-trip latency

    print(f"{'Size (KB)':<12} {'Transfer Time':<15} "
          f"{'Achieved BW':<15} {'Efficiency'}")
    for size_kb in sizes_kb:
        size_gb = size_kb / 1024 / 1024
        ideal_time_us = size_gb / pcie_bw_peak * 1e6
        actual_time_us = ideal_time_us + pcie_latency_us
        achieved_bw = size_gb / (actual_time_us / 1e6)
        efficiency = achieved_bw / pcie_bw_peak

        print(f"{size_kb:<12} {actual_time_us:>8.2f} us     "
              f"{achieved_bw:>8.2f} GB/s    "
              f"{efficiency*100:>6.1f}%")

transfer_size_optimization()

Technique 2: Chunked Transfers with Pipeline

def chunked_pipeline_transfer(total_gb=10, chunk_mb=256,
                               pcie_bw_GBs=27):
    """Pipeline model loading in chunks to overlap transfer with initialization."""
    total_mb = total_gb * 1024
    num_chunks = int(total_mb / chunk_mb)
    chunk_gb = chunk_mb / 1024

    transfer_time_per_chunk = chunk_gb / pcie_bw_GBs * 1000  # ms
    init_time_per_chunk = transfer_time_per_chunk * 0.3  # 30% of transfer

    # Sequential
    sequential_ms = num_chunks * (transfer_time_per_chunk +
                                   init_time_per_chunk)

    # Pipelined: overlap transfer[i+1] with init[i]
    # Total time = transfer_time * num_chunks + init_time (last chunk)
    pipelined_ms = (transfer_time_per_chunk * num_chunks +
                     init_time_per_chunk)

    print(f"Total data: {total_gb} GB in {num_chunks} chunks")
    print(f"Sequential:  {sequential_ms/1000:.1f} s")
    print(f"Pipelined:   {pipelined_ms/1000:.1f} s")
    print(f"Speedup: {sequential_ms/pipelined_ms:.2f}x")

chunked_pipeline_transfer()
๐Ÿ“Š

PCIe Optimization Techniques Impact

TechniqueBaseline BW (GB/s)Optimized BW (GB/s)Improvement
Pageable -> Pinned memory 11 27 2.5x
Synchronous -> Async + overlap 27 (serial) 27 + compute Hide transfer
Small chunks -> Large chunks 5-15 27 2-5x
GPUDirect RDMA (network) 11 (via CPU) 27 (direct) 2.5x
GPUDirect Storage (SSD) 7 (via CPU) 24 (4x SSD) 3.4x
Note: Pinned memory is the single most impactful optimization. Async transfers provide the second largest benefit by hiding transfer latency behind compute.

Summary

PCIe Gen5 x16 provides 27 GB/s achieved unidirectional bandwidth โ€” 104x less than HBM3 on the same GPU. For steady-state inference where model weights and KV cache reside on GPU, PCIe is irrelevant. PCIe becomes the bottleneck in three scenarios: model loading (140 GB model takes 5+ seconds over PCIe), KV cache offloading (each offloaded layer adds 0.5-2 ms latency), and multi-GPU inference without NVLink (all-reduce data crosses PCIe for every layer).

The optimization hierarchy: pinned memory (2.5x over pageable), async transfers with compute overlap (hides transfer time), large transfer sizes (avoids per-transaction overhead), and GPUDirect (bypasses CPU for network and storage transfers). For multi-GPU, NVLink provides 16x the bandwidth of PCIe โ€” any serious multi-GPU inference deployment uses NVLink-connected GPUs (HGX, DGX systems).