PCIe Gen5 x16 delivers 64 GB/s of bidirectional bandwidth (32 GB/s in each direction). HBM3 on an H100 delivers 3350 GB/s. The ratio is 104:1. This means that any data that must cross the PCIe bus between CPU and GPU is transferred at 1% of the speed the GPU can process it. For compute-bound workloads where data stays on the GPU, PCIe is irrelevant. For workloads that transfer data across PCIe โ model loading, KV cache offloading, multi-GPU communication without NVLink, or prefill-decode disaggregation โ PCIe is the bottleneck.
This post documents the PCIe bandwidth pipeline, measures real-world transfer rates (which are 20-40% below theoretical peak), identifies the scenarios where PCIe limits inference throughput, and implements bandwidth optimization techniques.
PCIe Bandwidth Fundamentals
Theoretical vs Achieved Bandwidth
PCIe Bandwidth by Generation (x16 lanes)
| Generation | Per-Lane Rate | x16 Raw BW | Encoding Overhead | Effective BW (one dir) |
|---|---|---|---|---|
| Gen3 | 8 GT/s | 16 GB/s | 128b/130b | 15.75 GB/s |
| Gen4 | 16 GT/s | 32 GB/s | 128b/130b | 31.5 GB/s |
| Gen5 | 32 GT/s | 64 GB/s | 128b/130b | 63 GB/s |
| Gen6 (2025) | 64 GT/s | 128 GB/s | 242b/256b | 121 GB/s |
Why Achieved Bandwidth Is Below Theoretical
def pcie_overhead_analysis():
"""Analyze PCIe bandwidth overhead sources."""
theoretical_gen5_x16 = 63.0 # GB/s one direction
overheads = {
"128b/130b encoding": 2 / 130, # 1.5%
"Transaction Layer Packet (TLP) headers": 0.04, # ~4%
"Data Link Layer (DLL) overhead": 0.02, # ~2%
"Flow control credits": 0.01, # ~1%
"CPU-side DMA engine limitations": 0.05, # ~5%
"IOMMU translation overhead": 0.03, # ~3%
"Interrupt/polling latency": 0.02, # ~2%
}
total_overhead = sum(overheads.values())
achieved = theoretical_gen5_x16 * (1 - total_overhead)
print(f"Theoretical PCIe Gen5 x16: {theoretical_gen5_x16:.1f} GB/s")
print(f"Overhead breakdown:")
for name, pct in overheads.items():
print(f" {name}: {pct*100:.1f}%")
print(f"Total overhead: {total_overhead*100:.1f}%")
print(f"Estimated achieved: {achieved:.1f} GB/s")
print(f"Typical measured: 25-28 GB/s")
pcie_overhead_analysis()
Measuring PCIe Bandwidth
Pinned vs Pageable Memory
The single most important optimization for PCIe transfers is using pinned (page-locked) memory instead of pageable memory. Pinned memory enables Direct Memory Access (DMA) โ the GPU DMA engine transfers data directly without CPU involvement. Pageable memory requires the CPU to copy data from the page cache to a pinned staging buffer first, then DMA to GPU.
import torch
import time
def measure_pcie_bandwidth(sizes_mb=None, num_iters=20):
"""Measure PCIe bandwidth with pinned and pageable memory."""
if sizes_mb is None:
sizes_mb = [1, 10, 100, 500, 1000, 2000, 4000]
device = 'cuda'
for size_mb in sizes_mb:
num_elements = int(size_mb * 1024 * 1024 / 4) # FP32
# --- Pageable memory (default) ---
h_pageable = torch.randn(num_elements, dtype=torch.float32)
d_tensor = torch.empty(num_elements, device=device,
dtype=torch.float32)
# Warmup
d_tensor.copy_(h_pageable)
torch.cuda.synchronize()
start = time.perf_counter()
for _ in range(num_iters):
d_tensor.copy_(h_pageable)
torch.cuda.synchronize()
pageable_time = (time.perf_counter() - start) / num_iters
pageable_bw = (size_mb / 1024) / pageable_time
# --- Pinned memory ---
h_pinned = torch.randn(num_elements, dtype=torch.float32).pin_memory()
# Warmup
d_tensor.copy_(h_pinned)
torch.cuda.synchronize()
start = time.perf_counter()
for _ in range(num_iters):
d_tensor.copy_(h_pinned)
torch.cuda.synchronize()
pinned_time = (time.perf_counter() - start) / num_iters
pinned_bw = (size_mb / 1024) / pinned_time
speedup = pinned_bw / pageable_bw
print(f"Size: {size_mb:>5d} MB | "
f"Pageable: {pageable_bw:.1f} GB/s | "
f"Pinned: {pinned_bw:.1f} GB/s | "
f"Speedup: {speedup:.2f}x")
measure_pcie_bandwidth()
PCIe H2D Transfer Bandwidth: Pinned vs Pageable (PCIe Gen5 x16)
| Transfer Size | Pageable (GB/s) | Pinned (GB/s) | Speedup |
|---|---|---|---|
| 1 MB | 4.2 | 12.5 | 3.0x |
| 10 MB | 8.1 | 22.4 | 2.8x |
| 100 MB | 10.3 | 26.8 | 2.6x |
| 500 MB | 10.8 | 27.2 | 2.5x |
| 1 GB | 10.9 | 27.5 | 2.5x |
| 4 GB | 11.0 | 27.8 | 2.5x |
Async Transfers and Overlap
def async_transfer_demonstration():
"""Demonstrate overlapping PCIe transfers with GPU compute."""
device = 'cuda'
size_mb = 500
num_elements = int(size_mb * 1024 * 1024 / 4)
# Create two streams
transfer_stream = torch.cuda.Stream()
compute_stream = torch.cuda.Stream()
# Pinned host memory (required for async transfers)
h_data = torch.randn(num_elements, dtype=torch.float32).pin_memory()
d_data = torch.empty(num_elements, device=device, dtype=torch.float32)
d_compute = torch.randn(4096, 4096, device=device, dtype=torch.float32)
# --- Sequential: transfer then compute ---
torch.cuda.synchronize()
start = time.perf_counter()
d_data.copy_(h_data) # Transfer on default stream
torch.cuda.synchronize()
result = torch.matmul(d_compute, d_compute) # Compute on default
torch.cuda.synchronize()
sequential_time = time.perf_counter() - start
# --- Overlapped: transfer and compute on separate streams ---
torch.cuda.synchronize()
start = time.perf_counter()
with torch.cuda.stream(transfer_stream):
d_data.copy_(h_data, non_blocking=True)
with torch.cuda.stream(compute_stream):
result = torch.matmul(d_compute, d_compute)
torch.cuda.synchronize()
overlapped_time = time.perf_counter() - start
print(f"Sequential: {sequential_time*1000:.1f} ms")
print(f"Overlapped: {overlapped_time*1000:.1f} ms")
print(f"Speedup: {sequential_time/overlapped_time:.2f}x")
async_transfer_demonstration()
tensor.copy_(source, non_blocking=True) only performs an asynchronous transfer if the source is in pinned memory. With pageable memory, non_blocking=True is silently ignored and the transfer is synchronous (the CPU blocks until the transfer completes). Always verify that your host tensors are pinned when using async transfers.
When PCIe Is the Bottleneck
Scenario 1: Model Loading
def model_loading_time(model_params_B, dtype_bytes=2,
pcie_bw_GBs=27):
"""Time to load a model from CPU to GPU over PCIe."""
model_bytes = model_params_B * 1e9 * dtype_bytes
model_gb = model_bytes / 1e9
load_time = model_gb / pcie_bw_GBs
print(f"Model: {model_params_B}B params ({dtype_bytes} bytes/param)")
print(f"Size: {model_gb:.1f} GB")
print(f"PCIe Gen5 transfer: {load_time:.1f} seconds")
print(f"PCIe Gen4 transfer: {model_gb / 13.5:.1f} seconds")
# Compare to loading from NVMe SSD
nvme_bw = 7.0 # GB/s (Gen4 NVMe)
print(f"NVMe SSD read: {model_gb / nvme_bw:.1f} seconds "
f"(SSD is the bottleneck, not PCIe)")
model_loading_time(7, 2) # 7B FP16
print()
model_loading_time(70, 2) # 70B FP16
print()
model_loading_time(70, 1) # 70B FP8
print()
model_loading_time(405, 1) # 405B FP8
Scenario 2: KV Cache Offloading
When GPU memory is full, some systems offload KV cache to CPU memory and bring it back when needed.
def kv_cache_offload_analysis(model_params_B=70, num_layers=80,
kv_heads=8, head_dim=128,
seq_len=4096, batch_size=1):
"""Analyze KV cache offload latency."""
# KV cache per layer: 2 * batch * heads * seq * dim * 2 bytes (FP16)
kv_per_layer = 2 * batch_size * kv_heads * seq_len * head_dim * 2
kv_total = kv_per_layer * num_layers
print(f"KV cache per layer: {kv_per_layer / 1e6:.1f} MB")
print(f"KV cache total: {kv_total / 1e6:.1f} MB")
print()
# Time to swap one layer's KV cache over PCIe
pcie_bw = 27e9 # 27 GB/s
swap_time_per_layer = kv_per_layer / pcie_bw * 1000 # ms
# Time for GPU to process one layer (decode, batch=1)
# Approximate: weight_bytes / HBM_BW
weight_per_layer = model_params_B * 1e9 * 2 / num_layers
hbm_bw = 3350e9
compute_time_per_layer = weight_per_layer / hbm_bw * 1000 # ms
print(f"KV swap time per layer: {swap_time_per_layer:.2f} ms "
f"(PCIe: 27 GB/s)")
print(f"GPU compute per layer: {compute_time_per_layer:.2f} ms "
f"(HBM: 3350 GB/s)")
print(f"Swap overhead: {swap_time_per_layer/compute_time_per_layer*100:.0f}%")
print()
# Can we overlap swap with compute?
if swap_time_per_layer < compute_time_per_layer:
print("Swap can be fully hidden behind compute (prefetch next layer)")
else:
overhead = swap_time_per_layer - compute_time_per_layer
print(f"Swap CANNOT be hidden: {overhead:.2f} ms exposed per layer")
print(f"Total overhead for {num_layers} layers: "
f"{overhead * num_layers:.0f} ms")
kv_cache_offload_analysis()
print()
kv_cache_offload_analysis(batch_size=32) # Larger batch = larger KV
Scenario 3: Multi-GPU Without NVLink
def multi_gpu_communication_analysis():
"""Compare PCIe vs NVLink for multi-GPU inference."""
# Tensor parallelism requires all-reduce after each layer
# All-reduce data volume: 2 * hidden_dim * batch * seq * dtype
# Factor of 2: allreduce = reduce-scatter + all-gather
configs = [
("7B, TP=2", 4096, 1, 1, 2),
("70B, TP=8", 8192, 1, 1, 8),
("70B, TP=8, batch=32", 8192, 32, 1, 8),
]
for name, hidden, batch, seq, tp_degree in configs:
allreduce_bytes = 2 * hidden * batch * seq * 2 # FP16
pcie_bw = 27 # GB/s (per direction, PCIe Gen5)
nvlink_bw = 450 # GB/s (per direction, H100 NVLink)
# All-reduce time (ring algorithm): 2 * (N-1)/N * data / BW
factor = 2 * (tp_degree - 1) / tp_degree
pcie_time = factor * allreduce_bytes / (pcie_bw * 1e9) * 1e6 # us
nvlink_time = factor * allreduce_bytes / (nvlink_bw * 1e9) * 1e6 # us
print(f"{name}:")
print(f" All-reduce size: {allreduce_bytes/1024:.1f} KB")
print(f" PCIe Gen5: {pcie_time:.1f} us")
print(f" NVLink: {nvlink_time:.1f} us")
print(f" NVLink speedup: {pcie_time/nvlink_time:.1f}x")
print()
multi_gpu_communication_analysis()
All-Reduce Latency: PCIe vs NVLink (70B, TP=8, batch=1)
(us)For single-token decode (batch=1), the all-reduce data is small (16 KB for hidden=8192 FP16). Even over PCIe, this takes ~1 us โ negligible compared to the layer compute time (~50 us). PCIe-connected GPUs work well for decode. For prefill (large batch/sequence), the all-reduce volume scales linearly and PCIe becomes a significant bottleneck.
GPUDirect RDMA: Bypassing the CPU
Architecture
GPUDirect RDMA enables network adapters (InfiniBand, RoCE) to read/write GPU memory directly without going through CPU memory. The data path changes from:
Without GPUDirect: Network adapter -> CPU memory -> PCIe -> GPU memory With GPUDirect: Network adapter -> PCIe -> GPU memory (direct)
def gpudirect_analysis():
"""Compare data paths with and without GPUDirect RDMA."""
payload_gb = 1.0 # 1 GB transfer
# Without GPUDirect: two copies
# 1. Network -> CPU memory (network BW limited)
# 2. CPU memory -> GPU memory (PCIe BW limited)
network_bw = 50 # 400 Gbps = 50 GB/s InfiniBand
pcie_bw = 27 # GB/s
time_no_gd = payload_gb / network_bw + payload_gb / pcie_bw
# Plus CPU overhead for the copy
# With GPUDirect: one copy
# Network -> GPU memory (limited by min of network and PCIe BW)
gd_bw = min(network_bw, pcie_bw)
time_gd = payload_gb / gd_bw
print(f"Without GPUDirect: {time_no_gd*1000:.1f} ms "
f"(network + PCIe copy)")
print(f"With GPUDirect: {time_gd*1000:.1f} ms "
f"(direct, limited by PCIe)")
print(f"Speedup: {time_no_gd/time_gd:.2f}x")
print()
print("GPUDirect eliminates:")
print(" - One memory copy (CPU-GPU)")
print(" - CPU overhead for managing the copy")
print(" - CPU memory bandwidth consumption")
gpudirect_analysis()
GPUDirect Storage
GPUDirect Storage extends the same concept to NVMe SSDs: load model weights directly from SSD to GPU memory without staging in CPU memory.
def gpudirect_storage_analysis():
"""Analyze model loading with GPUDirect Storage."""
model_gb = 140 # 70B FP16
# Traditional: SSD -> CPU -> GPU
nvme_bw = 7.0 # GB/s (one NVMe Gen4)
pcie_bw = 27.0 # GB/s
traditional_time = model_gb / nvme_bw + model_gb / pcie_bw
# Sequential: SSD read, then PCIe transfer
# (Can overlap with multiple SSDs but typically SSD is bottleneck)
# GPUDirect Storage: SSD -> GPU (bypass CPU)
# Limited by SSD read speed, not PCIe (for single SSD)
# With RAID/multiple SSDs: can approach PCIe limit
gds_bw_single_ssd = 6.5 # Slightly below raw NVMe due to protocol
gds_bw_4_ssds = 24.0 # 4 NVMe SSDs in parallel
gds_time_1 = model_gb / gds_bw_single_ssd
gds_time_4 = model_gb / gds_bw_4_ssds
print(f"Model size: {model_gb} GB")
print(f"Traditional (SSD -> CPU -> GPU): {traditional_time:.1f} s")
print(f"GDS 1 SSD: {gds_time_1:.1f} s")
print(f"GDS 4 SSDs: {gds_time_4:.1f} s")
print(f"Speedup (4 SSDs): {traditional_time/gds_time_4:.2f}x")
gpudirect_storage_analysis()
PCIe Bandwidth Optimization Techniques
Technique 1: Transfer Size Optimization
def transfer_size_optimization():
"""Show how transfer size affects achieved PCIe bandwidth."""
# Small transfers are dominated by latency overhead
# PCIe transaction latency: ~1 us (initiation + acknowledgment)
# At 27 GB/s, 1 us transfers 27 KB
sizes_kb = [1, 4, 16, 64, 256, 1024, 4096, 16384, 65536]
pcie_bw_peak = 27.0 # GB/s
pcie_latency_us = 1.5 # Round-trip latency
print(f"{'Size (KB)':<12} {'Transfer Time':<15} "
f"{'Achieved BW':<15} {'Efficiency'}")
for size_kb in sizes_kb:
size_gb = size_kb / 1024 / 1024
ideal_time_us = size_gb / pcie_bw_peak * 1e6
actual_time_us = ideal_time_us + pcie_latency_us
achieved_bw = size_gb / (actual_time_us / 1e6)
efficiency = achieved_bw / pcie_bw_peak
print(f"{size_kb:<12} {actual_time_us:>8.2f} us "
f"{achieved_bw:>8.2f} GB/s "
f"{efficiency*100:>6.1f}%")
transfer_size_optimization()
Technique 2: Chunked Transfers with Pipeline
def chunked_pipeline_transfer(total_gb=10, chunk_mb=256,
pcie_bw_GBs=27):
"""Pipeline model loading in chunks to overlap transfer with initialization."""
total_mb = total_gb * 1024
num_chunks = int(total_mb / chunk_mb)
chunk_gb = chunk_mb / 1024
transfer_time_per_chunk = chunk_gb / pcie_bw_GBs * 1000 # ms
init_time_per_chunk = transfer_time_per_chunk * 0.3 # 30% of transfer
# Sequential
sequential_ms = num_chunks * (transfer_time_per_chunk +
init_time_per_chunk)
# Pipelined: overlap transfer[i+1] with init[i]
# Total time = transfer_time * num_chunks + init_time (last chunk)
pipelined_ms = (transfer_time_per_chunk * num_chunks +
init_time_per_chunk)
print(f"Total data: {total_gb} GB in {num_chunks} chunks")
print(f"Sequential: {sequential_ms/1000:.1f} s")
print(f"Pipelined: {pipelined_ms/1000:.1f} s")
print(f"Speedup: {sequential_ms/pipelined_ms:.2f}x")
chunked_pipeline_transfer()
PCIe Optimization Techniques Impact
| Technique | Baseline BW (GB/s) | Optimized BW (GB/s) | Improvement |
|---|---|---|---|
| Pageable -> Pinned memory | 11 | 27 | 2.5x |
| Synchronous -> Async + overlap | 27 (serial) | 27 + compute | Hide transfer |
| Small chunks -> Large chunks | 5-15 | 27 | 2-5x |
| GPUDirect RDMA (network) | 11 (via CPU) | 27 (direct) | 2.5x |
| GPUDirect Storage (SSD) | 7 (via CPU) | 24 (4x SSD) | 3.4x |
Summary
PCIe Gen5 x16 provides 27 GB/s achieved unidirectional bandwidth โ 104x less than HBM3 on the same GPU. For steady-state inference where model weights and KV cache reside on GPU, PCIe is irrelevant. PCIe becomes the bottleneck in three scenarios: model loading (140 GB model takes 5+ seconds over PCIe), KV cache offloading (each offloaded layer adds 0.5-2 ms latency), and multi-GPU inference without NVLink (all-reduce data crosses PCIe for every layer).
The optimization hierarchy: pinned memory (2.5x over pageable), async transfers with compute overlap (hides transfer time), large transfer sizes (avoids per-transaction overhead), and GPUDirect (bypasses CPU for network and storage transfers). For multi-GPU, NVLink provides 16x the bandwidth of PCIe โ any serious multi-GPU inference deployment uses NVLink-connected GPUs (HGX, DGX systems).