Blackwell B200 Deep Dive: Dual-Die Design, FP4 Tensor Cores, and 8 TB/s HBM3e

Part of Series GPU Hardware & AI Accelerators 17 of 30

1 NVIDIA GPU Architecture Evolution: Volta, Ampere, Hopper, Blackwell — What Changed and Why 2 HBM Memory: HBM2, HBM2e, HBM3, HBM3e — Bandwidth, Capacity, and Why It Determines AI Performance 3 NVLink, NVSwitch, and GPU Interconnect: From Peer-to-Peer to NVL72 Rack-Scale Fabric 4 The Streaming Multiprocessor: Warp Schedulers, Register File, and the Execution Pipeline 5 AMD MI300X and ROCm: 192GB HBM3, 5.3 TB/s Bandwidth, and the CUDA Software Moat 6 Tensor Core Evolution: From Volta HMMA to Hopper WGMMA — What Changed at Each Generation 7 GPU Memory Hierarchy: L1, L2, Shared Memory, and Cache Behavior Under Different Access Patterns 8 PCIe Gen5 and the CPU-GPU Bandwidth Bottleneck: When PCIe Limits Your Inference 9 MIG and GPU Virtualization: Partitioning a Single GPU for Multi-Tenant Inference 10 Warp Schedulers and Instruction Issue: How GPUs Hide Latency with Thousands of Threads 11 The Register File: 256KB per SM, Register Pressure, and Why More Registers Mean Fewer Threads 12 L2 Cache Behavior: Residency Control, Working Set Effects, and Cache-Aware Kernel Design 13 ECC Memory and GPU Reliability: Silent Data Corruption, Error Detection, and the Cost of ECC 14 NVSwitch Fabric Topology: How 72 GPUs Share a Single Memory Address Space in NVL72 15 Grace Hopper Superchip: Unified CPU-GPU Memory via NVLink-C2C and What It Changes 16 Blackwell B200 Deep Dive: Dual-Die Design, FP4 Tensor Cores, and 8 TB/s HBM3e 17 Google TPU Architecture: MXU, ICI Interconnect, XLA Compilation, and When TPUs Win 18 Intel Gaudi and Habana: Graph Compiler Model, TPC Architecture, and the ROI Calculation 19 GPU Power Efficiency: Performance per Watt, Dynamic Voltage Scaling, and Datacenter Power Budgets 20 GPU Programming Models: CUDA vs ROCm vs Metal vs Vulkan Compute — Portability and Performance 21 Datacenter vs Consumer GPUs: H100 vs RTX 4090 — What You Actually Get for 10x the Price 22 GPU Cooling: Air, Liquid, and Immersion — Thermal Solutions for AI Datacenters 23 GPU Hardware Scheduling: How the GigaThread Engine Distributes Work Across SMs 24 CPU vs GPU Memory: Why GPUs Need Different Optimization 25 Non-NVIDIA AI Accelerators: Gaudi, MI300X, TPU, and the Software Challenge 26 The Definitive Guide to GPU Memory: Registers, Shared Memory, Caches, and HBM 27 GPU Tensor Core Programming: From Volta WMMA to Hopper WGMMA 28 Vector Processing: From ARM NEON to AVX-512 to GPU SIMT 29 Turing vs Volta Architecture for AI Workloads (Jan 2020) 30 Habana Gaudi vs NVIDIA V100: AI Training Performance (Jul 2020)

The NVIDIA B200 is the first GPU to ship as two dies in a single package. Each die contains 104 billion transistors on TSMC 4NP, for a total of 208 billion transistors — 2.6x more than the H100’s 80 billion. The two dies are connected by a 10 TB/s on-package NVLink interconnect, making them appear as a single monolithic GPU to software. The B200 delivers 4,500 TFLOPS of FP8 compute (2.3x the H100), 9,000 TFLOPS of FP4 (a new precision tier), 192 GB of HBM3e at 8 TB/s bandwidth (2.4x the H100), and 1,800 GB/s of NVLink 5.0 connectivity per GPU.

This is not an incremental upgrade. The B200 addresses the three bottlenecks that limited H100 performance on frontier LLM workloads: compute throughput (now 2-4x depending on precision), memory bandwidth (2.4x), and interconnect bandwidth (2x). This post dissects each architectural change, the engineering tradeoffs behind the dual-die design, and the performance implications for training and inference.

Dual-Die Architecture

Why Two Dies

A monolithic 208B-transistor die would require approximately 1,200 mm² — exceeding TSMC’s maximum reticle size (~858 mm²). Even if manufacturable, the yield would be catastrophic: at realistic defect densities, fewer than 10% of such dies would be functional.

The solution: two smaller dies (~530 mm² each), connected by a high-bandwidth on-package interconnect.

📊

B200 Die Comparison

Specification	H100 (Monolithic)	B200 (Per Die)	B200 (Total, 2 Dies)
Transistors	80 billion	104 billion	208 billion
Die area	814 mm²	~530 mm²	~1,060 mm²
Process	TSMC 4N	TSMC 4NP	TSMC 4NP
SMs	132	96	192
FP32 CUDA cores	16,896	12,288	24,576
Tensor cores (5th gen)	528	384	768
L2 cache	50 MB	~48 MB	~96 MB
HBM stacks	6 (HBM3)	4 (HBM3e)	8 (HBM3e)
HBM capacity	80 GB	96 GB	192 GB
HBM bandwidth	3,350 GB/s	4,000 GB/s	8,000 GB/s
TDP	700 W	~500 W	1,000 W

Note: Each B200 die is smaller than H100 (530 vs 814 mm²) but contains more transistors due to the 4NP process shrink. The dual-die total provides 2.6x transistors at 1.3x total area.

The On-Package Interconnect

The two B200 dies are connected by a silicon interposer or organic substrate providing 10 TB/s bidirectional bandwidth:

// On-package interconnect specifications:
// Bandwidth: 10 TB/s bidirectional (5 TB/s per direction)
// Latency: ~10-20 ns (on-package, short traces)
// Width: estimated 4096 data lanes at ~2.5 Gbps each
//
// For comparison:
// HBM3e per-stack bandwidth: ~1 TB/s (8 stacks total)
// NVLink 5.0 per-GPU: 1.8 TB/s (off-package, between GPUs)
// On-package link: 10 TB/s (5.6x faster than NVLink off-package)
//
// This means cross-die communication within the B200 is:
// - 5.6x faster than GPU-to-GPU NVLink
// - Nearly invisible to performance
// - Software sees one unified GPU, not two separate GPUs

Unified GPU Abstraction

The two dies appear as a single GPU to CUDA:

# nvidia-smi shows one B200, not two
nvidia-smi -L
# GPU 0: NVIDIA B200 (UUID: GPU-xxxx)
# (single entry, single device)

# CUDA reports one device with combined resources
cudaDeviceProp prop;
cudaGetDeviceProperties(&prop, 0);
// prop.multiProcessorCount = 192  (96 per die, combined)
// prop.totalGlobalMem = 192 GB    (96 per die, combined)

// Kernel launch uses all 192 SMs transparently
// The block scheduler distributes blocks across both dies
// Cross-die memory access is handled by hardware
my_kernel<<<grid, block>>>(data, n);
// Thread blocks on Die 0 access Die 0's HBM at 4 TB/s
// Thread blocks on Die 0 access Die 1's HBM via interconnect at ~5 TB/s
// Thread blocks on Die 1 access Die 1's HBM at 4 TB/s

⚠️ Cross-Die Access Is Not Free

While the on-package interconnect is fast (10 TB/s), it adds latency (~10-20 ns) compared to local HBM access. Kernels that access data primarily on one die’s HBM will run faster than kernels that access data uniformly across both dies’ HBM. The CUDA runtime and driver use NUMA-aware allocation to keep each die’s data local. For most kernels, this is handled automatically, but extremely latency-sensitive code may need explicit placement hints.

Fifth-Generation Tensor Cores

FP4 Support

The most significant compute addition in Blackwell is native FP4 (4-bit floating point) tensor core support:

// FP4 format (E2M1):
// 1 sign bit, 2 exponent bits, 1 mantissa bit
// Range: ±0.5 to ±6.0 (very limited, 16 distinct values)
// Typically used with per-group scaling factors
//
// FP4 tensor core operation:
// A[m,k] in FP4 × B[k,n] in FP4 + C[m,n] in FP32 → D[m,n] in FP32
// Each FP4 value requires a scaling factor (FP8 or FP16) per group
// Group size: typically 32-128 elements

// Per-SM FP4 throughput:
// 4 tensor cores per SM, each computes:
//   m16n8k64 in FP4: 16 × 8 × 64 × 2 = 16,384 FP4 ops per cycle per TC
// Total per SM: 4 × 16,384 = 65,536 FP4 ops per cycle
// Per GPU (192 SMs): 192 × 65,536 = 12,582,912 FP4 ops per cycle
// At ~1.5 GHz boost: ~18.9 PFLOPS (sparsity) or ~9.0 PFLOPS (dense)

📊

B200 Tensor Core Throughput by Precision

Precision	TFLOPS (Dense)	TFLOPS (Sparse)	vs H100 Dense	Primary Use Case
FP64 Tensor	90	N/A	2.3x (40)	HPC, scientific computing
FP32 (TF32)	2,250	4,500	2.3x (990)	Training accumulation
FP16 / BF16	2,250	4,500	2.3x (990)	Training, inference
FP8 (E4M3/E5M2)	4,500	9,000	2.3x (1,979)	Inference, fine-tuning
FP4 (E2M1)	9,000	18,000	NEW	Inference (quantized models)
INT8	4,500	9,000	2.3x (1,979)	Integer quantized inference

Note: B200 roughly 2.3x all precision tiers vs H100, plus the new FP4 tier at 2x FP8. Sparse variants use 2:4 structured sparsity.

The FP4 Quality Question

FP4 represents only 16 distinct values per group. Model quality depends entirely on the quantization scheme:

// FP4 quantization with group scaling:
// For a weight tensor W[4096, 4096]:
// Divide into groups of 64 elements
// For each group: scale = max(abs(group)) / 6.0  (FP4 max value)
// Quantize: W_fp4 = round(W / scale) in FP4
// Dequantize: W_approx = W_fp4 * scale
//
// Storage: 4096 * 4096 * 0.5 bytes (FP4) + (4096 * 4096 / 64) * 2 bytes (FP16 scales)
//        = 8 MB + 0.5 MB = 8.5 MB (vs 32 MB FP8, 64 MB FP16)
//
// Quality loss: depends on model and calibration
// Published results (NVIDIA, 2024):
// GPT-3 175B: FP4 perplexity within 0.3% of FP16 with GPTQ-style calibration
// LLaMA-2 70B: FP4 perplexity within 0.5% of FP16
// Smaller models (7B): FP4 quality degrades noticeably (1-3% accuracy loss)

ℹ️ FP4 Is for Inference, Not Training

FP4 does not have sufficient dynamic range for training gradients, which require both very small and very large values. FP4 is exclusively an inference format, used to compress model weights for faster loading and reduced memory footprint. The compute happens in FP4 multiplication with FP32 accumulation — similar to INT4 quantization but with the benefit of floating-point representation.

HBM3e: 8 TB/s Memory Bandwidth

HBM3e Specifications

// B200 HBM3e configuration:
// 8 HBM3e stacks (4 per die)
// Each stack: 24 GB capacity, ~1 TB/s bandwidth
// Total: 192 GB capacity, 8 TB/s bandwidth
//
// HBM3e vs HBM3 (H100):
// Per-pin data rate: 9.6 Gbps (HBM3e) vs 6.4 Gbps (HBM3)
// Pins per stack: 1024 (same)
// Per-stack bandwidth: 9.6 * 1024 / 8 = 1,228.8 GB/s (theoretical)
// Practical: ~1,000 GB/s per stack (efficiency ~81%)
// 8 stacks: ~8,000 GB/s total

Memory Bandwidth and Model Performance

The 2.4x bandwidth increase from H100 to B200 directly impacts memory-bound operations:

// LLM decode throughput (single token generation):
// Throughput ∝ memory_bandwidth / model_size
//
// H100 (80B FP16 model, FP16 weights):
// 3,350 GB/s / 160 GB = 20.9 tokens/s (batch 1)
//
// B200 (80B FP16 model, FP16 weights):
// 8,000 GB/s / 160 GB = 50.0 tokens/s (batch 1)
// Improvement: 2.4x
//
// B200 (80B FP8 model):
// 8,000 GB/s / 80 GB = 100.0 tokens/s (batch 1)
// Improvement: 4.8x vs H100 FP16
//
// B200 (80B FP4 model):
// 8,000 GB/s / 40 GB = 200.0 tokens/s (batch 1)
// Improvement: 9.6x vs H100 FP16

Single-Batch Decode Throughput: 70B Parameter Model

(tokens/s (batch 1))

H100, FP16 weights 20.9 tok/s — 3,350 GB/s BW

20.9 tokens/s (batch 1)

B200, FP16 weights 50.0 tok/s — 8,000 GB/s BW

50 tokens/s (batch 1)

B200, FP8 weights 100.0 tok/s — 2x model compression

100 tokens/s (batch 1)

B200, FP4 weights 200.0 tok/s — 4x model compression

200 tokens/s (batch 1)

Second-Generation Transformer Engine

Dynamic Precision Selection

The Transformer Engine automatically selects the optimal precision for each layer and each tensor during training and inference:

// Transformer Engine 2.0 (Blackwell):
// 1. Monitors tensor statistics (magnitude distribution)
// 2. Selects per-tensor precision: FP4, FP8 (E4M3 or E5M2), BF16
// 3. Computes scaling factors for quantization
// 4. Executes tensor core operations at selected precision
// 5. Accumulates in FP32, then down-converts output
//
// Decision logic (simplified):
// If tensor values fit in FP4 range with < 0.1% outliers → use FP4
// If tensor values fit in FP8 E4M3 range → use FP8 E4M3
// If tensor values require FP8 E5M2 range (larger exponent) → use FP8 E5M2
// Otherwise → use BF16
//
// Different layers in the same model may use different precisions:
// Embedding layers: BF16 (wide value range)
// Attention QKV projections: FP8 (stable distribution)
// FFN layers: FP8 or FP4 (often quantization-friendly)
// Output logits: BF16 (precision-sensitive)

Micro-Tensor Scaling

Blackwell introduces micro-tensor scaling for FP4 and FP8: per-group scaling factors that allow each small group of elements to have its own dynamic range:

// Micro-tensor scaling for FP8:
// Instead of one scale per tensor (coarse):
//   scale = max(abs(tensor)) / 448.0  (FP8 E4M3 max)
//   Problem: if tensor has outliers, most values use only a fraction of FP8 range
//
// Micro-tensor scaling (per group of 128 elements):
//   For each group of 128 elements:
//     group_scale = max(abs(group)) / 448.0
//     group_fp8 = round(group / group_scale)
//   Each group uses full FP8 dynamic range
//   Overhead: 1 FP16 scale per 128 FP8 values = 1.6% storage overhead
//
// Result: FP8 with micro-tensor scaling matches FP16 quality
// on nearly all LLM benchmarks (GPT-4 class models)

NVLink 5.0

Per-GPU Interconnect

// B200 NVLink 5.0:
// 18 NVLink 5.0 links per GPU
// Each link: 100 GB/s bidirectional (50 GB/s per direction)
// Total: 1,800 GB/s bidirectional per GPU
//
// vs H100 NVLink 4.0:
// 18 links × 50 GB/s = 900 GB/s per GPU
// B200: 2x per-GPU NVLink bandwidth

NVL72 with B200

In the NVL72 configuration, 72 B200 GPUs are connected via NVSwitch 4.0:

// NVL72 aggregate numbers:
// Total GPU memory: 72 × 192 GB = 13.8 TB HBM3e
// Total compute (FP8): 72 × 4,500 = 324 PFLOPS
// Total compute (FP4): 72 × 9,000 = 648 PFLOPS
// Total HBM bandwidth: 72 × 8,000 = 576 TB/s
// Total NVLink bisection: 130 TB/s
//
// For a 1 trillion parameter model (FP8):
// Model size: 1 TB
// Fits across 72 GPUs: 1 TB / 72 = 14.2 GB per GPU
// Each GPU has 192 GB: model fits easily with room for KV cache
// Per-GPU compute: 4,500 TFLOPS FP8
// Per-GPU model BW: only 14.2 GB to read → compute-bound

Decompression Engine

Hardware Decompression

Blackwell includes a dedicated hardware decompression engine that can decompress data as it is loaded from HBM:

// Decompression engine capabilities:
// - LZ4 decompression at HBM bandwidth (~8 TB/s compressed read rate)
// - Snappy decompression
// - Custom NVIDIA formats for weight compression
//
// Use case: store compressed model weights in HBM
// A 70B FP8 model (70 GB): compresses to ~50 GB with LZ4
// On load: decompress at 8 TB/s → effective bandwidth is > 8 TB/s for model data
// Savings: 20 GB less HBM used, same throughput
//
// The decompression engine operates in the memory pipeline
// No GPU compute cycles consumed for decompression

ℹ️ Decompression Helps Only If Data Is Compressible

Weight matrices are moderately compressible (1.2-1.5x with LZ4). Activations are poorly compressible (close to random data). The decompression engine primarily benefits weight loading in inference workloads. Training workloads see less benefit because activations (which are stored for backward pass) do not compress well.

Reliability Features

Confidential Computing

B200 adds hardware support for confidential computing — encryption of GPU memory and compute to protect against physical and software attacks:

// Confidential computing features:
// - AES-256 encryption of all HBM data at rest
// - Encrypted NVLink traffic between GPUs
// - Secure boot and firmware attestation
// - Protected memory regions (TEE - Trusted Execution Environment)
//
// Performance overhead: ~1-3% (encryption/decryption in memory controller)
// Not enabled by default — requires explicit configuration
// Target use case: multi-tenant cloud inference (protecting customer data)

Enhanced RAS

// Reliability, Availability, Serviceability (RAS):
// - Row remapping: transparently remap faulty HBM rows to spare rows
// - Enhanced ECC: SECDED on HBM3e + chipkill capability
// - Per-SM isolation: a faulty SM can be disabled without affecting others
// - Live migration: move GPU workloads between B200s without stopping
//   (data center management feature, not application-level)

Performance Projections

Training

// GPT-4 class model (1.8T parameters, MoE) training on NVL72:
// FP8 mixed precision with micro-tensor scaling
// Per-GPU compute: 4,500 TFLOPS FP8
// Model parallelism: 72-way (TP + EP across NVL72)
// Expected MFU (Model FLOPS Utilization): 40-50%
// Effective throughput: 72 × 4,500 × 0.45 = 145.8 PFLOPS
//
// vs H100 DGX cluster (72 GPUs, IB-connected):
// Per-GPU compute: 1,979 TFLOPS FP8
// MFU: 35-40% (IB limits TP to 8-way per node)
// Effective: 72 × 1,979 × 0.375 = 53.4 PFLOPS
//
// B200 NVL72 advantage: 145.8 / 53.4 = 2.7x faster training

Inference

📊

B200 vs H100 Inference Performance Projections

Workload	H100 (FP8)	B200 (FP8)	B200 (FP4)	B200 Speedup
LLaMA-70B decode (batch 1)	~25 tok/s	~60 tok/s	~120 tok/s	2.4-4.8x
LLaMA-70B decode (batch 64)	~800 tok/s	~2,000 tok/s	~3,500 tok/s	2.5-4.4x
LLaMA-70B prefill (4K tokens)	~3,000 tok/s	~7,000 tok/s	~12,000 tok/s	2.3-4x
GPT-4 1.8T (8-GPU TP)	~50 tok/s	~130 tok/s	~250 tok/s	2.6-5x
Stable Diffusion XL (512x512)	~40 img/s	~95 img/s	N/A	2.4x

Note: Projections based on published NVIDIA numbers and bandwidth/compute scaling analysis. Actual results depend on software optimization. FP4 numbers assume successful quantization without quality loss.

Summary

The B200 is a brute-force response to the scaling requirements of frontier AI models. The dual-die design delivers 2.6x more transistors by working around reticle limits, with a 10 TB/s on-package link that makes the two dies invisible to software. The 8 TB/s HBM3e provides the bandwidth to actually feed the 192 SMs with data. FP4 tensor cores double the effective compute density for quantized inference. And NVLink 5.0 at 1,800 GB/s per GPU enables 72-GPU NVL72 systems with full bisection bandwidth.

The key numbers: $9,000$ TFLOPS FP4, $4,500$ TFLOPS FP8, $8,000$ GB/s HBM3e, $192$ GB capacity, $1,800$ GB/s NVLink, $1,000$ W TDP. At the system level, an NVL72 rack delivers $648$ PFLOPS FP4 with $13.8$ TB of memory — enough to run a 1-trillion parameter model in FP8 without model parallelism complexity beyond a single rack.

The engineering tradeoffs are clear: dual-die adds cross-die latency and manufacturing complexity, 1,000 W TDP demands liquid cooling, and FP4 precision requires careful quantization. But for the target workload — frontier LLM training and high-throughput inference — every tradeoff is worth it.