The NVIDIA B200 is the first GPU to ship as two dies in a single package. Each die contains 104 billion transistors on TSMC 4NP, for a total of 208 billion transistors — 2.6x more than the H100’s 80 billion. The two dies are connected by a 10 TB/s on-package NVLink interconnect, making them appear as a single monolithic GPU to software. The B200 delivers 4,500 TFLOPS of FP8 compute (2.3x the H100), 9,000 TFLOPS of FP4 (a new precision tier), 192 GB of HBM3e at 8 TB/s bandwidth (2.4x the H100), and 1,800 GB/s of NVLink 5.0 connectivity per GPU.
This is not an incremental upgrade. The B200 addresses the three bottlenecks that limited H100 performance on frontier LLM workloads: compute throughput (now 2-4x depending on precision), memory bandwidth (2.4x), and interconnect bandwidth (2x). This post dissects each architectural change, the engineering tradeoffs behind the dual-die design, and the performance implications for training and inference.
Dual-Die Architecture
Why Two Dies
A monolithic 208B-transistor die would require approximately 1,200 mm² — exceeding TSMC’s maximum reticle size (~858 mm²). Even if manufacturable, the yield would be catastrophic: at realistic defect densities, fewer than 10% of such dies would be functional.
The solution: two smaller dies (~530 mm² each), connected by a high-bandwidth on-package interconnect.
B200 Die Comparison
| Specification | H100 (Monolithic) | B200 (Per Die) | B200 (Total, 2 Dies) |
|---|---|---|---|
| Transistors | 80 billion | 104 billion | 208 billion |
| Die area | 814 mm² | ~530 mm² | ~1,060 mm² |
| Process | TSMC 4N | TSMC 4NP | TSMC 4NP |
| SMs | 132 | 96 | 192 |
| FP32 CUDA cores | 16,896 | 12,288 | 24,576 |
| Tensor cores (5th gen) | 528 | 384 | 768 |
| L2 cache | 50 MB | ~48 MB | ~96 MB |
| HBM stacks | 6 (HBM3) | 4 (HBM3e) | 8 (HBM3e) |
| HBM capacity | 80 GB | 96 GB | 192 GB |
| HBM bandwidth | 3,350 GB/s | 4,000 GB/s | 8,000 GB/s |
| TDP | 700 W | ~500 W | 1,000 W |
The On-Package Interconnect
The two B200 dies are connected by a silicon interposer or organic substrate providing 10 TB/s bidirectional bandwidth:
// On-package interconnect specifications:
// Bandwidth: 10 TB/s bidirectional (5 TB/s per direction)
// Latency: ~10-20 ns (on-package, short traces)
// Width: estimated 4096 data lanes at ~2.5 Gbps each
//
// For comparison:
// HBM3e per-stack bandwidth: ~1 TB/s (8 stacks total)
// NVLink 5.0 per-GPU: 1.8 TB/s (off-package, between GPUs)
// On-package link: 10 TB/s (5.6x faster than NVLink off-package)
//
// This means cross-die communication within the B200 is:
// - 5.6x faster than GPU-to-GPU NVLink
// - Nearly invisible to performance
// - Software sees one unified GPU, not two separate GPUs
Unified GPU Abstraction
The two dies appear as a single GPU to CUDA:
# nvidia-smi shows one B200, not two
nvidia-smi -L
# GPU 0: NVIDIA B200 (UUID: GPU-xxxx)
# (single entry, single device)
# CUDA reports one device with combined resources
cudaDeviceProp prop;
cudaGetDeviceProperties(&prop, 0);
// prop.multiProcessorCount = 192 (96 per die, combined)
// prop.totalGlobalMem = 192 GB (96 per die, combined)
// Kernel launch uses all 192 SMs transparently
// The block scheduler distributes blocks across both dies
// Cross-die memory access is handled by hardware
my_kernel<<<grid, block>>>(data, n);
// Thread blocks on Die 0 access Die 0's HBM at 4 TB/s
// Thread blocks on Die 0 access Die 1's HBM via interconnect at ~5 TB/s
// Thread blocks on Die 1 access Die 1's HBM at 4 TB/s
While the on-package interconnect is fast (10 TB/s), it adds latency (~10-20 ns) compared to local HBM access. Kernels that access data primarily on one die’s HBM will run faster than kernels that access data uniformly across both dies’ HBM. The CUDA runtime and driver use NUMA-aware allocation to keep each die’s data local. For most kernels, this is handled automatically, but extremely latency-sensitive code may need explicit placement hints.
Fifth-Generation Tensor Cores
FP4 Support
The most significant compute addition in Blackwell is native FP4 (4-bit floating point) tensor core support:
// FP4 format (E2M1):
// 1 sign bit, 2 exponent bits, 1 mantissa bit
// Range: ±0.5 to ±6.0 (very limited, 16 distinct values)
// Typically used with per-group scaling factors
//
// FP4 tensor core operation:
// A[m,k] in FP4 × B[k,n] in FP4 + C[m,n] in FP32 → D[m,n] in FP32
// Each FP4 value requires a scaling factor (FP8 or FP16) per group
// Group size: typically 32-128 elements
// Per-SM FP4 throughput:
// 4 tensor cores per SM, each computes:
// m16n8k64 in FP4: 16 × 8 × 64 × 2 = 16,384 FP4 ops per cycle per TC
// Total per SM: 4 × 16,384 = 65,536 FP4 ops per cycle
// Per GPU (192 SMs): 192 × 65,536 = 12,582,912 FP4 ops per cycle
// At ~1.5 GHz boost: ~18.9 PFLOPS (sparsity) or ~9.0 PFLOPS (dense)
B200 Tensor Core Throughput by Precision
| Precision | TFLOPS (Dense) | TFLOPS (Sparse) | vs H100 Dense | Primary Use Case |
|---|---|---|---|---|
| FP64 Tensor | 90 | N/A | 2.3x (40) | HPC, scientific computing |
| FP32 (TF32) | 2,250 | 4,500 | 2.3x (990) | Training accumulation |
| FP16 / BF16 | 2,250 | 4,500 | 2.3x (990) | Training, inference |
| FP8 (E4M3/E5M2) | 4,500 | 9,000 | 2.3x (1,979) | Inference, fine-tuning |
| FP4 (E2M1) | 9,000 | 18,000 | NEW | Inference (quantized models) |
| INT8 | 4,500 | 9,000 | 2.3x (1,979) | Integer quantized inference |
The FP4 Quality Question
FP4 represents only 16 distinct values per group. Model quality depends entirely on the quantization scheme:
// FP4 quantization with group scaling:
// For a weight tensor W[4096, 4096]:
// Divide into groups of 64 elements
// For each group: scale = max(abs(group)) / 6.0 (FP4 max value)
// Quantize: W_fp4 = round(W / scale) in FP4
// Dequantize: W_approx = W_fp4 * scale
//
// Storage: 4096 * 4096 * 0.5 bytes (FP4) + (4096 * 4096 / 64) * 2 bytes (FP16 scales)
// = 8 MB + 0.5 MB = 8.5 MB (vs 32 MB FP8, 64 MB FP16)
//
// Quality loss: depends on model and calibration
// Published results (NVIDIA, 2024):
// GPT-3 175B: FP4 perplexity within 0.3% of FP16 with GPTQ-style calibration
// LLaMA-2 70B: FP4 perplexity within 0.5% of FP16
// Smaller models (7B): FP4 quality degrades noticeably (1-3% accuracy loss)
FP4 does not have sufficient dynamic range for training gradients, which require both very small and very large values. FP4 is exclusively an inference format, used to compress model weights for faster loading and reduced memory footprint. The compute happens in FP4 multiplication with FP32 accumulation — similar to INT4 quantization but with the benefit of floating-point representation.
HBM3e: 8 TB/s Memory Bandwidth
HBM3e Specifications
// B200 HBM3e configuration:
// 8 HBM3e stacks (4 per die)
// Each stack: 24 GB capacity, ~1 TB/s bandwidth
// Total: 192 GB capacity, 8 TB/s bandwidth
//
// HBM3e vs HBM3 (H100):
// Per-pin data rate: 9.6 Gbps (HBM3e) vs 6.4 Gbps (HBM3)
// Pins per stack: 1024 (same)
// Per-stack bandwidth: 9.6 * 1024 / 8 = 1,228.8 GB/s (theoretical)
// Practical: ~1,000 GB/s per stack (efficiency ~81%)
// 8 stacks: ~8,000 GB/s total
Memory Bandwidth and Model Performance
The 2.4x bandwidth increase from H100 to B200 directly impacts memory-bound operations:
// LLM decode throughput (single token generation):
// Throughput ∝ memory_bandwidth / model_size
//
// H100 (80B FP16 model, FP16 weights):
// 3,350 GB/s / 160 GB = 20.9 tokens/s (batch 1)
//
// B200 (80B FP16 model, FP16 weights):
// 8,000 GB/s / 160 GB = 50.0 tokens/s (batch 1)
// Improvement: 2.4x
//
// B200 (80B FP8 model):
// 8,000 GB/s / 80 GB = 100.0 tokens/s (batch 1)
// Improvement: 4.8x vs H100 FP16
//
// B200 (80B FP4 model):
// 8,000 GB/s / 40 GB = 200.0 tokens/s (batch 1)
// Improvement: 9.6x vs H100 FP16
Single-Batch Decode Throughput: 70B Parameter Model
(tokens/s (batch 1))Second-Generation Transformer Engine
Dynamic Precision Selection
The Transformer Engine automatically selects the optimal precision for each layer and each tensor during training and inference:
// Transformer Engine 2.0 (Blackwell):
// 1. Monitors tensor statistics (magnitude distribution)
// 2. Selects per-tensor precision: FP4, FP8 (E4M3 or E5M2), BF16
// 3. Computes scaling factors for quantization
// 4. Executes tensor core operations at selected precision
// 5. Accumulates in FP32, then down-converts output
//
// Decision logic (simplified):
// If tensor values fit in FP4 range with < 0.1% outliers → use FP4
// If tensor values fit in FP8 E4M3 range → use FP8 E4M3
// If tensor values require FP8 E5M2 range (larger exponent) → use FP8 E5M2
// Otherwise → use BF16
//
// Different layers in the same model may use different precisions:
// Embedding layers: BF16 (wide value range)
// Attention QKV projections: FP8 (stable distribution)
// FFN layers: FP8 or FP4 (often quantization-friendly)
// Output logits: BF16 (precision-sensitive)
Micro-Tensor Scaling
Blackwell introduces micro-tensor scaling for FP4 and FP8: per-group scaling factors that allow each small group of elements to have its own dynamic range:
// Micro-tensor scaling for FP8:
// Instead of one scale per tensor (coarse):
// scale = max(abs(tensor)) / 448.0 (FP8 E4M3 max)
// Problem: if tensor has outliers, most values use only a fraction of FP8 range
//
// Micro-tensor scaling (per group of 128 elements):
// For each group of 128 elements:
// group_scale = max(abs(group)) / 448.0
// group_fp8 = round(group / group_scale)
// Each group uses full FP8 dynamic range
// Overhead: 1 FP16 scale per 128 FP8 values = 1.6% storage overhead
//
// Result: FP8 with micro-tensor scaling matches FP16 quality
// on nearly all LLM benchmarks (GPT-4 class models)
NVLink 5.0
Per-GPU Interconnect
// B200 NVLink 5.0:
// 18 NVLink 5.0 links per GPU
// Each link: 100 GB/s bidirectional (50 GB/s per direction)
// Total: 1,800 GB/s bidirectional per GPU
//
// vs H100 NVLink 4.0:
// 18 links × 50 GB/s = 900 GB/s per GPU
// B200: 2x per-GPU NVLink bandwidth
NVL72 with B200
In the NVL72 configuration, 72 B200 GPUs are connected via NVSwitch 4.0:
// NVL72 aggregate numbers:
// Total GPU memory: 72 × 192 GB = 13.8 TB HBM3e
// Total compute (FP8): 72 × 4,500 = 324 PFLOPS
// Total compute (FP4): 72 × 9,000 = 648 PFLOPS
// Total HBM bandwidth: 72 × 8,000 = 576 TB/s
// Total NVLink bisection: 130 TB/s
//
// For a 1 trillion parameter model (FP8):
// Model size: 1 TB
// Fits across 72 GPUs: 1 TB / 72 = 14.2 GB per GPU
// Each GPU has 192 GB: model fits easily with room for KV cache
// Per-GPU compute: 4,500 TFLOPS FP8
// Per-GPU model BW: only 14.2 GB to read → compute-bound
Decompression Engine
Hardware Decompression
Blackwell includes a dedicated hardware decompression engine that can decompress data as it is loaded from HBM:
// Decompression engine capabilities:
// - LZ4 decompression at HBM bandwidth (~8 TB/s compressed read rate)
// - Snappy decompression
// - Custom NVIDIA formats for weight compression
//
// Use case: store compressed model weights in HBM
// A 70B FP8 model (70 GB): compresses to ~50 GB with LZ4
// On load: decompress at 8 TB/s → effective bandwidth is > 8 TB/s for model data
// Savings: 20 GB less HBM used, same throughput
//
// The decompression engine operates in the memory pipeline
// No GPU compute cycles consumed for decompression
Weight matrices are moderately compressible (1.2-1.5x with LZ4). Activations are poorly compressible (close to random data). The decompression engine primarily benefits weight loading in inference workloads. Training workloads see less benefit because activations (which are stored for backward pass) do not compress well.
Reliability Features
Confidential Computing
B200 adds hardware support for confidential computing — encryption of GPU memory and compute to protect against physical and software attacks:
// Confidential computing features:
// - AES-256 encryption of all HBM data at rest
// - Encrypted NVLink traffic between GPUs
// - Secure boot and firmware attestation
// - Protected memory regions (TEE - Trusted Execution Environment)
//
// Performance overhead: ~1-3% (encryption/decryption in memory controller)
// Not enabled by default — requires explicit configuration
// Target use case: multi-tenant cloud inference (protecting customer data)
Enhanced RAS
// Reliability, Availability, Serviceability (RAS):
// - Row remapping: transparently remap faulty HBM rows to spare rows
// - Enhanced ECC: SECDED on HBM3e + chipkill capability
// - Per-SM isolation: a faulty SM can be disabled without affecting others
// - Live migration: move GPU workloads between B200s without stopping
// (data center management feature, not application-level)
Performance Projections
Training
// GPT-4 class model (1.8T parameters, MoE) training on NVL72:
// FP8 mixed precision with micro-tensor scaling
// Per-GPU compute: 4,500 TFLOPS FP8
// Model parallelism: 72-way (TP + EP across NVL72)
// Expected MFU (Model FLOPS Utilization): 40-50%
// Effective throughput: 72 × 4,500 × 0.45 = 145.8 PFLOPS
//
// vs H100 DGX cluster (72 GPUs, IB-connected):
// Per-GPU compute: 1,979 TFLOPS FP8
// MFU: 35-40% (IB limits TP to 8-way per node)
// Effective: 72 × 1,979 × 0.375 = 53.4 PFLOPS
//
// B200 NVL72 advantage: 145.8 / 53.4 = 2.7x faster training
Inference
B200 vs H100 Inference Performance Projections
| Workload | H100 (FP8) | B200 (FP8) | B200 (FP4) | B200 Speedup |
|---|---|---|---|---|
| LLaMA-70B decode (batch 1) | ~25 tok/s | ~60 tok/s | ~120 tok/s | 2.4-4.8x |
| LLaMA-70B decode (batch 64) | ~800 tok/s | ~2,000 tok/s | ~3,500 tok/s | 2.5-4.4x |
| LLaMA-70B prefill (4K tokens) | ~3,000 tok/s | ~7,000 tok/s | ~12,000 tok/s | 2.3-4x |
| GPT-4 1.8T (8-GPU TP) | ~50 tok/s | ~130 tok/s | ~250 tok/s | 2.6-5x |
| Stable Diffusion XL (512x512) | ~40 img/s | ~95 img/s | N/A | 2.4x |
Summary
The B200 is a brute-force response to the scaling requirements of frontier AI models. The dual-die design delivers 2.6x more transistors by working around reticle limits, with a 10 TB/s on-package link that makes the two dies invisible to software. The 8 TB/s HBM3e provides the bandwidth to actually feed the 192 SMs with data. FP4 tensor cores double the effective compute density for quantized inference. And NVLink 5.0 at 1,800 GB/s per GPU enables 72-GPU NVL72 systems with full bisection bandwidth.
The key numbers: TFLOPS FP4, TFLOPS FP8, GB/s HBM3e, GB capacity, GB/s NVLink, W TDP. At the system level, an NVL72 rack delivers PFLOPS FP4 with TB of memory — enough to run a 1-trillion parameter model in FP8 without model parallelism complexity beyond a single rack.
The engineering tradeoffs are clear: dual-die adds cross-die latency and manufacturing complexity, 1,000 W TDP demands liquid cooling, and FP4 precision requires careful quantization. But for the target workload — frontier LLM training and high-throughput inference — every tradeoff is worth it.