Part of Series GPU Hardware & AI Accelerators 14 of 30
1 NVIDIA GPU Architecture Evolution: Volta, Ampere, Hopper, Blackwell — What Changed and Why 2 HBM Memory: HBM2, HBM2e, HBM3, HBM3e — Bandwidth, Capacity, and Why It Determines AI Performance 3 NVLink, NVSwitch, and GPU Interconnect: From Peer-to-Peer to NVL72 Rack-Scale Fabric 4 The Streaming Multiprocessor: Warp Schedulers, Register File, and the Execution Pipeline 5 AMD MI300X and ROCm: 192GB HBM3, 5.3 TB/s Bandwidth, and the CUDA Software Moat 6 Tensor Core Evolution: From Volta HMMA to Hopper WGMMA — What Changed at Each Generation 7 GPU Memory Hierarchy: L1, L2, Shared Memory, and Cache Behavior Under Different Access Patterns 8 PCIe Gen5 and the CPU-GPU Bandwidth Bottleneck: When PCIe Limits Your Inference 9 MIG and GPU Virtualization: Partitioning a Single GPU for Multi-Tenant Inference 10 Warp Schedulers and Instruction Issue: How GPUs Hide Latency with Thousands of Threads 11 The Register File: 256KB per SM, Register Pressure, and Why More Registers Mean Fewer Threads 12 L2 Cache Behavior: Residency Control, Working Set Effects, and Cache-Aware Kernel Design 13 ECC Memory and GPU Reliability: Silent Data Corruption, Error Detection, and the Cost of ECC 14 NVSwitch Fabric Topology: How 72 GPUs Share a Single Memory Address Space in NVL72 15 Grace Hopper Superchip: Unified CPU-GPU Memory via NVLink-C2C and What It Changes 16 Blackwell B200 Deep Dive: Dual-Die Design, FP4 Tensor Cores, and 8 TB/s HBM3e 17 Google TPU Architecture: MXU, ICI Interconnect, XLA Compilation, and When TPUs Win 18 Intel Gaudi and Habana: Graph Compiler Model, TPC Architecture, and the ROI Calculation 19 GPU Power Efficiency: Performance per Watt, Dynamic Voltage Scaling, and Datacenter Power Budgets 20 GPU Programming Models: CUDA vs ROCm vs Metal vs Vulkan Compute — Portability and Performance 21 Datacenter vs Consumer GPUs: H100 vs RTX 4090 — What You Actually Get for 10x the Price 22 GPU Cooling: Air, Liquid, and Immersion — Thermal Solutions for AI Datacenters 23 GPU Hardware Scheduling: How the GigaThread Engine Distributes Work Across SMs 24 CPU vs GPU Memory: Why GPUs Need Different Optimization 25 Non-NVIDIA AI Accelerators: Gaudi, MI300X, TPU, and the Software Challenge 26 The Definitive Guide to GPU Memory: Registers, Shared Memory, Caches, and HBM 27 GPU Tensor Core Programming: From Volta WMMA to Hopper WGMMA 28 Vector Processing: From ARM NEON to AVX-512 to GPU SIMT 29 Turing vs Volta Architecture for AI Workloads (Jan 2020) 30 Habana Gaudi vs NVIDIA V100: AI Training Performance (Jul 2020)

A single bit flip in a GPU’s HBM during LLM training can corrupt a gradient update, propagate through the optimizer state, and silently degrade model quality over thousands of steps. In a 10,000-GPU training cluster running for weeks, the probability of at least one uncorrected error is not negligible — it approaches certainty. At the scale of frontier model training (100,000+ GPU-hours), silent data corruption (SDC) is a production reliability concern, not a theoretical one. Meta reported that during LLaMA 3 training, they experienced hardware-induced data corruption events roughly once every few days across their fleet.

Error-Correcting Code (ECC) memory is the primary defense. This post covers the physics of memory errors, how ECC works at the circuit level, the specific ECC implementations in NVIDIA data center GPUs (HBM, SRAM, register file), the performance and capacity overhead of ECC, error detection and reporting mechanisms, and what happens when ECC is insufficient.

The Physics of Memory Errors

Sources of Bit Flips

Memory errors have three primary physical causes:

  1. Cosmic rays and alpha particles. High-energy particles striking a silicon die can deposit enough charge to flip a stored bit. At sea level, the soft error rate (SER) is approximately 101210^{-12} to 101510^{-15} errors per bit per hour, depending on process node and altitude. Denver (altitude 1,600m) experiences roughly 3x higher cosmic ray flux than sea level.

  2. Thermal noise. At operating temperatures (70-95C for GPU dies), thermal energy creates random voltage fluctuations. As process nodes shrink and operating voltages decrease, the noise margin between a stored “0” and “1” narrows, increasing vulnerability.

  3. Electromagnetic interference. Crosstalk between adjacent signal lines can induce voltage transients. At HBM densities (16+ Gb per die on HBM3e), bit cells are packed so tightly that crosstalk is a significant concern.

Error Rates in Practice

📊

Memory Error Rates in Data Center GPUs

Error SourceApproximate RateScale Impact (10K GPU cluster)Consequence Without ECC
Cosmic ray (HBM) ~1 error / GPU / month ~300 errors / day Random bit flip in weights/activations
Cosmic ray (SRAM) ~0.1 error / GPU / month ~30 errors / day Flip in register or cache line
Thermal noise (HBM) ~0.01 error / GPU / month ~3 errors / day Bit decay in stored data
Stuck bit (manufacturing) ~0.001 per HBM die Permanent Consistent wrong value at one address
Row hammer (adjacent cell) Possible at high density Variable Reads to one row flip bits in neighbor
Note: Rates are approximate and depend heavily on process node, altitude, temperature, and shielding. At scale, even rare events become frequent.

Why Single-Bit Errors Matter for AI

A single-bit flip in an FP16 weight can change the value catastrophically:

// FP16 representation: 1 sign bit, 5 exponent bits, 10 mantissa bits
// Value: 1.5 = 0 01111 1000000000 (binary)
//
// Flip bit 14 (highest exponent bit):
// 0 11111 1000000000 = NaN (exponent all 1s = special value)
//
// Flip bit 9 (mantissa):
// 0 01111 1100000000 = 1.75 (small change, possibly recoverable)
//
// Flip bit 15 (sign):
// 1 01111 1000000000 = -1.5 (sign flip, large error)

A NaN propagates through every subsequent computation, eventually making the entire tensor NaN. A sign flip in a gradient can cause the optimizer to step in the wrong direction. Both are silent — no exception is raised, no error is logged — the training simply produces a worse model.

🚨 Silent Data Corruption Is the Worst Failure Mode

Unlike crashes or hangs, silent data corruption produces incorrect results without any error signal. The training loss may increase slightly, or the model may develop subtle quality issues that are only caught during evaluation days or weeks later. At scale, SDC is the primary argument for ECC memory in training clusters.

How ECC Works

SECDED: Single-Error Correct, Double-Error Detect

Data center GPUs use SECDED (Single-Error Correct, Double-Error Detect) codes. For every 64 bits of data, 8 additional parity bits are stored (72 bits total). The parity bits are computed using a Hamming code extended with an overall parity bit.

The encoding process:

codeword72=data64    parity8\text{codeword}_{72} = \text{data}_{64} \; || \; \text{parity}_{8}

Each parity bit covers a specific subset of data bits, determined by the Hamming matrix HH:

syndrome=H×codewordT\text{syndrome} = H \times \text{codeword}^T

  • Syndrome = 0: No error detected. Data is correct.
  • Syndrome != 0, weight = 1 (odd parity): Single-bit error. The syndrome identifies the bit position. Flip it to correct.
  • Syndrome != 0, weight = 0 (even parity): Double-bit error detected. Cannot correct — report as uncorrectable error (UE).
// ECC encoding example (simplified):
// Data: 64 bits = [d63 d62 d61 ... d1 d0]
// Parity bits: p0 = d0 XOR d1 XOR d3 XOR d5 XOR ...  (covers odd positions)
//              p1 = d0 XOR d2 XOR d3 XOR d6 XOR ...  (covers bit pairs)
//              p2 = d1 XOR d2 XOR d3 XOR d8 XOR ...  (covers nibbles)
//              ... (7 Hamming parity bits + 1 overall parity)

// On read:
// Compute syndrome from stored 72-bit codeword
// If syndrome == 0: no error, return data
// If syndrome != 0 and odd weight: single error, correct and return
// If syndrome != 0 and even weight: double error, signal UE

Overhead of SECDED

For every 64 bits of data, 8 parity bits are stored:

overhead=864=12.5%\text{overhead} = \frac{8}{64} = 12.5\%

This 12.5% overhead applies to both storage capacity and bandwidth:

📊

ECC Overhead on HBM

MetricWithout ECCWith ECCOverhead
H100 HBM capacity (raw) 80 GB 80 GB N/A
H100 HBM usable capacity 80 GB ~71.1 GB 12.5% reserved for parity
H100 HBM bandwidth (raw) 3,350 GB/s 3,350 GB/s N/A
H100 HBM effective bandwidth 3,350 GB/s ~2,978 GB/s 12.5% carries parity bits
A100 HBM usable capacity 80 GB ~71.1 GB Same 12.5%
Note: NVIDIA GPUs expose the full 80 GB to users — the ECC parity is stored in a separate region not addressable by software. The bandwidth overhead means every 9 bytes transferred carry 8 bytes of data and 1 byte of parity.
ℹ️ ECC Capacity Is Not User-Visible on Modern GPUs

On A100 and H100, the HBM die includes dedicated storage for ECC parity bits that is separate from the user-addressable 80 GB. You do not lose 12.5% of user-visible capacity. However, the bandwidth overhead is real — every HBM read transfers both data and parity, consuming physical bus bandwidth.

ECC in GPU Subsystems

HBM ECC

HBM3 (H100) implements on-die ECC within the HBM stack itself. Each HBM die has ECC logic integrated into the DRAM array:

// HBM3 ECC flow:
// 1. CPU/GPU writes 256-bit data burst to HBM controller
// 2. HBM controller computes ECC parity (SECDED per 64-bit word)
// 3. HBM die stores data + parity in adjacent cells
// 4. On read: HBM die checks ECC before returning data
// 5. If single-bit error: corrected transparently
// 6. If double-bit error: error flag set, GPU gets UE interrupt

// HBM3e adds additional reliability:
// - On-die ECC corrects within the HBM stack
// - Link-level CRC protects the bus between HBM and GPU
// - GPU-side ECC provides end-to-end protection

SRAM ECC (L2 Cache, Shared Memory)

The L2 cache and shared memory use SRAM-based ECC, with slightly different parameters:

// L2 cache ECC:
// Each 32-byte sector has 7 ECC check bits
// SECDED per sector: corrects 1-bit, detects 2-bit
// Overhead: 7 bits per 256 bits = 2.7%
// Lower overhead than HBM because SRAM bit cells are more stable

// Shared memory ECC:
// Similar SECDED protection per 32-byte sector
// Can be disabled on some GPUs for extra capacity (not recommended)

Register File ECC

The register file on data center GPUs (V100, A100, H100) is ECC-protected:

// Register ECC:
// Each 32-bit register has ECC protection
// Parity bits are stored alongside the register data
// Correction happens in the same cycle as register read
// No additional latency for ECC-correct registers

// If an uncorrectable error is detected in a register:
// The warp is killed (hardware exception)
// The application receives a CUDA error
// Other warps/kernels on the same GPU are NOT affected (on Hopper)
📊

ECC Protection Across GPU Memory Subsystems

SubsystemECC TypeCorrection CapabilityDetection CapabilityLatency Overhead
HBM3/HBM3e On-die SECDED 1-bit per 64-bit word 2-bit per 64-bit word ~0 (integrated in HBM timing)
L2 Cache SRAM SECDED 1-bit per 32-byte sector 2-bit per sector ~0 (in read pipeline)
L1 Cache Parity None (detects only) 1-bit ~0
Shared Memory SRAM SECDED 1-bit per sector 2-bit per sector ~0
Register File SECDED 1-bit per 32-bit register 2-bit per register ~0
NVLink CRC + retry Full packet Full packet ~100ns per retry
PCIe LCRC + retry Full TLP Full TLP ~microseconds per retry
Note: L1 cache uses parity (detect-only) because it can re-fetch from L2 on error. L2 and below use SECDED because data loss at those levels is unrecoverable without HBM re-read.

Error Reporting and Monitoring

nvidia-smi Error Counters

# View ECC error counts
nvidia-smi -q -d ECC

# Output (example):
# ECC Errors
#     Volatile
#         SRAM Correctable              : 12
#         SRAM Uncorrectable            : 0
#         DRAM Correctable              : 847
#         DRAM Uncorrectable            : 0
#     Aggregate
#         SRAM Correctable              : 156
#         SRAM Uncorrectable            : 0
#         DRAM Correctable              : 23481
#         DRAM Uncorrectable            : 2
  • Volatile counters: Reset on driver reload or GPU reset.
  • Aggregate counters: Persist across reboots (stored in GPU infoROM).
  • Correctable: Single-bit errors that ECC silently fixed. Normal operation — these happen regularly.
  • Uncorrectable: Double-bit errors that ECC detected but could not fix. These cause application errors.
# Monitor ECC errors over time (useful for fleet health monitoring)
nvidia-smi dmon -d 1 -s e
# Reports ECC errors per second

# Query specific GPU for retirement status
nvidia-smi -q -d RETIRED_PAGES -i 0
# Shows pages retired due to uncorrectable ECC errors

# Check if GPU needs replacement
nvidia-smi -q -d RETIRED_PAGES -i 0 | grep "Pending"
# "Pending Blacklist": Yes → GPU has pending page retirements, needs reboot

Page Retirement

When an HBM address experiences repeated uncorrectable errors, the GPU driver retires that page — marking it as unusable. The retired page is not allocated to future memory requests:

# View retired pages
nvidia-smi -q -d RETIRED_PAGES
# Retired pages due to multiple single bit ECC errors : 3
# Retired pages due to double bit ECC errors          : 1
# Retired pages pending                               : 0

# Impact: each retired page reduces usable HBM by 4 KB (page size)
# With 80 GB HBM, losing 100 pages = 400 KB = negligible
# If retired pages exceed a threshold (~64 pages), NVIDIA recommends GPU replacement

DCGM (Data Center GPU Manager) for Fleet Monitoring

# Enable DCGM health monitoring
dcgmi health -s meit -g 1
# m = memory, e = ECC, i = inforom, t = thermal

# Run diagnostics
dcgmi diag -r 3 -g 1
# Level 3: comprehensive GPU health check including ECC memory test

# Query ECC metrics via DCGM fields
dcgmi dmon -e 312,313,314,315
# 312: DRAM correctable errors
# 313: DRAM uncorrectable errors
# 314: SRAM correctable errors
# 315: SRAM uncorrectable errors

The Cost of ECC

Bandwidth Overhead

The 12.5% bandwidth overhead of HBM ECC is significant for bandwidth-limited workloads:

// H100 HBM3 bandwidth: 3,350 GB/s (raw)
// Of this, 12.5% carries ECC parity: 419 GB/s "wasted" on parity
// Effective data bandwidth: 3,350 * (64/72) = 2,978 GB/s

// For a memory-bound kernel achieving 90% of peak:
// With ECC: 2,978 * 0.9 = 2,680 GB/s effective
// Without ECC (hypothetical): 3,350 * 0.9 = 3,015 GB/s effective
// Difference: ~12.5% throughput loss on pure bandwidth tests

ECC Bandwidth Impact on Memory-Bound Kernels (H100)

(GB/s effective data throughput)
Vector add (peak BW) ECC: 2,978 GB/s effective
2,978 GB/s effective data throughput
Vector add (no ECC, theoretical) No ECC: 3,350 GB/s
3,350 GB/s effective data throughput
GEMM (compute-bound) ECC irrelevant — compute limited
990 GB/s effective data throughput
Attention (mixed) Partial BW impact
2,400 GB/s effective data throughput
ECC Overhead Depends on Workload

The 12.5% bandwidth penalty only matters for purely memory-bandwidth-limited kernels. Compute-bound kernels (GEMM, tensor core operations) are not affected because they do not saturate HBM bandwidth. In LLM inference, the bandwidth-limited phases (attention, KV cache access) see the full 12.5% penalty, while the compute-limited phases (FFN GEMM) are unaffected.

Power Overhead

ECC logic consumes additional power:

// ECC power overhead (estimated):
// HBM on-die ECC: ~2-3 W per HBM stack (included in HBM power budget)
// H100 has 6 HBM3 stacks: ~12-18 W for HBM ECC
// SRAM ECC (L2, shared mem, registers): ~5-10 W
// Total ECC power: ~17-28 W out of 700 W TDP = 2.5-4%

// On consumer GPUs (RTX 4090, 450W TDP):
// No ECC on HBM (GDDR6X, no ECC)
// SRAM ECC still present on L2 and registers
// Power savings: ~12-18 W (HBM ECC not present)

Latency Overhead

ECC computation is pipelined into the memory read path. The latency overhead is effectively zero for normal operation:

// HBM read latency: ~500 cycles
// ECC decode: ~2-3 cycles (pipelined, overlaps with data transfer)
// Net additional latency: ~0 cycles (hidden by pipeline)

// Exception: when a correctable error is found
// Correction adds ~1-2 cycles to flip the erroneous bit
// Still negligible compared to 500-cycle base latency

// Exception: when an uncorrectable error is found
// Triggers an interrupt (~1000+ cycles)
// GPU halts the affected warp
// Error is logged, page may be retired

ECC vs No-ECC: The Datacenter Decision

When ECC Is Non-Negotiable

// Training at scale (1000+ GPUs, weeks of runtime):
// Expected correctable errors: ~30,000+ per month across cluster
// Expected uncorrectable errors: ~10-50 per month
// Without ECC: ~30,000 silent data corruptions per month
// With ECC: all 30,000 corrected silently, UEs caught and restarted

// Medical/financial inference:
// A single bit flip could change a diagnosis or financial prediction
// ECC provides error detection even when correction fails (UE)

When ECC Matters Less

// Small-scale inference (1-8 GPUs, short sessions):
// Expected errors: <1 per month
// Probability of SDC affecting a single inference: ~10^-10
// Consumer GPUs (no HBM ECC) are acceptable

// Evaluation and testing:
// Results are compared against ground truth
// A corrupted result is caught by the evaluation pipeline
// ECC is nice-to-have but not critical

The Real-World SDC Problem

Even with ECC, silent data corruption can occur from sources ECC does not protect:

// Sources of SDC that ECC CANNOT catch:
// 1. Logic errors in the GPU's compute pipeline (rare but documented)
//    - A faulty multiplier that produces wrong results
//    - No ECC on intermediate computation results
// 2. Software bugs in CUDA runtime or driver
// 3. Firmware bugs in the GPU microcontroller
// 4. Multi-bit errors exceeding SECDED capability
//    - Cosmic ray shower hitting adjacent bits
//    - ~1 in 10,000 error events produce multi-bit errors

// Google's experience (published 2022):
// ~1 in 1000 machines exhibited "mercurial cores" —
// compute units that produce wrong results rarely but repeatedly
// These pass standard diagnostics but fail on specific operations
⚠️ ECC Is Necessary But Not Sufficient

ECC protects against memory bit flips but not against compute errors (faulty ALUs, incorrect tensor core results). Large-scale training increasingly uses application-level checksumming — computing checksums of activations or gradients and comparing across data-parallel replicas. If one replica’s checksum diverges, its state is discarded and restored from a healthy replica.

Chipkill and Beyond: Future Error Correction

Chipkill for HBM

Chipkill extends SECDED to tolerate the failure of an entire DRAM chip within an HBM stack. HBM3e with chipkill can lose one of its 16 channels and continue operating with corrected data:

// Standard SECDED: corrects 1-bit per 64-bit word
// Chipkill: corrects all bits in a single DRAM chip (one channel)
// HBM3e: 16 channels per stack, 32 bits per channel
// Chipkill can correct a full 32-bit channel failure
// Overhead: higher than SECDED (~25% parity bits vs 12.5%)
// But protects against entire chip failures, not just single bits

Adaptive ECC

Future GPUs may implement adaptive ECC that adjusts protection level based on error rates:

// Concept: adaptive ECC
// Normal operation: SECDED (12.5% overhead)
// When error rate rises (aging, temperature): upgrade to DEC-TED
//   (Double-Error Correct, Triple-Error Detect)
//   Overhead increases to ~25%, but stronger protection
// When a region shows persistent errors: retire and remap

Practical Recommendations

For training clusters:

  1. Always enable ECC on data center GPUs (it is enabled by default on A100/H100 and cannot be disabled).
  2. Monitor ECC counters via DCGM. Set alerts for correctable error rates above 100 per day per GPU (indicates potential hardware degradation).
  3. Replace GPUs with more than 64 retired pages or any uncorrectable SRAM errors.
  4. Implement application-level checksumming for training runs exceeding 1,000 GPU-hours.
  5. Checkpoint frequently — every 10-30 minutes — to limit the blast radius of an undetected corruption event.
  6. Run GPU diagnostics (dcgmi diag -r 3) before starting multi-week training runs.

For inference deployments:

  1. Use ECC GPUs (A100, H100) for production inference serving.
  2. Monitor correctable error rates as a proxy for hardware health.
  3. Implement output validation for safety-critical applications (medical, financial, autonomous).
  4. Consider N+1 redundancy — keep spare GPUs for hot-swap when ECC errors trigger replacement.