A single bit flip in a GPU’s HBM during LLM training can corrupt a gradient update, propagate through the optimizer state, and silently degrade model quality over thousands of steps. In a 10,000-GPU training cluster running for weeks, the probability of at least one uncorrected error is not negligible — it approaches certainty. At the scale of frontier model training (100,000+ GPU-hours), silent data corruption (SDC) is a production reliability concern, not a theoretical one. Meta reported that during LLaMA 3 training, they experienced hardware-induced data corruption events roughly once every few days across their fleet.
Error-Correcting Code (ECC) memory is the primary defense. This post covers the physics of memory errors, how ECC works at the circuit level, the specific ECC implementations in NVIDIA data center GPUs (HBM, SRAM, register file), the performance and capacity overhead of ECC, error detection and reporting mechanisms, and what happens when ECC is insufficient.
The Physics of Memory Errors
Sources of Bit Flips
Memory errors have three primary physical causes:
-
Cosmic rays and alpha particles. High-energy particles striking a silicon die can deposit enough charge to flip a stored bit. At sea level, the soft error rate (SER) is approximately to errors per bit per hour, depending on process node and altitude. Denver (altitude 1,600m) experiences roughly 3x higher cosmic ray flux than sea level.
-
Thermal noise. At operating temperatures (70-95C for GPU dies), thermal energy creates random voltage fluctuations. As process nodes shrink and operating voltages decrease, the noise margin between a stored “0” and “1” narrows, increasing vulnerability.
-
Electromagnetic interference. Crosstalk between adjacent signal lines can induce voltage transients. At HBM densities (16+ Gb per die on HBM3e), bit cells are packed so tightly that crosstalk is a significant concern.
Error Rates in Practice
Memory Error Rates in Data Center GPUs
| Error Source | Approximate Rate | Scale Impact (10K GPU cluster) | Consequence Without ECC |
|---|---|---|---|
| Cosmic ray (HBM) | ~1 error / GPU / month | ~300 errors / day | Random bit flip in weights/activations |
| Cosmic ray (SRAM) | ~0.1 error / GPU / month | ~30 errors / day | Flip in register or cache line |
| Thermal noise (HBM) | ~0.01 error / GPU / month | ~3 errors / day | Bit decay in stored data |
| Stuck bit (manufacturing) | ~0.001 per HBM die | Permanent | Consistent wrong value at one address |
| Row hammer (adjacent cell) | Possible at high density | Variable | Reads to one row flip bits in neighbor |
Why Single-Bit Errors Matter for AI
A single-bit flip in an FP16 weight can change the value catastrophically:
// FP16 representation: 1 sign bit, 5 exponent bits, 10 mantissa bits
// Value: 1.5 = 0 01111 1000000000 (binary)
//
// Flip bit 14 (highest exponent bit):
// 0 11111 1000000000 = NaN (exponent all 1s = special value)
//
// Flip bit 9 (mantissa):
// 0 01111 1100000000 = 1.75 (small change, possibly recoverable)
//
// Flip bit 15 (sign):
// 1 01111 1000000000 = -1.5 (sign flip, large error)
A NaN propagates through every subsequent computation, eventually making the entire tensor NaN. A sign flip in a gradient can cause the optimizer to step in the wrong direction. Both are silent — no exception is raised, no error is logged — the training simply produces a worse model.
Unlike crashes or hangs, silent data corruption produces incorrect results without any error signal. The training loss may increase slightly, or the model may develop subtle quality issues that are only caught during evaluation days or weeks later. At scale, SDC is the primary argument for ECC memory in training clusters.
How ECC Works
SECDED: Single-Error Correct, Double-Error Detect
Data center GPUs use SECDED (Single-Error Correct, Double-Error Detect) codes. For every 64 bits of data, 8 additional parity bits are stored (72 bits total). The parity bits are computed using a Hamming code extended with an overall parity bit.
The encoding process:
Each parity bit covers a specific subset of data bits, determined by the Hamming matrix :
- Syndrome = 0: No error detected. Data is correct.
- Syndrome != 0, weight = 1 (odd parity): Single-bit error. The syndrome identifies the bit position. Flip it to correct.
- Syndrome != 0, weight = 0 (even parity): Double-bit error detected. Cannot correct — report as uncorrectable error (UE).
// ECC encoding example (simplified):
// Data: 64 bits = [d63 d62 d61 ... d1 d0]
// Parity bits: p0 = d0 XOR d1 XOR d3 XOR d5 XOR ... (covers odd positions)
// p1 = d0 XOR d2 XOR d3 XOR d6 XOR ... (covers bit pairs)
// p2 = d1 XOR d2 XOR d3 XOR d8 XOR ... (covers nibbles)
// ... (7 Hamming parity bits + 1 overall parity)
// On read:
// Compute syndrome from stored 72-bit codeword
// If syndrome == 0: no error, return data
// If syndrome != 0 and odd weight: single error, correct and return
// If syndrome != 0 and even weight: double error, signal UE
Overhead of SECDED
For every 64 bits of data, 8 parity bits are stored:
This 12.5% overhead applies to both storage capacity and bandwidth:
ECC Overhead on HBM
| Metric | Without ECC | With ECC | Overhead |
|---|---|---|---|
| H100 HBM capacity (raw) | 80 GB | 80 GB | N/A |
| H100 HBM usable capacity | 80 GB | ~71.1 GB | 12.5% reserved for parity |
| H100 HBM bandwidth (raw) | 3,350 GB/s | 3,350 GB/s | N/A |
| H100 HBM effective bandwidth | 3,350 GB/s | ~2,978 GB/s | 12.5% carries parity bits |
| A100 HBM usable capacity | 80 GB | ~71.1 GB | Same 12.5% |
On A100 and H100, the HBM die includes dedicated storage for ECC parity bits that is separate from the user-addressable 80 GB. You do not lose 12.5% of user-visible capacity. However, the bandwidth overhead is real — every HBM read transfers both data and parity, consuming physical bus bandwidth.
ECC in GPU Subsystems
HBM ECC
HBM3 (H100) implements on-die ECC within the HBM stack itself. Each HBM die has ECC logic integrated into the DRAM array:
// HBM3 ECC flow:
// 1. CPU/GPU writes 256-bit data burst to HBM controller
// 2. HBM controller computes ECC parity (SECDED per 64-bit word)
// 3. HBM die stores data + parity in adjacent cells
// 4. On read: HBM die checks ECC before returning data
// 5. If single-bit error: corrected transparently
// 6. If double-bit error: error flag set, GPU gets UE interrupt
// HBM3e adds additional reliability:
// - On-die ECC corrects within the HBM stack
// - Link-level CRC protects the bus between HBM and GPU
// - GPU-side ECC provides end-to-end protection
SRAM ECC (L2 Cache, Shared Memory)
The L2 cache and shared memory use SRAM-based ECC, with slightly different parameters:
// L2 cache ECC:
// Each 32-byte sector has 7 ECC check bits
// SECDED per sector: corrects 1-bit, detects 2-bit
// Overhead: 7 bits per 256 bits = 2.7%
// Lower overhead than HBM because SRAM bit cells are more stable
// Shared memory ECC:
// Similar SECDED protection per 32-byte sector
// Can be disabled on some GPUs for extra capacity (not recommended)
Register File ECC
The register file on data center GPUs (V100, A100, H100) is ECC-protected:
// Register ECC:
// Each 32-bit register has ECC protection
// Parity bits are stored alongside the register data
// Correction happens in the same cycle as register read
// No additional latency for ECC-correct registers
// If an uncorrectable error is detected in a register:
// The warp is killed (hardware exception)
// The application receives a CUDA error
// Other warps/kernels on the same GPU are NOT affected (on Hopper)
ECC Protection Across GPU Memory Subsystems
| Subsystem | ECC Type | Correction Capability | Detection Capability | Latency Overhead |
|---|---|---|---|---|
| HBM3/HBM3e | On-die SECDED | 1-bit per 64-bit word | 2-bit per 64-bit word | ~0 (integrated in HBM timing) |
| L2 Cache | SRAM SECDED | 1-bit per 32-byte sector | 2-bit per sector | ~0 (in read pipeline) |
| L1 Cache | Parity | None (detects only) | 1-bit | ~0 |
| Shared Memory | SRAM SECDED | 1-bit per sector | 2-bit per sector | ~0 |
| Register File | SECDED | 1-bit per 32-bit register | 2-bit per register | ~0 |
| NVLink | CRC + retry | Full packet | Full packet | ~100ns per retry |
| PCIe | LCRC + retry | Full TLP | Full TLP | ~microseconds per retry |
Error Reporting and Monitoring
nvidia-smi Error Counters
# View ECC error counts
nvidia-smi -q -d ECC
# Output (example):
# ECC Errors
# Volatile
# SRAM Correctable : 12
# SRAM Uncorrectable : 0
# DRAM Correctable : 847
# DRAM Uncorrectable : 0
# Aggregate
# SRAM Correctable : 156
# SRAM Uncorrectable : 0
# DRAM Correctable : 23481
# DRAM Uncorrectable : 2
- Volatile counters: Reset on driver reload or GPU reset.
- Aggregate counters: Persist across reboots (stored in GPU infoROM).
- Correctable: Single-bit errors that ECC silently fixed. Normal operation — these happen regularly.
- Uncorrectable: Double-bit errors that ECC detected but could not fix. These cause application errors.
# Monitor ECC errors over time (useful for fleet health monitoring)
nvidia-smi dmon -d 1 -s e
# Reports ECC errors per second
# Query specific GPU for retirement status
nvidia-smi -q -d RETIRED_PAGES -i 0
# Shows pages retired due to uncorrectable ECC errors
# Check if GPU needs replacement
nvidia-smi -q -d RETIRED_PAGES -i 0 | grep "Pending"
# "Pending Blacklist": Yes → GPU has pending page retirements, needs reboot
Page Retirement
When an HBM address experiences repeated uncorrectable errors, the GPU driver retires that page — marking it as unusable. The retired page is not allocated to future memory requests:
# View retired pages
nvidia-smi -q -d RETIRED_PAGES
# Retired pages due to multiple single bit ECC errors : 3
# Retired pages due to double bit ECC errors : 1
# Retired pages pending : 0
# Impact: each retired page reduces usable HBM by 4 KB (page size)
# With 80 GB HBM, losing 100 pages = 400 KB = negligible
# If retired pages exceed a threshold (~64 pages), NVIDIA recommends GPU replacement
DCGM (Data Center GPU Manager) for Fleet Monitoring
# Enable DCGM health monitoring
dcgmi health -s meit -g 1
# m = memory, e = ECC, i = inforom, t = thermal
# Run diagnostics
dcgmi diag -r 3 -g 1
# Level 3: comprehensive GPU health check including ECC memory test
# Query ECC metrics via DCGM fields
dcgmi dmon -e 312,313,314,315
# 312: DRAM correctable errors
# 313: DRAM uncorrectable errors
# 314: SRAM correctable errors
# 315: SRAM uncorrectable errors
The Cost of ECC
Bandwidth Overhead
The 12.5% bandwidth overhead of HBM ECC is significant for bandwidth-limited workloads:
// H100 HBM3 bandwidth: 3,350 GB/s (raw)
// Of this, 12.5% carries ECC parity: 419 GB/s "wasted" on parity
// Effective data bandwidth: 3,350 * (64/72) = 2,978 GB/s
// For a memory-bound kernel achieving 90% of peak:
// With ECC: 2,978 * 0.9 = 2,680 GB/s effective
// Without ECC (hypothetical): 3,350 * 0.9 = 3,015 GB/s effective
// Difference: ~12.5% throughput loss on pure bandwidth tests
ECC Bandwidth Impact on Memory-Bound Kernels (H100)
(GB/s effective data throughput)The 12.5% bandwidth penalty only matters for purely memory-bandwidth-limited kernels. Compute-bound kernels (GEMM, tensor core operations) are not affected because they do not saturate HBM bandwidth. In LLM inference, the bandwidth-limited phases (attention, KV cache access) see the full 12.5% penalty, while the compute-limited phases (FFN GEMM) are unaffected.
Power Overhead
ECC logic consumes additional power:
// ECC power overhead (estimated):
// HBM on-die ECC: ~2-3 W per HBM stack (included in HBM power budget)
// H100 has 6 HBM3 stacks: ~12-18 W for HBM ECC
// SRAM ECC (L2, shared mem, registers): ~5-10 W
// Total ECC power: ~17-28 W out of 700 W TDP = 2.5-4%
// On consumer GPUs (RTX 4090, 450W TDP):
// No ECC on HBM (GDDR6X, no ECC)
// SRAM ECC still present on L2 and registers
// Power savings: ~12-18 W (HBM ECC not present)
Latency Overhead
ECC computation is pipelined into the memory read path. The latency overhead is effectively zero for normal operation:
// HBM read latency: ~500 cycles
// ECC decode: ~2-3 cycles (pipelined, overlaps with data transfer)
// Net additional latency: ~0 cycles (hidden by pipeline)
// Exception: when a correctable error is found
// Correction adds ~1-2 cycles to flip the erroneous bit
// Still negligible compared to 500-cycle base latency
// Exception: when an uncorrectable error is found
// Triggers an interrupt (~1000+ cycles)
// GPU halts the affected warp
// Error is logged, page may be retired
ECC vs No-ECC: The Datacenter Decision
When ECC Is Non-Negotiable
// Training at scale (1000+ GPUs, weeks of runtime):
// Expected correctable errors: ~30,000+ per month across cluster
// Expected uncorrectable errors: ~10-50 per month
// Without ECC: ~30,000 silent data corruptions per month
// With ECC: all 30,000 corrected silently, UEs caught and restarted
// Medical/financial inference:
// A single bit flip could change a diagnosis or financial prediction
// ECC provides error detection even when correction fails (UE)
When ECC Matters Less
// Small-scale inference (1-8 GPUs, short sessions):
// Expected errors: <1 per month
// Probability of SDC affecting a single inference: ~10^-10
// Consumer GPUs (no HBM ECC) are acceptable
// Evaluation and testing:
// Results are compared against ground truth
// A corrupted result is caught by the evaluation pipeline
// ECC is nice-to-have but not critical
The Real-World SDC Problem
Even with ECC, silent data corruption can occur from sources ECC does not protect:
// Sources of SDC that ECC CANNOT catch:
// 1. Logic errors in the GPU's compute pipeline (rare but documented)
// - A faulty multiplier that produces wrong results
// - No ECC on intermediate computation results
// 2. Software bugs in CUDA runtime or driver
// 3. Firmware bugs in the GPU microcontroller
// 4. Multi-bit errors exceeding SECDED capability
// - Cosmic ray shower hitting adjacent bits
// - ~1 in 10,000 error events produce multi-bit errors
// Google's experience (published 2022):
// ~1 in 1000 machines exhibited "mercurial cores" —
// compute units that produce wrong results rarely but repeatedly
// These pass standard diagnostics but fail on specific operations
ECC protects against memory bit flips but not against compute errors (faulty ALUs, incorrect tensor core results). Large-scale training increasingly uses application-level checksumming — computing checksums of activations or gradients and comparing across data-parallel replicas. If one replica’s checksum diverges, its state is discarded and restored from a healthy replica.
Chipkill and Beyond: Future Error Correction
Chipkill for HBM
Chipkill extends SECDED to tolerate the failure of an entire DRAM chip within an HBM stack. HBM3e with chipkill can lose one of its 16 channels and continue operating with corrected data:
// Standard SECDED: corrects 1-bit per 64-bit word
// Chipkill: corrects all bits in a single DRAM chip (one channel)
// HBM3e: 16 channels per stack, 32 bits per channel
// Chipkill can correct a full 32-bit channel failure
// Overhead: higher than SECDED (~25% parity bits vs 12.5%)
// But protects against entire chip failures, not just single bits
Adaptive ECC
Future GPUs may implement adaptive ECC that adjusts protection level based on error rates:
// Concept: adaptive ECC
// Normal operation: SECDED (12.5% overhead)
// When error rate rises (aging, temperature): upgrade to DEC-TED
// (Double-Error Correct, Triple-Error Detect)
// Overhead increases to ~25%, but stronger protection
// When a region shows persistent errors: retire and remap
Practical Recommendations
For training clusters:
- Always enable ECC on data center GPUs (it is enabled by default on A100/H100 and cannot be disabled).
- Monitor ECC counters via DCGM. Set alerts for correctable error rates above 100 per day per GPU (indicates potential hardware degradation).
- Replace GPUs with more than 64 retired pages or any uncorrectable SRAM errors.
- Implement application-level checksumming for training runs exceeding 1,000 GPU-hours.
- Checkpoint frequently — every 10-30 minutes — to limit the blast radius of an undetected corruption event.
- Run GPU diagnostics (dcgmi diag -r 3) before starting multi-week training runs.
For inference deployments:
- Use ECC GPUs (A100, H100) for production inference serving.
- Monitor correctable error rates as a proxy for hardware health.
- Implement output validation for safety-critical applications (medical, financial, autonomous).
- Consider N+1 redundancy — keep spare GPUs for hot-swap when ECC errors trigger replacement.