In a large GPU cluster serving LLM inference, hardware failures are not exceptional events β they are statistical certainties. At scale, you will see ECC memory errors, NVLink degradation, thermal throttling, and outright GPU failures every week. The question is not whether failures will occur but how quickly you detect them and how gracefully you handle them. NVIDIA Dynamo integrates with the Data Center GPU Manager (DCGM) to monitor GPU health in real time, detect degradation before it causes serving errors, and automatically evict unhealthy GPUs from the serving pool. This post covers the monitoring architecture, the specific failure modes that matter for LLM inference, and the implementation of predictive maintenance.
GPU Failure Modes in Inference Clusters
GPU Failure Modes and Frequency (1000-GPU H100 Cluster, Annualized)
| Failure Mode | Frequency | Impact | Detection Method | Recovery Time |
|---|---|---|---|---|
| ECC correctable errors (threshold) | Weekly | Silent data corruption risk | DCGM counter threshold | GPU reset (seconds) |
| ECC uncorrectable error | Monthly | Process crash, incorrect output | DCGM event, Xid 48 | GPU reboot (minutes) |
| NVLink CRC errors | Weekly | Degraded TP throughput | DCGM NVLink counters | Link retrain or GPU replace |
| Thermal throttling | Daily (varies) | Reduced throughput | Clock frequency monitoring | Cooling fix or workload reduction |
| GPU hang (Xid 31/79) | Monthly | Complete GPU unresponsive | Xid error in dmesg | GPU reset or node reboot |
| PCIe errors | Rare | I/O failures, OOM | PCIe AER counters | Reseat or replace |
| Complete GPU failure | Quarterly per 1000 GPUs | Total loss of capacity | No heartbeat | Hardware replacement |
DCGM Integration Architecture
DCGM is NVIDIAβs GPU management daemon that collects health metrics, runs diagnostics, and exposes data via API. Dynamo queries DCGM to make scheduling decisions.
βββββββββββββββββββββββββββββββββββββββββββββββ
β Dynamo Scheduler β
β βββββββββββββββββββββββββββββββββββββββββββ β
β β Health Manager β β
β β - GPU health scores β β
β β - Eviction decisions β β
β β - Predictive alerts β β
β βββββββββββββββ¬ββββββββββββββββββββββββββββ β
β β β
β βββββββββββββββΌββββββββββββββββββββββββββββ β
β β DCGM Client Library β β
β β - Field value queries β β
β β - Health watch registration β β
β β - Diagnostic triggers β β
β βββββββββββββββ¬ββββββββββββββββββββββββββββ β
ββββββββββββββββββΌβββββββββββββββββββββββββββββββ
β gRPC / shared memory
ββββββββββββββββββΌβββββββββββββββββββββββββββββββ
β DCGM Daemon (nv-hostengine) β
β - Per-GPU metric collection (1-second) β
β - ECC error tracking β
β - NVLink error tracking β
β - Thermal monitoring β
β - Power monitoring β
β - Health watch system β
ββββββββββββββββββ¬βββββββββββββββββββββββββββββββ
β NVIDIA driver
ββββββββββββββββββΌβββββββββββββββββββββββββββββββ
β GPU Hardware (H100 x 8) β
β GPU 0 GPU 1 GPU 2 GPU 3 β
β GPU 4 GPU 5 GPU 6 GPU 7 β
ββββββββββββββββββββββββββββββββββββββββββββββββββ
DCGM Field Collection
import pydcgm
import dcgm_fields
import dcgm_structs
class DCGMHealthCollector:
"""Collect GPU health metrics via DCGM."""
HEALTH_FIELDS = [
dcgm_fields.DCGM_FI_DEV_ECC_SBE_VOL_TOTAL, # Single-bit ECC (correctable)
dcgm_fields.DCGM_FI_DEV_ECC_DBE_VOL_TOTAL, # Double-bit ECC (uncorrectable)
dcgm_fields.DCGM_FI_DEV_GPU_TEMP, # GPU temperature
dcgm_fields.DCGM_FI_DEV_POWER_USAGE, # Power draw (watts)
dcgm_fields.DCGM_FI_DEV_SM_CLOCK, # SM clock frequency
dcgm_fields.DCGM_FI_DEV_MEM_CLOCK, # Memory clock
dcgm_fields.DCGM_FI_DEV_PCIE_REPLAY_COUNTER, # PCIe replay errors
dcgm_fields.DCGM_FI_DEV_NVLINK_CRC_FLIT_ERROR_COUNT_TOTAL, # NVLink CRC
dcgm_fields.DCGM_FI_DEV_XID_ERRORS, # Xid error codes
dcgm_fields.DCGM_FI_DEV_RETIRED_SBE, # Retired pages (single-bit)
dcgm_fields.DCGM_FI_DEV_RETIRED_DBE, # Retired pages (double-bit)
dcgm_fields.DCGM_FI_DEV_RETIRED_PENDING, # Pages pending retirement
]
def __init__(self):
self.handle = pydcgm.DcgmHandle()
self.system = self.handle.GetSystem()
self.group = self.system.GetDefaultGroup()
# Create field group for health monitoring
self.field_group = pydcgm.DcgmFieldGroup(
self.handle, "health_fields", self.HEALTH_FIELDS
)
# Enable watches at 1-second intervals
self.group.samples.WatchFields(
self.field_group, updateFreq=1000000, # microseconds
maxKeepAge=3600, maxKeepSamples=3600
)
def get_gpu_health(self, gpu_id):
"""Get current health metrics for a specific GPU."""
values = self.group.samples.GetLatest(self.field_group)
gpu_values = values.entityValues.get(gpu_id, {})
return {
'ecc_correctable': gpu_values.get(
dcgm_fields.DCGM_FI_DEV_ECC_SBE_VOL_TOTAL, 0),
'ecc_uncorrectable': gpu_values.get(
dcgm_fields.DCGM_FI_DEV_ECC_DBE_VOL_TOTAL, 0),
'temperature_c': gpu_values.get(
dcgm_fields.DCGM_FI_DEV_GPU_TEMP, 0),
'power_watts': gpu_values.get(
dcgm_fields.DCGM_FI_DEV_POWER_USAGE, 0),
'sm_clock_mhz': gpu_values.get(
dcgm_fields.DCGM_FI_DEV_SM_CLOCK, 0),
'nvlink_crc_errors': gpu_values.get(
dcgm_fields.DCGM_FI_DEV_NVLINK_CRC_FLIT_ERROR_COUNT_TOTAL, 0),
'retired_pages_sbe': gpu_values.get(
dcgm_fields.DCGM_FI_DEV_RETIRED_SBE, 0),
'retired_pages_dbe': gpu_values.get(
dcgm_fields.DCGM_FI_DEV_RETIRED_DBE, 0),
'retired_pages_pending': gpu_values.get(
dcgm_fields.DCGM_FI_DEV_RETIRED_PENDING, 0),
}
Health Scoring System
Dynamo computes a composite health score for each GPU, ranging from 0.0 (dead) to 1.0 (perfect health). This score drives scheduling and eviction decisions.
class GPUHealthScorer:
"""Compute composite GPU health score from DCGM metrics."""
# Thresholds for health degradation
THRESHOLDS = {
'ecc_correctable_rate': { # Errors per hour
'warning': 10,
'critical': 100,
'evict': 1000
},
'ecc_uncorrectable': { # Total count
'warning': 1,
'critical': 1, # Any uncorrectable is critical
'evict': 3
},
'temperature_c': {
'warning': 80,
'critical': 85,
'evict': 90
},
'nvlink_crc_rate': { # Errors per hour
'warning': 100,
'critical': 10000,
'evict': 100000
},
'retired_pages_total': {
'warning': 10,
'critical': 40,
'evict': 60 # H100 max is 63
},
'clock_throttle_pct': { # % below boost clock
'warning': 5,
'critical': 15,
'evict': 30
}
}
def compute_score(self, metrics, history):
"""Compute 0.0-1.0 health score."""
score = 1.0
issues = []
# ECC correctable error rate
ecc_rate = self._compute_rate(
history, 'ecc_correctable', window_hours=1
)
if ecc_rate > self.THRESHOLDS['ecc_correctable_rate']['evict']:
score = min(score, 0.0)
issues.append(f"ECC correctable rate: {ecc_rate}/hr (EVICT)")
elif ecc_rate > self.THRESHOLDS['ecc_correctable_rate']['critical']:
score = min(score, 0.3)
issues.append(f"ECC correctable rate: {ecc_rate}/hr (CRITICAL)")
elif ecc_rate > self.THRESHOLDS['ecc_correctable_rate']['warning']:
score = min(score, 0.7)
issues.append(f"ECC correctable rate: {ecc_rate}/hr (WARNING)")
# ECC uncorrectable (any is bad)
if metrics['ecc_uncorrectable'] > 0:
score = min(score, 0.1)
issues.append(f"ECC uncorrectable: {metrics['ecc_uncorrectable']} (CRITICAL)")
# Temperature
temp = metrics['temperature_c']
if temp > self.THRESHOLDS['temperature_c']['evict']:
score = min(score, 0.0)
issues.append(f"Temperature: {temp}C (EVICT)")
elif temp > self.THRESHOLDS['temperature_c']['critical']:
score = min(score, 0.4)
issues.append(f"Temperature: {temp}C (CRITICAL)")
# Retired pages
retired = (metrics['retired_pages_sbe'] +
metrics['retired_pages_dbe'])
if retired > self.THRESHOLDS['retired_pages_total']['evict']:
score = min(score, 0.0)
issues.append(f"Retired pages: {retired} (EVICT)")
elif retired > self.THRESHOLDS['retired_pages_total']['critical']:
score = min(score, 0.3)
issues.append(f"Retired pages: {retired} (CRITICAL)")
# Clock throttling
boost_clock = 1830 # H100 boost clock MHz
throttle_pct = max(0, (boost_clock - metrics['sm_clock_mhz']) /
boost_clock * 100)
if throttle_pct > self.THRESHOLDS['clock_throttle_pct']['evict']:
score = min(score, 0.2)
issues.append(f"Clock throttle: {throttle_pct:.1f}% (EVICT)")
return score, issues
def _compute_rate(self, history, metric_name, window_hours=1):
"""Compute error rate from history."""
if len(history) < 2:
return 0.0
recent = [h[metric_name] for h in history[-window_hours*3600:]]
if len(recent) < 2:
return 0.0
return max(0, recent[-1] - recent[0])
GPU Health Score Distribution (1000 H100 Cluster, Typical Week)
(% of GPUs)Xid Error Handling
Xid errors are NVIDIA GPU error codes reported through the kernel driver. Different Xid codes indicate different failure modes.
Critical Xid Errors for LLM Inference
| Xid Code | Description | Severity | Recovery Action | Dynamo Response |
|---|---|---|---|---|
| Xid 31 | GPU memory page fault | Critical | Reset GPU | Evict GPU, drain requests |
| Xid 43 | GPU stopped processing | Critical | Reset GPU | Evict GPU, failover |
| Xid 45 | Preemptive GPU reset | High | Automatic recovery | Mark degraded, monitor |
| Xid 48 | Double-bit ECC error | Critical | Reset GPU, retire page | Evict GPU if repeat |
| Xid 63 | ECC page retirement | Medium | Page retired, continue | Decrement health score |
| Xid 64 | ECC page retirement limit | Critical | GPU needs replacement | Evict GPU permanently |
| Xid 79 | GPU fell off the bus | Fatal | Node reboot required | Remove node from pool |
| Xid 94 | Contained ECC error | Low | Automatic recovery | Log and monitor rate |
import subprocess
import re
from datetime import datetime
class XidMonitor:
"""Monitor kernel logs for Xid errors."""
CRITICAL_XID = {31, 43, 48, 64, 79}
WARNING_XID = {45, 63, 94}
def parse_dmesg_xid_errors(self):
"""Parse dmesg for NVIDIA Xid errors."""
result = subprocess.run(
['dmesg', '--time-format=iso'],
capture_output=True, text=True
)
xid_pattern = re.compile(
r'(\d{4}-\d{2}-\d{2}T\d{2}:\d{2}:\d{2})'
r'.*NVRM: Xid.*: (\d+),'
r'.*GPU (\d+)'
)
errors = []
for line in result.stdout.splitlines():
match = xid_pattern.search(line)
if match:
timestamp = datetime.fromisoformat(match.group(1))
xid_code = int(match.group(2))
gpu_id = int(match.group(3))
errors.append({
'timestamp': timestamp,
'xid': xid_code,
'gpu_id': gpu_id,
'critical': xid_code in self.CRITICAL_XID,
'raw_line': line.strip()
})
return errors
def should_evict(self, gpu_id, errors, window_minutes=60):
"""Determine if a GPU should be evicted based on Xid history."""
cutoff = datetime.now() - timedelta(minutes=window_minutes)
recent = [e for e in errors
if e['gpu_id'] == gpu_id and e['timestamp'] > cutoff]
# Any critical Xid -> immediate eviction
critical = [e for e in recent if e['critical']]
if critical:
return True, f"Critical Xid {critical[0]['xid']}"
# High rate of warning Xids -> eviction
warning_count = sum(1 for e in recent if e['xid'] in self.WARNING_XID)
if warning_count > 10:
return True, f"{warning_count} warning Xids in {window_minutes}min"
return False, "Healthy"
A double-bit ECC error (Xid 48) can corrupt GPU memory without crashing the process. The corrupted memory may contain model weights, KV cache, or intermediate activations. The result is a model that generates plausible but subtly wrong output β the most dangerous failure mode in production. Dynamo must immediately evict GPUs with uncorrectable ECC errors and revalidate any outputs generated on that GPU since the last known-good checkpoint.
Predictive Maintenance
The goal of predictive maintenance is to detect GPU degradation before it causes serving errors.
ECC Error Rate Prediction
Correctable ECC errors are the best predictor of future uncorrectable errors. A GPU with an accelerating correctable error rate is likely to produce an uncorrectable error within days.
import numpy as np
class ECCPredictor:
"""Predict GPU failure from ECC error rate trends."""
def __init__(self, history_hours=168): # 1 week
self.history_hours = history_hours
def predict_failure(self, ecc_history):
"""Predict time to uncorrectable error based on correctable error trend.
Args:
ecc_history: list of (timestamp_hours, cumulative_ecc_count)
Returns:
predicted_hours_to_failure, confidence
"""
if len(ecc_history) < 24: # Need at least 24 hours of data
return float('inf'), 0.0
times = np.array([h[0] for h in ecc_history])
counts = np.array([h[1] for h in ecc_history])
# Fit exponential growth model: count = a * exp(b * t)
# Use log-linear regression: log(count) = log(a) + b * t
log_counts = np.log(counts + 1) # +1 to handle zeros
coeffs = np.polyfit(times, log_counts, 1)
growth_rate = coeffs[0] # b in exp(b*t)
if growth_rate <= 0:
return float('inf'), 0.0 # Stable or decreasing
# Predict when cumulative count reaches critical threshold
# Empirical: uncorrectable likely when correctable rate > 1000/hr
critical_count = counts[-1] + 1000
current_count = counts[-1]
current_time = times[-1]
if current_count <= 0:
return float('inf'), 0.0
hours_to_critical = (np.log(critical_count) - np.log(current_count)) / growth_rate
# Confidence based on R^2 of fit
predicted = np.exp(coeffs[0] * times + coeffs[1])
ss_res = np.sum((counts - predicted) ** 2)
ss_tot = np.sum((counts - np.mean(counts)) ** 2)
r_squared = 1 - ss_res / ss_tot if ss_tot > 0 else 0
confidence = max(0, r_squared)
return max(0, hours_to_critical), confidence
def recommend_action(self, hours_to_failure, confidence):
"""Recommend maintenance action."""
if confidence < 0.5:
return "MONITOR", "Low confidence prediction, continue monitoring"
if hours_to_failure < 4:
return "EVICT_NOW", "Predicted failure within 4 hours"
elif hours_to_failure < 24:
return "SCHEDULE_MAINTENANCE", "Predicted failure within 24 hours"
elif hours_to_failure < 168:
return "ORDER_REPLACEMENT", "Predicted failure within 1 week"
else:
return "MONITOR", "No imminent failure predicted"
Predictive Maintenance Accuracy (Validated on 6-Month Production Data)
(% accuracy)Automated GPU Eviction
When a GPUβs health score drops below the eviction threshold, Dynamo must gracefully remove it from the serving pool.
class GPUEvictionManager:
"""Manage graceful GPU eviction from serving pool."""
def __init__(self, scheduler, health_scorer):
self.scheduler = scheduler
self.health_scorer = health_scorer
self.evicted_gpus = set()
def evict_gpu(self, gpu_id, reason):
"""Gracefully evict a GPU from the serving pool."""
print(f"EVICTING GPU {gpu_id}: {reason}")
# Step 1: Stop routing new requests to this GPU
self.scheduler.mark_gpu_draining(gpu_id)
# Step 2: Wait for in-flight requests to complete (with timeout)
timeout_seconds = 30
start = time.time()
while self.scheduler.get_inflight_count(gpu_id) > 0:
if time.time() - start > timeout_seconds:
print(f"GPU {gpu_id}: Force-draining {self.scheduler.get_inflight_count(gpu_id)} requests")
self.scheduler.force_drain(gpu_id)
break
time.sleep(0.5)
# Step 3: Remove from serving pool
self.scheduler.remove_gpu(gpu_id)
self.evicted_gpus.add(gpu_id)
# Step 4: Attempt GPU reset (may recover the GPU)
success = self._attempt_gpu_reset(gpu_id)
if success:
# Run quick diagnostic before re-adding
if self._run_diagnostic(gpu_id):
print(f"GPU {gpu_id}: Reset successful, re-adding to pool")
self.scheduler.add_gpu(gpu_id)
self.evicted_gpus.discard(gpu_id)
return "RECOVERED"
# Step 5: If reset fails, mark for hardware replacement
self._create_replacement_ticket(gpu_id, reason)
return "NEEDS_REPLACEMENT"
def _attempt_gpu_reset(self, gpu_id):
"""Attempt to reset GPU via nvidia-smi."""
try:
result = subprocess.run(
['nvidia-smi', '--gpu-reset', '-i', str(gpu_id)],
capture_output=True, text=True, timeout=60
)
return result.returncode == 0
except subprocess.TimeoutExpired:
return False
def _run_diagnostic(self, gpu_id):
"""Run DCGM diagnostic on GPU."""
# Level 2 diagnostic: memory test + PCIe bandwidth
try:
result = subprocess.run(
['dcgmi', 'diag', '-r', '2', '-i', str(gpu_id)],
capture_output=True, text=True, timeout=300
)
return 'PASS' in result.stdout
except subprocess.TimeoutExpired:
return False
def _create_replacement_ticket(self, gpu_id, reason):
"""Create ticket for hardware replacement."""
ticket = {
'gpu_id': gpu_id,
'hostname': os.uname().nodename,
'reason': reason,
'timestamp': datetime.now().isoformat(),
'priority': 'high'
}
# Send to ticketing system (ServiceNow, Jira, etc.)
print(f"Created replacement ticket: {json.dumps(ticket)}")
NVLink Health Monitoring
For tensor-parallel inference across multiple GPUs, NVLink health directly affects throughput. A degraded NVLink forces communication over PCIe, which is 5-10x slower.
class NVLinkHealthChecker:
"""Monitor NVLink health for tensor-parallel inference."""
def check_nvlink_topology(self):
"""Verify NVLink connectivity matches expected topology."""
result = subprocess.run(
['nvidia-smi', 'nvlink', '--status', '-i', '0'],
capture_output=True, text=True
)
# Parse NVLink status for each GPU pair
# Expected: all links active with full bandwidth
return self._parse_nvlink_status(result.stdout)
def measure_nvlink_bandwidth(self, gpu_a, gpu_b):
"""Measure actual NVLink bandwidth between GPU pair."""
# Use DCGM bandwidth test
result = subprocess.run(
['dcgmi', 'nvlink', '-b', '-g', f'{gpu_a},{gpu_b}'],
capture_output=True, text=True, timeout=60
)
# Expected H100 NVLink: ~450 GB/s per direction
# Alert if below 80% of expected
return self._parse_bandwidth(result.stdout)
def validate_tp_group(self, gpu_ids):
"""Validate all GPUs in a TP group have healthy NVLink."""
issues = []
for i, gpu_a in enumerate(gpu_ids):
for gpu_b in gpu_ids[i+1:]:
bw = self.measure_nvlink_bandwidth(gpu_a, gpu_b)
expected_bw = 450 # GB/s for H100
if bw < expected_bw * 0.8:
issues.append(
f"GPU {gpu_a} <-> GPU {gpu_b}: "
f"{bw:.0f} GB/s (expected {expected_bw})"
)
return issues
NVLink Degradation Impact on TP-2 Inference (Llama 70B, H100)
| NVLink State | AllReduce BW | Decode Latency (batch=1) | Throughput Impact | Action |
|---|---|---|---|---|
| All 18 links active | 450 GB/s | 12.5 ms | Baseline | None |
| 16/18 links active | 400 GB/s | 13.0 ms | -4% | Monitor |
| 12/18 links active | 300 GB/s | 14.8 ms | -15% | Schedule maintenance |
| Fallback to PCIe | 50 GB/s | 28.2 ms | -56% | Evict from TP group |
| One GPU unreachable | N/A | N/A | -100% | Replace GPU |
Prometheus Metrics Export
from prometheus_client import Gauge, Counter, Info
# GPU health metrics for Prometheus
gpu_health_score = Gauge(
'dcgm_gpu_health_score',
'Composite GPU health score (0.0-1.0)',
['gpu_id', 'hostname']
)
gpu_ecc_correctable = Counter(
'dcgm_ecc_correctable_total',
'Total correctable ECC errors',
['gpu_id', 'hostname']
)
gpu_ecc_uncorrectable = Counter(
'dcgm_ecc_uncorrectable_total',
'Total uncorrectable ECC errors',
['gpu_id', 'hostname']
)
gpu_temperature = Gauge(
'dcgm_gpu_temperature_celsius',
'Current GPU temperature',
['gpu_id', 'hostname']
)
gpu_retired_pages = Gauge(
'dcgm_retired_pages_total',
'Total retired memory pages',
['gpu_id', 'hostname', 'type']
)
gpu_nvlink_errors = Counter(
'dcgm_nvlink_crc_errors_total',
'Total NVLink CRC flit errors',
['gpu_id', 'hostname', 'link_id']
)
gpu_eviction_events = Counter(
'dynamo_gpu_eviction_total',
'Total GPU eviction events',
['gpu_id', 'hostname', 'reason']
)
def export_metrics(collector, scorer, hostname):
"""Export DCGM metrics to Prometheus."""
for gpu_id in range(8): # 8-GPU node
metrics = collector.get_gpu_health(gpu_id)
history = collector.get_history(gpu_id)
score, issues = scorer.compute_score(metrics, history)
gpu_health_score.labels(gpu_id=gpu_id, hostname=hostname).set(score)
gpu_temperature.labels(gpu_id=gpu_id, hostname=hostname).set(
metrics['temperature_c']
)
gpu_retired_pages.labels(
gpu_id=gpu_id, hostname=hostname, type='sbe'
).set(metrics['retired_pages_sbe'])
gpu_retired_pages.labels(
gpu_id=gpu_id, hostname=hostname, type='dbe'
).set(metrics['retired_pages_dbe'])
In a 1000-GPU cluster, correctable ECC errors generate thousands of events per day. Do not alert on individual correctable errors. Instead, alert on the error rate (errors per hour per GPU) and the trend (accelerating rate). Set thresholds that produce no more than 2-3 actionable alerts per day across the entire cluster.
Integration with Dynamo Scheduling
The health system feeds directly into Dynamoβs request routing. Unhealthy GPUs receive fewer or no requests:
class HealthAwareScheduler:
"""Dynamo scheduler with GPU health integration."""
def __init__(self, health_scorer, eviction_threshold=0.3):
self.health_scorer = health_scorer
self.eviction_threshold = eviction_threshold
self.gpu_weights = {} # gpu_id -> scheduling weight
def update_weights(self, gpu_health_scores):
"""Update scheduling weights based on health scores."""
for gpu_id, score in gpu_health_scores.items():
if score <= self.eviction_threshold:
self.gpu_weights[gpu_id] = 0.0 # No requests
elif score <= 0.7:
# Proportionally reduce load on degraded GPUs
self.gpu_weights[gpu_id] = score
else:
self.gpu_weights[gpu_id] = 1.0 # Full capacity
def select_gpu(self, request):
"""Select GPU for request, weighted by health."""
available = {
gpu_id: weight
for gpu_id, weight in self.gpu_weights.items()
if weight > 0
}
if not available:
raise RuntimeError("No healthy GPUs available")
# Weighted selection: healthier GPUs get more requests
total_weight = sum(available.values())
r = random.random() * total_weight
cumulative = 0.0
for gpu_id, weight in available.items():
cumulative += weight
if r <= cumulative:
return gpu_id
return list(available.keys())[-1]
Summary
GPU health monitoring in large inference clusters requires continuous DCGM metric collection, composite health scoring, Xid error classification, ECC error rate prediction, NVLink bandwidth validation, and automated GPU eviction with graceful request draining. The most dangerous failure mode is silent data corruption from uncorrectable ECC errors β these require immediate GPU eviction and output revalidation. Predictive maintenance based on correctable ECC error trends can detect 72% of impending failures, enabling proactive replacement before user-facing impact. Integration with the Dynamo scheduler ensures that degraded GPUs receive proportionally fewer requests, maintaining cluster-level SLA while individual GPUs decline.