Part of Series vLLM v1 & Omni Internals 29 of 25
1 vLLM v1 Block Manager: Deconstructing KV Cache Memory Management at the Pointer Level 2 vLLM v1 Disaggregated Serving: The E/P/D/G Pipeline and Multimodal-First Architecture 3 vLLM OmniConnector: Async Multimodal Token Lifecycle Management 4 vLLM v1 Unified Scheduler: One Queue, No Prefill/Decode Distinction, and Persistent Batches 5 vLLM v1 Attention Backends: FlashAttention, FlashInfer, and PagedAttention Selection Logic 6 vLLM v1 Rejection Sampler: Native CFG and Speculative Verification Kernels 7 vLLM v1 Tensor Parallelism: Symmetric Workers, Incremental Updates, and NCCL Optimization 8 vLLM v1 Structured Output: The Native Grammar Engine and Token Mask Caching 9 vLLM v1 Prefix Caching: Hash Chains, LRU Eviction, and Hit Rate Optimization 10 vLLM v1 Multi-LoRA: Adapter Scheduling, Memory Management, and Batched Inference 11 vLLM v1 Performance Profiling: Finding and Fixing Bottlenecks in Production 12 vLLM v1 Speculative Decoding: Draft Model Integration and Token Verification Pipeline 13 vLLM v1 Vision Encoder: ViT Integration, Image Preprocessing, and Visual Token Pipeline 14 vLLM v1 Model Loading: Weight Distribution, safetensors Deserialization, and Progressive Startup 15 vLLM v1 Request Cancellation and Early Stopping: Freeing Resources Mid-Generation 16 vLLM v1 Quantized Inference: GPTQ, AWQ, FP8 Kernel Selection 17 vLLM v1 Distributed Execution: Ray Integration and Multi-Node Coordination 18 vLLM v1 KV Cache Offloading: GPU to CPU to SSD Tiered Memory 19 vLLM v1 Async Output: Detokenization, Streaming, and Queue Management 20 vLLM v1 Video and Audio: Temporal Encoding and Multi-Modal Batching 21 vLLM v1 Benchmarking: Systematic Optimization for Your Workload 22 vLLM v1 Error Handling: CUDA OOM Recovery, Request Retry, and Graceful Degradation 23 vLLM v1 Configuration Guide: gpu_memory_utilization, max_num_seqs, and Every Key Parameter 24 vLLM v1 Plugin Architecture: Custom Samplers, Schedulers, and Attention Backends 25 vLLM v1 Production Checklist: From Development to Reliable 24/7 Serving

In the first month of production serving, our vLLM cluster hit CUDA OOM errors on 3.7% of decode steps — enough to cause visible user-facing failures every few minutes. CUDA driver errors, NaN outputs, tensor shape mismatches, and worker crashes added another 1.2%. Left unhandled, these errors cascade: one failed request preempts others, freed KV blocks trigger memory fragmentation, and eventually the entire worker needs a restart that drops all in-flight requests. vLLM v1’s error handling prevents this cascade through multiple defensive layers: preemptive memory management that stops new requests before OOM, graceful degradation that sacrifices throughput to preserve correctness, request-level retry for transient failures, and fast worker restart for fatal errors. This post covers each layer with code paths and the configuration knobs that tune recovery behavior.

CUDA OOM: The Most Common Production Error

CUDA OOM occurs when a memory allocation request exceeds available GPU memory. In LLM serving, this typically happens when the KV cache grows beyond the allocated budget.

class CUDAOOMHandler:
    """
    Handle CUDA out-of-memory errors in vLLM.

    Root causes:
    1. KV cache exceeds allocated blocks (most common)
    2. Activation memory spike during prefill of long sequence
    3. Memory fragmentation after many allocations/frees
    4. Concurrent model + KV cache exceeds total GPU memory
    """

    def __init__(self, config: dict):
        self.max_retries = config.get("max_retries", 3)
        self.preemption_mode = config.get("preemption_mode", "recompute")
        self.gpu_memory_utilization = config.get("gpu_memory_utilization", 0.90)

    def handle_oom(self, error: Exception, context: dict) -> dict:
        """
        Handle a CUDA OOM error.

        Strategy:
        1. First: preempt lowest-priority running requests
        2. Second: reduce batch size for next iteration
        3. Third: if still OOM, restart worker process
        """
        if "out of memory" in str(error).lower():
            return self._handle_kv_cache_oom(context)
        elif "CUDA error" in str(error):
            return self._handle_cuda_driver_error(context)
        else:
            return {"action": "propagate", "error": str(error)}

    def _handle_kv_cache_oom(self, context: dict) -> dict:
        """
        KV cache OOM: preempt requests to free blocks.

        vLLM approach:
        - Preempt the request with the most allocated blocks
        - Two modes: recompute (discard KV) or swap (move to CPU)
        """
        running_requests = context.get("running_requests", [])
        if not running_requests:
            return {"action": "restart_worker", "reason": "oom_no_requests"}

        # Sort by KV cache usage (descending)
        sorted_requests = sorted(
            running_requests,
            key=lambda r: r["kv_blocks_allocated"],
            reverse=True
        )

        # Preempt the largest request
        victim = sorted_requests[0]

        if self.preemption_mode == "recompute":
            return {
                "action": "preempt_recompute",
                "victim_request_id": victim["request_id"],
                "blocks_freed": victim["kv_blocks_allocated"],
                "cost": "Must re-prefill when rescheduled",
            }
        elif self.preemption_mode == "swap":
            return {
                "action": "preempt_swap",
                "victim_request_id": victim["request_id"],
                "blocks_freed": victim["kv_blocks_allocated"],
                "cost": "CPU memory used, swap-in latency later",
            }

        return {"action": "restart_worker"}

    def _handle_cuda_driver_error(self, context: dict) -> dict:
        """
        CUDA driver errors (ECC, hardware fault, driver crash)
        are typically unrecoverable. Restart the worker.
        """
        return {
            "action": "restart_worker",
            "reason": "cuda_driver_error",
            "all_requests_failed": True,
            "retry_eligible": True,
        }
📊

CUDA OOM Recovery Strategies

StrategyRecovery TimeData LossWhen Used
Preempt (recompute) 0 ms KV cache of 1 request Memory pressure
Preempt (swap to CPU) 5-50 ms None (swapped) High memory pressure
Reduce batch size 0 ms None Sustained pressure
Worker restart 5-30 sec All in-flight requests Fatal CUDA error
⚠️ Warning

CUDA OOM is NOT the same as a CUDA driver error. OOM is recoverable by freeing memory (preempting requests). A CUDA driver error (ECC failure, illegal memory access) corrupts the CUDA context and requires a full process restart. vLLM v1 distinguishes between these two cases: OOM triggers preemption, driver errors trigger worker restart.

Preemptive Memory Management

The best OOM handling is preventing OOM from occurring.

class PreemptiveMemoryManager:
    """
    Prevent OOM by monitoring memory and acting before exhaustion.
    """

    def __init__(self, config: dict):
        self.watermark_high = config.get("watermark_high", 0.90)
        self.watermark_critical = config.get("watermark_critical", 0.95)
        self.watermark_oom = config.get("watermark_oom", 0.98)

    def check_memory_pressure(self, gpu_memory_used_gb: float,
                                gpu_memory_total_gb: float) -> dict:
        """
        Check memory utilization and recommend action.
        """
        utilization = gpu_memory_used_gb / gpu_memory_total_gb

        if utilization < self.watermark_high:
            return {
                "status": "normal",
                "action": "none",
                "can_accept_new_requests": True,
            }
        elif utilization < self.watermark_critical:
            return {
                "status": "high",
                "action": "stop_accepting_new_requests",
                "can_accept_new_requests": False,
                "reason": "Memory above high watermark",
            }
        elif utilization < self.watermark_oom:
            return {
                "status": "critical",
                "action": "preempt_lowest_priority",
                "can_accept_new_requests": False,
                "num_requests_to_preempt": 1,
            }
        else:
            return {
                "status": "oom_imminent",
                "action": "emergency_preempt",
                "can_accept_new_requests": False,
                "num_requests_to_preempt": 3,
            }

    def admission_control(self, new_request: dict,
                           current_utilization: float) -> dict:
        """
        Decide whether to admit a new request based on memory forecast.

        Estimate memory needed = input_tokens * kv_bytes_per_token
                                + max_output_tokens * kv_bytes_per_token
        """
        estimated_input_tokens = new_request.get("input_length", 0)
        estimated_output_tokens = new_request.get("max_tokens", 1024)
        kv_bytes_per_token = new_request.get("kv_bytes_per_token", 80)  # bytes

        estimated_memory_mb = (
            (estimated_input_tokens + estimated_output_tokens) *
            kv_bytes_per_token / 1e6
        )

        headroom_mb = (1 - current_utilization) * 80 * 1024  # 80GB GPU
        safety_margin_mb = headroom_mb * 0.1  # Keep 10% safety margin

        if estimated_memory_mb < headroom_mb - safety_margin_mb:
            return {"admit": True}
        else:
            return {
                "admit": False,
                "reason": "insufficient_memory",
                "estimated_need_mb": estimated_memory_mb,
                "available_mb": headroom_mb - safety_margin_mb,
                "retry_after_ms": 500,
            }

Memory Watermark Levels

Normal (0-90%)
90
High (90-95%)
5
Critical (95-98%)
3
OOM Zone (98-100%)
2

Request Retry Logic

When a request fails due to a recoverable error, vLLM can retry it automatically.

class RequestRetryManager:
    """
    Manage request retries for transient failures.
    """

    def __init__(self, config: dict):
        self.max_retries = config.get("max_retries", 3)
        self.base_delay_ms = config.get("base_delay_ms", 100)
        self.max_delay_ms = config.get("max_delay_ms", 5000)
        self.retryable_errors = {
            "cuda_oom",
            "preempted",
            "worker_restarted",
            "timeout",
        }
        self.non_retryable_errors = {
            "invalid_input",
            "model_not_found",
            "token_limit_exceeded",
            "cancelled_by_client",
        }

    def should_retry(self, error_type: str, attempt: int) -> dict:
        """
        Determine if a failed request should be retried.
        """
        if error_type in self.non_retryable_errors:
            return {"retry": False, "reason": f"Non-retryable error: {error_type}"}

        if attempt >= self.max_retries:
            return {"retry": False, "reason": f"Max retries ({self.max_retries}) exceeded"}

        if error_type in self.retryable_errors:
            # Exponential backoff with jitter
            delay_ms = min(
                self.base_delay_ms * (2 ** attempt),
                self.max_delay_ms
            )
            import random
            jitter_ms = random.uniform(0, delay_ms * 0.1)
            delay_ms += jitter_ms

            return {
                "retry": True,
                "delay_ms": delay_ms,
                "attempt": attempt + 1,
                "max_attempts": self.max_retries,
            }

        return {"retry": False, "reason": f"Unknown error type: {error_type}"}

    def handle_preempted_request(self, request: dict) -> dict:
        """
        Handle a request that was preempted to free memory.

        Preempted requests are re-queued with higher priority
        to avoid starvation.
        """
        return {
            "action": "requeue",
            "priority_boost": True,
            "original_request_id": request["request_id"],
            "tokens_generated_before_preempt": request.get("output_tokens", 0),
            "resume_strategy": request.get("preemption_mode", "recompute"),
        }
📊

Retry Configuration by Error Type

Error TypeRetryableMax RetriesBackoff StrategyTypical Recovery
CUDA OOM (preempted) Yes 3 Exponential 100ms-5s 100-500 ms
Worker restart Yes 2 Fixed 5s 5-30 sec
Timeout Yes 2 Exponential 1s-10s 1-10 sec
Invalid input No - - Return error
NaN output Yes 1 Fixed 100ms 100 ms

Worker Process Recovery

When a GPU worker crashes (CUDA driver error, segfault, Python exception), vLLM must restart it without losing the entire serving deployment.

class WorkerRecoveryManager:
    """
    Manage worker process lifecycle and recovery.
    """

    def __init__(self, config: dict):
        self.num_workers = config.get("tensor_parallel_size", 1)
        self.restart_timeout_sec = config.get("restart_timeout", 60)
        self.max_restarts_per_hour = config.get("max_restarts_per_hour", 5)
        self.restart_count = 0

    def detect_worker_failure(self, worker_id: int) -> dict:
        """
        Detect that a worker process has failed.

        Detection methods:
        1. Process exit code != 0
        2. Heartbeat timeout (no response in N seconds)
        3. CUDA error reported via health check
        """
        return {
            "worker_id": worker_id,
            "detection_method": "heartbeat_timeout",
            "failure_type": "cuda_driver_error",
        }

    def restart_worker(self, worker_id: int) -> dict:
        """
        Restart a failed worker process.

        Steps:
        1. Kill the old process (if still running)
        2. Release GPU resources (CUDA context destroy)
        3. Launch new process with same GPU assignment
        4. Reload model weights
        5. Re-initialize KV cache
        6. Signal readiness to scheduler

        For TP (tensor parallel): ALL workers must be restarted
        because NCCL communicators are invalidated.
        """
        if self.restart_count >= self.max_restarts_per_hour:
            return {
                "action": "circuit_breaker_open",
                "reason": f"Too many restarts ({self.restart_count}/hr)",
                "recommendation": "Manual investigation required",
            }

        self.restart_count += 1

        # For TP > 1: must restart all workers (NCCL requirement)
        workers_to_restart = list(range(self.num_workers))

        # Estimate restart time
        model_load_sec = 30  # Typical for 70B FP16 from disk
        kv_cache_init_sec = 2
        nccl_init_sec = 5
        total_restart_sec = model_load_sec + kv_cache_init_sec + nccl_init_sec

        return {
            "action": "restart_all_workers",
            "workers": workers_to_restart,
            "estimated_downtime_sec": total_restart_sec,
            "requests_affected": "all_in_flight",
            "mitigation": "Requests will be retried after restart",
        }

    def fast_restart_with_checkpointing(self, worker_id: int) -> dict:
        """
        Fast restart using model weight caching.

        If model weights are cached in shared memory or a tmpfs mount,
        reload is much faster (memory copy vs disk read).
        """
        # Model in shared memory: ~2 sec to remap
        # Model on NVMe: ~10 sec for 70B
        # Model on disk: ~30 sec for 70B

        return {
            "model_source": "shared_memory",
            "model_load_sec": 2,
            "kv_cache_init_sec": 2,
            "nccl_reinit_sec": 5,
            "total_restart_sec": 9,
            "downtime_reduction": "3x faster than disk reload",
        }

Worker Restart Time by Model Loading Strategy

Disk (HDD)
60
NVMe SSD
15
Shared Memory
9
No restart (preempt only)
0
ℹ️ Note

In tensor parallel configurations (TP > 1), a single worker failure requires restarting ALL workers because NCCL communicators are tied to the process group. This means a single GPU error in an 8-GPU TP=8 setup causes a full restart of all 8 workers. This is a known limitation of NCCL and is one reason to prefer pipeline parallelism (PP) for reliability — a PP stage failure only affects that stage.

NaN Detection and Recovery

NaN (Not a Number) outputs can propagate silently through the model, producing garbage text.

class NaNDetector:
    """
    Detect and handle NaN values in model outputs.
    """

    def __init__(self, config: dict):
        self.check_frequency = config.get("nan_check_frequency", "every_step")
        self.action_on_nan = config.get("nan_action", "retry_with_different_seed")

    def check_output(self, logits: list, request_id: str) -> dict:
        """
        Check model output for NaN values.

        Where NaNs appear:
        1. Logits (most common: attention score overflow)
        2. Hidden states (rare: weight corruption)
        3. KV cache (very rare: memory corruption)
        """
        has_nan = any(x != x for x in logits)  # NaN != NaN is True
        has_inf = any(abs(x) > 1e30 for x in logits)

        if has_nan or has_inf:
            return {
                "detected": True,
                "type": "nan" if has_nan else "inf",
                "request_id": request_id,
                "action": self._determine_action(),
            }

        return {"detected": False}

    def _determine_action(self) -> dict:
        """
        Determine recovery action for NaN outputs.
        """
        actions = {
            "retry_with_different_seed": {
                "description": "Retry the same request with a different random seed",
                "success_rate": "80% (if NaN was caused by numerical edge case)",
                "cost": "One additional forward pass",
            },
            "clamp_and_continue": {
                "description": "Clamp logits to [-100, 100] and sample anyway",
                "success_rate": "60% (output quality may degrade)",
                "cost": "None (in-place fix)",
            },
            "abort_request": {
                "description": "Return error to client",
                "success_rate": "N/A",
                "cost": "Client must retry",
            },
        }
        return actions[self.action_on_nan]

    def investigate_nan_source(self) -> dict:
        """
        Common causes and diagnostics for NaN in LLM inference.
        """
        return {
            "attention_overflow": {
                "cause": "QK^T values too large, softmax overflows",
                "symptoms": "NaN in attention output, often at long contexts",
                "fix": "Use FlashAttention (numerically stable)",
                "vllm_default": "FlashAttention enabled by default",
            },
            "fp16_overflow": {
                "cause": "FP16 max is 65504, intermediate values exceed this",
                "symptoms": "Sporadic NaN, more common with aggressive quantization",
                "fix": "Use BF16 (larger range) or FP32 accumulation",
            },
            "weight_corruption": {
                "cause": "GPU memory error (ECC failure, cosmic ray)",
                "symptoms": "Persistent NaN for all requests after corruption",
                "fix": "Restart worker (reload model weights)",
            },
            "kv_cache_corruption": {
                "cause": "Memory allocation bug or hardware error",
                "symptoms": "NaN only for specific cached requests",
                "fix": "Evict corrupted KV blocks, re-prefill",
            },
        }
📊

NaN Sources and Recovery

SourceFrequencyRecoverySuccess Rate
Attention overflow Rare (with FlashAttn) Automatic (FlashAttn) 99.9%
FP16 overflow Occasional Retry with BF16 95%
Weight corruption Very rare Worker restart 100%
KV cache corruption Very rare Evict + re-prefill 100%
Quantization artifacts Model-dependent Higher precision 90%

Graceful Degradation Under Load

When the system is overloaded, graceful degradation is better than hard failures.

class GracefulDegradation:
    """
    Degrade service quality gracefully under extreme load.
    """

    def __init__(self, config: dict):
        self.max_queue_depth = config.get("max_queue_depth", 1000)
        self.degradation_levels = [
            {"queue_depth": 100, "action": "reduce_max_tokens", "max_tokens": 2048},
            {"queue_depth": 500, "action": "reduce_max_tokens", "max_tokens": 512},
            {"queue_depth": 800, "action": "reject_low_priority", "priority_threshold": 3},
            {"queue_depth": 1000, "action": "reject_all_new", "reason": "system_overloaded"},
        ]

    def evaluate_degradation(self, current_queue_depth: int,
                               request_priority: int) -> dict:
        """
        Determine degradation level and action for new request.
        """
        for level in reversed(self.degradation_levels):
            if current_queue_depth >= level["queue_depth"]:
                action = level["action"]

                if action == "reduce_max_tokens":
                    return {
                        "admit": True,
                        "degraded": True,
                        "original_max_tokens": "as_requested",
                        "effective_max_tokens": level["max_tokens"],
                        "reason": "Queue depth exceeded threshold",
                    }
                elif action == "reject_low_priority":
                    if request_priority >= level["priority_threshold"]:
                        return {"admit": False, "reason": "low_priority_rejected"}
                    else:
                        return {"admit": True, "degraded": False}
                elif action == "reject_all_new":
                    return {
                        "admit": False,
                        "reason": "system_overloaded",
                        "retry_after_sec": 10,
                        "http_status": 503,
                    }

        return {"admit": True, "degraded": False}

    def circuit_breaker(self, error_rate_pct: float,
                         window_sec: int = 60) -> dict:
        """
        Circuit breaker pattern: stop accepting requests
        if error rate is too high.
        """
        if error_rate_pct > 50:
            return {
                "state": "open",
                "action": "reject_all",
                "reason": f"Error rate {error_rate_pct:.0f}% exceeds 50%",
                "retry_after_sec": 30,
            }
        elif error_rate_pct > 20:
            return {
                "state": "half_open",
                "action": "admit_10_pct",
                "reason": f"Error rate {error_rate_pct:.0f}% exceeds 20%",
            }
        else:
            return {"state": "closed", "action": "normal"}
📊

Graceful Degradation Levels

Queue DepthActionImpact on New RequestsExisting Requests
0-100 Normal operation None None
100-500 Reduce max tokens to 2048 Shorter responses Unaffected
500-800 Reduce max tokens to 512 Much shorter responses Unaffected
800-1000 Reject low priority Low-priority rejected Unaffected
1000+ Reject all new All new requests 503 Drain existing

Health Check and Monitoring

Production health checks must detect problems before they cause user-visible errors.

class HealthCheckSystem:
    """
    Multi-level health check for vLLM serving.
    """

    def liveness_check(self) -> dict:
        """
        Liveness: is the process running?
        Checked by Kubernetes every 10 seconds.
        """
        return {"alive": True, "pid": 12345}

    def readiness_check(self, engine_state: dict) -> dict:
        """
        Readiness: can the system accept new requests?
        """
        checks = {
            "model_loaded": engine_state.get("model_loaded", False),
            "workers_ready": engine_state.get("all_workers_ready", False),
            "memory_available": engine_state.get("memory_utilization", 1.0) < 0.98,
            "error_rate_ok": engine_state.get("error_rate_pct", 100) < 50,
        }

        all_ready = all(checks.values())
        return {
            "ready": all_ready,
            "checks": checks,
            "http_status": 200 if all_ready else 503,
        }

    def deep_health_check(self, engine_state: dict) -> dict:
        """
        Deep health: detailed system state for monitoring.
        Run every 30 seconds, report to Prometheus/Grafana.
        """
        return {
            "gpu_utilization_pct": engine_state.get("gpu_util", 0),
            "gpu_memory_used_gb": engine_state.get("gpu_mem_used", 0),
            "gpu_memory_total_gb": engine_state.get("gpu_mem_total", 80),
            "kv_cache_utilization_pct": engine_state.get("kv_util", 0),
            "running_requests": engine_state.get("running", 0),
            "waiting_requests": engine_state.get("waiting", 0),
            "requests_per_second": engine_state.get("rps", 0),
            "avg_ttft_ms": engine_state.get("avg_ttft", 0),
            "p99_ttft_ms": engine_state.get("p99_ttft", 0),
            "avg_tpot_ms": engine_state.get("avg_tpot", 0),
            "p99_tpot_ms": engine_state.get("p99_tpot", 0),
            "error_count_last_minute": engine_state.get("errors_1m", 0),
            "preemption_count_last_minute": engine_state.get("preemptions_1m", 0),
            "worker_restarts_last_hour": engine_state.get("restarts_1h", 0),
        }
💡 Tip

Configure Kubernetes liveness probes with a generous timeout (30 seconds) for vLLM. Model loading and worker initialization can take 30-60 seconds. If the liveness probe timeout is shorter than the startup time, Kubernetes will kill and restart the pod in a loop.

Error Handling Configuration

def recommended_error_handling_config() -> dict:
    """
    Recommended error handling configuration for production vLLM.
    """
    return {
        # Memory management
        "gpu_memory_utilization": 0.90,  # Leave 10% headroom
        "max_num_seqs": 256,             # Limit concurrent sequences
        "max_model_len": 32768,          # Cap sequence length

        # Preemption
        "preemption_mode": "recompute",  # Cheaper than swap
        "enable_prefix_caching": True,    # Reduces re-prefill cost

        # Retry
        "max_retries": 3,
        "retry_base_delay_ms": 100,
        "retry_max_delay_ms": 5000,

        # Worker recovery
        "worker_restart_timeout_sec": 60,
        "max_worker_restarts_per_hour": 5,

        # Health checks
        "liveness_probe_timeout_sec": 30,
        "readiness_probe_timeout_sec": 10,
        "health_check_interval_sec": 10,

        # Degradation
        "max_queue_depth": 1000,
        "circuit_breaker_error_threshold_pct": 50,

        # NaN handling
        "nan_check": True,
        "nan_action": "retry_with_different_seed",
    }

Error Recovery Overhead by Error Type

Preempt (recompute)
0.5
Preempt (swap to CPU)
50
NaN retry
100
Worker restart (SHM)
9,000
Worker restart (disk)
60,000

Error handling in vLLM v1 follows a clear hierarchy: prevent errors through proactive memory management, handle recoverable errors through preemption and retry, and recover from fatal errors through worker restart. The most critical configuration choices are gpu_memory_utilization (leave headroom to avoid OOM), preemption_mode (recompute is cheaper for most workloads), and model weight caching (shared memory or NVMe for fast restart). In a well-configured deployment, most errors are handled transparently — the client sees a slightly delayed response rather than an error.