Part of Series vLLM v1 & Omni Internals 32 of 25
1 vLLM v1 Block Manager: Deconstructing KV Cache Memory Management at the Pointer Level 2 vLLM v1 Disaggregated Serving: The E/P/D/G Pipeline and Multimodal-First Architecture 3 vLLM OmniConnector: Async Multimodal Token Lifecycle Management 4 vLLM v1 Unified Scheduler: One Queue, No Prefill/Decode Distinction, and Persistent Batches 5 vLLM v1 Attention Backends: FlashAttention, FlashInfer, and PagedAttention Selection Logic 6 vLLM v1 Rejection Sampler: Native CFG and Speculative Verification Kernels 7 vLLM v1 Tensor Parallelism: Symmetric Workers, Incremental Updates, and NCCL Optimization 8 vLLM v1 Structured Output: The Native Grammar Engine and Token Mask Caching 9 vLLM v1 Prefix Caching: Hash Chains, LRU Eviction, and Hit Rate Optimization 10 vLLM v1 Multi-LoRA: Adapter Scheduling, Memory Management, and Batched Inference 11 vLLM v1 Performance Profiling: Finding and Fixing Bottlenecks in Production 12 vLLM v1 Speculative Decoding: Draft Model Integration and Token Verification Pipeline 13 vLLM v1 Vision Encoder: ViT Integration, Image Preprocessing, and Visual Token Pipeline 14 vLLM v1 Model Loading: Weight Distribution, safetensors Deserialization, and Progressive Startup 15 vLLM v1 Request Cancellation and Early Stopping: Freeing Resources Mid-Generation 16 vLLM v1 Quantized Inference: GPTQ, AWQ, FP8 Kernel Selection 17 vLLM v1 Distributed Execution: Ray Integration and Multi-Node Coordination 18 vLLM v1 KV Cache Offloading: GPU to CPU to SSD Tiered Memory 19 vLLM v1 Async Output: Detokenization, Streaming, and Queue Management 20 vLLM v1 Video and Audio: Temporal Encoding and Multi-Modal Batching 21 vLLM v1 Benchmarking: Systematic Optimization for Your Workload 22 vLLM v1 Error Handling: CUDA OOM Recovery, Request Retry, and Graceful Degradation 23 vLLM v1 Configuration Guide: gpu_memory_utilization, max_num_seqs, and Every Key Parameter 24 vLLM v1 Plugin Architecture: Custom Samplers, Schedulers, and Attention Backends 25 vLLM v1 Production Checklist: From Development to Reliable 24/7 Serving

The first production outage always happens at 2 AM on a weekend: a configuration issue you did not catch in testing, a memory leak that only appears under sustained load, or a monitoring gap that left you blind to the actual problem. Moving vLLM from a development notebook to reliable 24/7 production serving is not about clicking “deploy” — it is about systematic preparation across six dimensions: hardware sizing for peak load (not average), configuration hardening to prevent known failure modes, monitoring that catches issues before users do, load testing that surfaces bottlenecks, a deployment strategy that limits blast radius, and operational runbooks so your on-call engineer knows exactly what to do at 2 AM.

Hardware Sizing

The first decision: how many GPUs of what type.

def hardware_sizing(
    model_id: str,
    target_throughput_rps: float,
    target_ttft_p99_ms: float,
    target_tpot_p99_ms: float,
    avg_input_tokens: int,
    avg_output_tokens: int,
    peak_multiplier: float = 2.0
) -> dict:
    """
    Size GPU fleet for production deployment.

    Key principle: size for PEAK load, not average.
    """
    # Model memory requirements
    model_configs = {
        "llama-3.1-70b-instruct": {
            "bf16_memory_gb": 140,
            "fp8_memory_gb": 70,
            "awq4_memory_gb": 38,
            "per_gpu_throughput_rps": {
                "bf16_tp8": 5.0,
                "fp8_tp4": 8.0,
                "awq4_tp2": 12.0,
            },
        },
        "llama-3.1-8b-instruct": {
            "bf16_memory_gb": 16,
            "fp8_memory_gb": 8,
            "awq4_memory_gb": 5,
            "per_gpu_throughput_rps": {
                "bf16_tp1": 15.0,
                "fp8_tp1": 25.0,
                "awq4_tp1": 30.0,
            },
        },
    }

    config = model_configs.get(model_id, {})

    # Calculate replicas needed for peak throughput
    peak_rps = target_throughput_rps * peak_multiplier

    # Choose quantization based on quality requirements
    # Recommendation: FP8 for quality-sensitive, AWQ4 for cost-sensitive
    quantization_options = []

    for quant, rps_per_gpu in config.get("per_gpu_throughput_rps", {}).items():
        gpus_per_replica = int(quant.split("tp")[1]) if "tp" in quant else 1
        replicas_needed = max(1, int(peak_rps / rps_per_gpu) + 1)
        total_gpus = replicas_needed * gpus_per_replica

        quantization_options.append({
            "quantization": quant,
            "gpus_per_replica": gpus_per_replica,
            "replicas": replicas_needed,
            "total_gpus": total_gpus,
            "estimated_cost_per_hour": total_gpus * 3.00,  # H100 cost
        })

    return {
        "model": model_id,
        "target_peak_rps": peak_rps,
        "options": sorted(quantization_options, key=lambda x: x["total_gpus"]),
    }
📊

Hardware Sizing: Llama 3.1 70B for 100 RPS Peak

ConfigurationGPUs/ReplicaReplicasTotal GPUsHourly Cost
BF16, TP=8 8 20 160 $480
FP8, TP=4 4 13 52 $156
AWQ4, TP=2 2 9 18 $54
FP8, TP=8 8 7 56 $168
⚠️ Warning

Always size for peak load, not average. If your average is 50 RPS but peak is 100 RPS, you need capacity for 100 RPS. Autoscaling can help but has a 30-60 second response time, during which requests will queue. For latency-sensitive deployments, provision for peak. For cost-sensitive deployments, provision for average + 30% and accept higher tail latency during peaks.

Pre-Deployment Configuration Checklist

def pre_deployment_checklist() -> dict:
    """
    Configuration checklist before going to production.
    """
    return {
        "model_validation": [
            {
                "check": "Run eval suite on quantized model",
                "why": "Verify quality after quantization",
                "command": "python eval.py --model /path/to/model --tasks mmlu,humaneval",
                "pass_criteria": "Within 2% of BF16 baseline",
            },
            {
                "check": "Test maximum context length",
                "why": "Verify KV cache handles max_model_len",
                "command": "Send request with max_model_len input tokens",
                "pass_criteria": "No OOM, correct output",
            },
            {
                "check": "Test chat template",
                "why": "Ensure correct prompt formatting",
                "command": "Compare output with reference implementation",
                "pass_criteria": "Identical tokenization",
            },
        ],
        "configuration_hardening": [
            {
                "param": "gpu_memory_utilization",
                "production_value": 0.90,
                "why": "Leave headroom for activation spikes",
            },
            {
                "param": "max_num_seqs",
                "production_value": "128-256",
                "why": "Prevents OOM from too many concurrent sequences",
            },
            {
                "param": "max_model_len",
                "production_value": "Set explicitly (don't use model default)",
                "why": "Controls worst-case memory per sequence",
            },
            {
                "param": "disable_log_requests",
                "production_value": True,
                "why": "Avoid logging every request (performance + privacy)",
            },
            {
                "param": "enable_prefix_caching",
                "production_value": True,
                "why": "Reduce TTFT for repeated system prompts",
            },
        ],
    }
📊

Pre-Deployment Configuration

ParameterDev ValueProduction ValueWhy Change
gpu_memory_utilization 0.95 0.90 Safety headroom
max_num_seqs 1024 256 Prevent OOM
max_model_len 131072 Set per use case Memory planning
disable_log_requests False True Performance + privacy
enable_prefix_caching False True TTFT reduction
enable_chunked_prefill True True Stable TPOT

Monitoring Setup

Production monitoring requires four categories of metrics.

def monitoring_setup() -> dict:
    """
    Monitoring configuration for production vLLM.
    """
    return {
        "request_metrics": {
            "vllm_request_success_total": {
                "type": "counter",
                "labels": ["model", "status"],
                "alert": "Rate drops below expected baseline",
            },
            "vllm_request_duration_seconds": {
                "type": "histogram",
                "labels": ["model"],
                "buckets": [0.1, 0.5, 1, 2, 5, 10, 30, 60],
                "alert": "P99 exceeds SLO",
            },
            "vllm_time_to_first_token_seconds": {
                "type": "histogram",
                "labels": ["model"],
                "buckets": [0.05, 0.1, 0.2, 0.5, 1, 2, 5],
                "alert": "P99 TTFT exceeds SLO",
            },
            "vllm_time_per_output_token_seconds": {
                "type": "histogram",
                "labels": ["model"],
                "buckets": [0.01, 0.02, 0.05, 0.1, 0.2, 0.5],
                "alert": "P99 TPOT exceeds SLO",
            },
        },
        "system_metrics": {
            "vllm_gpu_cache_usage_percent": {
                "type": "gauge",
                "alert_threshold": 95,
                "alert": "KV cache utilization above 95%",
            },
            "vllm_num_requests_running": {
                "type": "gauge",
                "alert": "Exceeds max_num_seqs",
            },
            "vllm_num_requests_waiting": {
                "type": "gauge",
                "alert_threshold": 100,
                "alert": "Queue depth exceeds 100",
            },
            "vllm_num_preemptions_total": {
                "type": "counter",
                "alert": "Rate exceeds 10/minute",
            },
        },
        "gpu_metrics": {
            "gpu_utilization_percent": {
                "source": "DCGM or nvidia-smi",
                "alert_low": 20,  # Under-utilization
                "alert_high": 99,  # Saturated
            },
            "gpu_memory_used_bytes": {
                "source": "DCGM",
                "alert_threshold": "95% of total",
            },
            "gpu_temperature_celsius": {
                "source": "DCGM",
                "alert_threshold": 85,
            },
        },
        "application_metrics": {
            "error_rate_percent": {
                "formula": "errors / total_requests * 100",
                "alert_threshold": 1.0,
            },
            "availability_percent": {
                "formula": "successful_health_checks / total_checks * 100",
                "target": 99.9,
            },
        },
    }
📊

Alert Thresholds for Production Monitoring

MetricWarningCriticalAction
KV Cache Usage 85% 95% Scale up or reduce max_num_seqs
Queue Depth 50 200 Scale up replicas
P99 TTFT 1.5x SLO 2x SLO Investigate prefill bottleneck
P99 TPOT 1.5x SLO 2x SLO Reduce batch size
Error Rate 0.5% 2% Check GPU health, logs
Preemption Rate 5/min 20/min Increase KV cache budget
GPU Temperature 80C 85C Check cooling, throttling

Load Testing

Before production deployment, systematic load testing validates capacity and identifies breaking points.

def load_testing_plan() -> dict:
    """
    Structured load testing plan for vLLM deployment.
    """
    return {
        "phase_1_baseline": {
            "description": "Single-request latency baseline",
            "concurrency": 1,
            "duration_min": 5,
            "measure": ["TTFT", "TPOT", "total_latency"],
            "purpose": "Establish minimum latency achievable",
            "tool": "curl or custom script",
        },
        "phase_2_ramp": {
            "description": "Gradual ramp from 1 to target RPS",
            "concurrency": "1 -> target_rps over 10 minutes",
            "duration_min": 15,
            "measure": ["throughput", "latency percentiles", "GPU utilization"],
            "purpose": "Find throughput vs latency curve",
            "tool": "locust, k6, or vllm benchmark script",
        },
        "phase_3_sustained": {
            "description": "Sustained load at target RPS",
            "concurrency": "target_rps",
            "duration_min": 60,
            "measure": ["stability", "memory growth", "error rate"],
            "purpose": "Verify system handles sustained production load",
            "pass_criteria": [
                "Error rate below 0.1%",
                "No memory growth (no leak)",
                "P99 latency stable (no degradation over time)",
            ],
        },
        "phase_4_overload": {
            "description": "2x target RPS (overload test)",
            "concurrency": "2x target_rps",
            "duration_min": 15,
            "measure": ["graceful degradation", "error handling", "recovery"],
            "purpose": "Verify system degrades gracefully under overload",
            "pass_criteria": [
                "No crashes",
                "Errors are proper 429/503 (not 500)",
                "Recovers within 30s after load reduction",
            ],
        },
        "phase_5_chaos": {
            "description": "Kill a GPU worker during load",
            "concurrency": "target_rps",
            "duration_min": 10,
            "measure": ["recovery time", "requests affected", "data loss"],
            "purpose": "Verify worker recovery under load",
            "pass_criteria": [
                "Worker recovers within 60s",
                "Failed requests are retried",
                "Healthy replicas absorb load during recovery",
            ],
        },
    }

Load Test Phases and Duration

1. Baseline (5 min)
5
2. Ramp (15 min)
15
3. Sustained (60 min)
60
4. Overload (15 min)
15
5. Chaos (10 min)
10

Kubernetes Deployment Configuration

def kubernetes_config() -> dict:
    """
    Kubernetes deployment configuration for vLLM.
    """
    return {
        "pod_spec": {
            "resources": {
                "requests": {
                    "nvidia.com/gpu": 4,  # TP=4
                    "memory": "64Gi",
                    "cpu": "16",
                },
                "limits": {
                    "nvidia.com/gpu": 4,
                    "memory": "128Gi",
                    "cpu": "32",
                },
            },
            "topology_spread": {
                "topologyKey": "kubernetes.io/hostname",
                "whenUnsatisfiable": "DoNotSchedule",
                "purpose": "Ensure GPUs are on same node for NVLink"
            },
        },
        "probes": {
            "livenessProbe": {
                "httpGet": {"path": "/health", "port": 8000},
                "initialDelaySeconds": 120,  # Model loading time
                "periodSeconds": 10,
                "failureThreshold": 3,
                "timeoutSeconds": 5,
            },
            "readinessProbe": {
                "httpGet": {"path": "/health", "port": 8000},
                "initialDelaySeconds": 120,
                "periodSeconds": 5,
                "failureThreshold": 1,
                "timeoutSeconds": 5,
            },
            "startupProbe": {
                "httpGet": {"path": "/health", "port": 8000},
                "initialDelaySeconds": 30,
                "periodSeconds": 10,
                "failureThreshold": 30,  # 5 min total startup time
                "timeoutSeconds": 5,
            },
        },
        "hpa": {
            "minReplicas": 2,
            "maxReplicas": 10,
            "metrics": [
                {
                    "type": "Pods",
                    "pods": {
                        "metric": {"name": "vllm_num_requests_waiting"},
                        "target": {"type": "AverageValue", "averageValue": 50},
                    },
                },
            ],
            "behavior": {
                "scaleUp": {"stabilizationWindowSeconds": 60},
                "scaleDown": {"stabilizationWindowSeconds": 300},
            },
        },
    }
ℹ️ Note

The startupProbe is critical for vLLM. Model loading can take 1-5 minutes depending on model size and storage speed. Without a startupProbe, the livenessProbe may kill the pod before the model finishes loading. Set failureThreshold * periodSeconds to be longer than the worst-case model loading time.

Rollout Strategy

def rollout_strategy() -> dict:
    """
    Safe rollout strategy for vLLM production deployment.
    """
    return {
        "phase_1_canary": {
            "traffic": "5%",
            "duration": "1 hour",
            "criteria": [
                "Error rate below 0.1%",
                "Latency within 10% of previous version",
                "No GPU memory leaks",
            ],
            "rollback_trigger": "Any criteria violated",
        },
        "phase_2_progressive": {
            "traffic": "5% -> 25% -> 50% -> 100%",
            "step_duration": "30 minutes each",
            "criteria": "Same as canary",
            "monitoring": "Per-replica metrics comparison",
        },
        "rollback_procedure": {
            "trigger": "Error rate exceeds 1% or latency exceeds 2x SLO",
            "action": "Route all traffic to previous version",
            "time_to_rollback": "less than 1 minute (traffic routing change)",
            "data_impact": "In-flight requests on new version will fail",
        },
    }
📊

Rollout Timeline

PhaseTraffic %DurationGate Criteria
Canary 5% 1 hour Error rate and latency
Progressive 1 25% 30 min Same as canary
Progressive 2 50% 30 min Same + GPU metrics
Full rollout 100% - All criteria green
Rollback 0% new 1 min Any criteria red

Operational Runbooks

def operational_runbooks() -> dict:
    """
    Runbooks for common production issues.
    """
    return {
        "high_latency": {
            "symptoms": "P99 TTFT or TPOT exceeding SLO",
            "diagnosis": [
                "Check GPU utilization (if low: scheduling issue)",
                "Check KV cache utilization (if high: memory pressure)",
                "Check queue depth (if high: under-provisioned)",
                "Check for long prefill requests (causing head-of-line blocking)",
            ],
            "remediation": [
                "If under-provisioned: scale up replicas",
                "If memory pressure: reduce max_num_seqs or max_model_len",
                "If long prefills: enable chunked_prefill if not already",
                "If GPU under-utilized: increase max_num_seqs",
            ],
        },
        "oom_crashes": {
            "symptoms": "Worker restarts, CUDA OOM in logs",
            "diagnosis": [
                "Check gpu_memory_utilization setting",
                "Check for unusually long sequences",
                "Check if activation memory spikes during prefill",
            ],
            "remediation": [
                "Reduce gpu_memory_utilization to 0.88",
                "Set max_model_len lower",
                "Reduce max_num_seqs",
                "Enable swap-based preemption for long sequences",
            ],
        },
        "model_quality_regression": {
            "symptoms": "User complaints, eval score drop",
            "diagnosis": [
                "Check model version (accidental wrong checkpoint)",
                "Check quantization (verify eval scores match pre-deploy)",
                "Check chat template (formatting errors cause quality drop)",
                "Check for NaN outputs (silent corruption)",
            ],
            "remediation": [
                "Rollback to previous known-good version",
                "Re-run evaluation suite",
                "Compare tokenization output with reference",
            ],
        },
        "gpu_hardware_failure": {
            "symptoms": "Xid errors in dmesg, ECC errors in nvidia-smi",
            "diagnosis": [
                "Check nvidia-smi for ECC errors",
                "Check dmesg for Xid errors (Xid 48 = DBE, fatal)",
                "Check GPU temperature (thermal throttling)",
            ],
            "remediation": [
                "Drain affected node (move traffic to healthy replicas)",
                "Replace GPU (if ECC uncorrectable)",
                "Restart vLLM worker (if correctable ECC)",
            ],
        },
    }

Production Readiness Scorecard

def production_readiness_scorecard() -> dict:
    """
    Scorecard to assess production readiness.
    Each item is pass/fail. All must pass before production deployment.
    """
    return {
        "infrastructure": [
            {"item": "GPU fleet sized for peak + 30% headroom", "critical": True},
            {"item": "Multi-AZ deployment for availability", "critical": True},
            {"item": "Model weights on fast storage (NVMe or shared memory)", "critical": False},
            {"item": "Network bandwidth sufficient for TP communication", "critical": True},
        ],
        "configuration": [
            {"item": "gpu_memory_utilization set to 0.90 or lower", "critical": True},
            {"item": "max_model_len set explicitly", "critical": True},
            {"item": "max_num_seqs tested under load", "critical": True},
            {"item": "Quantized model evaluated against baseline", "critical": True},
        ],
        "monitoring": [
            {"item": "Prometheus metrics collection configured", "critical": True},
            {"item": "Grafana dashboards for request and GPU metrics", "critical": True},
            {"item": "Alerts for error rate, latency, and GPU health", "critical": True},
            {"item": "Log aggregation (request logs, error logs)", "critical": False},
        ],
        "testing": [
            {"item": "Load test completed at 2x target RPS", "critical": True},
            {"item": "Sustained 1-hour load test passed", "critical": True},
            {"item": "Chaos test (worker kill) passed", "critical": False},
            {"item": "Rollback procedure tested", "critical": True},
        ],
        "operations": [
            {"item": "Runbooks for top 5 failure modes documented", "critical": True},
            {"item": "On-call rotation established", "critical": True},
            {"item": "Canary deployment pipeline configured", "critical": True},
            {"item": "Rollback can execute in under 1 minute", "critical": True},
        ],
    }
📊

Production Readiness Scorecard Summary

CategoryTotal ItemsCritical ItemsPass Requirement
Infrastructure 4 3 All critical must pass
Configuration 4 4 All critical must pass
Monitoring 4 3 All critical must pass
Testing 4 3 All critical must pass
Operations 4 4 All critical must pass
Total 20 17 17/17 critical = go
Performance

The scorecard has 20 items, 17 of which are critical. You cannot ship to production until all 17 critical items pass. The 3 non-critical items (fast storage, chaos testing, log aggregation) should be completed within the first week of production operation. Do not skip the load testing phases — most production outages in LLM serving are caused by configurations that work at low load but fail at peak.

Production deployment of vLLM v1 is not just about getting the model running — it is about keeping it running reliably at scale. The checklist in this post covers the full lifecycle: size hardware for peak load, harden configuration with safety margins, set up monitoring before the first request, load test until something breaks, deploy with canary rollout, and prepare runbooks for when things go wrong in production. Each item is a lesson learned from real-world LLM serving failures. Complete the checklist methodically, and your deployment will handle the inevitable production challenges gracefully.