The first production outage always happens at 2 AM on a weekend: a configuration issue you did not catch in testing, a memory leak that only appears under sustained load, or a monitoring gap that left you blind to the actual problem. Moving vLLM from a development notebook to reliable 24/7 production serving is not about clicking “deploy” — it is about systematic preparation across six dimensions: hardware sizing for peak load (not average), configuration hardening to prevent known failure modes, monitoring that catches issues before users do, load testing that surfaces bottlenecks, a deployment strategy that limits blast radius, and operational runbooks so your on-call engineer knows exactly what to do at 2 AM.
Hardware Sizing
The first decision: how many GPUs of what type.
def hardware_sizing(
model_id: str,
target_throughput_rps: float,
target_ttft_p99_ms: float,
target_tpot_p99_ms: float,
avg_input_tokens: int,
avg_output_tokens: int,
peak_multiplier: float = 2.0
) -> dict:
"""
Size GPU fleet for production deployment.
Key principle: size for PEAK load, not average.
"""
# Model memory requirements
model_configs = {
"llama-3.1-70b-instruct": {
"bf16_memory_gb": 140,
"fp8_memory_gb": 70,
"awq4_memory_gb": 38,
"per_gpu_throughput_rps": {
"bf16_tp8": 5.0,
"fp8_tp4": 8.0,
"awq4_tp2": 12.0,
},
},
"llama-3.1-8b-instruct": {
"bf16_memory_gb": 16,
"fp8_memory_gb": 8,
"awq4_memory_gb": 5,
"per_gpu_throughput_rps": {
"bf16_tp1": 15.0,
"fp8_tp1": 25.0,
"awq4_tp1": 30.0,
},
},
}
config = model_configs.get(model_id, {})
# Calculate replicas needed for peak throughput
peak_rps = target_throughput_rps * peak_multiplier
# Choose quantization based on quality requirements
# Recommendation: FP8 for quality-sensitive, AWQ4 for cost-sensitive
quantization_options = []
for quant, rps_per_gpu in config.get("per_gpu_throughput_rps", {}).items():
gpus_per_replica = int(quant.split("tp")[1]) if "tp" in quant else 1
replicas_needed = max(1, int(peak_rps / rps_per_gpu) + 1)
total_gpus = replicas_needed * gpus_per_replica
quantization_options.append({
"quantization": quant,
"gpus_per_replica": gpus_per_replica,
"replicas": replicas_needed,
"total_gpus": total_gpus,
"estimated_cost_per_hour": total_gpus * 3.00, # H100 cost
})
return {
"model": model_id,
"target_peak_rps": peak_rps,
"options": sorted(quantization_options, key=lambda x: x["total_gpus"]),
}
Hardware Sizing: Llama 3.1 70B for 100 RPS Peak
| Configuration | GPUs/Replica | Replicas | Total GPUs | Hourly Cost |
|---|---|---|---|---|
| BF16, TP=8 | 8 | 20 | 160 | $480 |
| FP8, TP=4 | 4 | 13 | 52 | $156 |
| AWQ4, TP=2 | 2 | 9 | 18 | $54 |
| FP8, TP=8 | 8 | 7 | 56 | $168 |
Always size for peak load, not average. If your average is 50 RPS but peak is 100 RPS, you need capacity for 100 RPS. Autoscaling can help but has a 30-60 second response time, during which requests will queue. For latency-sensitive deployments, provision for peak. For cost-sensitive deployments, provision for average + 30% and accept higher tail latency during peaks.
Pre-Deployment Configuration Checklist
def pre_deployment_checklist() -> dict:
"""
Configuration checklist before going to production.
"""
return {
"model_validation": [
{
"check": "Run eval suite on quantized model",
"why": "Verify quality after quantization",
"command": "python eval.py --model /path/to/model --tasks mmlu,humaneval",
"pass_criteria": "Within 2% of BF16 baseline",
},
{
"check": "Test maximum context length",
"why": "Verify KV cache handles max_model_len",
"command": "Send request with max_model_len input tokens",
"pass_criteria": "No OOM, correct output",
},
{
"check": "Test chat template",
"why": "Ensure correct prompt formatting",
"command": "Compare output with reference implementation",
"pass_criteria": "Identical tokenization",
},
],
"configuration_hardening": [
{
"param": "gpu_memory_utilization",
"production_value": 0.90,
"why": "Leave headroom for activation spikes",
},
{
"param": "max_num_seqs",
"production_value": "128-256",
"why": "Prevents OOM from too many concurrent sequences",
},
{
"param": "max_model_len",
"production_value": "Set explicitly (don't use model default)",
"why": "Controls worst-case memory per sequence",
},
{
"param": "disable_log_requests",
"production_value": True,
"why": "Avoid logging every request (performance + privacy)",
},
{
"param": "enable_prefix_caching",
"production_value": True,
"why": "Reduce TTFT for repeated system prompts",
},
],
}
Pre-Deployment Configuration
| Parameter | Dev Value | Production Value | Why Change |
|---|---|---|---|
| gpu_memory_utilization | 0.95 | 0.90 | Safety headroom |
| max_num_seqs | 1024 | 256 | Prevent OOM |
| max_model_len | 131072 | Set per use case | Memory planning |
| disable_log_requests | False | True | Performance + privacy |
| enable_prefix_caching | False | True | TTFT reduction |
| enable_chunked_prefill | True | True | Stable TPOT |
Monitoring Setup
Production monitoring requires four categories of metrics.
def monitoring_setup() -> dict:
"""
Monitoring configuration for production vLLM.
"""
return {
"request_metrics": {
"vllm_request_success_total": {
"type": "counter",
"labels": ["model", "status"],
"alert": "Rate drops below expected baseline",
},
"vllm_request_duration_seconds": {
"type": "histogram",
"labels": ["model"],
"buckets": [0.1, 0.5, 1, 2, 5, 10, 30, 60],
"alert": "P99 exceeds SLO",
},
"vllm_time_to_first_token_seconds": {
"type": "histogram",
"labels": ["model"],
"buckets": [0.05, 0.1, 0.2, 0.5, 1, 2, 5],
"alert": "P99 TTFT exceeds SLO",
},
"vllm_time_per_output_token_seconds": {
"type": "histogram",
"labels": ["model"],
"buckets": [0.01, 0.02, 0.05, 0.1, 0.2, 0.5],
"alert": "P99 TPOT exceeds SLO",
},
},
"system_metrics": {
"vllm_gpu_cache_usage_percent": {
"type": "gauge",
"alert_threshold": 95,
"alert": "KV cache utilization above 95%",
},
"vllm_num_requests_running": {
"type": "gauge",
"alert": "Exceeds max_num_seqs",
},
"vllm_num_requests_waiting": {
"type": "gauge",
"alert_threshold": 100,
"alert": "Queue depth exceeds 100",
},
"vllm_num_preemptions_total": {
"type": "counter",
"alert": "Rate exceeds 10/minute",
},
},
"gpu_metrics": {
"gpu_utilization_percent": {
"source": "DCGM or nvidia-smi",
"alert_low": 20, # Under-utilization
"alert_high": 99, # Saturated
},
"gpu_memory_used_bytes": {
"source": "DCGM",
"alert_threshold": "95% of total",
},
"gpu_temperature_celsius": {
"source": "DCGM",
"alert_threshold": 85,
},
},
"application_metrics": {
"error_rate_percent": {
"formula": "errors / total_requests * 100",
"alert_threshold": 1.0,
},
"availability_percent": {
"formula": "successful_health_checks / total_checks * 100",
"target": 99.9,
},
},
}
Alert Thresholds for Production Monitoring
| Metric | Warning | Critical | Action |
|---|---|---|---|
| KV Cache Usage | 85% | 95% | Scale up or reduce max_num_seqs |
| Queue Depth | 50 | 200 | Scale up replicas |
| P99 TTFT | 1.5x SLO | 2x SLO | Investigate prefill bottleneck |
| P99 TPOT | 1.5x SLO | 2x SLO | Reduce batch size |
| Error Rate | 0.5% | 2% | Check GPU health, logs |
| Preemption Rate | 5/min | 20/min | Increase KV cache budget |
| GPU Temperature | 80C | 85C | Check cooling, throttling |
Load Testing
Before production deployment, systematic load testing validates capacity and identifies breaking points.
def load_testing_plan() -> dict:
"""
Structured load testing plan for vLLM deployment.
"""
return {
"phase_1_baseline": {
"description": "Single-request latency baseline",
"concurrency": 1,
"duration_min": 5,
"measure": ["TTFT", "TPOT", "total_latency"],
"purpose": "Establish minimum latency achievable",
"tool": "curl or custom script",
},
"phase_2_ramp": {
"description": "Gradual ramp from 1 to target RPS",
"concurrency": "1 -> target_rps over 10 minutes",
"duration_min": 15,
"measure": ["throughput", "latency percentiles", "GPU utilization"],
"purpose": "Find throughput vs latency curve",
"tool": "locust, k6, or vllm benchmark script",
},
"phase_3_sustained": {
"description": "Sustained load at target RPS",
"concurrency": "target_rps",
"duration_min": 60,
"measure": ["stability", "memory growth", "error rate"],
"purpose": "Verify system handles sustained production load",
"pass_criteria": [
"Error rate below 0.1%",
"No memory growth (no leak)",
"P99 latency stable (no degradation over time)",
],
},
"phase_4_overload": {
"description": "2x target RPS (overload test)",
"concurrency": "2x target_rps",
"duration_min": 15,
"measure": ["graceful degradation", "error handling", "recovery"],
"purpose": "Verify system degrades gracefully under overload",
"pass_criteria": [
"No crashes",
"Errors are proper 429/503 (not 500)",
"Recovers within 30s after load reduction",
],
},
"phase_5_chaos": {
"description": "Kill a GPU worker during load",
"concurrency": "target_rps",
"duration_min": 10,
"measure": ["recovery time", "requests affected", "data loss"],
"purpose": "Verify worker recovery under load",
"pass_criteria": [
"Worker recovers within 60s",
"Failed requests are retried",
"Healthy replicas absorb load during recovery",
],
},
}
Load Test Phases and Duration
Kubernetes Deployment Configuration
def kubernetes_config() -> dict:
"""
Kubernetes deployment configuration for vLLM.
"""
return {
"pod_spec": {
"resources": {
"requests": {
"nvidia.com/gpu": 4, # TP=4
"memory": "64Gi",
"cpu": "16",
},
"limits": {
"nvidia.com/gpu": 4,
"memory": "128Gi",
"cpu": "32",
},
},
"topology_spread": {
"topologyKey": "kubernetes.io/hostname",
"whenUnsatisfiable": "DoNotSchedule",
"purpose": "Ensure GPUs are on same node for NVLink"
},
},
"probes": {
"livenessProbe": {
"httpGet": {"path": "/health", "port": 8000},
"initialDelaySeconds": 120, # Model loading time
"periodSeconds": 10,
"failureThreshold": 3,
"timeoutSeconds": 5,
},
"readinessProbe": {
"httpGet": {"path": "/health", "port": 8000},
"initialDelaySeconds": 120,
"periodSeconds": 5,
"failureThreshold": 1,
"timeoutSeconds": 5,
},
"startupProbe": {
"httpGet": {"path": "/health", "port": 8000},
"initialDelaySeconds": 30,
"periodSeconds": 10,
"failureThreshold": 30, # 5 min total startup time
"timeoutSeconds": 5,
},
},
"hpa": {
"minReplicas": 2,
"maxReplicas": 10,
"metrics": [
{
"type": "Pods",
"pods": {
"metric": {"name": "vllm_num_requests_waiting"},
"target": {"type": "AverageValue", "averageValue": 50},
},
},
],
"behavior": {
"scaleUp": {"stabilizationWindowSeconds": 60},
"scaleDown": {"stabilizationWindowSeconds": 300},
},
},
}
The startupProbe is critical for vLLM. Model loading can take 1-5 minutes depending on model size and storage speed. Without a startupProbe, the livenessProbe may kill the pod before the model finishes loading. Set failureThreshold * periodSeconds to be longer than the worst-case model loading time.
Rollout Strategy
def rollout_strategy() -> dict:
"""
Safe rollout strategy for vLLM production deployment.
"""
return {
"phase_1_canary": {
"traffic": "5%",
"duration": "1 hour",
"criteria": [
"Error rate below 0.1%",
"Latency within 10% of previous version",
"No GPU memory leaks",
],
"rollback_trigger": "Any criteria violated",
},
"phase_2_progressive": {
"traffic": "5% -> 25% -> 50% -> 100%",
"step_duration": "30 minutes each",
"criteria": "Same as canary",
"monitoring": "Per-replica metrics comparison",
},
"rollback_procedure": {
"trigger": "Error rate exceeds 1% or latency exceeds 2x SLO",
"action": "Route all traffic to previous version",
"time_to_rollback": "less than 1 minute (traffic routing change)",
"data_impact": "In-flight requests on new version will fail",
},
}
Rollout Timeline
| Phase | Traffic % | Duration | Gate Criteria |
|---|---|---|---|
| Canary | 5% | 1 hour | Error rate and latency |
| Progressive 1 | 25% | 30 min | Same as canary |
| Progressive 2 | 50% | 30 min | Same + GPU metrics |
| Full rollout | 100% | - | All criteria green |
| Rollback | 0% new | 1 min | Any criteria red |
Operational Runbooks
def operational_runbooks() -> dict:
"""
Runbooks for common production issues.
"""
return {
"high_latency": {
"symptoms": "P99 TTFT or TPOT exceeding SLO",
"diagnosis": [
"Check GPU utilization (if low: scheduling issue)",
"Check KV cache utilization (if high: memory pressure)",
"Check queue depth (if high: under-provisioned)",
"Check for long prefill requests (causing head-of-line blocking)",
],
"remediation": [
"If under-provisioned: scale up replicas",
"If memory pressure: reduce max_num_seqs or max_model_len",
"If long prefills: enable chunked_prefill if not already",
"If GPU under-utilized: increase max_num_seqs",
],
},
"oom_crashes": {
"symptoms": "Worker restarts, CUDA OOM in logs",
"diagnosis": [
"Check gpu_memory_utilization setting",
"Check for unusually long sequences",
"Check if activation memory spikes during prefill",
],
"remediation": [
"Reduce gpu_memory_utilization to 0.88",
"Set max_model_len lower",
"Reduce max_num_seqs",
"Enable swap-based preemption for long sequences",
],
},
"model_quality_regression": {
"symptoms": "User complaints, eval score drop",
"diagnosis": [
"Check model version (accidental wrong checkpoint)",
"Check quantization (verify eval scores match pre-deploy)",
"Check chat template (formatting errors cause quality drop)",
"Check for NaN outputs (silent corruption)",
],
"remediation": [
"Rollback to previous known-good version",
"Re-run evaluation suite",
"Compare tokenization output with reference",
],
},
"gpu_hardware_failure": {
"symptoms": "Xid errors in dmesg, ECC errors in nvidia-smi",
"diagnosis": [
"Check nvidia-smi for ECC errors",
"Check dmesg for Xid errors (Xid 48 = DBE, fatal)",
"Check GPU temperature (thermal throttling)",
],
"remediation": [
"Drain affected node (move traffic to healthy replicas)",
"Replace GPU (if ECC uncorrectable)",
"Restart vLLM worker (if correctable ECC)",
],
},
}
Production Readiness Scorecard
def production_readiness_scorecard() -> dict:
"""
Scorecard to assess production readiness.
Each item is pass/fail. All must pass before production deployment.
"""
return {
"infrastructure": [
{"item": "GPU fleet sized for peak + 30% headroom", "critical": True},
{"item": "Multi-AZ deployment for availability", "critical": True},
{"item": "Model weights on fast storage (NVMe or shared memory)", "critical": False},
{"item": "Network bandwidth sufficient for TP communication", "critical": True},
],
"configuration": [
{"item": "gpu_memory_utilization set to 0.90 or lower", "critical": True},
{"item": "max_model_len set explicitly", "critical": True},
{"item": "max_num_seqs tested under load", "critical": True},
{"item": "Quantized model evaluated against baseline", "critical": True},
],
"monitoring": [
{"item": "Prometheus metrics collection configured", "critical": True},
{"item": "Grafana dashboards for request and GPU metrics", "critical": True},
{"item": "Alerts for error rate, latency, and GPU health", "critical": True},
{"item": "Log aggregation (request logs, error logs)", "critical": False},
],
"testing": [
{"item": "Load test completed at 2x target RPS", "critical": True},
{"item": "Sustained 1-hour load test passed", "critical": True},
{"item": "Chaos test (worker kill) passed", "critical": False},
{"item": "Rollback procedure tested", "critical": True},
],
"operations": [
{"item": "Runbooks for top 5 failure modes documented", "critical": True},
{"item": "On-call rotation established", "critical": True},
{"item": "Canary deployment pipeline configured", "critical": True},
{"item": "Rollback can execute in under 1 minute", "critical": True},
],
}
Production Readiness Scorecard Summary
| Category | Total Items | Critical Items | Pass Requirement |
|---|---|---|---|
| Infrastructure | 4 | 3 | All critical must pass |
| Configuration | 4 | 4 | All critical must pass |
| Monitoring | 4 | 3 | All critical must pass |
| Testing | 4 | 3 | All critical must pass |
| Operations | 4 | 4 | All critical must pass |
| Total | 20 | 17 | 17/17 critical = go |
The scorecard has 20 items, 17 of which are critical. You cannot ship to production until all 17 critical items pass. The 3 non-critical items (fast storage, chaos testing, log aggregation) should be completed within the first week of production operation. Do not skip the load testing phases — most production outages in LLM serving are caused by configurations that work at low load but fail at peak.
Production deployment of vLLM v1 is not just about getting the model running — it is about keeping it running reliably at scale. The checklist in this post covers the full lifecycle: size hardware for peak load, harden configuration with safety margins, set up monitoring before the first request, load test until something breaks, deploy with canary rollout, and prepare runbooks for when things go wrong in production. Each item is a lesson learned from real-world LLM serving failures. Complete the checklist methodically, and your deployment will handle the inevitable production challenges gracefully.