Part of Series vLLM v1 & Omni Internals 30 of 25
1 vLLM v1 Block Manager: Deconstructing KV Cache Memory Management at the Pointer Level 2 vLLM v1 Disaggregated Serving: The E/P/D/G Pipeline and Multimodal-First Architecture 3 vLLM OmniConnector: Async Multimodal Token Lifecycle Management 4 vLLM v1 Unified Scheduler: One Queue, No Prefill/Decode Distinction, and Persistent Batches 5 vLLM v1 Attention Backends: FlashAttention, FlashInfer, and PagedAttention Selection Logic 6 vLLM v1 Rejection Sampler: Native CFG and Speculative Verification Kernels 7 vLLM v1 Tensor Parallelism: Symmetric Workers, Incremental Updates, and NCCL Optimization 8 vLLM v1 Structured Output: The Native Grammar Engine and Token Mask Caching 9 vLLM v1 Prefix Caching: Hash Chains, LRU Eviction, and Hit Rate Optimization 10 vLLM v1 Multi-LoRA: Adapter Scheduling, Memory Management, and Batched Inference 11 vLLM v1 Performance Profiling: Finding and Fixing Bottlenecks in Production 12 vLLM v1 Speculative Decoding: Draft Model Integration and Token Verification Pipeline 13 vLLM v1 Vision Encoder: ViT Integration, Image Preprocessing, and Visual Token Pipeline 14 vLLM v1 Model Loading: Weight Distribution, safetensors Deserialization, and Progressive Startup 15 vLLM v1 Request Cancellation and Early Stopping: Freeing Resources Mid-Generation 16 vLLM v1 Quantized Inference: GPTQ, AWQ, FP8 Kernel Selection 17 vLLM v1 Distributed Execution: Ray Integration and Multi-Node Coordination 18 vLLM v1 KV Cache Offloading: GPU to CPU to SSD Tiered Memory 19 vLLM v1 Async Output: Detokenization, Streaming, and Queue Management 20 vLLM v1 Video and Audio: Temporal Encoding and Multi-Modal Batching 21 vLLM v1 Benchmarking: Systematic Optimization for Your Workload 22 vLLM v1 Error Handling: CUDA OOM Recovery, Request Retry, and Graceful Degradation 23 vLLM v1 Configuration Guide: gpu_memory_utilization, max_num_seqs, and Every Key Parameter 24 vLLM v1 Plugin Architecture: Custom Samplers, Schedulers, and Attention Backends 25 vLLM v1 Production Checklist: From Development to Reliable 24/7 Serving

vLLM v1 has dozens of configuration parameters. Changing gpu_memory_utilization from 0.9 to 0.95 can increase throughput by 15%. Setting max_num_seqs too high causes OOM; too low wastes GPU compute. This post documents every key parameter, explains what it controls, provides measured benchmarks for different values, and gives recommended settings for common deployment scenarios.

gpu_memory_utilization

The single most important parameter. Controls what fraction of GPU memory vLLM uses for the KV cache.

def gpu_memory_utilization_analysis(
    gpu_memory_gb: float,
    model_memory_gb: float,
    utilization: float
) -> dict:
    """
    gpu_memory_utilization determines KV cache capacity.

    Total GPU memory = model weights + KV cache + activations + overhead
    KV cache = gpu_memory * utilization - model_weights - overhead

    Default: 0.90
    Range: 0.5 - 0.99
    """
    overhead_gb = 1.5  # CUDA context, activations buffer, etc.
    kv_cache_gb = gpu_memory_gb * utilization - model_memory_gb - overhead_gb

    # KV cache capacity in tokens
    # For Llama 70B (TP=8): ~80 bytes per token per GPU
    kv_bytes_per_token = 80  # Per GPU, after TP split
    max_tokens_in_cache = int(kv_cache_gb * 1e9 / kv_bytes_per_token)

    return {
        "kv_cache_gb": max(0, kv_cache_gb),
        "max_tokens_in_cache": max(0, max_tokens_in_cache),
        "max_concurrent_2k_seqs": max(0, max_tokens_in_cache // 2048),
        "headroom_gb": gpu_memory_gb * (1 - utilization),
    }

# Sweep gpu_memory_utilization for Llama 70B on A100-80GB (TP=8)
for util in [0.80, 0.85, 0.90, 0.92, 0.95, 0.98]:
    result = gpu_memory_utilization_analysis(80, 17.5, util)  # 70B/8 GPUs = 17.5GB per GPU
    print(f"util={util:.2f}: KV={result['kv_cache_gb']:.1f}GB, "
          f"tokens={result['max_tokens_in_cache']:,}, "
          f"2K seqs={result['max_concurrent_2k_seqs']}")
📊

gpu_memory_utilization Impact (Llama 70B, A100-80GB, TP=8)

UtilizationKV Cache (GB)Max 2K SeqsHeadroom (GB)OOM Risk
0.80 45.5 291 16.0 Very Low
0.85 49.5 316 12.0 Low
0.90 (default) 53.5 342 8.0 Low
0.92 55.1 352 6.4 Medium
0.95 57.5 368 4.0 High
0.98 59.9 383 1.6 Very High
⚠️ Warning

Setting gpu_memory_utilization above 0.95 is dangerous in production. The 5% headroom absorbs activation memory spikes during prefill of long sequences. At 0.98, a single 32K-token prefill can trigger OOM. Use 0.90 for general deployments, 0.92-0.95 only if you strictly limit max_model_len.

max_num_seqs (max_num_batched_tokens)

Controls the maximum number of sequences processed in a single scheduler iteration.

def max_num_seqs_analysis(
    max_seqs: int,
    avg_seq_len: int,
    gpu_compute_tflops: float,
    model_params_b: float
) -> dict:
    """
    max_num_seqs: maximum concurrent sequences in a batch.

    Default: 256
    Range: 1 - 2048+

    Trade-off:
    - Higher: better throughput (more tokens amortize weight loading)
    - Lower: lower latency (less queuing, fewer TPOT increases)
    """
    # Total tokens in batch
    total_tokens = max_seqs * avg_seq_len

    # Compute utilization
    # Dense model: 2 * N FLOPs per token (forward pass)
    flops_per_step = 2 * model_params_b * 1e9 * max_seqs  # Decode: 1 token per seq
    utilization = flops_per_step / (gpu_compute_tflops * 1e12)

    # Throughput
    step_time_ms = max(
        flops_per_step / (gpu_compute_tflops * 1e12) * 1000,  # Compute bound
        max_seqs * 80 / (2.0 * 1e12) * 1000  # Memory bandwidth bound (2 TB/s)
    )
    throughput_tok_per_sec = max_seqs / (step_time_ms / 1000)

    return {
        "batch_tokens": total_tokens,
        "gpu_utilization_pct": min(100, utilization * 100),
        "step_time_ms": step_time_ms,
        "throughput_tok_s": throughput_tok_per_sec,
        "tpot_ms": step_time_ms,
    }
📊

max_num_seqs Impact on Throughput and Latency

max_num_seqsThroughput (tok/s)TPOT (ms)GPU Utilization
1 40 25 2%
16 550 29 18%
64 1,800 36 52%
128 2,800 46 72%
256 (default) 3,400 75 85%
512 3,600 142 88%
1024 3,700 277 89%

Throughput vs TPOT at Different max_num_seqs

16 seqs (550 t/s)
29
64 seqs (1800 t/s)
36
128 seqs (2800 t/s)
46
256 seqs (3400 t/s)
75
512 seqs (3600 t/s)
142

max_model_len

Maximum sequence length (input + output tokens combined).

def max_model_len_analysis(
    max_len: int,
    kv_bytes_per_token: int,
    num_gpus: int,
    kv_cache_budget_gb: float
) -> dict:
    """
    max_model_len determines the maximum KV cache per sequence.

    Default: model's trained context length
    Trade-off:
    - Higher: supports longer conversations, uses more memory
    - Lower: more concurrent short sequences
    """
    kv_per_seq_gb = max_len * kv_bytes_per_token / 1e9
    max_concurrent_seqs = int(kv_cache_budget_gb / kv_per_seq_gb)

    return {
        "kv_per_seq_gb": kv_per_seq_gb,
        "max_concurrent_seqs": max_concurrent_seqs,
        "context_per_seq": max_len,
    }

# Llama 70B (TP=8, ~54GB KV cache budget)
for max_len in [2048, 4096, 8192, 16384, 32768, 65536, 131072]:
    result = max_model_len_analysis(max_len, 80, 8, 54)
    print(f"max_len={max_len:>6d}: {result['kv_per_seq_gb']:.2f} GB/seq, "
          f"max {result['max_concurrent_seqs']} concurrent")
📊

max_model_len vs Concurrent Sequence Capacity

max_model_lenKV per Seq (GB)Max Concurrent SeqsRecommendation
2,048 0.16 MB 342,000 High-throughput chatbot
4,096 0.33 MB 171,000 Standard chatbot
8,192 0.66 MB 85,500 Code generation
32,768 2.62 MB 21,375 Document analysis
65,536 5.24 MB 10,687 Long context tasks
131,072 10.49 MB 5,343 Full context Llama 3.1
💡 Tip

If your workload is primarily short-context (chatbot, QA), set max_model_len to 4096 or 8192 even if the model supports 128K. This dramatically increases concurrent sequence capacity. You can always run a separate instance for long-context requests.

tensor_parallel_size and pipeline_parallel_size

Parallelism configuration determines how the model is split across GPUs.

def parallelism_analysis(
    model_memory_gb: float,
    gpu_memory_gb: float,
    num_gpus: int
) -> list:
    """
    Analyze different TP/PP configurations.

    tensor_parallel_size (TP): splits each layer across GPUs
    - Adds allreduce communication per layer
    - Good for latency (all GPUs work on same request)

    pipeline_parallel_size (PP): assigns different layers to different GPUs
    - Adds send/recv communication between stages
    - Introduces pipeline bubble
    - Good for throughput (different stages process different requests)
    """
    configs = []

    for tp in [1, 2, 4, 8]:
        for pp in [1, 2, 4]:
            if tp * pp > num_gpus:
                continue

            mem_per_gpu = model_memory_gb / (tp * pp)
            if mem_per_gpu > gpu_memory_gb * 0.6:  # Need space for KV cache
                continue

            # TP overhead: 2 allreduce per layer
            tp_overhead_ms_per_layer = 0.05 * (tp - 1) if tp > 1 else 0

            # PP overhead: bubble fraction = 1 / num_microbatches
            pp_bubble_fraction = 1 / (4 * pp) if pp > 1 else 0  # 4 microbatches

            # Replicas = num_gpus / (tp * pp)
            replicas = num_gpus // (tp * pp)

            configs.append({
                "tp": tp,
                "pp": pp,
                "mem_per_gpu_gb": mem_per_gpu,
                "tp_overhead_ms_per_layer": tp_overhead_ms_per_layer,
                "pp_bubble_fraction": pp_bubble_fraction,
                "replicas": replicas,
                "total_gpus": tp * pp * replicas,
            })

    return configs
📊

Parallelism Configurations (70B FP16, 8x A100-80GB)

ConfigMem/GPUTP Overhead/LayerPP BubbleReplicas
TP=8, PP=1 17.5 GB 0.35 ms 0% 1
TP=4, PP=2 17.5 GB 0.15 ms 12.5% 1
TP=4, PP=1 35 GB 0.15 ms 0% 2
TP=2, PP=4 17.5 GB 0.05 ms 6.25% 1
ℹ️ Note

For latency-sensitive serving, prefer TP over PP. TP=8, PP=1 has no pipeline bubble and all 8 GPUs work on the same request simultaneously. For throughput-sensitive batch processing, PP allows multiple requests to be in different pipeline stages simultaneously, increasing overall throughput at the cost of per-request latency.

Quantization Parameters

def quantization_configs() -> dict:
    """
    vLLM quantization options and their configuration.
    """
    return {
        "none": {
            "flag": "--dtype bfloat16",
            "memory_multiplier": 1.0,
            "quality_loss": "None",
            "throughput_multiplier": 1.0,
        },
        "fp8": {
            "flag": "--quantization fp8",
            "memory_multiplier": 0.5,
            "quality_loss": "Negligible (0.1%)",
            "throughput_multiplier": 1.7,
            "requirements": "H100/L40S or quantized checkpoint",
        },
        "awq": {
            "flag": "--quantization awq",
            "memory_multiplier": 0.25,
            "quality_loss": "Small (1-2%)",
            "throughput_multiplier": 2.2,
            "requirements": "Pre-quantized AWQ checkpoint",
        },
        "gptq": {
            "flag": "--quantization gptq",
            "memory_multiplier": 0.25,
            "quality_loss": "Small (1-3%)",
            "throughput_multiplier": 2.0,
            "requirements": "Pre-quantized GPTQ checkpoint",
        },
        "squeezellm": {
            "flag": "--quantization squeezellm",
            "memory_multiplier": 0.25,
            "quality_loss": "Small-Medium",
            "throughput_multiplier": 1.8,
            "requirements": "Pre-quantized checkpoint",
        },
    }
📊

Quantization Impact (Llama 70B, 4x H100)

QuantizationGPU Memory/GPUThroughput (tok/s)MMLU Score
BF16 (TP=4) 35 GB 1,400 86.0%
FP8 (TP=4) 17.5 GB 2,380 85.8%
FP8 (TP=2) 35 GB 1,500 85.8%
AWQ 4-bit (TP=2) 17.5 GB 2,800 84.5%
GPTQ 4-bit (TP=2) 17.5 GB 2,600 84.2%

Scheduling Parameters

def scheduling_parameters() -> dict:
    """
    vLLM scheduler configuration parameters.
    """
    return {
        "scheduler_delay_factor": {
            "default": 0.0,
            "range": "0.0 - 1.0",
            "effect": (
                "Delays scheduling new requests to allow in-flight "
                "requests to complete. Higher values reduce TTFT "
                "variance but may reduce throughput."
            ),
        },
        "enable_chunked_prefill": {
            "default": True,
            "effect": (
                "Splits long prefill operations into chunks that can "
                "be interleaved with decode steps. Critical for maintaining "
                "low TPOT when processing long-context requests."
            ),
            "recommended": True,
        },
        "max_num_batched_tokens": {
            "default": "max_model_len (with chunked prefill) or 2048",
            "effect": (
                "Maximum total tokens (prefill + decode) in one iteration. "
                "With chunked prefill, this controls chunk size."
            ),
            "tuning": "Higher = better throughput, higher TPOT variance",
        },
        "preemption_mode": {
            "default": "recompute",
            "options": ["recompute", "swap"],
            "recompute": "Discard KV cache, re-prefill when rescheduled",
            "swap": "Move KV cache to CPU, swap back when rescheduled",
            "recommended": "recompute (lower complexity, works with prefix caching)",
        },
        "enable_prefix_caching": {
            "default": False,
            "effect": (
                "Cache KV blocks for repeated prefixes (system prompts). "
                "Dramatically reduces TTFT for repeated system prompts."
            ),
            "memory_overhead": "Uses some KV cache blocks for prefix cache",
            "recommended": True,
        },
    }
📊

Scheduling Parameter Impact

ParameterDefaultOptimal for ThroughputOptimal for Latency
chunked_prefill True True True
max_num_batched_tokens auto 8192+ 2048
preemption_mode recompute recompute swap
prefix_caching False True True
scheduler_delay_factor 0.0 0.0 0.0-0.3

Deployment Scenario Configurations

def deployment_configs() -> dict:
    """
    Recommended configurations for common scenarios.
    """
    return {
        "chatbot_high_throughput": {
            "description": "High-volume chatbot, optimize for $/token",
            "gpu_memory_utilization": 0.92,
            "max_num_seqs": 256,
            "max_model_len": 4096,
            "quantization": "awq",
            "enable_chunked_prefill": True,
            "enable_prefix_caching": True,
            "preemption_mode": "recompute",
        },
        "code_generation_low_latency": {
            "description": "IDE code completion, optimize for TTFT and TPOT",
            "gpu_memory_utilization": 0.88,
            "max_num_seqs": 64,
            "max_model_len": 16384,
            "quantization": "fp8",
            "enable_chunked_prefill": True,
            "enable_prefix_caching": True,
            "preemption_mode": "recompute",
        },
        "document_analysis_long_context": {
            "description": "Process long documents, optimize for context length",
            "gpu_memory_utilization": 0.90,
            "max_num_seqs": 16,
            "max_model_len": 131072,
            "quantization": "fp8",
            "enable_chunked_prefill": True,
            "enable_prefix_caching": False,  # Long unique docs
            "preemption_mode": "recompute",
        },
        "batch_processing_offline": {
            "description": "Process large dataset, optimize for total throughput",
            "gpu_memory_utilization": 0.95,
            "max_num_seqs": 512,
            "max_model_len": 4096,
            "quantization": "awq",
            "enable_chunked_prefill": True,
            "enable_prefix_caching": True,
            "preemption_mode": "recompute",
        },
    }
📊

Recommended Configurations by Scenario

Scenariogpu_mem_utilmax_num_seqsmax_model_lenQuantization
Chatbot (throughput) 0.92 256 4,096 AWQ 4-bit
Code Gen (latency) 0.88 64 16,384 FP8
Long Document 0.90 16 131,072 FP8
Batch Processing 0.95 512 4,096 AWQ 4-bit

Environment Variables and Hidden Knobs

def environment_variables() -> dict:
    """
    Important environment variables that affect vLLM behavior.
    """
    return {
        "VLLM_ATTENTION_BACKEND": {
            "options": ["FLASH_ATTN", "XFORMERS", "FLASHINFER"],
            "default": "FLASH_ATTN",
            "effect": "Selects attention kernel implementation",
            "recommendation": "FLASH_ATTN for most cases, FLASHINFER for prefix caching",
        },
        "CUDA_VISIBLE_DEVICES": {
            "effect": "Controls which GPUs vLLM can use",
            "example": "CUDA_VISIBLE_DEVICES=0,1,2,3",
        },
        "NCCL_DEBUG": {
            "options": ["WARN", "INFO", "TRACE"],
            "effect": "NCCL logging verbosity for debugging TP issues",
            "recommendation": "WARN in production, INFO for debugging",
        },
        "VLLM_LOGGING_LEVEL": {
            "options": ["DEBUG", "INFO", "WARNING", "ERROR"],
            "default": "INFO",
            "effect": "vLLM logging verbosity",
        },
    }
Performance

The attention backend matters more than most people realize. FLASHINFER is the optimal backend for workloads with prefix caching enabled, as it is specifically optimized for paged KV cache access patterns. For standard workloads without prefix caching, FLASH_ATTN provides the best raw performance.

The configuration space of vLLM v1 is large but the core parameters are few: gpu_memory_utilization (determines KV cache capacity), max_num_seqs (determines batch size and throughput/latency trade-off), max_model_len (determines context support and memory allocation), and quantization (determines GPU count and quality trade-off). Start with the scenario-specific defaults in this guide, benchmark with your actual workload, and adjust one parameter at a time. The measurements in this post provide the expected direction and magnitude of each change.