vLLM v1 has dozens of configuration parameters. Changing gpu_memory_utilization from 0.9 to 0.95 can increase throughput by 15%. Setting max_num_seqs too high causes OOM; too low wastes GPU compute. This post documents every key parameter, explains what it controls, provides measured benchmarks for different values, and gives recommended settings for common deployment scenarios.
gpu_memory_utilization
The single most important parameter. Controls what fraction of GPU memory vLLM uses for the KV cache.
def gpu_memory_utilization_analysis(
gpu_memory_gb: float,
model_memory_gb: float,
utilization: float
) -> dict:
"""
gpu_memory_utilization determines KV cache capacity.
Total GPU memory = model weights + KV cache + activations + overhead
KV cache = gpu_memory * utilization - model_weights - overhead
Default: 0.90
Range: 0.5 - 0.99
"""
overhead_gb = 1.5 # CUDA context, activations buffer, etc.
kv_cache_gb = gpu_memory_gb * utilization - model_memory_gb - overhead_gb
# KV cache capacity in tokens
# For Llama 70B (TP=8): ~80 bytes per token per GPU
kv_bytes_per_token = 80 # Per GPU, after TP split
max_tokens_in_cache = int(kv_cache_gb * 1e9 / kv_bytes_per_token)
return {
"kv_cache_gb": max(0, kv_cache_gb),
"max_tokens_in_cache": max(0, max_tokens_in_cache),
"max_concurrent_2k_seqs": max(0, max_tokens_in_cache // 2048),
"headroom_gb": gpu_memory_gb * (1 - utilization),
}
# Sweep gpu_memory_utilization for Llama 70B on A100-80GB (TP=8)
for util in [0.80, 0.85, 0.90, 0.92, 0.95, 0.98]:
result = gpu_memory_utilization_analysis(80, 17.5, util) # 70B/8 GPUs = 17.5GB per GPU
print(f"util={util:.2f}: KV={result['kv_cache_gb']:.1f}GB, "
f"tokens={result['max_tokens_in_cache']:,}, "
f"2K seqs={result['max_concurrent_2k_seqs']}")
gpu_memory_utilization Impact (Llama 70B, A100-80GB, TP=8)
| Utilization | KV Cache (GB) | Max 2K Seqs | Headroom (GB) | OOM Risk |
|---|---|---|---|---|
| 0.80 | 45.5 | 291 | 16.0 | Very Low |
| 0.85 | 49.5 | 316 | 12.0 | Low |
| 0.90 (default) | 53.5 | 342 | 8.0 | Low |
| 0.92 | 55.1 | 352 | 6.4 | Medium |
| 0.95 | 57.5 | 368 | 4.0 | High |
| 0.98 | 59.9 | 383 | 1.6 | Very High |
Setting gpu_memory_utilization above 0.95 is dangerous in production. The 5% headroom absorbs activation memory spikes during prefill of long sequences. At 0.98, a single 32K-token prefill can trigger OOM. Use 0.90 for general deployments, 0.92-0.95 only if you strictly limit max_model_len.
max_num_seqs (max_num_batched_tokens)
Controls the maximum number of sequences processed in a single scheduler iteration.
def max_num_seqs_analysis(
max_seqs: int,
avg_seq_len: int,
gpu_compute_tflops: float,
model_params_b: float
) -> dict:
"""
max_num_seqs: maximum concurrent sequences in a batch.
Default: 256
Range: 1 - 2048+
Trade-off:
- Higher: better throughput (more tokens amortize weight loading)
- Lower: lower latency (less queuing, fewer TPOT increases)
"""
# Total tokens in batch
total_tokens = max_seqs * avg_seq_len
# Compute utilization
# Dense model: 2 * N FLOPs per token (forward pass)
flops_per_step = 2 * model_params_b * 1e9 * max_seqs # Decode: 1 token per seq
utilization = flops_per_step / (gpu_compute_tflops * 1e12)
# Throughput
step_time_ms = max(
flops_per_step / (gpu_compute_tflops * 1e12) * 1000, # Compute bound
max_seqs * 80 / (2.0 * 1e12) * 1000 # Memory bandwidth bound (2 TB/s)
)
throughput_tok_per_sec = max_seqs / (step_time_ms / 1000)
return {
"batch_tokens": total_tokens,
"gpu_utilization_pct": min(100, utilization * 100),
"step_time_ms": step_time_ms,
"throughput_tok_s": throughput_tok_per_sec,
"tpot_ms": step_time_ms,
}
max_num_seqs Impact on Throughput and Latency
| max_num_seqs | Throughput (tok/s) | TPOT (ms) | GPU Utilization |
|---|---|---|---|
| 1 | 40 | 25 | 2% |
| 16 | 550 | 29 | 18% |
| 64 | 1,800 | 36 | 52% |
| 128 | 2,800 | 46 | 72% |
| 256 (default) | 3,400 | 75 | 85% |
| 512 | 3,600 | 142 | 88% |
| 1024 | 3,700 | 277 | 89% |
Throughput vs TPOT at Different max_num_seqs
max_model_len
Maximum sequence length (input + output tokens combined).
def max_model_len_analysis(
max_len: int,
kv_bytes_per_token: int,
num_gpus: int,
kv_cache_budget_gb: float
) -> dict:
"""
max_model_len determines the maximum KV cache per sequence.
Default: model's trained context length
Trade-off:
- Higher: supports longer conversations, uses more memory
- Lower: more concurrent short sequences
"""
kv_per_seq_gb = max_len * kv_bytes_per_token / 1e9
max_concurrent_seqs = int(kv_cache_budget_gb / kv_per_seq_gb)
return {
"kv_per_seq_gb": kv_per_seq_gb,
"max_concurrent_seqs": max_concurrent_seqs,
"context_per_seq": max_len,
}
# Llama 70B (TP=8, ~54GB KV cache budget)
for max_len in [2048, 4096, 8192, 16384, 32768, 65536, 131072]:
result = max_model_len_analysis(max_len, 80, 8, 54)
print(f"max_len={max_len:>6d}: {result['kv_per_seq_gb']:.2f} GB/seq, "
f"max {result['max_concurrent_seqs']} concurrent")
max_model_len vs Concurrent Sequence Capacity
| max_model_len | KV per Seq (GB) | Max Concurrent Seqs | Recommendation |
|---|---|---|---|
| 2,048 | 0.16 MB | 342,000 | High-throughput chatbot |
| 4,096 | 0.33 MB | 171,000 | Standard chatbot |
| 8,192 | 0.66 MB | 85,500 | Code generation |
| 32,768 | 2.62 MB | 21,375 | Document analysis |
| 65,536 | 5.24 MB | 10,687 | Long context tasks |
| 131,072 | 10.49 MB | 5,343 | Full context Llama 3.1 |
If your workload is primarily short-context (chatbot, QA), set max_model_len to 4096 or 8192 even if the model supports 128K. This dramatically increases concurrent sequence capacity. You can always run a separate instance for long-context requests.
tensor_parallel_size and pipeline_parallel_size
Parallelism configuration determines how the model is split across GPUs.
def parallelism_analysis(
model_memory_gb: float,
gpu_memory_gb: float,
num_gpus: int
) -> list:
"""
Analyze different TP/PP configurations.
tensor_parallel_size (TP): splits each layer across GPUs
- Adds allreduce communication per layer
- Good for latency (all GPUs work on same request)
pipeline_parallel_size (PP): assigns different layers to different GPUs
- Adds send/recv communication between stages
- Introduces pipeline bubble
- Good for throughput (different stages process different requests)
"""
configs = []
for tp in [1, 2, 4, 8]:
for pp in [1, 2, 4]:
if tp * pp > num_gpus:
continue
mem_per_gpu = model_memory_gb / (tp * pp)
if mem_per_gpu > gpu_memory_gb * 0.6: # Need space for KV cache
continue
# TP overhead: 2 allreduce per layer
tp_overhead_ms_per_layer = 0.05 * (tp - 1) if tp > 1 else 0
# PP overhead: bubble fraction = 1 / num_microbatches
pp_bubble_fraction = 1 / (4 * pp) if pp > 1 else 0 # 4 microbatches
# Replicas = num_gpus / (tp * pp)
replicas = num_gpus // (tp * pp)
configs.append({
"tp": tp,
"pp": pp,
"mem_per_gpu_gb": mem_per_gpu,
"tp_overhead_ms_per_layer": tp_overhead_ms_per_layer,
"pp_bubble_fraction": pp_bubble_fraction,
"replicas": replicas,
"total_gpus": tp * pp * replicas,
})
return configs
Parallelism Configurations (70B FP16, 8x A100-80GB)
| Config | Mem/GPU | TP Overhead/Layer | PP Bubble | Replicas |
|---|---|---|---|---|
| TP=8, PP=1 | 17.5 GB | 0.35 ms | 0% | 1 |
| TP=4, PP=2 | 17.5 GB | 0.15 ms | 12.5% | 1 |
| TP=4, PP=1 | 35 GB | 0.15 ms | 0% | 2 |
| TP=2, PP=4 | 17.5 GB | 0.05 ms | 6.25% | 1 |
For latency-sensitive serving, prefer TP over PP. TP=8, PP=1 has no pipeline bubble and all 8 GPUs work on the same request simultaneously. For throughput-sensitive batch processing, PP allows multiple requests to be in different pipeline stages simultaneously, increasing overall throughput at the cost of per-request latency.
Quantization Parameters
def quantization_configs() -> dict:
"""
vLLM quantization options and their configuration.
"""
return {
"none": {
"flag": "--dtype bfloat16",
"memory_multiplier": 1.0,
"quality_loss": "None",
"throughput_multiplier": 1.0,
},
"fp8": {
"flag": "--quantization fp8",
"memory_multiplier": 0.5,
"quality_loss": "Negligible (0.1%)",
"throughput_multiplier": 1.7,
"requirements": "H100/L40S or quantized checkpoint",
},
"awq": {
"flag": "--quantization awq",
"memory_multiplier": 0.25,
"quality_loss": "Small (1-2%)",
"throughput_multiplier": 2.2,
"requirements": "Pre-quantized AWQ checkpoint",
},
"gptq": {
"flag": "--quantization gptq",
"memory_multiplier": 0.25,
"quality_loss": "Small (1-3%)",
"throughput_multiplier": 2.0,
"requirements": "Pre-quantized GPTQ checkpoint",
},
"squeezellm": {
"flag": "--quantization squeezellm",
"memory_multiplier": 0.25,
"quality_loss": "Small-Medium",
"throughput_multiplier": 1.8,
"requirements": "Pre-quantized checkpoint",
},
}
Quantization Impact (Llama 70B, 4x H100)
| Quantization | GPU Memory/GPU | Throughput (tok/s) | MMLU Score |
|---|---|---|---|
| BF16 (TP=4) | 35 GB | 1,400 | 86.0% |
| FP8 (TP=4) | 17.5 GB | 2,380 | 85.8% |
| FP8 (TP=2) | 35 GB | 1,500 | 85.8% |
| AWQ 4-bit (TP=2) | 17.5 GB | 2,800 | 84.5% |
| GPTQ 4-bit (TP=2) | 17.5 GB | 2,600 | 84.2% |
Scheduling Parameters
def scheduling_parameters() -> dict:
"""
vLLM scheduler configuration parameters.
"""
return {
"scheduler_delay_factor": {
"default": 0.0,
"range": "0.0 - 1.0",
"effect": (
"Delays scheduling new requests to allow in-flight "
"requests to complete. Higher values reduce TTFT "
"variance but may reduce throughput."
),
},
"enable_chunked_prefill": {
"default": True,
"effect": (
"Splits long prefill operations into chunks that can "
"be interleaved with decode steps. Critical for maintaining "
"low TPOT when processing long-context requests."
),
"recommended": True,
},
"max_num_batched_tokens": {
"default": "max_model_len (with chunked prefill) or 2048",
"effect": (
"Maximum total tokens (prefill + decode) in one iteration. "
"With chunked prefill, this controls chunk size."
),
"tuning": "Higher = better throughput, higher TPOT variance",
},
"preemption_mode": {
"default": "recompute",
"options": ["recompute", "swap"],
"recompute": "Discard KV cache, re-prefill when rescheduled",
"swap": "Move KV cache to CPU, swap back when rescheduled",
"recommended": "recompute (lower complexity, works with prefix caching)",
},
"enable_prefix_caching": {
"default": False,
"effect": (
"Cache KV blocks for repeated prefixes (system prompts). "
"Dramatically reduces TTFT for repeated system prompts."
),
"memory_overhead": "Uses some KV cache blocks for prefix cache",
"recommended": True,
},
}
Scheduling Parameter Impact
| Parameter | Default | Optimal for Throughput | Optimal for Latency |
|---|---|---|---|
| chunked_prefill | True | True | True |
| max_num_batched_tokens | auto | 8192+ | 2048 |
| preemption_mode | recompute | recompute | swap |
| prefix_caching | False | True | True |
| scheduler_delay_factor | 0.0 | 0.0 | 0.0-0.3 |
Deployment Scenario Configurations
def deployment_configs() -> dict:
"""
Recommended configurations for common scenarios.
"""
return {
"chatbot_high_throughput": {
"description": "High-volume chatbot, optimize for $/token",
"gpu_memory_utilization": 0.92,
"max_num_seqs": 256,
"max_model_len": 4096,
"quantization": "awq",
"enable_chunked_prefill": True,
"enable_prefix_caching": True,
"preemption_mode": "recompute",
},
"code_generation_low_latency": {
"description": "IDE code completion, optimize for TTFT and TPOT",
"gpu_memory_utilization": 0.88,
"max_num_seqs": 64,
"max_model_len": 16384,
"quantization": "fp8",
"enable_chunked_prefill": True,
"enable_prefix_caching": True,
"preemption_mode": "recompute",
},
"document_analysis_long_context": {
"description": "Process long documents, optimize for context length",
"gpu_memory_utilization": 0.90,
"max_num_seqs": 16,
"max_model_len": 131072,
"quantization": "fp8",
"enable_chunked_prefill": True,
"enable_prefix_caching": False, # Long unique docs
"preemption_mode": "recompute",
},
"batch_processing_offline": {
"description": "Process large dataset, optimize for total throughput",
"gpu_memory_utilization": 0.95,
"max_num_seqs": 512,
"max_model_len": 4096,
"quantization": "awq",
"enable_chunked_prefill": True,
"enable_prefix_caching": True,
"preemption_mode": "recompute",
},
}
Recommended Configurations by Scenario
| Scenario | gpu_mem_util | max_num_seqs | max_model_len | Quantization |
|---|---|---|---|---|
| Chatbot (throughput) | 0.92 | 256 | 4,096 | AWQ 4-bit |
| Code Gen (latency) | 0.88 | 64 | 16,384 | FP8 |
| Long Document | 0.90 | 16 | 131,072 | FP8 |
| Batch Processing | 0.95 | 512 | 4,096 | AWQ 4-bit |
Environment Variables and Hidden Knobs
def environment_variables() -> dict:
"""
Important environment variables that affect vLLM behavior.
"""
return {
"VLLM_ATTENTION_BACKEND": {
"options": ["FLASH_ATTN", "XFORMERS", "FLASHINFER"],
"default": "FLASH_ATTN",
"effect": "Selects attention kernel implementation",
"recommendation": "FLASH_ATTN for most cases, FLASHINFER for prefix caching",
},
"CUDA_VISIBLE_DEVICES": {
"effect": "Controls which GPUs vLLM can use",
"example": "CUDA_VISIBLE_DEVICES=0,1,2,3",
},
"NCCL_DEBUG": {
"options": ["WARN", "INFO", "TRACE"],
"effect": "NCCL logging verbosity for debugging TP issues",
"recommendation": "WARN in production, INFO for debugging",
},
"VLLM_LOGGING_LEVEL": {
"options": ["DEBUG", "INFO", "WARNING", "ERROR"],
"default": "INFO",
"effect": "vLLM logging verbosity",
},
}
The attention backend matters more than most people realize. FLASHINFER is the optimal backend for workloads with prefix caching enabled, as it is specifically optimized for paged KV cache access patterns. For standard workloads without prefix caching, FLASH_ATTN provides the best raw performance.
The configuration space of vLLM v1 is large but the core parameters are few: gpu_memory_utilization (determines KV cache capacity), max_num_seqs (determines batch size and throughput/latency trade-off), max_model_len (determines context support and memory allocation), and quantization (determines GPU count and quality trade-off). Start with the scenario-specific defaults in this guide, benchmark with your actual workload, and adjust one parameter at a time. The measurements in this post provide the expected direction and magnitude of each change.