vLLM v1 Benchmarking: Systematic Optimization for Your Workload

Part of Series vLLM v1 & Omni Internals 28 of 25

1 vLLM v1 Block Manager: Deconstructing KV Cache Memory Management at the Pointer Level 2 vLLM v1 Disaggregated Serving: The E/P/D/G Pipeline and Multimodal-First Architecture 3 vLLM OmniConnector: Async Multimodal Token Lifecycle Management 4 vLLM v1 Unified Scheduler: One Queue, No Prefill/Decode Distinction, and Persistent Batches 5 vLLM v1 Attention Backends: FlashAttention, FlashInfer, and PagedAttention Selection Logic 6 vLLM v1 Rejection Sampler: Native CFG and Speculative Verification Kernels 7 vLLM v1 Tensor Parallelism: Symmetric Workers, Incremental Updates, and NCCL Optimization 8 vLLM v1 Structured Output: The Native Grammar Engine and Token Mask Caching 9 vLLM v1 Prefix Caching: Hash Chains, LRU Eviction, and Hit Rate Optimization 10 vLLM v1 Multi-LoRA: Adapter Scheduling, Memory Management, and Batched Inference 11 vLLM v1 Performance Profiling: Finding and Fixing Bottlenecks in Production 12 vLLM v1 Speculative Decoding: Draft Model Integration and Token Verification Pipeline 13 vLLM v1 Vision Encoder: ViT Integration, Image Preprocessing, and Visual Token Pipeline 14 vLLM v1 Model Loading: Weight Distribution, safetensors Deserialization, and Progressive Startup 15 vLLM v1 Request Cancellation and Early Stopping: Freeing Resources Mid-Generation 16 vLLM v1 Quantized Inference: GPTQ, AWQ, FP8 Kernel Selection 17 vLLM v1 Distributed Execution: Ray Integration and Multi-Node Coordination 18 vLLM v1 KV Cache Offloading: GPU to CPU to SSD Tiered Memory 19 vLLM v1 Async Output: Detokenization, Streaming, and Queue Management 20 vLLM v1 Video and Audio: Temporal Encoding and Multi-Modal Batching 21 vLLM v1 Benchmarking: Systematic Optimization for Your Workload 22 vLLM v1 Error Handling: CUDA OOM Recovery, Request Retry, and Graceful Degradation 23 vLLM v1 Configuration Guide: gpu_memory_utilization, max_num_seqs, and Every Key Parameter 24 vLLM v1 Plugin Architecture: Custom Samplers, Schedulers, and Attention Backends 25 vLLM v1 Production Checklist: From Development to Reliable 24/7 Serving

Benchmarking LLM inference is not a single number. Time-to-first-token (TTFT), time-per-output-token (TPOT), total throughput (tokens/sec), and memory utilization are four independent metrics that often trade off against each other. A configuration that maximizes throughput may double TTFT. This post provides a systematic framework for benchmarking vLLM v1 across the configuration space: batch size, max sequence length, quantization, tensor parallelism, KV cache allocation, and scheduling parameters.

The Four Core Metrics

Every LLM serving benchmark must measure these independently:

# Metric definitions
TTFT = time_of_first_generated_token - time_of_request_arrival
TPOT = (time_of_last_token - time_of_first_token) / (num_output_tokens - 1)
throughput = total_tokens_generated / wall_clock_time  # across all requests
memory_util = kv_cache_blocks_used / kv_cache_blocks_total

TTFT is dominated by prefill time — the forward pass over the entire input prompt. TPOT is the decode step latency divided by batch size. Throughput is the aggregate generation rate.

The fundamental trade-off: higher batch sizes increase throughput but increase TPOT because each decode step takes longer with more sequences. TTFT increases when queued requests must wait for running sequences to free KV cache blocks.

vLLM’s Built-in Benchmarking Tool

vLLM ships with benchmark_serving.py which generates load and measures all four metrics:

# Basic benchmark: fixed request rate
python -m vllm.entrypoints.openai.api_server \
    --model meta-llama/Llama-2-7b-hf \
    --max-model-len 4096 &

python benchmarks/benchmark_serving.py \
    --backend vllm \
    --model meta-llama/Llama-2-7b-hf \
    --dataset-name sharegpt \
    --dataset-path ShareGPT_V3_unfiltered.json \
    --request-rate 10 \
    --num-prompts 1000 \
    --save-result results.json

The output includes percentile breakdowns:

============ Serving Benchmark Result ============
Successful requests:                     1000
Benchmark duration (s):                  105.32
Request throughput (req/s):              9.49
Input token throughput (tok/s):          2847.12
Output token throughput (tok/s):         4521.88

TTFT (ms):  p50=45.2  p90=89.1  p95=124.3  p99=312.7
TPOT (ms):  p50=12.1  p90=18.4  p95=22.1   p99=35.8

💡 Tip

Always report p99 latencies, not just p50. A system with p50 TTFT of 45ms and p99 TTFT of 312ms has a 7x tail latency ratio, indicating queuing or preemption under load. Production SLAs are typically defined at p95 or p99.

Configuration Axes to Benchmark

Axis 1: GPU Memory Utilization

The --gpu-memory-utilization flag controls what fraction of GPU memory is allocated to the KV cache pool. Higher values allow more concurrent sequences but leave less headroom for activation memory spikes.

# Sweep memory utilization
for util in 0.80 0.85 0.90 0.92 0.95; do
    python -m vllm.entrypoints.openai.api_server \
        --model meta-llama/Llama-2-70b-hf \
        --tensor-parallel-size 4 \
        --gpu-memory-utilization $util \
        --max-model-len 4096 &
    sleep 30  # wait for model load

    python benchmarks/benchmark_serving.py \
        --request-rate 20 \
        --num-prompts 500 \
        --save-result "results_util_${util}.json"

    kill %1
done

📊

Impact of GPU Memory Utilization — Llama 70B, 4xA100, RPS=20

Utilization	KV Blocks	Max Concurrent Seqs	Throughput (tok/s)	p99 TTFT (ms)
0.80	1,842	28	3,150	189
0.85	2,105	32	3,480	165
0.90	2,368	36	3,780	142
0.92	2,473	38	3,850	138
0.95	2,631	40	OOM crash	---

At 0.95 utilization, activation memory spikes during long-context prefill cause OOM. The safe maximum on A100 is typically 0.92.

Axis 2: Max Sequence Length

--max-model-len determines the maximum total tokens (input + output) per request. Larger values reduce KV cache blocks available because each block must be reservable up to the maximum length:

for maxlen in 2048 4096 8192 16384 32768; do
    python -m vllm.entrypoints.openai.api_server \
        --model meta-llama/Llama-2-70b-hf \
        --tensor-parallel-size 4 \
        --max-model-len $maxlen \
        --gpu-memory-utilization 0.90 &
    sleep 30
    python benchmarks/benchmark_serving.py \
        --request-rate 10 --num-prompts 300 \
        --save-result "results_maxlen_${maxlen}.json"
    kill %1
done

📊

Impact of Max Sequence Length — Llama 70B, 4xA100

Max Length	Blocks Available	Max Concurrent (avg input 512)	Throughput (tok/s)
2,048	2,368	72	4,210
4,096	2,368	36	3,780
8,192	2,368	18	2,950
16,384	2,368	9	1,820
32,768	2,368	4	980

⚠️ Warning

Setting max-model-len higher than your actual workload’s longest sequence wastes KV cache. If your P99 input+output length is 3,000 tokens, set max-model-len to 4,096, not 32,768. The scheduler reserves blocks based on potential maximum, not actual usage.

Axis 3: Max Number of Sequences

The --max-num-seqs parameter caps how many sequences can run concurrently:

# The relationship between max_num_seqs and throughput
# Throughput = batch_size * tokens_per_step / step_latency
# step_latency grows with batch_size (more KV cache reads)
# There's an optimal batch size where throughput peaks

# For Llama 70B on 4xA100:
# - Below batch 32: memory-bandwidth bound, throughput scales linearly
# - Batch 32-128: transition region
# - Above batch 128: compute-bound, throughput plateaus
# - Above batch 256: KV cache thrashing, throughput drops

Profiling with torch.profiler

vLLM supports PyTorch profiler integration to identify bottlenecks within the forward pass:

import torch
from vllm import LLM, SamplingParams

llm = LLM(model="meta-llama/Llama-2-7b-hf")

with torch.profiler.profile(
    activities=[
        torch.profiler.ProfilerActivity.CPU,
        torch.profiler.ProfilerActivity.CUDA,
    ],
    schedule=torch.profiler.schedule(wait=2, warmup=2, active=3),
    on_trace_ready=torch.profiler.tensorboard_trace_handler("./profiler_logs"),
    record_shapes=True,
    with_stack=True,
) as prof:
    for i in range(7):
        outputs = llm.generate(
            ["Explain quantum computing in detail."],
            SamplingParams(max_tokens=128)
        )
        prof.step()

The profiler trace reveals the time split between:

Typical decode step breakdown (Llama 70B, batch=64, 4xA100):
  Attention (FlashAttention):    35%  (KV cache read is bandwidth-bound)
  Linear projections (GEMM):     42%  (QKV, O, gate, up, down)
  All-reduce (NCCL):              8%  (2 per layer, 160 total)
  Sampling + scheduling:          5%  (CPU-side)
  Kernel launch overhead:         4%
  Other (norms, activations):     6%

Decode Step Time Breakdown — Llama 70B, Batch=64

Attention

Linear (GEMM)

All-reduce

Sampling/Sched

Kernel Launch

Other

TTFT Optimization

TTFT is the most user-visible metric in interactive applications. It is primarily determined by prefill latency.

Chunked Prefill

vLLM v1 supports chunked prefill, which breaks long prompts into chunks and interleaves them with decode steps:

python -m vllm.entrypoints.openai.api_server \
    --model meta-llama/Llama-2-70b-hf \
    --tensor-parallel-size 4 \
    --enable-chunked-prefill \
    --max-num-batched-tokens 2048

Without chunked prefill, a 4,096-token prompt blocks all decode steps for the duration of the prefill (approximately 200ms on 4xA100 for Llama 70B). With chunked prefill at chunk size 2,048, the prompt is processed in 2 chunks, and decode steps for other sequences run between chunks.

📊

TTFT with and without Chunked Prefill — Llama 70B, 4xA100

Input Length	No Chunking p50	No Chunking p99	Chunked (2048) p50	Chunked (2048) p99
512	28 ms	45 ms	30 ms	48 ms
2,048	95 ms	142 ms	98 ms	155 ms
4,096	198 ms	312 ms	210 ms	245 ms
8,192	410 ms	680 ms	425 ms	490 ms
16,384	850 ms	1,420 ms	880 ms	1,020 ms

Chunked prefill increases p50 TTFT slightly (the target request’s own prefill takes one extra scheduling round) but dramatically reduces p99 TTFT by preventing head-of-line blocking.

Prefix Caching

If many requests share a common prefix (system prompt, few-shot examples), prefix caching avoids redundant prefill computation:

python -m vllm.entrypoints.openai.api_server \
    --model meta-llama/Llama-2-70b-hf \
    --tensor-parallel-size 4 \
    --enable-prefix-caching

# Impact: shared 2048-token system prompt
# Without caching: every request prefills 2048 tokens
# With caching: first request prefills 2048 tokens, subsequent requests skip

# TTFT reduction for request with 2048 shared prefix + 512 unique:
# Without: prefill(2048 + 512) = 125 ms
# With: prefill(512) + cache_lookup(2048) = 32 ms + 0.1 ms = 32.1 ms
# Speedup: 3.9x

Throughput Optimization

Finding the Optimal Batch Size

Throughput is maximized at the batch size where GPU compute utilization is high but KV cache is not exhausted:

# Automated batch size sweep
import subprocess
import json

results = []
for max_seqs in [8, 16, 32, 64, 128, 256, 512]:
    # Launch server with specific max_num_seqs
    server = subprocess.Popen([
        "python", "-m", "vllm.entrypoints.openai.api_server",
        "--model", "meta-llama/Llama-2-70b-hf",
        "--tensor-parallel-size", "4",
        "--max-num-seqs", str(max_seqs),
        "--gpu-memory-utilization", "0.90"
    ])

    # Run benchmark at saturation load
    result = subprocess.run([
        "python", "benchmarks/benchmark_serving.py",
        "--request-rate", "inf",  # send all at once
        "--num-prompts", "500",
        "--save-result", f"batch_{max_seqs}.json"
    ], capture_output=True)

    with open(f"batch_{max_seqs}.json") as f:
        data = json.load(f)
    results.append({
        "max_seqs": max_seqs,
        "throughput": data["output_throughput"],
        "p99_tpot": data["tpot_p99"]
    })
    server.terminate()

📊

Throughput vs Max Concurrent Sequences — Llama 70B, 4xA100

Max Seqs	Throughput (tok/s)	p50 TPOT (ms)	p99 TPOT (ms)	GPU Compute Util
8	1,280	6.2	8.1	22%
16	2,410	6.6	9.5	38%
32	3,780	8.4	12.8	58%
64	4,650	13.7	19.2	74%
128	5,120	24.9	35.1	85%
256	4,890	52.3	78.4	82%
512	3,200	159.8	245.0	68%

Peak throughput occurs at max_seqs=128, but p99 TPOT is 35ms. If your SLA requires TPOT under 20ms, the optimal point is max_seqs=64 at 4,650 tok/s.

At 256+ sequences, KV cache pressure causes preemption (sequences are swapped out and back in), which wastes compute on recomputation. At 512, thrashing dominates and throughput drops below the 32-sequence configuration.

Throughput vs Batch Size Sweet Spot

8 seqs

1,280

16 seqs

2,410

32 seqs

3,780

64 seqs

4,650

128 seqs

5,120

256 seqs

4,890

512 seqs

3,200

Memory Profiling

Understanding where GPU memory goes is critical for optimization:

# Memory breakdown for Llama 70B on 4xA100 (TP=4)
# Total per GPU: 80 GB

model_weights = 70e9 * 2 / 4  # FP16, sharded 4 ways = 35 GB per GPU
# Actually lower due to GQA: ~32.5 GB per GPU

kv_cache_budget = 80 * 0.90 - 32.5 - 2.0  # 2 GB for activations/overhead
# kv_cache_budget = 37.5 GB per GPU

bytes_per_block = 5.24e6  # From block manager analysis (Llama 70B)
bytes_per_block_per_gpu = 5.24e6 / 4  # Sharded across TP
num_blocks = int(37.5e9 / (5.24e6 / 4))
# num_blocks = 28,626 blocks

tokens_per_block = 16
max_tokens_cached = num_blocks * tokens_per_block
# max_tokens_cached = 458,016 tokens across all sequences

# Runtime memory monitoring
watch -n 1 nvidia-smi --query-gpu=memory.used,memory.total,utilization.gpu \
    --format=csv,noheader,nounits

Comparing Quantization Impact on Throughput

# Benchmark matrix: model format x batch size
for model in "meta-llama/Llama-2-70b-hf" \
             "TheBloke/Llama-2-70B-GPTQ" \
             "neuralmagic/Llama-2-70B-FP8"; do
    python -m vllm.entrypoints.openai.api_server \
        --model $model \
        --tensor-parallel-size 4 \
        --gpu-memory-utilization 0.90 &
    sleep 60
    python benchmarks/benchmark_serving.py \
        --request-rate inf --num-prompts 500 \
        --save-result "quant_${model##*/}.json"
    kill %1
done

📊

Quantization Impact — Llama 70B, 4xA100, Saturated Load

Format	Model Mem/GPU (GB)	KV Cache (GB)	Max Seqs	Peak Throughput (tok/s)
FP16	32.5	37.5	128	5,120
FP8	16.3	53.7	184	7,840
GPTQ-INT4	9.2	60.8	208	8,350
AWQ-INT4	9.0	61.0	209	8,280

Quantization provides a double benefit: smaller weights mean faster GEMM (bandwidth-bound at small batch) AND more KV cache capacity, which allows higher concurrency.

End-to-End Optimization Workflow

Here is the systematic workflow for optimizing a vLLM deployment:

# Step 1: Establish baseline
# Run with default settings, measure all 4 metrics

# Step 2: Identify the bottleneck
def identify_bottleneck(profile_data):
    if profile_data["gpu_compute_util"] < 50:
        return "BATCH_TOO_SMALL"
    if profile_data["kv_cache_util"] > 95:
        return "KV_CACHE_EXHAUSTED"
    if profile_data["preemption_rate"] > 0.01:
        return "TOO_MANY_SEQUENCES"
    if profile_data["p99_ttft"] > profile_data["sla_ttft"]:
        return "PREFILL_TOO_SLOW"
    if profile_data["nccl_fraction"] > 15:
        return "COMMUNICATION_BOUND"
    return "NEAR_OPTIMAL"

# Step 3: Apply targeted optimization
optimizations = {
    "BATCH_TOO_SMALL": "Increase --max-num-seqs or request rate",
    "KV_CACHE_EXHAUSTED": "Reduce --max-model-len or use quantization",
    "TOO_MANY_SEQUENCES": "Reduce --max-num-seqs",
    "PREFILL_TOO_SLOW": "Enable chunked prefill or prefix caching",
    "COMMUNICATION_BOUND": "Reduce TP degree or use faster interconnect",
}

# Step 4: Re-benchmark and iterate

Production Monitoring Metrics

Deploy these Prometheus metrics for ongoing optimization:

# Key metrics to export
from prometheus_client import Histogram, Gauge, Counter

ttft_histogram = Histogram(
    "vllm_ttft_seconds",
    "Time to first token",
    buckets=[0.01, 0.025, 0.05, 0.1, 0.25, 0.5, 1.0, 2.5, 5.0]
)
tpot_histogram = Histogram(
    "vllm_tpot_seconds",
    "Time per output token",
    buckets=[0.005, 0.01, 0.02, 0.05, 0.1, 0.25]
)
kv_cache_usage = Gauge(
    "vllm_kv_cache_usage_ratio",
    "Fraction of KV cache blocks in use"
)
running_sequences = Gauge(
    "vllm_running_sequences",
    "Number of sequences currently being decoded"
)
preemption_counter = Counter(
    "vllm_preemptions_total",
    "Total number of sequence preemptions"
)

⚡ Performance

The preemption counter is the single most important metric for detecting throughput degradation. A preemption rate above 1% of requests indicates KV cache pressure. Each preemption discards computed KV cache and recomputes it later, wasting 2x the original compute. If you see preemptions, reduce max-num-seqs or increase KV cache capacity via quantization or more GPUs.

Summary

Benchmarking vLLM v1 requires measuring four independent metrics: TTFT, TPOT, throughput, and memory utilization. The optimal configuration depends on your SLA priorities. For latency-sensitive interactive serving, optimize for TTFT and TPOT by keeping batch sizes moderate (32-64) and enabling chunked prefill. For throughput-oriented batch processing, push batch sizes to 128+ and use quantized models to maximize KV cache capacity. Always benchmark with representative workload distributions (input/output length distributions from your actual traffic) rather than synthetic fixed-length prompts. Monitor preemption rate in production as the leading indicator of capacity exhaustion.

The Four Core Metrics

vLLM’s Built-in Benchmarking Tool

Configuration Axes to Benchmark

Axis 1: GPU Memory Utilization

Impact of GPU Memory Utilization — Llama 70B, 4xA100, RPS=20

Axis 2: Max Sequence Length

Impact of Max Sequence Length — Llama 70B, 4xA100

Axis 3: Max Number of Sequences

Profiling with torch.profiler

Decode Step Time Breakdown — Llama 70B, Batch=64

TTFT Optimization

Chunked Prefill

TTFT with and without Chunked Prefill — Llama 70B, 4xA100

Prefix Caching

Throughput Optimization

Finding the Optimal Batch Size

Throughput vs Max Concurrent Sequences — Llama 70B, 4xA100

Throughput vs Batch Size Sweet Spot

Memory Profiling

Comparing Quantization Impact on Throughput

Quantization Impact — Llama 70B, 4xA100, Saturated Load

End-to-End Optimization Workflow

Production Monitoring Metrics

Summary

Stanley Phoong

Related Posts

Continuous Batching: The Complete Guide to LLM Inference Scheduling

vLLM v1 Configuration Guide: gpu_memory_utilization, max_num_seqs, and Every Key Parameter

vLLM v1 Performance Profiling: Finding and Fixing Bottlenecks in Production