Benchmarking LLM inference is not a single number. Time-to-first-token (TTFT), time-per-output-token (TPOT), total throughput (tokens/sec), and memory utilization are four independent metrics that often trade off against each other. A configuration that maximizes throughput may double TTFT. This post provides a systematic framework for benchmarking vLLM v1 across the configuration space: batch size, max sequence length, quantization, tensor parallelism, KV cache allocation, and scheduling parameters.
The Four Core Metrics
Every LLM serving benchmark must measure these independently:
# Metric definitions
TTFT = time_of_first_generated_token - time_of_request_arrival
TPOT = (time_of_last_token - time_of_first_token) / (num_output_tokens - 1)
throughput = total_tokens_generated / wall_clock_time # across all requests
memory_util = kv_cache_blocks_used / kv_cache_blocks_total
TTFT is dominated by prefill time — the forward pass over the entire input prompt. TPOT is the decode step latency divided by batch size. Throughput is the aggregate generation rate.
The fundamental trade-off: higher batch sizes increase throughput but increase TPOT because each decode step takes longer with more sequences. TTFT increases when queued requests must wait for running sequences to free KV cache blocks.
vLLM’s Built-in Benchmarking Tool
vLLM ships with benchmark_serving.py which generates load and measures all four metrics:
# Basic benchmark: fixed request rate
python -m vllm.entrypoints.openai.api_server \
--model meta-llama/Llama-2-7b-hf \
--max-model-len 4096 &
python benchmarks/benchmark_serving.py \
--backend vllm \
--model meta-llama/Llama-2-7b-hf \
--dataset-name sharegpt \
--dataset-path ShareGPT_V3_unfiltered.json \
--request-rate 10 \
--num-prompts 1000 \
--save-result results.json
The output includes percentile breakdowns:
============ Serving Benchmark Result ============
Successful requests: 1000
Benchmark duration (s): 105.32
Request throughput (req/s): 9.49
Input token throughput (tok/s): 2847.12
Output token throughput (tok/s): 4521.88
TTFT (ms): p50=45.2 p90=89.1 p95=124.3 p99=312.7
TPOT (ms): p50=12.1 p90=18.4 p95=22.1 p99=35.8
Always report p99 latencies, not just p50. A system with p50 TTFT of 45ms and p99 TTFT of 312ms has a 7x tail latency ratio, indicating queuing or preemption under load. Production SLAs are typically defined at p95 or p99.
Configuration Axes to Benchmark
Axis 1: GPU Memory Utilization
The --gpu-memory-utilization flag controls what fraction of GPU memory is allocated to the KV cache pool. Higher values allow more concurrent sequences but leave less headroom for activation memory spikes.
# Sweep memory utilization
for util in 0.80 0.85 0.90 0.92 0.95; do
python -m vllm.entrypoints.openai.api_server \
--model meta-llama/Llama-2-70b-hf \
--tensor-parallel-size 4 \
--gpu-memory-utilization $util \
--max-model-len 4096 &
sleep 30 # wait for model load
python benchmarks/benchmark_serving.py \
--request-rate 20 \
--num-prompts 500 \
--save-result "results_util_${util}.json"
kill %1
done
Impact of GPU Memory Utilization — Llama 70B, 4xA100, RPS=20
| Utilization | KV Blocks | Max Concurrent Seqs | Throughput (tok/s) | p99 TTFT (ms) |
|---|---|---|---|---|
| 0.80 | 1,842 | 28 | 3,150 | 189 |
| 0.85 | 2,105 | 32 | 3,480 | 165 |
| 0.90 | 2,368 | 36 | 3,780 | 142 |
| 0.92 | 2,473 | 38 | 3,850 | 138 |
| 0.95 | 2,631 | 40 | OOM crash | --- |
At 0.95 utilization, activation memory spikes during long-context prefill cause OOM. The safe maximum on A100 is typically 0.92.
Axis 2: Max Sequence Length
--max-model-len determines the maximum total tokens (input + output) per request. Larger values reduce KV cache blocks available because each block must be reservable up to the maximum length:
for maxlen in 2048 4096 8192 16384 32768; do
python -m vllm.entrypoints.openai.api_server \
--model meta-llama/Llama-2-70b-hf \
--tensor-parallel-size 4 \
--max-model-len $maxlen \
--gpu-memory-utilization 0.90 &
sleep 30
python benchmarks/benchmark_serving.py \
--request-rate 10 --num-prompts 300 \
--save-result "results_maxlen_${maxlen}.json"
kill %1
done
Impact of Max Sequence Length — Llama 70B, 4xA100
| Max Length | Blocks Available | Max Concurrent (avg input 512) | Throughput (tok/s) |
|---|---|---|---|
| 2,048 | 2,368 | 72 | 4,210 |
| 4,096 | 2,368 | 36 | 3,780 |
| 8,192 | 2,368 | 18 | 2,950 |
| 16,384 | 2,368 | 9 | 1,820 |
| 32,768 | 2,368 | 4 | 980 |
Setting max-model-len higher than your actual workload’s longest sequence wastes KV cache. If your P99 input+output length is 3,000 tokens, set max-model-len to 4,096, not 32,768. The scheduler reserves blocks based on potential maximum, not actual usage.
Axis 3: Max Number of Sequences
The --max-num-seqs parameter caps how many sequences can run concurrently:
# The relationship between max_num_seqs and throughput
# Throughput = batch_size * tokens_per_step / step_latency
# step_latency grows with batch_size (more KV cache reads)
# There's an optimal batch size where throughput peaks
# For Llama 70B on 4xA100:
# - Below batch 32: memory-bandwidth bound, throughput scales linearly
# - Batch 32-128: transition region
# - Above batch 128: compute-bound, throughput plateaus
# - Above batch 256: KV cache thrashing, throughput drops
Profiling with torch.profiler
vLLM supports PyTorch profiler integration to identify bottlenecks within the forward pass:
import torch
from vllm import LLM, SamplingParams
llm = LLM(model="meta-llama/Llama-2-7b-hf")
with torch.profiler.profile(
activities=[
torch.profiler.ProfilerActivity.CPU,
torch.profiler.ProfilerActivity.CUDA,
],
schedule=torch.profiler.schedule(wait=2, warmup=2, active=3),
on_trace_ready=torch.profiler.tensorboard_trace_handler("./profiler_logs"),
record_shapes=True,
with_stack=True,
) as prof:
for i in range(7):
outputs = llm.generate(
["Explain quantum computing in detail."],
SamplingParams(max_tokens=128)
)
prof.step()
The profiler trace reveals the time split between:
Typical decode step breakdown (Llama 70B, batch=64, 4xA100):
Attention (FlashAttention): 35% (KV cache read is bandwidth-bound)
Linear projections (GEMM): 42% (QKV, O, gate, up, down)
All-reduce (NCCL): 8% (2 per layer, 160 total)
Sampling + scheduling: 5% (CPU-side)
Kernel launch overhead: 4%
Other (norms, activations): 6%
Decode Step Time Breakdown — Llama 70B, Batch=64
TTFT Optimization
TTFT is the most user-visible metric in interactive applications. It is primarily determined by prefill latency.
Chunked Prefill
vLLM v1 supports chunked prefill, which breaks long prompts into chunks and interleaves them with decode steps:
python -m vllm.entrypoints.openai.api_server \
--model meta-llama/Llama-2-70b-hf \
--tensor-parallel-size 4 \
--enable-chunked-prefill \
--max-num-batched-tokens 2048
Without chunked prefill, a 4,096-token prompt blocks all decode steps for the duration of the prefill (approximately 200ms on 4xA100 for Llama 70B). With chunked prefill at chunk size 2,048, the prompt is processed in 2 chunks, and decode steps for other sequences run between chunks.
TTFT with and without Chunked Prefill — Llama 70B, 4xA100
| Input Length | No Chunking p50 | No Chunking p99 | Chunked (2048) p50 | Chunked (2048) p99 |
|---|---|---|---|---|
| 512 | 28 ms | 45 ms | 30 ms | 48 ms |
| 2,048 | 95 ms | 142 ms | 98 ms | 155 ms |
| 4,096 | 198 ms | 312 ms | 210 ms | 245 ms |
| 8,192 | 410 ms | 680 ms | 425 ms | 490 ms |
| 16,384 | 850 ms | 1,420 ms | 880 ms | 1,020 ms |
Chunked prefill increases p50 TTFT slightly (the target request’s own prefill takes one extra scheduling round) but dramatically reduces p99 TTFT by preventing head-of-line blocking.
Prefix Caching
If many requests share a common prefix (system prompt, few-shot examples), prefix caching avoids redundant prefill computation:
python -m vllm.entrypoints.openai.api_server \
--model meta-llama/Llama-2-70b-hf \
--tensor-parallel-size 4 \
--enable-prefix-caching
# Impact: shared 2048-token system prompt
# Without caching: every request prefills 2048 tokens
# With caching: first request prefills 2048 tokens, subsequent requests skip
# TTFT reduction for request with 2048 shared prefix + 512 unique:
# Without: prefill(2048 + 512) = 125 ms
# With: prefill(512) + cache_lookup(2048) = 32 ms + 0.1 ms = 32.1 ms
# Speedup: 3.9x
Throughput Optimization
Finding the Optimal Batch Size
Throughput is maximized at the batch size where GPU compute utilization is high but KV cache is not exhausted:
# Automated batch size sweep
import subprocess
import json
results = []
for max_seqs in [8, 16, 32, 64, 128, 256, 512]:
# Launch server with specific max_num_seqs
server = subprocess.Popen([
"python", "-m", "vllm.entrypoints.openai.api_server",
"--model", "meta-llama/Llama-2-70b-hf",
"--tensor-parallel-size", "4",
"--max-num-seqs", str(max_seqs),
"--gpu-memory-utilization", "0.90"
])
# Run benchmark at saturation load
result = subprocess.run([
"python", "benchmarks/benchmark_serving.py",
"--request-rate", "inf", # send all at once
"--num-prompts", "500",
"--save-result", f"batch_{max_seqs}.json"
], capture_output=True)
with open(f"batch_{max_seqs}.json") as f:
data = json.load(f)
results.append({
"max_seqs": max_seqs,
"throughput": data["output_throughput"],
"p99_tpot": data["tpot_p99"]
})
server.terminate()
Throughput vs Max Concurrent Sequences — Llama 70B, 4xA100
| Max Seqs | Throughput (tok/s) | p50 TPOT (ms) | p99 TPOT (ms) | GPU Compute Util |
|---|---|---|---|---|
| 8 | 1,280 | 6.2 | 8.1 | 22% |
| 16 | 2,410 | 6.6 | 9.5 | 38% |
| 32 | 3,780 | 8.4 | 12.8 | 58% |
| 64 | 4,650 | 13.7 | 19.2 | 74% |
| 128 | 5,120 | 24.9 | 35.1 | 85% |
| 256 | 4,890 | 52.3 | 78.4 | 82% |
| 512 | 3,200 | 159.8 | 245.0 | 68% |
Peak throughput occurs at max_seqs=128, but p99 TPOT is 35ms. If your SLA requires TPOT under 20ms, the optimal point is max_seqs=64 at 4,650 tok/s.
At 256+ sequences, KV cache pressure causes preemption (sequences are swapped out and back in), which wastes compute on recomputation. At 512, thrashing dominates and throughput drops below the 32-sequence configuration.
Throughput vs Batch Size Sweet Spot
Memory Profiling
Understanding where GPU memory goes is critical for optimization:
# Memory breakdown for Llama 70B on 4xA100 (TP=4)
# Total per GPU: 80 GB
model_weights = 70e9 * 2 / 4 # FP16, sharded 4 ways = 35 GB per GPU
# Actually lower due to GQA: ~32.5 GB per GPU
kv_cache_budget = 80 * 0.90 - 32.5 - 2.0 # 2 GB for activations/overhead
# kv_cache_budget = 37.5 GB per GPU
bytes_per_block = 5.24e6 # From block manager analysis (Llama 70B)
bytes_per_block_per_gpu = 5.24e6 / 4 # Sharded across TP
num_blocks = int(37.5e9 / (5.24e6 / 4))
# num_blocks = 28,626 blocks
tokens_per_block = 16
max_tokens_cached = num_blocks * tokens_per_block
# max_tokens_cached = 458,016 tokens across all sequences
# Runtime memory monitoring
watch -n 1 nvidia-smi --query-gpu=memory.used,memory.total,utilization.gpu \
--format=csv,noheader,nounits
Comparing Quantization Impact on Throughput
# Benchmark matrix: model format x batch size
for model in "meta-llama/Llama-2-70b-hf" \
"TheBloke/Llama-2-70B-GPTQ" \
"neuralmagic/Llama-2-70B-FP8"; do
python -m vllm.entrypoints.openai.api_server \
--model $model \
--tensor-parallel-size 4 \
--gpu-memory-utilization 0.90 &
sleep 60
python benchmarks/benchmark_serving.py \
--request-rate inf --num-prompts 500 \
--save-result "quant_${model##*/}.json"
kill %1
done
Quantization Impact — Llama 70B, 4xA100, Saturated Load
| Format | Model Mem/GPU (GB) | KV Cache (GB) | Max Seqs | Peak Throughput (tok/s) |
|---|---|---|---|---|
| FP16 | 32.5 | 37.5 | 128 | 5,120 |
| FP8 | 16.3 | 53.7 | 184 | 7,840 |
| GPTQ-INT4 | 9.2 | 60.8 | 208 | 8,350 |
| AWQ-INT4 | 9.0 | 61.0 | 209 | 8,280 |
Quantization provides a double benefit: smaller weights mean faster GEMM (bandwidth-bound at small batch) AND more KV cache capacity, which allows higher concurrency.
End-to-End Optimization Workflow
Here is the systematic workflow for optimizing a vLLM deployment:
# Step 1: Establish baseline
# Run with default settings, measure all 4 metrics
# Step 2: Identify the bottleneck
def identify_bottleneck(profile_data):
if profile_data["gpu_compute_util"] < 50:
return "BATCH_TOO_SMALL"
if profile_data["kv_cache_util"] > 95:
return "KV_CACHE_EXHAUSTED"
if profile_data["preemption_rate"] > 0.01:
return "TOO_MANY_SEQUENCES"
if profile_data["p99_ttft"] > profile_data["sla_ttft"]:
return "PREFILL_TOO_SLOW"
if profile_data["nccl_fraction"] > 15:
return "COMMUNICATION_BOUND"
return "NEAR_OPTIMAL"
# Step 3: Apply targeted optimization
optimizations = {
"BATCH_TOO_SMALL": "Increase --max-num-seqs or request rate",
"KV_CACHE_EXHAUSTED": "Reduce --max-model-len or use quantization",
"TOO_MANY_SEQUENCES": "Reduce --max-num-seqs",
"PREFILL_TOO_SLOW": "Enable chunked prefill or prefix caching",
"COMMUNICATION_BOUND": "Reduce TP degree or use faster interconnect",
}
# Step 4: Re-benchmark and iterate
Production Monitoring Metrics
Deploy these Prometheus metrics for ongoing optimization:
# Key metrics to export
from prometheus_client import Histogram, Gauge, Counter
ttft_histogram = Histogram(
"vllm_ttft_seconds",
"Time to first token",
buckets=[0.01, 0.025, 0.05, 0.1, 0.25, 0.5, 1.0, 2.5, 5.0]
)
tpot_histogram = Histogram(
"vllm_tpot_seconds",
"Time per output token",
buckets=[0.005, 0.01, 0.02, 0.05, 0.1, 0.25]
)
kv_cache_usage = Gauge(
"vllm_kv_cache_usage_ratio",
"Fraction of KV cache blocks in use"
)
running_sequences = Gauge(
"vllm_running_sequences",
"Number of sequences currently being decoded"
)
preemption_counter = Counter(
"vllm_preemptions_total",
"Total number of sequence preemptions"
)
The preemption counter is the single most important metric for detecting throughput degradation. A preemption rate above 1% of requests indicates KV cache pressure. Each preemption discards computed KV cache and recomputes it later, wasting 2x the original compute. If you see preemptions, reduce max-num-seqs or increase KV cache capacity via quantization or more GPUs.
Summary
Benchmarking vLLM v1 requires measuring four independent metrics: TTFT, TPOT, throughput, and memory utilization. The optimal configuration depends on your SLA priorities. For latency-sensitive interactive serving, optimize for TTFT and TPOT by keeping batch sizes moderate (32-64) and enabling chunked prefill. For throughput-oriented batch processing, push batch sizes to 128+ and use quantized models to maximize KV cache capacity. Always benchmark with representative workload distributions (input/output length distributions from your actual traffic) rather than synthetic fixed-length prompts. Monitor preemption rate in production as the leading indicator of capacity exhaustion.