Single-request inference on an A100 achieves 15% GPU utilization — 85% of the chip sits idle while memory controllers fetch weights from HBM. Batching eight requests together pushes utilization to 72% and increases throughput from 45 tokens/sec to 210 tokens/sec. But batching has a cost: P50 latency rises from 22ms to 38ms because each decode step now processes eight sequences instead of one. The optimization challenge is finding the batch size that maximizes throughput without pushing latency past your SLA — typically batch size 8-16 for interactive serving, 32-64 for batch processing.
The Batching Throughput Curve
Increasing batch size improves throughput until memory bandwidth or capacity becomes the bottleneck:
Batch Size Impact on Performance (Llama-7B, A100-80GB, seq_len=512)
| Batch Size | Throughput (tok/s) | P50 Latency (ms) | GPU Util | Regime |
|---|---|---|---|---|
| 1 | 45 | 22 | 15% | Memory-latency bound |
| 4 | 128 | 31 | 48% | Scaling well |
| 8 | 210 | 38 | 72% | Good balance |
| 16 | 285 | 56 | 88% | Near saturation |
| 32 | 320 | 100 | 95% | Bandwidth-limited |
| 64 | 330 | 195 | 96% | Diminishing returns + latency penalty |
The sweet spot here is batch size 8-16: throughput has reached 70-90% of peak while latency remains under 60ms. Beyond 32, throughput barely increases while latency doubles — the GPU is saturated and you’re just queuing more work.
Throughput and Latency vs Batch Size
line| Metric | 1 | 4 | 8 | 16 | 32 | 64 |
|---|---|---|---|---|---|---|
| Throughput (tok/s) | ||||||
| P50 Latency (ms) |
At batch size 1, the GPU loads model weights from HBM for a single token’s worth of computation. At batch size 16, those same weights serve 16 tokens. Weight loading (the dominant cost in decode) is amortized 16x, which is why throughput scales nearly linearly until memory bandwidth saturates.
The Padding Problem
When batching variable-length sequences, all sequences must be padded to the length of the longest. This wastes compute on padding tokens that produce no useful output.
Padding Waste by Sequence Length Distribution
| Distribution | Avg Length | Max Length | Padding Waste | Effective Throughput Loss |
|---|---|---|---|---|
| Uniform (all same) | 512 | 512 | 0% | None |
| Moderate variation | 400 | 600 | 18% | ~15% throughput loss |
| High variation | 256 | 1024 | 42% | ~35% throughput loss |
| Extreme (chat + long doc) | 128 | 2048 | 68% | ~55% throughput loss |
With extreme length variation, more than half your compute is wasted on padding. This is the primary motivation for smarter batching strategies.
Dynamic Batching: Group by Length
The simplest improvement: sort incoming requests by length and batch together similarly-sized sequences. This reduces padding to near-zero within each batch at the cost of slightly increased wait time for short sequences.
def form_batch_by_length(queue, max_batch_size=16, length_tolerance=1.2):
"""Group requests with similar lengths to minimize padding."""
queue.sort(key=lambda r: len(r.tokens))
batch = [queue[0]]
base_length = len(queue[0].tokens)
for req in queue[1:]:
if len(batch) >= max_batch_size:
break
if len(req.tokens) <= base_length * length_tolerance:
batch.append(req)
return batch
With a 20% length tolerance (length_tolerance=1.2), padding waste drops from 42% to under 8% for typical workloads.
Continuous Batching: The Real Win
Static batching waits for an entire batch to complete before starting the next one. If one request generates 200 tokens and another generates 20, the GPU sits idle for 180 token steps while the short request’s slot is wasted.
Continuous batching (also called iteration-level batching) fixes this by inserting new requests into empty slots as completed requests exit:
Batching Strategy Throughput Comparison
(tok/s)Continuous batching achieves 2-3x higher throughput than static batching because:
- No idle slots: Empty slots are immediately filled with new requests
- No padding waste: Each request processes only its own tokens
- No batch completion stalls: The GPU is always doing useful work
vLLM’s core innovation (PagedAttention + continuous batching) is exactly this: decouple memory allocation from batch composition so requests can enter and exit the batch at any iteration. The throughput improvement is transformative for production serving.
Memory Constraints on Batch Size
The maximum batch size is constrained by GPU memory. The dominant memory consumer is the KV cache:
Memory Budget Analysis (A100-80GB, Llama-7B)
| Component | Size | Scales With |
|---|---|---|
| Model weights (FP16) | 14 GB | Fixed |
| KV cache per request (seq=2048) | 1.0 GB | batch_size x seq_len |
| Activations per request | 0.2 GB | batch_size x seq_len |
| CUDA overhead | ~2 GB | Fixed |
With PagedAttention, KV memory is allocated dynamically (only for actual tokens, not reserved for max sequence length), which typically allows 3-5x more concurrent requests.
Optimal Batch Size Selection
The optimal batch size depends on your optimization target:
Batch Size Selection Guide
| Optimization Target | Recommended BS | Rationale |
|---|---|---|
| Minimum latency | 1-4 | Lowest queueing delay |
| Balanced (interactive) | 8-16 | Good throughput with under 100ms P99 latency |
| Maximum throughput | 32-64 | GPU saturated, latency secondary |
| Continuous batching | 16-32 active | Dynamic sizing handles the rest |
In practice, most production deployments use continuous batching with a maximum of 16-32 concurrent requests, letting the batcher dynamically adjust based on memory pressure and request arrival rate.
Conclusion
Batch processing transforms LLM inference from 15% GPU utilization to 95%+. The progression from static batching -> dynamic batching -> continuous batching each delivers a meaningful throughput improvement. Continuous batching eliminates both padding waste and idle slot waste, achieving 2-3x throughput over static batching. The primary constraint is KV cache memory, which limits the maximum number of concurrent requests — making memory-efficient KV management (PagedAttention, KV quantization) a critical enabling technology.