Batch Processing Optimization for LLM Inference: Throughput vs Latency Trade-offs

Single-request inference on an A100 achieves 15% GPU utilization — 85% of the chip sits idle while memory controllers fetch weights from HBM. Batching eight requests together pushes utilization to 72% and increases throughput from 45 tokens/sec to 210 tokens/sec. But batching has a cost: P50 latency rises from 22ms to 38ms because each decode step now processes eight sequences instead of one. The optimization challenge is finding the batch size that maximizes throughput without pushing latency past your SLA — typically batch size 8-16 for interactive serving, 32-64 for batch processing.

The Batching Throughput Curve

Increasing batch size improves throughput until memory bandwidth or capacity becomes the bottleneck:

📊

Batch Size Impact on Performance (Llama-7B, A100-80GB, seq_len=512)

Batch Size	Throughput (tok/s)	P50 Latency (ms)	GPU Util	Regime
1	45	22	15%	Memory-latency bound
4	128	31	48%	Scaling well
8	210	38	72%	Good balance
16	285	56	88%	Near saturation
32	320	100	95%	Bandwidth-limited
64	330	195	96%	Diminishing returns + latency penalty

The sweet spot here is batch size 8-16: throughput has reached 70-90% of peak while latency remains under 60ms. Beyond 32, throughput barely increases while latency doubles — the GPU is saturated and you’re just queuing more work.

Throughput and Latency vs Batch Size

line

Metric	1	4	8	16	32	64
Throughput (tok/s)	45	128	210	285	320	330
P50 Latency (ms)	22	31	38	56	100	195

⚡ Why Batching Helps

At batch size 1, the GPU loads model weights from HBM for a single token’s worth of computation. At batch size 16, those same weights serve 16 tokens. Weight loading (the dominant cost in decode) is amortized 16x, which is why throughput scales nearly linearly until memory bandwidth saturates.

The Padding Problem

When batching variable-length sequences, all sequences must be padded to the length of the longest. This wastes compute on padding tokens that produce no useful output.

📊

Padding Waste by Sequence Length Distribution

Distribution	Avg Length	Max Length	Padding Waste	Effective Throughput Loss
Uniform (all same)	512	512	0%	None
Moderate variation	400	600	18%	~15% throughput loss
High variation	256	1024	42%	~35% throughput loss
Extreme (chat + long doc)	128	2048	68%	~55% throughput loss

Note: Batch size 16 in all cases. Waste = padding_tokens / total_tokens.

With extreme length variation, more than half your compute is wasted on padding. This is the primary motivation for smarter batching strategies.

Dynamic Batching: Group by Length

The simplest improvement: sort incoming requests by length and batch together similarly-sized sequences. This reduces padding to near-zero within each batch at the cost of slightly increased wait time for short sequences.

def form_batch_by_length(queue, max_batch_size=16, length_tolerance=1.2):
    """Group requests with similar lengths to minimize padding."""
    queue.sort(key=lambda r: len(r.tokens))

    batch = [queue[0]]
    base_length = len(queue[0].tokens)

    for req in queue[1:]:
        if len(batch) >= max_batch_size:
            break
        if len(req.tokens) <= base_length * length_tolerance:
            batch.append(req)

    return batch

With a 20% length tolerance (length_tolerance=1.2), padding waste drops from 42% to under 8% for typical workloads.

Continuous Batching: The Real Win

Static batching waits for an entire batch to complete before starting the next one. If one request generates 200 tokens and another generates 20, the GPU sits idle for 180 token steps while the short request’s slot is wasted.

Continuous batching (also called iteration-level batching) fixes this by inserting new requests into empty slots as completed requests exit:

Batching Strategy Throughput Comparison

(tok/s)

No batching (BS=1)

45 tok/s

Static batching (BS=16)

285 tok/s

+533.3%

Dynamic batching (BS=16) Length-sorted

340 tok/s

+655.6%

Continuous batching (BS=16) 2.2x vs static

620 tok/s

+1277.8%

Continuous batching achieves 2-3x higher throughput than static batching because:

No idle slots: Empty slots are immediately filled with new requests
No padding waste: Each request processes only its own tokens
No batch completion stalls: The GPU is always doing useful work

ℹ️ This Is What vLLM Does

vLLM’s core innovation (PagedAttention + continuous batching) is exactly this: decouple memory allocation from batch composition so requests can enter and exit the batch at any iteration. The throughput improvement is transformative for production serving.

Memory Constraints on Batch Size

The maximum batch size is constrained by GPU memory. The dominant memory consumer is the KV cache:

📊

Memory Budget Analysis (A100-80GB, Llama-7B)

Component	Size	Scales With
Model weights (FP16)	14 GB	Fixed
KV cache per request (seq=2048)	1.0 GB	batch_size x seq_len
Activations per request	0.2 GB	batch_size x seq_len
CUDA overhead	~2 GB	Fixed

Note: Available for KV: 80 - 14 - 2 = 64 GB -> max ~64 concurrent requests at seq_len=2048

With PagedAttention, KV memory is allocated dynamically (only for actual tokens, not reserved for max sequence length), which typically allows 3-5x more concurrent requests.

Optimal Batch Size Selection

The optimal batch size depends on your optimization target:

📊

Batch Size Selection Guide

Optimization Target	Recommended BS	Rationale
Minimum latency	1-4	Lowest queueing delay
Balanced (interactive)	8-16	Good throughput with under 100ms P99 latency
Maximum throughput	32-64	GPU saturated, latency secondary
Continuous batching	16-32 active	Dynamic sizing handles the rest

In practice, most production deployments use continuous batching with a maximum of 16-32 concurrent requests, letting the batcher dynamically adjust based on memory pressure and request arrival rate.

Conclusion

Batch processing transforms LLM inference from 15% GPU utilization to 95%+. The progression from static batching -> dynamic batching -> continuous batching each delivers a meaningful throughput improvement. Continuous batching eliminates both padding waste and idle slot waste, achieving 2-3x throughput over static batching. The primary constraint is KV cache memory, which limits the maximum number of concurrent requests — making memory-efficient KV management (PagedAttention, KV quantization) a critical enabling technology.