Single-request inference on an A100 achieves 15% GPU utilization — 85% of the chip sits idle while memory controllers fetch weights from HBM. Batching eight requests together pushes utilization to 72% and increases throughput from 45 tokens/sec to 210 tokens/sec. But batching has a cost: P50 latency rises from 22ms to 38ms because each decode step now processes eight sequences instead of one. The optimization challenge is finding the batch size that maximizes throughput without pushing latency past your SLA — typically batch size 8-16 for interactive serving, 32-64 for batch processing.

The Batching Throughput Curve

Increasing batch size improves throughput until memory bandwidth or capacity becomes the bottleneck:

📊

Batch Size Impact on Performance (Llama-7B, A100-80GB, seq_len=512)

Batch SizeThroughput (tok/s)P50 Latency (ms)GPU UtilRegime
1 45 22 15% Memory-latency bound
4 128 31 48% Scaling well
8 210 38 72% Good balance
16 285 56 88% Near saturation
32 320 100 95% Bandwidth-limited
64 330 195 96% Diminishing returns + latency penalty

The sweet spot here is batch size 8-16: throughput has reached 70-90% of peak while latency remains under 60ms. Beyond 32, throughput barely increases while latency doubles — the GPU is saturated and you’re just queuing more work.

Throughput and Latency vs Batch Size

line
Metric 148163264
Throughput (tok/s)
45
128
210
285
320
330
P50 Latency (ms)
22
31
38
56
100
195
Why Batching Helps

At batch size 1, the GPU loads model weights from HBM for a single token’s worth of computation. At batch size 16, those same weights serve 16 tokens. Weight loading (the dominant cost in decode) is amortized 16x, which is why throughput scales nearly linearly until memory bandwidth saturates.

The Padding Problem

When batching variable-length sequences, all sequences must be padded to the length of the longest. This wastes compute on padding tokens that produce no useful output.

📊

Padding Waste by Sequence Length Distribution

DistributionAvg LengthMax LengthPadding WasteEffective Throughput Loss
Uniform (all same) 512 512 0% None
Moderate variation 400 600 18% ~15% throughput loss
High variation 256 1024 42% ~35% throughput loss
Extreme (chat + long doc) 128 2048 68% ~55% throughput loss
Note: Batch size 16 in all cases. Waste = padding_tokens / total_tokens.

With extreme length variation, more than half your compute is wasted on padding. This is the primary motivation for smarter batching strategies.

Dynamic Batching: Group by Length

The simplest improvement: sort incoming requests by length and batch together similarly-sized sequences. This reduces padding to near-zero within each batch at the cost of slightly increased wait time for short sequences.

def form_batch_by_length(queue, max_batch_size=16, length_tolerance=1.2):
    """Group requests with similar lengths to minimize padding."""
    queue.sort(key=lambda r: len(r.tokens))

    batch = [queue[0]]
    base_length = len(queue[0].tokens)

    for req in queue[1:]:
        if len(batch) >= max_batch_size:
            break
        if len(req.tokens) <= base_length * length_tolerance:
            batch.append(req)

    return batch

With a 20% length tolerance (length_tolerance=1.2), padding waste drops from 42% to under 8% for typical workloads.

Continuous Batching: The Real Win

Static batching waits for an entire batch to complete before starting the next one. If one request generates 200 tokens and another generates 20, the GPU sits idle for 180 token steps while the short request’s slot is wasted.

Continuous batching (also called iteration-level batching) fixes this by inserting new requests into empty slots as completed requests exit:

Batching Strategy Throughput Comparison

(tok/s)
No batching (BS=1)
45 tok/s
Static batching (BS=16)
285 tok/s
+533.3%
Dynamic batching (BS=16) Length-sorted
340 tok/s
+655.6%
Continuous batching (BS=16) 2.2x vs static
620 tok/s
+1277.8%

Continuous batching achieves 2-3x higher throughput than static batching because:

  1. No idle slots: Empty slots are immediately filled with new requests
  2. No padding waste: Each request processes only its own tokens
  3. No batch completion stalls: The GPU is always doing useful work
ℹ️ This Is What vLLM Does

vLLM’s core innovation (PagedAttention + continuous batching) is exactly this: decouple memory allocation from batch composition so requests can enter and exit the batch at any iteration. The throughput improvement is transformative for production serving.

Memory Constraints on Batch Size

The maximum batch size is constrained by GPU memory. The dominant memory consumer is the KV cache:

📊

Memory Budget Analysis (A100-80GB, Llama-7B)

ComponentSizeScales With
Model weights (FP16) 14 GB Fixed
KV cache per request (seq=2048) 1.0 GB batch_size x seq_len
Activations per request 0.2 GB batch_size x seq_len
CUDA overhead ~2 GB Fixed
Note: Available for KV: 80 - 14 - 2 = 64 GB -> max ~64 concurrent requests at seq_len=2048

With PagedAttention, KV memory is allocated dynamically (only for actual tokens, not reserved for max sequence length), which typically allows 3-5x more concurrent requests.

Optimal Batch Size Selection

The optimal batch size depends on your optimization target:

📊

Batch Size Selection Guide

Optimization TargetRecommended BSRationale
Minimum latency 1-4 Lowest queueing delay
Balanced (interactive) 8-16 Good throughput with under 100ms P99 latency
Maximum throughput 32-64 GPU saturated, latency secondary
Continuous batching 16-32 active Dynamic sizing handles the rest

In practice, most production deployments use continuous batching with a maximum of 16-32 concurrent requests, letting the batcher dynamically adjust based on memory pressure and request arrival rate.

Conclusion

Batch processing transforms LLM inference from 15% GPU utilization to 95%+. The progression from static batching -> dynamic batching -> continuous batching each delivers a meaningful throughput improvement. Continuous batching eliminates both padding waste and idle slot waste, achieving 2-3x throughput over static batching. The primary constraint is KV cache memory, which limits the maximum number of concurrent requests — making memory-efficient KV management (PagedAttention, KV quantization) a critical enabling technology.