Static batching wastes GPU cycles waiting for the longest sequence. Continuous batching fills those gaps by adding new requests as others complete. But implementing it correctly requires solving several non-trivial scheduling problems.
The Static Batching Problem
In static batching, a batch executes together until completion:
Time β
Request A: [ββββββββββββββββββββββββββββββββββββββββ] 1000 tokens
Request B: [ββββββββ] 200 tokens
Request C: [ββββββββββββββββ] 400 tokens
GPU utilization during Request A's tail: 33% (only A running)
Total batch time: 1000 iterations
Wasted compute: 1400 token-iterations
With continuous batching:
Time β
Request A: [ββββββββββββββββββββββββββββββββββββββββ] 1000 tokens
Request B: [ββββββββ] 200 tokens
Request D: [ββββββββββββ] 300 tokens (added when B finished)
Request C: [ββββββββββββββββ] 400 tokens
Request E: [ββββββββββββββββββββ] 500 tokens (added when C finished)
GPU utilization: ~95% throughout
The Scheduler Architecture
vLLMβs scheduler operates at iteration granularity:
class IterationLevelScheduler:
"""
Schedules sequences for each decode iteration.
Must balance:
- Memory constraints (KV cache)
- Fairness (starvation prevention)
- Throughput (batch size)
- Latency SLOs
"""
def __init__(self, config: SchedulerConfig):
self.waiting_queue: List[SequenceGroup] = [] # New requests
self.running_queue: List[SequenceGroup] = [] # Active sequences
self.swapped_queue: List[SequenceGroup] = [] # Preempted to CPU
self.block_manager = BlockManager(config)
def schedule(self) -> SchedulerOutputs:
"""
Called before each decode iteration.
Returns which sequences to run and memory operations to perform.
"""
# Phase 1: Handle running sequences
running_scheduled = self._schedule_running()
# Phase 2: Try to swap in preempted sequences
swapped_scheduled = self._schedule_swapped()
# Phase 3: Admit new sequences if memory available
waiting_scheduled = self._schedule_waiting()
return SchedulerOutputs(
scheduled_seqs=running_scheduled + swapped_scheduled + waiting_scheduled,
preemption_ops=self._preempt_if_needed(),
swap_in_ops=self._get_swap_ins(),
swap_out_ops=self._get_swap_outs(),
)
Memory-Aware Scheduling
The scheduler must coordinate with the block manager:
def _can_admit_sequence(self, seq_group: SequenceGroup) -> bool:
"""
Check if we can admit a new sequence without preemption.
"""
# Calculate blocks needed for this sequence
prompt_tokens = len(seq_group.prompt_token_ids)
blocks_needed = (prompt_tokens + self.block_size - 1) // self.block_size
# Reserve space for at least N decode iterations
min_decode_blocks = self.config.min_decode_blocks # e.g., 16
total_blocks_needed = blocks_needed + min_decode_blocks
# Check against available blocks
available_blocks = self.block_manager.get_available_blocks()
if total_blocks_needed <= available_blocks:
return True
# Check if preemption could free enough blocks
preemptable_blocks = self._calculate_preemptable_blocks()
return total_blocks_needed <= available_blocks + preemptable_blocks
def _calculate_preemptable_blocks(self) -> int:
"""
Calculate how many blocks we could free via preemption.
Lower priority sequences can be preempted.
"""
preemptable = 0
for seq_group in reversed(self.running_queue): # Lowest priority last
if seq_group.priority < self.config.preemption_threshold:
preemptable += self.block_manager.get_blocks_for_seq(seq_group)
return preemptable
Preemption requires copying KV cache to CPU memory (swap out) and back (swap in). On A100, this costs ~2ms per GB swapped. Frequent preemption can dominate latency.
Preemption Strategies
Two main preemption approaches:
1. Recomputation (No Swap)
class RecomputationPreemption:
"""
Discard KV cache and recompute on resume.
Best when: Prefill is cheap relative to swap overhead.
"""
def preempt(self, seq_group: SequenceGroup) -> PreemptionResult:
# Simply free the blocks - sequence will re-prefill on resume
freed_blocks = self.block_manager.free_blocks(seq_group)
# Track how much work will be redone
recompute_tokens = seq_group.get_completed_tokens()
return PreemptionResult(
method='recompute',
freed_blocks=freed_blocks,
recompute_cost_tokens=recompute_tokens,
swap_cost_bytes=0
)
2. Swapping (Preserve State)
class SwapPreemption:
"""
Copy KV cache to CPU memory.
Best when: Sequence has generated many tokens (expensive to recompute).
"""
def preempt(self, seq_group: SequenceGroup) -> PreemptionResult:
# Get physical blocks to swap
blocks = self.block_manager.get_blocks(seq_group)
# Calculate swap cost
bytes_to_swap = len(blocks) * self.block_size_bytes
# Initiate async copy to CPU
cpu_blocks = self.cpu_allocator.allocate(len(blocks))
self.swap_engine.swap_out_async(blocks, cpu_blocks)
# Update block table to point to CPU
self.block_manager.swap_out(seq_group, cpu_blocks)
return PreemptionResult(
method='swap',
freed_blocks=len(blocks),
recompute_cost_tokens=0,
swap_cost_bytes=bytes_to_swap
)
Decision Logic
def choose_preemption_method(self, seq_group: SequenceGroup) -> str:
"""
Choose between recomputation and swapping.
"""
generated_tokens = seq_group.get_num_generated_tokens()
blocks_used = self.block_manager.get_blocks_for_seq(seq_group)
# Estimate costs
recompute_cost_ms = self._estimate_prefill_time(
seq_group.get_prompt_len() + generated_tokens
)
swap_cost_ms = self._estimate_swap_time(blocks_used * self.block_size_bytes)
# Also consider: will this sequence likely be resumed soon?
resume_probability = self._estimate_resume_probability(seq_group)
# If recompute is cheap or sequence won't resume soon, recompute
if recompute_cost_ms < swap_cost_ms or resume_probability < 0.5:
return 'recompute'
else:
return 'swap'
Priority and Fairness
Without fairness mechanisms, long sequences can starve:
class FairnessScheduler(IterationLevelScheduler):
"""
Implements weighted fair queueing for inference requests.
"""
def __init__(self, config: SchedulerConfig):
super().__init__(config)
self.virtual_time = 0.0
def _calculate_priority(self, seq_group: SequenceGroup) -> float:
"""
Priority based on virtual finish time.
Lower = higher priority (will finish earlier in fair schedule).
"""
# Base priority from request
weight = seq_group.priority_weight # e.g., 1.0 for normal
# Calculate "fair" finish time
remaining_tokens = seq_group.max_tokens - seq_group.get_num_generated_tokens()
virtual_finish_time = self.virtual_time + remaining_tokens / weight
# Adjust for waiting time (prevent starvation)
wait_time = time.time() - seq_group.arrival_time
starvation_boost = wait_time / self.config.max_wait_time # 0 to 1
return virtual_finish_time * (1 - 0.3 * starvation_boost)
def _schedule_running(self) -> List[SequenceGroup]:
# Sort by priority
self.running_queue.sort(key=self._calculate_priority)
# Take top N that fit in memory
scheduled = []
memory_used = 0
for seq_group in self.running_queue:
blocks_needed = self._estimate_next_iteration_blocks(seq_group)
if memory_used + blocks_needed <= self.config.max_batch_blocks:
scheduled.append(seq_group)
memory_used += blocks_needed
# Update virtual time
if scheduled:
self.virtual_time += 1.0 / len(scheduled)
return scheduled
Performance Characteristics
Continuous vs Static Batching Performance
| Metric | Static (batch=8) | Continuous (max=64) | Improvement |
|---|---|---|---|
| Throughput | 1,247 tok/s | 3,891 tok/s | +212% |
| P50 Latency | 45ms/tok | 52ms/tok | -15% |
| P99 Latency | 89ms/tok | 142ms/tok | -60% |
| GPU Utilization | 62% | 94% | +52% |
| Memory Efficiency | 34% | 89% | +162% |
Continuous batching improves throughput at the cost of tail latency. Each request shares GPU with more concurrent requests, increasing per-token latency.
Chunked Prefill Integration
Modern continuous batching also chunks prefill:
def schedule_with_chunked_prefill(self) -> SchedulerOutputs:
"""
Interleave prefill chunks with decode iterations.
Prevents prefill from blocking decode.
"""
decode_seqs = []
prefill_seqs = []
prefill_budget = self.config.prefill_chunk_size # e.g., 512 tokens
# First, schedule decode (high priority)
for seq in self.running_queue:
if seq.is_decoding():
decode_seqs.append(seq)
# Then, schedule prefill chunks with remaining budget
for seq in self.waiting_queue:
if seq.is_prefilling():
remaining_prefill = seq.get_remaining_prefill_tokens()
chunk_size = min(remaining_prefill, prefill_budget)
if chunk_size > 0 and self._can_schedule_prefill_chunk(seq, chunk_size):
prefill_seqs.append((seq, chunk_size))
prefill_budget -= chunk_size
if prefill_budget <= 0:
break
return SchedulerOutputs(
decode_seqs=decode_seqs,
prefill_seqs=prefill_seqs,
)
Latency Impact of Prefill Chunking
(ms)Debugging Scheduler Issues
Common scheduler problems and diagnosis:
class SchedulerDiagnostics:
def diagnose_low_throughput(self):
metrics = self.get_metrics()
if metrics.avg_batch_size < self.config.max_batch_size * 0.5:
if metrics.memory_utilization > 0.9:
print("Issue: Memory pressure limiting batch size")
print(f" - Consider: Reduce max_model_len or increase GPU memory")
elif metrics.waiting_queue_avg > 0:
print("Issue: Scheduler not admitting waiting requests")
print(f" - Check: Block allocation failures")
if metrics.preemption_rate > 0.1:
print("Issue: High preemption rate ({:.1%})".format(metrics.preemption_rate))
print(f" - Average swap size: {metrics.avg_swap_size_mb:.1f} MB")
print(f" - Consider: Increase GPU memory or reduce max sequence length")
if metrics.avg_waiting_time > self.config.slo_ms:
print("Issue: Queue time exceeding SLO")
print(f" - Avg wait: {metrics.avg_waiting_time:.1f}ms")
print(f" - Consider: Add more replicas or reduce batch size for latency")
Conclusion
Continuous batching transforms inference from a batch processing problem to a real-time scheduling problem. The key challenges are:
- Memory coordination: Scheduler and block manager must cooperate
- Preemption strategy: Balance recomputation vs. swap costs
- Fairness: Prevent long-sequence starvation
- Prefill chunking: Donβt let prefill block decode
Properly implemented, continuous batching delivers 2-4x throughput improvement with acceptable latency trade-offs.