You have deployed Llama 70B on an H100, and your SLA is P99 latency under 200ms. At batch size 8, you hit 210 tokens/sec throughput and 180ms P99 β within SLA. At batch size 16, throughput rises to 285 tokens/sec but P99 spikes to 350ms β you exceed your SLA. The scheduler is your only lever: batching policy, queueing discipline, and fairness mechanisms determine whether you meet your latency target while maximizing throughput. This post treats LLM serving as a queueing system where batching trades throughput for latency, continuous decoding creates head-of-line blocking, and fairness between tenants requires explicit scheduling policy.
Basic model: arrivals, service, and batching
Simplify:
- requests arrive with some rate
- each token decode step costs on average if run alone
- batching B requests into one step costs
Throughput (tokens/s) with batch size B:
Latency has two parts:
- queueing delay waiting to be batched
- service time once in a batch
Batching trade-offs
| Batch size | Throughput | Queueing delay | Tail latency risk |
|---|---|---|---|
| 1 | Low | None | Low |
| 4 | Medium | Small | Moderate |
| 16 | High | Can be large | High if arrivals bursty |
Simple schedulers
FIFO with fixed batch size
class FifoBatchScheduler:
def __init__(self, max_batch_size):
self.queue = []
self.max_batch = max_batch_size
def enqueue(self, req):
self.queue.append(req)
def form_batch(self):
if not self.queue:
return []
# take up to max_batch requests
batch = self.queue[: self.max_batch]
self.queue = self.queue[self.max_batch :]
return batch
Pros:
- simple
- high utilization at high load
Cons:
- at low load, you delay small batches waiting to fill
Timeout-based batching
class TimeoutBatchScheduler:
def __init__(self, max_batch_size, max_wait_ms):
self.queue = []
self.max_batch = max_batch_size
self.max_wait = max_wait_ms / 1000.0
def enqueue(self, req):
req.arrival = time.time()
self.queue.append(req)
def form_batch(self):
if not self.queue:
return []
now = time.time()
# If oldest request waited long enough, form whatever batch we have
if now - self.queue[0].arrival >= self.max_wait or len(self.queue) >= self.max_batch:
batch = self.queue[: self.max_batch]
self.queue = self.queue[self.max_batch :]
return batch
return []
This caps queueing delay and stabilizes p99.
Continuous decoding and iteration-level batches
Unlike prefill, decode steps repeat for each token. With continuous batching, at each decode iteration you:
- drop finished requests
- add new arrivals
- form a batch across all active requests
The GPU runs per-iteration batches of active sequences. Scheduling is about which sequences make it into each iteration and in what groupings.
Fairness vs throughput
Greedy batching (always fill largest possible batch) can starve small, latency-sensitive jobs behind a stream of long, high-throughput streams.
Simple mitigation:
- age-based priority: weight requests by waiting time
- tenant-aware limits: cap concurrent tokens per tenant
Fairness policy examples
| Policy | Pros | Cons |
|---|---|---|
| Pure FIFO | Simple, fair by arrival | May waste batching opportunities |
| Greedy size-based | High throughput | Can hurt small requests |
| Age-weighted | Balances both | Slightly more complex |
Metrics you should track
- tokens/sec per GPU (throughput)
- queueing delay distribution
- time-in-system distribution (end-to-end latency)
- utilization of GPU (SM active %, memory BW)
Example: latency vs batch size
line| Metric | 1 | 2 | 4 | 8 | 16 |
|---|---|---|---|---|---|
| Median latency (ms) | |||||
| p99 latency (ms) |
Practical guidance
- In low-traffic environments:
- prioritize latency over throughput
- small batches, tight timeouts
- In high-traffic environments:
- prioritize throughput, but cap max wait
- use continuous batching with age-based fairness
- For multi-tenant setups:
- enforce per-tenant token budgets
- monitor per-tenant p99 separately
Conclusion
Scheduling is where your performance SLOs meet your quality SLOs:
- batching boosts throughput but hurts tail latency if unmanaged
- age- and tenant-aware policies prevent starvation
- continuous batching makes the most of active sequences
You donβt control arrivals, but you do control how you group and order work on the GPU β treat that as a first-class optimization problem.