LLM Request Scheduling: Batching, Fairness, and p99 Latency in Shared Clusters

You have deployed Llama 70B on an H100, and your SLA is P99 latency under 200ms. At batch size 8, you hit 210 tokens/sec throughput and 180ms P99 — within SLA. At batch size 16, throughput rises to 285 tokens/sec but P99 spikes to 350ms — you exceed your SLA. The scheduler is your only lever: batching policy, queueing discipline, and fairness mechanisms determine whether you meet your latency target while maximizing throughput. This post treats LLM serving as a queueing system where batching trades throughput for latency, continuous decoding creates head-of-line blocking, and fairness between tenants requires explicit scheduling policy.

Basic model: arrivals, service, and batching

Simplify:

requests arrive with some rate $\lambda$
each token decode step costs $T$ on average if run alone
batching B requests into one step costs $T_B \le B \cdot T$

Throughput (tokens/s) with batch size B: $R_B \approx \frac{B}{T_B}$

Latency has two parts:

queueing delay waiting to be batched
service time once in a batch

📊

Batching trade-offs

Batch size	Throughput	Queueing delay	Tail latency risk
1	Low	None	Low
4	Medium	Small	Moderate
16	High	Can be large	High if arrivals bursty

Simple schedulers

FIFO with fixed batch size

class FifoBatchScheduler:
    def __init__(self, max_batch_size):
        self.queue = []
        self.max_batch = max_batch_size

    def enqueue(self, req):
        self.queue.append(req)

    def form_batch(self):
        if not self.queue:
            return []
        # take up to max_batch requests
        batch = self.queue[: self.max_batch]
        self.queue = self.queue[self.max_batch :]
        return batch

Pros:

simple
high utilization at high load

Cons:

at low load, you delay small batches waiting to fill

Timeout-based batching

class TimeoutBatchScheduler:
    def __init__(self, max_batch_size, max_wait_ms):
        self.queue = []
        self.max_batch = max_batch_size
        self.max_wait = max_wait_ms / 1000.0

    def enqueue(self, req):
        req.arrival = time.time()
        self.queue.append(req)

    def form_batch(self):
        if not self.queue:
            return []
        now = time.time()
        # If oldest request waited long enough, form whatever batch we have
        if now - self.queue[0].arrival >= self.max_wait or len(self.queue) >= self.max_batch:
            batch = self.queue[: self.max_batch]
            self.queue = self.queue[self.max_batch :]
            return batch
        return []

This caps queueing delay and stabilizes p99.

Continuous decoding and iteration-level batches

Unlike prefill, decode steps repeat for each token. With continuous batching, at each decode iteration you:

drop finished requests
add new arrivals
form a batch across all active requests

💡 Think in iterations, not requests

The GPU runs per-iteration batches of active sequences. Scheduling is about which sequences make it into each iteration and in what groupings.

Fairness vs throughput

Greedy batching (always fill largest possible batch) can starve small, latency-sensitive jobs behind a stream of long, high-throughput streams.

Simple mitigation:

age-based priority: weight requests by waiting time
tenant-aware limits: cap concurrent tokens per tenant

📊

Fairness policy examples

Policy	Pros	Cons
Pure FIFO	Simple, fair by arrival	May waste batching opportunities
Greedy size-based	High throughput	Can hurt small requests
Age-weighted	Balances both	Slightly more complex

Metrics you should track

tokens/sec per GPU (throughput)
queueing delay distribution
time-in-system distribution (end-to-end latency)
utilization of GPU (SM active %, memory BW)

Example: latency vs batch size

line

Metric	1	2	4	8	16
Median latency (ms)	80	85	95	120	180
p99 latency (ms)	120	135	170	250	420

Practical guidance

In low-traffic environments:
- prioritize latency over throughput
- small batches, tight timeouts
In high-traffic environments:
- prioritize throughput, but cap max wait
- use continuous batching with age-based fairness
For multi-tenant setups:
- enforce per-tenant token budgets
- monitor per-tenant p99 separately

Conclusion

Scheduling is where your performance SLOs meet your quality SLOs:

batching boosts throughput but hurts tail latency if unmanaged
age- and tenant-aware policies prevent starvation
continuous batching makes the most of active sequences

You don’t control arrivals, but you do control how you group and order work on the GPU — treat that as a first-class optimization problem.

Basic model: arrivals, service, and batching

Batching trade-offs

Simple schedulers

FIFO with fixed batch size

Timeout-based batching

Continuous decoding and iteration-level batches

Fairness vs throughput

Fairness policy examples

Metrics you should track

Example: latency vs batch size

Practical guidance

Conclusion

Stanley Phoong

Related Posts

Decoding Performance: Beam Search vs Sampling — Latency, Throughput, Memory, and the Full Design Space

Batch Processing Optimization for LLM Inference: Throughput vs Latency Trade-offs

Speculative Decoding: Trading Compute for Latency in LLM Inference