You have deployed Llama 70B on an H100, and your SLA is P99 latency under 200ms. At batch size 8, you hit 210 tokens/sec throughput and 180ms P99 β€” within SLA. At batch size 16, throughput rises to 285 tokens/sec but P99 spikes to 350ms β€” you exceed your SLA. The scheduler is your only lever: batching policy, queueing discipline, and fairness mechanisms determine whether you meet your latency target while maximizing throughput. This post treats LLM serving as a queueing system where batching trades throughput for latency, continuous decoding creates head-of-line blocking, and fairness between tenants requires explicit scheduling policy.

Basic model: arrivals, service, and batching

Simplify:

  • requests arrive with some rate Ξ»\lambda
  • each token decode step costs TT on average if run alone
  • batching B requests into one step costs TB≀Bβ‹…TT_B \le B \cdot T

Throughput (tokens/s) with batch size B: RBβ‰ˆBTBR_B \approx \frac{B}{T_B}

Latency has two parts:

  • queueing delay waiting to be batched
  • service time once in a batch
πŸ“Š

Batching trade-offs

Batch sizeThroughputQueueing delayTail latency risk
1 Low None Low
4 Medium Small Moderate
16 High Can be large High if arrivals bursty

Simple schedulers

FIFO with fixed batch size

class FifoBatchScheduler:
    def __init__(self, max_batch_size):
        self.queue = []
        self.max_batch = max_batch_size

    def enqueue(self, req):
        self.queue.append(req)

    def form_batch(self):
        if not self.queue:
            return []
        # take up to max_batch requests
        batch = self.queue[: self.max_batch]
        self.queue = self.queue[self.max_batch :]
        return batch

Pros:

  • simple
  • high utilization at high load

Cons:

  • at low load, you delay small batches waiting to fill

Timeout-based batching

class TimeoutBatchScheduler:
    def __init__(self, max_batch_size, max_wait_ms):
        self.queue = []
        self.max_batch = max_batch_size
        self.max_wait = max_wait_ms / 1000.0

    def enqueue(self, req):
        req.arrival = time.time()
        self.queue.append(req)

    def form_batch(self):
        if not self.queue:
            return []
        now = time.time()
        # If oldest request waited long enough, form whatever batch we have
        if now - self.queue[0].arrival >= self.max_wait or len(self.queue) >= self.max_batch:
            batch = self.queue[: self.max_batch]
            self.queue = self.queue[self.max_batch :]
            return batch
        return []

This caps queueing delay and stabilizes p99.

Continuous decoding and iteration-level batches

Unlike prefill, decode steps repeat for each token. With continuous batching, at each decode iteration you:

  • drop finished requests
  • add new arrivals
  • form a batch across all active requests
πŸ’‘ Think in iterations, not requests

The GPU runs per-iteration batches of active sequences. Scheduling is about which sequences make it into each iteration and in what groupings.

Fairness vs throughput

Greedy batching (always fill largest possible batch) can starve small, latency-sensitive jobs behind a stream of long, high-throughput streams.

Simple mitigation:

  • age-based priority: weight requests by waiting time
  • tenant-aware limits: cap concurrent tokens per tenant
πŸ“Š

Fairness policy examples

PolicyProsCons
Pure FIFO Simple, fair by arrival May waste batching opportunities
Greedy size-based High throughput Can hurt small requests
Age-weighted Balances both Slightly more complex

Metrics you should track

  • tokens/sec per GPU (throughput)
  • queueing delay distribution
  • time-in-system distribution (end-to-end latency)
  • utilization of GPU (SM active %, memory BW)

Example: latency vs batch size

line
Metric 124816
Median latency (ms)
80
85
95
120
180
p99 latency (ms)
120
135
170
250
420

Practical guidance

  • In low-traffic environments:
    • prioritize latency over throughput
    • small batches, tight timeouts
  • In high-traffic environments:
    • prioritize throughput, but cap max wait
    • use continuous batching with age-based fairness
  • For multi-tenant setups:
    • enforce per-tenant token budgets
    • monitor per-tenant p99 separately

Conclusion

Scheduling is where your performance SLOs meet your quality SLOs:

  • batching boosts throughput but hurts tail latency if unmanaged
  • age- and tenant-aware policies prevent starvation
  • continuous batching makes the most of active sequences

You don’t control arrivals, but you do control how you group and order work on the GPU β€” treat that as a first-class optimization problem.