Part of Series vLLM v1 & Omni Internals 4 of 25
1 vLLM v1 Block Manager: Deconstructing KV Cache Memory Management at the Pointer Level 2 vLLM v1 Disaggregated Serving: The E/P/D/G Pipeline and Multimodal-First Architecture 3 vLLM OmniConnector: Async Multimodal Token Lifecycle Management 4 vLLM v1 Unified Scheduler: One Queue, No Prefill/Decode Distinction, and Persistent Batches 5 vLLM v1 Attention Backends: FlashAttention, FlashInfer, and PagedAttention Selection Logic 6 vLLM v1 Rejection Sampler: Native CFG and Speculative Verification Kernels 7 vLLM v1 Tensor Parallelism: Symmetric Workers, Incremental Updates, and NCCL Optimization 8 vLLM v1 Structured Output: The Native Grammar Engine and Token Mask Caching 9 vLLM v1 Prefix Caching: Hash Chains, LRU Eviction, and Hit Rate Optimization 10 vLLM v1 Multi-LoRA: Adapter Scheduling, Memory Management, and Batched Inference 11 vLLM v1 Performance Profiling: Finding and Fixing Bottlenecks in Production 12 vLLM v1 Speculative Decoding: Draft Model Integration and Token Verification Pipeline 13 vLLM v1 Vision Encoder: ViT Integration, Image Preprocessing, and Visual Token Pipeline 14 vLLM v1 Model Loading: Weight Distribution, safetensors Deserialization, and Progressive Startup 15 vLLM v1 Request Cancellation and Early Stopping: Freeing Resources Mid-Generation 16 vLLM v1 Quantized Inference: GPTQ, AWQ, FP8 Kernel Selection 17 vLLM v1 Distributed Execution: Ray Integration and Multi-Node Coordination 18 vLLM v1 KV Cache Offloading: GPU to CPU to SSD Tiered Memory 19 vLLM v1 Async Output: Detokenization, Streaming, and Queue Management 20 vLLM v1 Video and Audio: Temporal Encoding and Multi-Modal Batching 21 vLLM v1 Benchmarking: Systematic Optimization for Your Workload 22 vLLM v1 Error Handling: CUDA OOM Recovery, Request Retry, and Graceful Degradation 23 vLLM v1 Configuration Guide: gpu_memory_utilization, max_num_seqs, and Every Key Parameter 24 vLLM v1 Plugin Architecture: Custom Samplers, Schedulers, and Attention Backends 25 vLLM v1 Production Checklist: From Development to Reliable 24/7 Serving

vLLM v0’s scheduler was a maintenance nightmare. Every time we wanted to change scheduling policy — prioritize decodes over prefills, tune chunked prefill chunk sizes, implement prefix caching — we had to modify three separate code paths that all did variations of the same thing. The prefill path handled new requests. The decode path handled autoregressive token generation. And chunked prefill bolted on a third path to split long prompts across iterations. Each path built input tensors differently, tracked request state differently, and communicated with the block manager through different APIs. Debugging became an exercise in checking whether a bug existed in all three paths or just one. The v1 scheduler collapses this complexity into a single abstraction: every request, whether prefilling or decoding, is just a request ID and a token count.

vLLM v0’s scheduler maintained two separate code paths: one for prefill requests and one for decode requests. The prefill path computed how many tokens to process from a new prompt, allocated KV cache blocks, and built the input tensor from scratch. The decode path handled autoregressive generation: one new token per request per iteration. Chunked prefill — splitting long prompts across multiple iterations — bolted on a third path. Each path had its own budget calculation, its own tensor preparation logic, and its own interaction with the block manager.

vLLM v1 collapses all three into a single scheduling abstraction. The scheduler does not know or care whether a request is in the prefill phase or the decode phase. The scheduling output is a dictionary: request_id -> num_tokens. A request getting its full prompt processed in one shot appears as {req_42: 2048}. A request generating its next autoregressive token appears as {req_42: 1}. A request being chunked appears as {req_42: 512}. The downstream model runner processes all of these identically.

v0 Scheduler: The Problem

Three Separate Paths

In v0, the Scheduler class had distinct methods:

# v0: Simplified structure
class SchedulerV0:
    def _schedule_prefills(self, budget):
        """Select new requests to prefill, allocate KV blocks."""
        scheduled = []
        for req in self.waiting_queue:
            tokens_needed = len(req.prompt_tokens)
            if budget.can_fit(tokens_needed):
                blocks = self.block_manager.allocate(req)
                scheduled.append(ScheduledPrefill(req, blocks, tokens_needed))
                budget.subtract(tokens_needed)
        return scheduled

    def _schedule_decodes(self, budget):
        """Select running requests for next decode step."""
        scheduled = []
        for req in self.running_queue:
            if budget.can_fit(1):  # 1 token per decode step
                scheduled.append(ScheduledDecode(req, num_tokens=1))
                budget.subtract(1)
        return scheduled

    def _schedule_chunked_prefills(self, budget):
        """Handle partially-prefilled requests."""
        scheduled = []
        for req in self.partial_queue:
            remaining = req.prompt_length - req.num_computed_tokens
            chunk_size = min(remaining, budget.remaining_tokens)
            if chunk_size > 0:
                scheduled.append(ScheduledChunk(req, chunk_size))
                budget.subtract(chunk_size)
        return scheduled

    def schedule(self):
        budget = SchedulingBudget(
            max_tokens=self.max_num_batched_tokens,
            max_sequences=self.max_num_seqs,
        )
        # Priority: decodes first (they're cheap), then prefills
        decodes = self._schedule_decodes(budget)
        prefills = self._schedule_prefills(budget)
        chunks = self._schedule_chunked_prefills(budget)
        return SchedulerOutput(decodes, prefills, chunks)

This design had several problems:

  1. Code duplication: Each path built input tensors differently, tracked state differently, and interacted with the block manager through different APIs.
  2. Priority coupling: The scheduling order (decodes first, then prefills) was hardcoded. Changing the policy required touching all three methods.
  3. Asymmetric workers: In v0, worker 0 was “special” — it ran the scheduler and broadcast decisions to other TP workers. This created a communication bottleneck and a single point of failure.
ℹ️ The v0 Worker Asymmetry Problem

In v0 with TP=8, worker 0 executed the scheduler, built the input tensors, and broadcast them to workers 1-7. This meant worker 0 had higher CPU load (scheduler + tensor building) and was the serialization point for every iteration. If worker 0 was slow, all 7 other GPUs waited idle. In v1, all workers are symmetric — the scheduler runs once and broadcasts only a compact scheduling decision, not full tensors.

v1 Unified Scheduler

The Core Abstraction

The v1 scheduler produces a single data structure: a mapping from request ID to the number of tokens that request should process in this iteration.

class SchedulerOutput:
    """The complete scheduling decision for one iteration."""

    # The core: request_id -> num_tokens_to_process
    num_scheduled_tokens: dict  # {request_id: int}

    # Requests that are finished (reached stop condition or max_tokens)
    finished_requests: set

    # Requests that need to be preempted (evicted to free KV blocks)
    preempted_requests: set

    # New requests entering the system this iteration
    new_requests: set

The num_scheduled_tokens dict is the entire scheduling decision. Every downstream component — the model runner, the attention backend, the sampler — reads only this dict.

Unified Scheduling Logic

class UnifiedScheduler:
    """
    v1 scheduler. No prefill/decode distinction.
    Every request is just (request_id, num_tokens_to_process).
    """

    def __init__(self, config):
        self.max_num_batched_tokens = config.max_num_batched_tokens
        self.max_num_seqs = config.max_num_seqs
        self.requests = {}           # request_id -> RequestState
        self.waiting = deque()       # New requests not yet started
        self.running = OrderedDict() # Active requests (prefilling or decoding)

    def schedule(self):
        num_scheduled_tokens = {}
        new_requests = set()
        finished_requests = set()
        preempted_requests = set()

        token_budget = self.max_num_batched_tokens
        seq_budget = self.max_num_seqs

        # Phase 1: Schedule running requests (both prefilling and decoding)
        for req_id, state in list(self.running.items()):
            if seq_budget <= 0 or token_budget <= 0:
                # Must preempt: not enough budget
                preempted_requests.add(req_id)
                continue

            if state.is_finished():
                finished_requests.add(req_id)
                continue

            # How many tokens does this request need?
            remaining_prefill = state.prompt_length - state.num_computed_tokens
            if remaining_prefill > 0:
                # Still prefilling: schedule up to remaining prompt tokens
                num_tokens = min(remaining_prefill, token_budget)
            else:
                # Decoding: schedule exactly 1 token
                num_tokens = 1

            num_scheduled_tokens[req_id] = num_tokens
            token_budget -= num_tokens
            seq_budget -= 1

        # Phase 2: Admit new requests from the waiting queue
        while self.waiting and seq_budget > 0 and token_budget > 0:
            req_id = self.waiting[0]
            state = self.requests[req_id]

            # Can we fit at least some tokens from this request?
            prompt_len = state.prompt_length
            num_tokens = min(prompt_len, token_budget)

            if num_tokens <= 0:
                break

            self.waiting.popleft()
            self.running[req_id] = state
            new_requests.add(req_id)
            num_scheduled_tokens[req_id] = num_tokens
            token_budget -= num_tokens
            seq_budget -= 1

        return SchedulerOutput(
            num_scheduled_tokens=num_scheduled_tokens,
            finished_requests=finished_requests,
            preempted_requests=preempted_requests,
            new_requests=new_requests,
        )

What “No Distinction” Means

The key insight is in Phase 1. For each running request, the scheduler computes remaining_prefill = prompt_length - num_computed_tokens. If positive, the request still has prompt tokens to process — it gets up to remaining_prefill tokens. If zero, the request is decoding — it gets exactly 1 token. The scheduler does not maintain separate queues for “prefill requests” and “decode requests”. There is one pool of running requests, and each request’s state determines how many tokens it needs.

This unification naturally handles chunked prefill. If a new request has a 4,096-token prompt but only 512 tokens fit in the budget, the scheduler assigns num_tokens = 512. Next iteration, remaining_prefill = 4096 - 512 = 3584, and the scheduler assigns another chunk. The request transitions seamlessly from chunked prefill to decode without changing queues or state machines.

📊

Scheduling Decision Examples

RequestPrompt LengthComputed TokensRemaining PrefillScheduled TokensPhase
req_1 2048 2048 0 1 Decode
req_2 4096 0 4096 4096 Full prefill
req_3 8192 3072 5120 512 Chunked prefill
req_4 512 512 0 1 Decode
req_5 (new) 1024 0 1024 1024 New + full prefill
req_6 (new) 16384 0 16384 256 New + chunked prefill

The scheduling output for the above iteration would be:

{
    "req_1": 1,
    "req_2": 4096,
    "req_3": 512,
    "req_4": 1,
    "req_5": 1024,
    "req_6": 256,
}

Total tokens: 1+4096+512+1+1024+256=58901 + 4096 + 512 + 1 + 1024 + 256 = 5890. If the token budget was 6000, 110 tokens remain unused — not enough to admit another request.

Persistent Batches

The v0 Problem: Tensor Rebuilding

In v0, the model runner rebuilt the full input batch tensor from scratch every iteration:

# v0: Build input from scratch every iteration
def prepare_input(self, scheduler_output):
    input_ids = []
    positions = []
    block_tables = []

    for req in scheduler_output.all_requests():
        if req.is_prefill:
            input_ids.extend(req.prompt_tokens[req.start:req.end])
            positions.extend(range(req.start, req.end))
        else:
            input_ids.append(req.last_generated_token)
            positions.append(req.current_position)
        block_tables.append(req.block_table)

    return ModelInput(
        input_ids=torch.tensor(input_ids, device="cuda"),
        positions=torch.tensor(positions, device="cuda"),
        block_tables=self._pad_block_tables(block_tables),
    )

Every iteration, this code iterated over all requests, built Python lists, converted them to PyTorch tensors, and transferred them to GPU. For a batch of 256 decode requests, this meant:

  • 256 iterations through the request list
  • 256 Python list appends per tensor
  • 3 torch.tensor() calls with device="cuda" (CPU-to-GPU transfer)
  • 1 block table padding operation (variable-length to fixed-length)

Profiling showed this took 0.5-2 ms per iteration. At 30 ms per decode iteration for a 70B model, that is 1.7-6.7% of total iteration time — pure overhead.

v1: Cache and Update Incrementally

v1 maintains persistent GPU tensors that survive across iterations. Each iteration, only the changed elements are updated:

class PersistentBatch:
    """
    Maintains GPU-resident tensors across iterations.
    Only sends incremental updates (new tokens, position increments).
    """

    def __init__(self, max_batch_size, max_seq_len, device):
        # Pre-allocated GPU tensors (created once, reused every iteration)
        self.input_ids = torch.zeros(max_batch_size, dtype=torch.long, device=device)
        self.positions = torch.zeros(max_batch_size, dtype=torch.long, device=device)
        self.seq_lens = torch.zeros(max_batch_size, dtype=torch.int32, device=device)

        # Slot mapping: request_id -> slot index in the batch tensors
        self.slot_map = {}       # request_id -> int
        self.free_slots = list(range(max_batch_size - 1, -1, -1))  # Stack
        self.num_active = 0

    def apply_scheduling_decision(self, scheduler_output):
        """
        Update the persistent tensors based on the scheduling decision.
        Only touches slots that changed.
        """
        updates_input_ids = []
        updates_positions = []
        update_indices = []

        # Handle finished requests: free their slots
        for req_id in scheduler_output.finished_requests:
            if req_id in self.slot_map:
                slot = self.slot_map.pop(req_id)
                self.free_slots.append(slot)
                self.num_active -= 1

        # Handle preempted requests: free their slots
        for req_id in scheduler_output.preempted_requests:
            if req_id in self.slot_map:
                slot = self.slot_map.pop(req_id)
                self.free_slots.append(slot)
                self.num_active -= 1

        # Handle new requests: allocate slots
        for req_id in scheduler_output.new_requests:
            slot = self.free_slots.pop()
            self.slot_map[req_id] = slot
            self.num_active += 1

        # For each scheduled request, compute the incremental update
        for req_id, num_tokens in scheduler_output.num_scheduled_tokens.items():
            slot = self.slot_map[req_id]
            request_state = self._get_request_state(req_id)

            if num_tokens == 1:
                # Decode: update single token and increment position
                new_token = request_state.last_generated_token
                new_position = request_state.current_position
                updates_input_ids.append(new_token)
                updates_positions.append(new_position)
                update_indices.append(slot)
            else:
                # Prefill (full or chunked): this is a bulk update
                # For prefill, we use a separate code path that writes
                # the prompt tokens directly into a contiguous region
                self._write_prefill_tokens(slot, request_state, num_tokens)

        # Batch-update decode slots (the common case)
        if update_indices:
            indices = torch.tensor(update_indices, device=self.input_ids.device)
            tokens = torch.tensor(updates_input_ids, device=self.input_ids.device)
            positions = torch.tensor(updates_positions, device=self.input_ids.device)
            self.input_ids.index_copy_(0, indices, tokens)
            self.positions.index_copy_(0, indices, positions)

    def get_model_input(self):
        """Return the current batch tensors (no copy, no rebuild)."""
        active_mask = self._compute_active_mask()
        return ModelInput(
            input_ids=self.input_ids[:self.num_active],
            positions=self.positions[:self.num_active],
            seq_lens=self.seq_lens[:self.num_active],
        )
The index_copy_ Optimization

torch.index_copy_ is a CUDA kernel that writes values into specific indices of a tensor in one GPU operation. For 256 decode requests, this is one kernel launch that updates 256 elements — roughly 5 microseconds. Compare to v0’s approach: 256 Python-level list operations (200 us), tensor creation from list (100 us), and CPU-to-GPU transfer (200 us). The persistent batch reduces tensor preparation from 0.5-2 ms to 0.01-0.05 ms per iteration.

Quantifying the Savings

Tensor Preparation Time per Iteration

(ms)
v0 (batch=32)
0.5 ms
v1 (batch=32)
0.01 ms
v0 (batch=128)
1 ms
v1 (batch=128)
0.02 ms
v0 (batch=256)
1.8 ms
v1 (batch=256)
0.04 ms
v0 (batch=512)
3.2 ms
v1 (batch=512)
0.08 ms

The savings compound over thousands of iterations. For a typical serving workload processing 1,000 iterations per second at batch size 256:

v0 overhead=1.8 ms×1000=1.8 seconds/second\text{v0 overhead} = 1.8 \text{ ms} \times 1000 = 1.8 \text{ seconds/second}

That is not a typo. v0 spent 1.8 seconds out of every second on tensor preparation — meaning it could not sustain 1,000 iterations per second. The actual throughput was lower because tensor preparation was on the critical path.

v1 overhead=0.04 ms×1000=0.04 seconds/second\text{v1 overhead} = 0.04 \text{ ms} \times 1000 = 0.04 \text{ seconds/second}

The persistent batch reduces tensor preparation to 4% of a single second, leaving 96% for actual model computation.

📊

Iteration Overhead Breakdown (Llama 70B, TP=8, Batch=256)

Componentv0 Time (ms)v1 Time (ms)Reduction
Scheduling decision 0.1 0.08 20%
Tensor preparation 1.8 0.04 97.8%
Scheduler -> Worker broadcast 0.3 0.05 83.3%
Model forward pass 28.0 28.0 0%
Sampling 0.2 0.2 0%
Total iteration 30.4 28.37 6.7%

The 6.7% reduction in total iteration time translates directly to 6.7% higher decode throughput. For a 256-request batch generating tokens at 30 ms/iteration:

v0 throughput=25630.4 ms=8,421 tokens/sec\text{v0 throughput} = \frac{256}{30.4 \text{ ms}} = 8{,}421 \text{ tokens/sec}

v1 throughput=25628.37 ms=9,024 tokens/sec\text{v1 throughput} = \frac{256}{28.37 \text{ ms}} = 9{,}024 \text{ tokens/sec}

That is 603 additional tokens per second from eliminating overhead — no algorithmic change to the model, no hardware upgrade.

Symmetric Tensor Parallelism

v0: Worker 0 Was Special

In v0, worker 0 ran the scheduler, built input tensors, and broadcast them to workers 1 through N1N-1:

Worker 0: [Schedule] -> [Build Tensors] -> [Broadcast] -> [Forward Pass] -> [Sample]
Worker 1:              (idle, waiting)   -> [Receive]   -> [Forward Pass] -> (idle)
Worker 2:              (idle, waiting)   -> [Receive]   -> [Forward Pass] -> (idle)
...

Worker 0 had more work than other workers. The time from “scheduling done” to “all workers start forward pass” included tensor building (1.8 ms) and broadcast (0.3 ms). Workers 1-7 were idle during this 2.1 ms window.

v1: All Workers Are Identical

In v1, the scheduler broadcasts only the compact num_scheduled_tokens dict — not full tensors. Each worker maintains its own PersistentBatch and applies the scheduling decision locally:

Scheduler:  [Schedule] -> [Broadcast decision dict]
Worker 0:   [Receive dict] -> [Apply to local PersistentBatch] -> [Forward Pass] -> [Sample]
Worker 1:   [Receive dict] -> [Apply to local PersistentBatch] -> [Forward Pass] -> [Sample]
Worker 2:   [Receive dict] -> [Apply to local PersistentBatch] -> [Forward Pass] -> [Sample]
...
class SymmetricWorker:
    """
    v1 worker. All workers are identical.
    Each maintains its own persistent batch and request state cache.
    """

    def __init__(self, rank, tp_size, model, config):
        self.rank = rank
        self.tp_size = tp_size
        self.model = model
        self.persistent_batch = PersistentBatch(
            max_batch_size=config.max_num_seqs,
            max_seq_len=config.max_model_len,
            device=f"cuda:{rank}",
        )
        self.request_cache = {}  # request_id -> local RequestState

    def execute_iteration(self, scheduler_decision):
        """
        All workers execute the same logic.
        No worker is 'special'.
        """
        # Step 1: Update local persistent batch
        self.persistent_batch.apply_scheduling_decision(scheduler_decision)

        # Step 2: Get model input (no rebuild, just a view)
        model_input = self.persistent_batch.get_model_input()

        # Step 3: Forward pass (TP-sharded, NCCL all-reduce inside)
        hidden_states = self.model.forward(model_input)

        # Step 4: Sample (all workers sample independently, results are identical)
        # In v1, all workers produce the same logits after all-reduce,
        # so sampling on any worker yields the same token.
        output_tokens = self.model.sample(hidden_states)

        # Step 5: Update local request state
        for req_id, token_id in output_tokens.items():
            self.request_cache[req_id].append_token(token_id)

        return output_tokens
ℹ️ Why Sampling on All Workers Is Safe

After the final all-reduce in the last transformer layer, all TP workers hold identical hidden states. The output projection (unembedding) and softmax are element-wise or row-wise operations — they produce identical logits on all workers. Greedy sampling (argmax) is deterministic, so all workers select the same token. For stochastic sampling (temperature, top-p), all workers must use the same random seed, which vLLM v1 ensures by synchronizing the RNG state at the start of each iteration via the scheduling decision.

Broadcast Cost: Dict vs. Full Tensors

The scheduling decision dict is tiny. For a batch of 256 requests, the dict contains 256 key-value pairs. Serialized as a flat array of (int64, int32) pairs:

decision_size=256×(8+4)=3,072 bytes=3 KB\text{decision\_size} = 256 \times (8 + 4) = 3{,}072 \text{ bytes} = 3 \text{ KB}

In v0, the broadcast included the full input tensor:

tensor_size=256×4 (int32 input_ids)+256×4 (int32 positions)+block_tables= ⁣100 KB\text{tensor\_size} = 256 \times 4 \text{ (int32 input\_ids)} + 256 \times 4 \text{ (int32 positions)} + \text{block\_tables} = \sim\!100 \text{ KB}

Plus the overhead of PyTorch tensor serialization. The v1 broadcast is 33x smaller, and because it is a simple byte buffer (not a PyTorch tensor), it avoids the CUDA-Python serialization overhead entirely.

📊

Scheduler Broadcast Cost

Metricv0v1Improvement
Payload size (batch=256) ~100 KB 3 KB 33x smaller
Serialization time 0.2 ms (torch pickle) 0.005 ms (memcpy) 40x faster
Network transfer (InfiniBand) 0.1 ms 0.003 ms 33x faster
Total broadcast overhead 0.3 ms 0.008 ms 37.5x faster

Request State Caching on Workers

Each worker caches per-request state locally, eliminating the need to broadcast request metadata every iteration:

@dataclass
class LocalRequestState:
    """Cached on each worker. Updated incrementally."""
    request_id: str
    prompt_tokens: list            # Set once when request arrives
    prompt_length: int             # Set once
    num_computed_tokens: int       # Incremented each iteration
    generated_tokens: list         # Appended each iteration
    kv_block_ids: list             # Updated by block manager
    current_position: int          # Incremented each iteration
    sampling_params: dict          # Set once (temperature, top_p, etc.)

    def append_token(self, token_id):
        self.generated_tokens.append(token_id)
        self.current_position += 1
        self.num_computed_tokens += 1

    def is_finished(self):
        if self.current_position >= self.sampling_params.get("max_tokens", float("inf")):
            return True
        if token_id in self.sampling_params.get("stop_token_ids", []):
            return True
        return False

When a new request arrives, the scheduler broadcasts its full metadata once:

class NewRequestMessage:
    request_id: str
    prompt_tokens: list
    sampling_params: dict

For a 2,048-token prompt, this is approximately 2048×4+200=8.42048 \times 4 + 200 = 8.4 KB — sent once. All subsequent iterations send only the 12-byte (request_id, num_tokens) pair.

Over the lifetime of a request generating 500 tokens, the total communication is:

v0 total=500×100 KB=50 MB (broadcast full tensors each iteration)\text{v0 total} = 500 \times 100 \text{ KB} = 50 \text{ MB (broadcast full tensors each iteration)}

v1 total=8.4 KB (initial)+500×12 bytes=14.4 KB\text{v1 total} = 8.4 \text{ KB (initial)} + 500 \times 12 \text{ bytes} = 14.4 \text{ KB}

A 3,472×3{,}472\times reduction in total scheduler-to-worker communication.

The schedule() Method: Complete Pseudocode

Here is the full scheduling algorithm with all edge cases:

class UnifiedSchedulerV1:
    def __init__(self, config, block_manager):
        self.max_num_batched_tokens = config.max_num_batched_tokens  # e.g., 8192
        self.max_num_seqs = config.max_num_seqs                      # e.g., 256
        self.block_manager = block_manager
        self.waiting = deque()       # Requests not yet started
        self.running = OrderedDict() # Active requests
        self.request_states = {}     # All request states

    def add_request(self, request):
        """Called by the API server when a new request arrives."""
        state = RequestState(
            request_id=request.request_id,
            prompt_tokens=request.prompt_tokens,
            prompt_length=len(request.prompt_tokens),
            num_computed_tokens=0,
            generated_tokens=[],
            sampling_params=request.sampling_params,
        )
        self.request_states[request.request_id] = state
        self.waiting.append(request.request_id)

    def schedule(self):
        """
        The main scheduling method. Called once per iteration.

        Returns: SchedulerOutput containing {request_id: num_tokens}
        for all requests to process this iteration.
        """
        scheduled = {}
        new_requests = set()
        finished = set()
        preempted = set()

        token_budget = self.max_num_batched_tokens
        seq_budget = self.max_num_seqs

        # ---- Phase 1: Handle running requests ----
        # Running requests have priority over new requests.
        # This ensures decode latency is not penalized by incoming prefills.

        to_remove = []
        for req_id in self.running:
            state = self.request_states[req_id]

            # Check if request is finished
            if state.is_finished():
                finished.add(req_id)
                to_remove.append(req_id)
                self.block_manager.free(req_id)
                continue

            # Check budget
            if seq_budget <= 0:
                preempted.add(req_id)
                to_remove.append(req_id)
                self.block_manager.free(req_id)
                continue

            # Compute tokens needed
            remaining_prefill = state.prompt_length - state.num_computed_tokens
            if remaining_prefill > 0:
                # Still in prefill phase: schedule as many as budget allows
                num_tokens = min(remaining_prefill, token_budget)
            else:
                # In decode phase: exactly 1 token
                num_tokens = 1

            if num_tokens <= 0:
                # No token budget left; preempt lowest-priority
                preempted.add(req_id)
                to_remove.append(req_id)
                self.block_manager.free(req_id)
                continue

            # Allocate KV blocks for the new tokens
            can_allocate = self.block_manager.can_allocate(req_id, num_tokens)
            if not can_allocate:
                # Not enough KV cache memory; preempt this request
                preempted.add(req_id)
                to_remove.append(req_id)
                self.block_manager.free(req_id)
                continue

            self.block_manager.allocate(req_id, num_tokens)
            scheduled[req_id] = num_tokens
            token_budget -= num_tokens
            seq_budget -= 1

        for req_id in to_remove:
            self.running.pop(req_id, None)

        # ---- Phase 2: Admit new requests ----
        # Fill remaining budget with waiting requests.

        while self.waiting and seq_budget > 0 and token_budget > 0:
            req_id = self.waiting[0]
            state = self.request_states[req_id]

            # How many tokens can we schedule for this request?
            num_tokens = min(state.prompt_length, token_budget)
            if num_tokens <= 0:
                break

            # Can we allocate KV blocks?
            can_allocate = self.block_manager.can_allocate(req_id, num_tokens)
            if not can_allocate:
                break  # No memory; stop admitting

            self.waiting.popleft()
            self.running[req_id] = True
            new_requests.add(req_id)
            self.block_manager.allocate(req_id, num_tokens)
            scheduled[req_id] = num_tokens
            token_budget -= num_tokens
            seq_budget -= 1

        # ---- Phase 3: Build output ----
        return SchedulerOutput(
            num_scheduled_tokens=scheduled,
            finished_requests=finished,
            preempted_requests=preempted,
            new_requests=new_requests,
        )

    def update_from_output(self, output_tokens):
        """Called after the model runner produces output tokens."""
        for req_id, token_id in output_tokens.items():
            if req_id in self.request_states:
                state = self.request_states[req_id]
                state.generated_tokens.append(token_id)
                state.num_computed_tokens += 1
⚠️ Preemption Order Matters

When the scheduler must preempt running requests (KV cache exhausted or sequence budget exceeded), the eviction order is critical. vLLM v1 preempts requests in LIFO order — the most recently admitted request is evicted first. This preserves KV cache for requests that have already generated many tokens (expensive to recompute) at the expense of newer requests (cheaper to re-prefill).

Interaction with the Block Manager

The unified scheduler communicates with the block manager through two simple operations:

class BlockManagerInterface:
    def can_allocate(self, request_id, num_tokens):
        """
        Can the block manager accommodate `num_tokens` new tokens
        for this request?

        This checks:
        1. Are there enough free physical blocks?
        2. If the request's last block is partially filled,
           can the new tokens fit in the remaining slots?
        """
        current_blocks = self.get_block_count(request_id)
        current_tokens_in_last_block = self.get_last_block_fill(request_id)
        remaining_in_last_block = self.block_size - current_tokens_in_last_block

        if num_tokens <= remaining_in_last_block:
            return True  # Fits in existing block, no allocation needed

        extra_tokens = num_tokens - remaining_in_last_block
        extra_blocks_needed = math.ceil(extra_tokens / self.block_size)
        return extra_blocks_needed <= self.num_free_blocks

    def allocate(self, request_id, num_tokens):
        """Allocate physical blocks for num_tokens new tokens."""
        # ... (same logic as can_allocate, but actually allocates)
        pass

    def free(self, request_id):
        """Free all blocks belonging to this request."""
        blocks = self.request_blocks.pop(request_id, [])
        for block in blocks:
            block.ref_count -= 1
            if block.ref_count == 0:
                self.free_pool.append(block.block_id)

The scheduler does not know about block sizes, physical addresses, or GPU memory layout. It asks “can I have N tokens?” and gets a boolean. This clean interface is what makes the unified scheduler possible: the complexity of memory management is entirely encapsulated in the block manager.

Performance Comparison: v0 vs v1

End-to-End Throughput (Llama 70B, TP=8, A100 80GB)

(tokens/sec)
v0 (batch 64)
2,100 tokens/sec
v1 (batch 64)
2,240 tokens/sec
v0 (batch 128)
3,800 tokens/sec
v1 (batch 128)
4,180 tokens/sec
v0 (batch 256)
5,600 tokens/sec
v1 (batch 256)
6,380 tokens/sec
📊

Latency Comparison: P50 TPOT (Llama 70B, TP=8, Batch=256)

Metricv0v1Improvement
P50 TPOT 30.4 ms 28.4 ms 6.6%
P99 TPOT 42.1 ms 33.2 ms 21.1%
Scheduling jitter (stddev) 1.2 ms 0.15 ms 87.5%
Max batch size at SLO=40ms 210 256 +21.9%

The P99 improvement (21.1%) is larger than the P50 improvement (6.6%) because v0’s tensor rebuilding time varied significantly with batch composition. When multiple large prefills ran simultaneously, v0’s tensor preparation could spike to 5+ ms. v1’s persistent batch has nearly constant overhead regardless of batch composition.

Chunked Prefill: A Natural Consequence

In v0, chunked prefill required explicit logic:

# v0: Chunked prefill was a special case
if request.is_chunked_prefill:
    chunk_size = min(remaining, self.chunk_budget)
    # Special handling for partial prefill state...
    # Special tensor building for chunk boundaries...

In v1, chunked prefill is not a feature — it is an emergent property of the unified scheduler. When the token budget cannot fit a full prompt, the scheduler naturally assigns a partial chunk:

# v1: Chunked prefill happens automatically
remaining_prefill = state.prompt_length - state.num_computed_tokens
num_tokens = min(remaining_prefill, token_budget)
# That's it. No special case. No chunk tracking.

The num_computed_tokens field tracks progress. Each iteration, after the forward pass, num_computed_tokens is incremented by the number of tokens processed. When num_computed_tokens == prompt_length, the request transitions to decode. No explicit state transition. No queue change. The scheduler’s remaining_prefill calculation simply returns 0, and the request gets 1 token (decode mode).

This eliminates an entire class of bugs related to chunk boundary handling, partial state tracking, and queue transitions that plagued v0’s chunked prefill implementation.

💡 Scheduling Policy Is Now Trivially Configurable

Because the scheduler’s core logic is just “compute num_tokens per request” and “fill a budget,” the scheduling policy can be changed by reordering the iteration or adjusting the budget allocation strategy. Want to prioritize prefills over decodes? Iterate over waiting requests before running requests. Want to limit prefill to 50% of the budget? Set prefill_budget = token_budget * 0.5. In v0, these changes required modifying three separate methods and ensuring consistency across them.