Part of Series vLLM v1 & Omni Internals 23 of 25
1 vLLM v1 Block Manager: Deconstructing KV Cache Memory Management at the Pointer Level 2 vLLM v1 Disaggregated Serving: The E/P/D/G Pipeline and Multimodal-First Architecture 3 vLLM OmniConnector: Async Multimodal Token Lifecycle Management 4 vLLM v1 Unified Scheduler: One Queue, No Prefill/Decode Distinction, and Persistent Batches 5 vLLM v1 Attention Backends: FlashAttention, FlashInfer, and PagedAttention Selection Logic 6 vLLM v1 Rejection Sampler: Native CFG and Speculative Verification Kernels 7 vLLM v1 Tensor Parallelism: Symmetric Workers, Incremental Updates, and NCCL Optimization 8 vLLM v1 Structured Output: The Native Grammar Engine and Token Mask Caching 9 vLLM v1 Prefix Caching: Hash Chains, LRU Eviction, and Hit Rate Optimization 10 vLLM v1 Multi-LoRA: Adapter Scheduling, Memory Management, and Batched Inference 11 vLLM v1 Performance Profiling: Finding and Fixing Bottlenecks in Production 12 vLLM v1 Speculative Decoding: Draft Model Integration and Token Verification Pipeline 13 vLLM v1 Vision Encoder: ViT Integration, Image Preprocessing, and Visual Token Pipeline 14 vLLM v1 Model Loading: Weight Distribution, safetensors Deserialization, and Progressive Startup 15 vLLM v1 Request Cancellation and Early Stopping: Freeing Resources Mid-Generation 16 vLLM v1 Quantized Inference: GPTQ, AWQ, FP8 Kernel Selection 17 vLLM v1 Distributed Execution: Ray Integration and Multi-Node Coordination 18 vLLM v1 KV Cache Offloading: GPU to CPU to SSD Tiered Memory 19 vLLM v1 Async Output: Detokenization, Streaming, and Queue Management 20 vLLM v1 Video and Audio: Temporal Encoding and Multi-Modal Batching 21 vLLM v1 Benchmarking: Systematic Optimization for Your Workload 22 vLLM v1 Error Handling: CUDA OOM Recovery, Request Retry, and Graceful Degradation 23 vLLM v1 Configuration Guide: gpu_memory_utilization, max_num_seqs, and Every Key Parameter 24 vLLM v1 Plugin Architecture: Custom Samplers, Schedulers, and Attention Backends 25 vLLM v1 Production Checklist: From Development to Reliable 24/7 Serving

Every token vLLM generates starts as an integer token ID on the GPU. By the time it reaches the client as part of an HTTP response, it has passed through sampling, detokenization, output queue management, and SSE (Server-Sent Events) streaming. In vLLM v1, this entire pipeline is asynchronous — the GPU never waits for detokenization or network I/O. This post traces the output processing path from sampled token ID to client-received text, covering the detokenizer architecture, incremental decoding, queue management, and backpressure mechanisms.

The Output Processing Pipeline

The pipeline has four stages:

GPU Sampler -> Output Queue -> Detokenizer -> HTTP Stream
     |              |               |              |
  token_id     RequestOutput    text delta     SSE chunk
  (int32)      (id + logprobs)  (string)       (bytes)
     |              |               |              |
   ~0.01ms       ~0.05ms         ~0.1ms         ~0.5ms

Each stage runs on a different execution context:

  • GPU Sampler: CUDA kernel on GPU
  • Output Queue: asyncio queue on engine event loop
  • Detokenizer: CPU thread (can be a separate process)
  • HTTP Stream: asyncio coroutine on API server event loop
# Simplified pipeline flow
class AsyncLLMEngine:
    async def generate(self, prompt: str, params: SamplingParams):
        request_id = self._add_request(prompt, params)

        async for output in self._stream_results(request_id):
            yield output  # Yields RequestOutput with text delta

    async def _stream_results(self, request_id: str):
        while True:
            # Non-blocking wait for next output from the engine
            output = await self.output_queue.get(request_id)
            if output.finished:
                yield output
                break
            yield output

The Detokenizer

Detokenization converts token IDs back to text. This is not a simple lookup — subword tokenizers (BPE, SentencePiece) can produce tokens that span character boundaries, requiring stateful incremental decoding.

The Incremental Decoding Problem

Consider the token sequence [1234, 5678] where token 1234 decodes to "hel" and token 5678 decodes to "lo world". The correct output is "hello world", but token 1234 alone cannot be decoded to "hel" because the tokenizer might merge it differently with the next token.

The standard approach: maintain a buffer of undecoded tokens and decode the entire sequence, then emit only the new characters:

class IncrementalDetokenizer:
    def __init__(self, tokenizer):
        self.tokenizer = tokenizer
        self.token_buffer = []
        self.prefix_offset = 0
        self.read_offset = 0

    def decode_next(self, token_id: int) -> str:
        """Decode one new token, return the new text delta."""
        self.token_buffer.append(token_id)

        # Decode the full token buffer
        full_text = self.tokenizer.decode(
            self.token_buffer,
            skip_special_tokens=True
        )

        # Decode all-but-last tokens to find the stable prefix
        if len(self.token_buffer) > 1:
            prefix_text = self.tokenizer.decode(
                self.token_buffer[:-1],
                skip_special_tokens=True
            )
        else:
            prefix_text = ""

        # New text is the difference
        # But we need to handle the case where the last token
        # changes the decoding of previous tokens
        new_text = full_text[len(prefix_text):]

        # Optimization: trim buffer when prefix is stable
        if len(self.token_buffer) > 16:
            # Keep last 8 tokens as context for correct decoding
            stable_len = len(self.tokenizer.decode(
                self.token_buffer[:-8], skip_special_tokens=True
            ))
            self.prefix_offset = stable_len
            self.token_buffer = self.token_buffer[-8:]

        return new_text
⚠️ Warning

Naive incremental decoding (decode only the new token) produces incorrect output for multi-byte UTF-8 characters and tokenizers that use merge rules. The buffer-based approach decodes the full sequence but only emits the delta, guaranteeing correctness at the cost of O(n) decoding per step where n is the sequence length.

Detokenizer Performance

The detokenizer’s cost grows linearly with sequence length because it re-decodes the full buffer:

# Detokenization cost per token at various sequence lengths
# Using Llama tokenizer (SentencePiece BPE, 32K vocab)
# Measured on single CPU core

# Sequence length 100:   0.012 ms per token
# Sequence length 1000:  0.089 ms per token
# Sequence length 4000:  0.342 ms per token
# Sequence length 16000: 1.380 ms per token
📊

Detokenization Latency per Token (Llama Tokenizer, 1 CPU Core)

Sequence LengthTime/Token (ms)Fraction of Decode Step
100 0.012 0.08%
500 0.048 0.32%
1,000 0.089 0.59%
4,000 0.342 2.28%
16,000 1.380 9.20%
32,000 2.810 18.73%

At 32K tokens, detokenization takes 2.8ms per token — nearly 19% of a 15ms decode step. vLLM addresses this with the buffer trimming optimization (keeping only the last 8 tokens as context) and by running the detokenizer on a separate thread.

vLLM’s Optimized Detokenizer

vLLM v1 uses a more efficient approach that avoids full re-decoding:

class VLLMDetokenizer:
    def __init__(self, tokenizer):
        self.tokenizer = tokenizer
        # Track prefix and read offsets into the decoded string
        self.all_token_ids = []
        self.prefix_offset = 0
        self.read_offset = 0

    def decode_token(self, token_id: int) -> str:
        self.all_token_ids.append(token_id)

        # Only decode from prefix_offset onward
        prefix_text = self.tokenizer.convert_ids_to_tokens(
            self.all_token_ids[self.prefix_offset:self.read_offset]
        )
        new_text = self.tokenizer.convert_ids_to_tokens(
            self.all_token_ids[self.prefix_offset:]
        )

        # Decode to string
        prefix_str = self.tokenizer.convert_tokens_to_string(prefix_text)
        full_str = self.tokenizer.convert_tokens_to_string(new_text)

        # Delta
        delta = full_str[len(prefix_str):]

        # Advance offsets
        if len(self.all_token_ids) > 6:
            self.prefix_offset = len(self.all_token_ids) - 6
            self.read_offset = len(self.all_token_ids) - 1

        return delta

This maintains O(1) amortized cost by only decoding a sliding window of tokens rather than the full sequence.

Output Queue Architecture

The output queue connects the engine loop (which runs forward passes) to the API server (which streams results to clients):

class OutputQueue:
    def __init__(self):
        self._queues = {}  # request_id -> asyncio.Queue
        self._finished = {}  # request_id -> bool

    def create_queue(self, request_id: str) -> asyncio.Queue:
        q = asyncio.Queue(maxsize=64)
        self._queues[request_id] = q
        self._finished[request_id] = False
        return q

    def put_output(self, request_id: str, output: RequestOutput):
        """Called by engine after each decode step."""
        q = self._queues.get(request_id)
        if q is None:
            return  # Request was cancelled

        try:
            q.put_nowait(output)
        except asyncio.QueueFull:
            # Backpressure: drop oldest if client is slow
            try:
                q.get_nowait()  # Drop oldest
            except asyncio.QueueEmpty:
                pass
            q.put_nowait(output)

        if output.finished:
            self._finished[request_id] = True

    def remove_queue(self, request_id: str):
        self._queues.pop(request_id, None)
        self._finished.pop(request_id, None)
ℹ️ Note

The queue maxsize of 64 means at most 64 undelivered outputs can accumulate per request. At a TPOT of 15ms, this represents roughly 1 second of buffered tokens. If the client falls behind by more than 1 second, the oldest outputs are dropped from the queue (but the full response text is always available from the final RequestOutput).

SSE Streaming Implementation

The API server streams tokens to clients using Server-Sent Events over HTTP:

from fastapi import FastAPI
from fastapi.responses import StreamingResponse
import json

app = FastAPI()

@app.post("/v1/chat/completions")
async def chat_completions(request: ChatRequest):
    if request.stream:
        return StreamingResponse(
            stream_chat_response(request),
            media_type="text/event-stream"
        )
    else:
        return await complete_chat_response(request)

async def stream_chat_response(request: ChatRequest):
    request_id = engine.add_request(request)
    queue = engine.output_queue.create_queue(request_id)

    try:
        while True:
            output = await asyncio.wait_for(queue.get(), timeout=30.0)

            chunk = {
                "id": request_id,
                "object": "chat.completion.chunk",
                "choices": [{
                    "index": 0,
                    "delta": {"content": output.text_delta},
                    "finish_reason": output.finish_reason
                }]
            }

            yield f"data: {json.dumps(chunk)}\n\n"

            if output.finished:
                yield "data: [DONE]\n\n"
                break
    except asyncio.TimeoutError:
        yield f"data: {json.dumps({'error': 'timeout'})}\n\n"
    finally:
        engine.output_queue.remove_queue(request_id)

SSE Chunking and Buffering

Each SSE chunk contains one token’s worth of text. The HTTP layer may buffer these, defeating the purpose of streaming. vLLM forces flushing:

# In the ASGI server configuration
# uvicorn must not buffer SSE responses
# This is handled by the StreamingResponse class
# which sets Transfer-Encoding: chunked

# Nginx/reverse proxy configuration must also disable buffering:
# proxy_buffering off;
# proxy_cache off;
# X-Accel-Buffering: no;

Batch Output Processing

After each decode step, the engine produces outputs for all sequences in the batch simultaneously. Processing these outputs efficiently is critical:

class EngineOutputProcessor:
    def __init__(self, detokenizer: VLLMDetokenizer,
                 output_queue: OutputQueue):
        self.detokenizer = detokenizer
        self.output_queue = output_queue

    def process_batch_outputs(self, sampler_output: SamplerOutput):
        """Process outputs from one decode step for all sequences."""
        for seq_output in sampler_output.outputs:
            seq_id = seq_output.seq_id
            token_id = seq_output.token_id
            logprobs = seq_output.logprobs
            finished = seq_output.finished
            finish_reason = seq_output.finish_reason

            # Detokenize
            text_delta = self.detokenizer.decode_token(seq_id, token_id)

            # Create output object
            output = RequestOutput(
                request_id=seq_output.request_id,
                token_id=token_id,
                text_delta=text_delta,
                logprobs=logprobs,
                finished=finished,
                finish_reason=finish_reason,
                cumulative_text=self.detokenizer.get_full_text(seq_id)
            )

            # Enqueue for client delivery
            self.output_queue.put_output(seq_output.request_id, output)

            if finished:
                self.detokenizer.remove_sequence(seq_id)

The batch processing loop runs on the engine thread after torch.cuda.synchronize() returns the sampled token IDs. For a batch of 128 sequences, this takes approximately:

📊

Batch Output Processing Time (128 sequences, Llama tokenizer)

ComponentTime (ms)Per Sequence (us)
GPU -> CPU transfer (token IDs) 0.015 0.12
Detokenization (all seqs) 0.82 6.4
Logprob formatting 0.35 2.7
Queue enqueue (all seqs) 0.08 0.6
Total 1.27 9.9

At 1.27ms for 128 sequences, output processing adds less than 10% overhead to a 15ms decode step.

Logprob Processing

When clients request log probabilities, the output pipeline must extract and format them:

class LogprobProcessor:
    def __init__(self, tokenizer, num_logprobs: int):
        self.tokenizer = tokenizer
        self.num_logprobs = num_logprobs

    def process(self, logits: torch.Tensor,
                sampled_token: int) -> dict:
        """Extract top-k logprobs from logits tensor."""
        # logits shape: [vocab_size]
        log_probs = torch.log_softmax(logits, dim=-1)

        # Top-k logprobs
        top_values, top_indices = torch.topk(
            log_probs, self.num_logprobs
        )

        result = {
            "token": self.tokenizer.decode([sampled_token]),
            "token_logprob": log_probs[sampled_token].item(),
            "top_logprobs": {}
        }

        for val, idx in zip(top_values.tolist(), top_indices.tolist()):
            token_str = self.tokenizer.decode([idx])
            result["top_logprobs"][token_str] = val

        return result
Performance

Logprob extraction with torch.topk requires transferring the full logits tensor (vocab_size * 4 bytes) from GPU to CPU if processed on CPU, or executing the topk on GPU and transferring only k values. For Llama’s 32K vocab, the full tensor is 128 KB per sequence. With 128 sequences, that is 16 MB per step. vLLM processes topk on GPU and transfers only the results (k * 8 bytes per sequence), reducing the transfer to under 10 KB.

Request Cancellation and Cleanup

When a client disconnects mid-stream, the output pipeline must clean up resources:

class CancellationHandler:
    async def monitor_client_disconnect(self, request_id: str,
                                         http_request):
        """Monitor for client disconnect and cancel the request."""
        while True:
            if await http_request.is_disconnected():
                # Client gone — cancel the request
                await self.engine.abort(request_id)
                self.output_queue.remove_queue(request_id)
                self.detokenizer.remove_sequence(request_id)
                return
            await asyncio.sleep(0.1)  # Check every 100ms

    def abort_request(self, request_id: str):
        """Engine-side abort: free KV cache blocks."""
        seq = self.scheduler.get_sequence(request_id)
        if seq is not None:
            # Free KV cache blocks immediately
            self.block_manager.free(seq.block_table)
            self.scheduler.remove_sequence(seq)

The 100ms polling interval for disconnect detection means up to 100ms of wasted decode steps after a client disconnects. At TPOT=15ms, that is roughly 6-7 wasted tokens per cancelled request.

Non-Streaming Response Assembly

For non-streaming requests, the output pipeline accumulates all tokens before returning:

async def complete_chat_response(request: ChatRequest):
    """Non-streaming: collect all tokens, return single response."""
    request_id = engine.add_request(request)
    queue = engine.output_queue.create_queue(request_id)

    full_text = ""
    all_logprobs = []
    finish_reason = None

    while True:
        output = await queue.get()
        full_text += output.text_delta
        if output.logprobs:
            all_logprobs.append(output.logprobs)
        if output.finished:
            finish_reason = output.finish_reason
            break

    engine.output_queue.remove_queue(request_id)

    return {
        "id": request_id,
        "object": "chat.completion",
        "choices": [{
            "index": 0,
            "message": {"role": "assistant", "content": full_text},
            "finish_reason": finish_reason,
            "logprobs": all_logprobs if all_logprobs else None
        }],
        "usage": {
            "prompt_tokens": output.prompt_token_count,
            "completion_tokens": output.completion_token_count,
            "total_tokens": output.prompt_token_count + output.completion_token_count
        }
    }

Throughput Under Streaming Load

Streaming has a measurable impact on API server throughput because each token triggers an SSE write:

📊

API Server Throughput — Streaming vs Non-Streaming

ModeConcurrent ClientsRequests/secOutput tok/sCPU Util (API server)
Non-streaming 32 4.8 4,610 15%
Streaming 32 4.7 4,520 28%
Non-streaming 128 14.2 4,580 42%
Streaming 128 13.8 4,410 68%
Non-streaming 512 15.1 4,590 65%
Streaming 512 13.2 4,180 95%

API Server CPU Utilization by Mode and Concurrency

Non-stream 32
15
Stream 32
28
Non-stream 128
42
Stream 128
68
Non-stream 512
65
Stream 512
95

At 512 concurrent streaming clients, the API server CPU hits 95% utilization, becoming the bottleneck rather than the GPU. This is because each token generates an SSE write (JSON serialization + HTTP chunk + TCP send) for each of 512 connections.

💡 Tip

For high-concurrency streaming deployments, run the API server with multiple uvicorn workers (--workers 4) and place a load balancer in front. Each worker handles a subset of connections. Alternatively, use the gRPC endpoint which has lower per-message overhead than SSE over HTTP.

End-to-End Latency Breakdown

The full latency from token sampling to client receipt:

Token sampled on GPU:                    t=0.00 ms
GPU->CPU transfer (token ID):            t=0.01 ms
Detokenization:                          t=0.02 ms
Queue enqueue + dequeue:                 t=0.05 ms
JSON serialization:                      t=0.08 ms
SSE write to socket buffer:              t=0.12 ms
TCP transmission (local):                t=0.15 ms
TCP transmission (same datacenter):      t=0.45 ms
TCP transmission (cross-region):         t=35.0 ms
Client-side parse + render:              t=0.5-5.0 ms (varies)

The output processing pipeline adds approximately 0.15ms of overhead per token — negligible compared to the 15ms decode step. Network latency is the dominant factor for client-perceived inter-token latency.

Summary

vLLM v1’s async output pipeline decouples GPU computation from client-facing I/O through a three-stage architecture: detokenizer, output queue, and SSE streamer. The detokenizer uses a sliding-window approach to maintain O(1) amortized cost per token rather than re-decoding the full sequence. Output queues with bounded capacity (64 entries) provide backpressure when clients fall behind. SSE streaming adds 13% CPU overhead at 128 concurrent connections and becomes the bottleneck at 512+ connections. The total output processing latency is 0.15ms per token, which is 1% of a typical decode step. For production deployments with high streaming concurrency, multiple API server workers behind a load balancer prevent the output pipeline from becoming the throughput bottleneck.