Part of Series Inference Optimization Timeline 35 of 60
1 Transformer Fundamentals for Systems Engineers: The 10-Minute Bridge from Architecture to Inference 2 LLM Inference Fundamentals: Prefill, Decode, and the Memory-Compute Divide 3 KV Cache: The Hidden Memory Giant in LLM Serving 4 Quantization for LLM Inference: From FP16 to INT4 β€” A Deep Dive into Precision, Performance, and Production Deployment 5 FlashAttention: Why Tiling Attention Through the Memory Hierarchy Changes Everything 6 PagedAttention: How vLLM Borrowed OS Virtual Memory to Fix LLM Serving 7 Continuous Batching: The Complete Guide to LLM Inference Scheduling 8 Speculative Decoding: Why Autoregressive LLMs Leave 99% of Your GPU Idle and How to Fix It 9 Prefix Caching: RadixAttention, Cache Hierarchies, and Reusing Computation Across Requests 10 LoRA and QLoRA for Serving: Multi-Adapter Inference, S-LoRA, and When to Merge 11 Disaggregated Prefill-Decode: Why Splitting LLM Inference Changes Everything 12 Constrained Generation: FSM-Based Decoding, Outlines, and Grammar-Guided LLM Output 13 Mamba and State Space Models: The O(n) Alternative to Attention 14 Inference-Time Compute Scaling: When More Thinking Helps (o1, DeepSeek-R1, and the Reasoning Frontier) 15 CPU and Edge Inference: llama.cpp Internals, GGUF Format, and When CPU Actually Wins 16 Inference Cost Economics: Tokens per Dollar, GPU-Hours, and the Real Math of LLM Serving 17 Model Loading and Cold Start: safetensors, mmap, and Startup Optimization 18 Batched GEMM: Why Matrix Multiply Throughput Determines Everything in LLM Inference 19 Kernel Autotuning: How TensorRT and torch.compile Find Optimal CUDA Kernels 20 Attention Kernel Comparison: FlashAttention vs FlashInfer vs xformers vs Triton 21 Token Generation Pipeline: Logit Processing, Sampling Strategies, and Stop Criteria 22 Dynamic Batching: Orca, Sarathi, and Iteration-Level Scheduling Algorithms 23 Memory Pool Management: Slab Allocators for GPU Inference 24 Prefill vs Decode Optimization: Different Bottlenecks, Different Solutions 25 Decode Optimization: CUDA Graphs, Persistent Batches, and Speculative Verification 26 Multi-Model Serving: GPU Sharing, Model Switching, and Adapter Pool Management 27 Structured Output Acceleration: Compressed FSMs, Speculative JSON, and Grammar Caching 28 Vision-Language Model Serving: ViT Encoding, Cross-Attention, and KV Cache Paging for Multimodal 29 Long-Context Serving: Ring Attention, KV Offloading, and Chunked Processing in Production 30 Inference Profiling: Nsight Systems, torch.profiler, and Finding Where Time Actually Goes 31 FP8 Inference: E4M3 Format, Per-Tensor Scaling, and the Hardware Support Matrix 32 Speculative Decoding v2: Medusa, EAGLE, Lookahead, and Token Tree Verification 33 Disaggregated Serving v2: Mooncake KV-Centric Architecture and LoongServe Elastic SP 34 Request Preemption and Priority Scheduling in Production LLM Serving 35 Autoscaling LLM Inference: Signals, Lag, Warm Pools, and Cost-Optimal Scaling 36 The Inference Stack in 2026: From HTTP Request to GPU Kernel and Back 37 Video and Audio LLM Serving: Temporal Encoding, Chunked Streaming, and Latency Budgets 38 KV Cache Compression and Eviction: H2O, Attention Sinks, Sliding Window, and Quantized KV 39 Distributed Inference: Tensor Parallelism vs Pipeline Parallelism for Serving 40 Serving Benchmark Methodology: How to Properly Measure LLM Inference Performance 41 Compute-Communication Overlap: Hiding Distributed Training Latency 42 DeepSpeed ZeRO: Memory Optimization for Distributed Training at Scale 43 Pipeline Parallelism: From GPipe to DualPipe -- Eliminating the Bubble 44 Gradient Compression for Distributed Training: Promise, Reality, and Where It Still Wins 45 The Definitive Guide to Distributed Parallelism: Data, Tensor, Pipeline, Expert, and Sequence Parallelism for Large-Scale Training 46 Decoding Performance: Beam Search vs Sampling β€” Latency, Throughput, Memory, and the Full Design Space 47 LLM Prefill Phase Optimization: Why Prompt Processing Is Compute-Bound and How to Fix It 48 LLM Serving Engines: vLLM vs SGLang vs TensorRT-LLM β€” A Systems Comparison 49 Request Routing for LLM Inference: From Naive Load Balancing to KV Cache-Aware Scheduling 50 Why Adam Is Expensive and What To Do About It: 8-bit Adam, Adafactor, CAME, and the Memory Math of Optimizers 51 How Large Models Actually Get Loaded: Safetensors, mmap, Tensor Parallelism, and Progressive Loading 52 Mixed Precision Training: The Complete Precision Landscape from FP32 to FP4 53 Model Compression: Pruning, Distillation, and Why Quantization Won 54 From NAS to Scaling Laws: How We Design LLM Architectures Now 55 NVIDIA NCCL Performance Tuning for Multi-GPU Training 56 ONNX Runtime in Practice: Graph Optimization, Execution Providers, Quantization, and When ORT Is the Right Choice 57 Optimizing GEMM for Neural Networks: BLAS vs Custom Kernels (Nov 2019) 58 Long Context: From Sparse Attention to Ring Attention 59 TensorRT-LLM: Graph Optimization for Maximum Inference Performance 60 Long Context LLMs: From 2K to 1M Tokens

A user sends a chat completion request. 180 milliseconds later, the first token arrives. Between those two events, the request passes through 12 distinct software layers, crosses 3 hardware boundaries (NIC, CPU, GPU), and triggers approximately 160,000 CUDA kernel launches. This post traces every hop.

The Request Path

User Browser/API Client
  |  HTTPS POST /v1/chat/completions
  v
[1] API Gateway (nginx/envoy)                    ~2ms
  |  TLS termination, rate limit, auth
  v
[2] Load Balancer / Router (Dynamo)               ~3ms
  |  KV-aware routing, model selection
  v
[3] Engine Process (vLLM v1)                       ~1ms
  |  Request validation, tokenization
  v
[4] Scheduler (continuous batching)                ~0.1ms
  |  Admission, batch assembly
  v
[5] Model Runner                                   ~0.5ms
  |  Prepare inputs, select graph/eager
  v
[6] Model Forward Pass (80 layers)                 ~85ms (prefill, 2K prompt)
  |  Linear layers -> Attention -> FFN
  v
[7] Attention Kernel (FlashAttention/FlashInfer)   (within [6])
  |  Tiled QKV computation in SRAM
  v
[8] KV Cache Manager (PagedAttention)              (within [6])
  |  Block allocation, page table lookup
  v
[9] Sampling                                       ~0.3ms
  |  Temperature, top-p, logit processing
  v
[10] Detokenizer                                   ~0.05ms
  |   Token ID -> string
  v
[11] SSE Stream Writer                             ~0.1ms
  |   Format response, write to socket
  v
[12] Network Return Path                           ~2ms
  |   TCP/TLS back to client
  v
User receives first token

Total first-token latency: approximately 94ms for a 2K prompt on Llama 70B with 8x H100 (TP=8).

Layer 1: API Gateway

The gateway terminates TLS, validates API keys, enforces rate limits, and routes to the correct model deployment.

# Simplified nginx config for LLM inference gateway
"""
upstream llm_router {
    server dynamo-router-1:8080;
    server dynamo-router-2:8080;
    keepalive 256;  # Persistent connections to router
}

server {
    listen 443 ssl http2;

    # TLS 1.3 with 0-RTT for returning clients
    ssl_protocols TLSv1.3;
    ssl_early_data on;  # 0-RTT saves one round trip

    location /v1/chat/completions {
        # Rate limiting per API key
        limit_req zone=api_key burst=100 nodelay;

        # Streaming: disable buffering for SSE
        proxy_buffering off;
        proxy_http_version 1.1;
        proxy_set_header Connection "";

        # Timeout: long requests can generate for minutes
        proxy_read_timeout 300s;

        proxy_pass http://llm_router;
    }
}
"""
πŸ“Š

Gateway Latency Breakdown

ComponentP50 (us)P99 (us)Notes
TLS handshake (new) 800 2000 Full TLS 1.3 handshake
TLS handshake (resumed) 100 300 Session ticket
TLS 0-RTT 0 0 Data sent with ClientHello
API key validation 50 200 Redis lookup
Rate limit check 20 80 In-memory counter
Proxy header + forward 100 400 nginx internal
Total (returning client) 270 1000

Layer 2: Router (Dynamo KV-Aware Routing)

The router selects which vLLM engine instance handles the request. In 2026, this is not round-robin. NVIDIA Dynamo’s router considers:

  1. KV cache locality - does this engine already have relevant KV cache from a previous turn?
  2. Queue depth - how many requests are waiting?
  3. GPU memory utilization - can the engine accept more KV cache?
  4. Model variant - is the request for the base model or a LoRA adapter?
class DynamoRouter:
    """KV-aware request router for LLM inference cluster."""

    def __init__(self, engines):
        self.engines = engines  # List of vLLM engine endpoints
        self.prefix_cache = {}  # prefix_hash -> engine_id

    def route(self, request):
        """Select the best engine for this request."""
        # Extract routing signals
        prompt_tokens = self.tokenize(request.messages)
        prefix_hash = self._compute_prefix_hash(prompt_tokens)

        # Priority 1: KV cache hit (avoid redundant prefill)
        if prefix_hash in self.prefix_cache:
            engine_id = self.prefix_cache[prefix_hash]
            engine = self.engines[engine_id]
            if engine.queue_depth < engine.max_queue:
                return engine_id

        # Priority 2: least queue depth (minimize waiting time)
        candidates = sorted(
            self.engines,
            key=lambda e: (e.queue_depth, e.gpu_memory_used_pct)
        )

        selected = candidates[0]

        # Register prefix for future cache hits
        self.prefix_cache[prefix_hash] = selected.id

        return selected.id

    def _compute_prefix_hash(self, tokens):
        """Hash the system prompt + first part of user message.
        Most multi-turn conversations share the system prompt."""
        import hashlib
        # Hash in blocks of 16 tokens for granular matching
        prefix = tokens[:min(len(tokens), 2048)]
        return hashlib.sha256(bytes(prefix)).hexdigest()[:16]

Router latency: 1-3ms (includes network hop to router + routing decision + forwarding to engine).

Layer 3: Engine Process (vLLM v1)

The vLLM engine receives the request, validates it, and tokenizes the input:

class VLLMEngineV1:
    """Simplified vLLM v1 engine entry point."""

    def __init__(self, model_config, scheduler_config):
        self.tokenizer = AutoTokenizer.from_pretrained(model_config.model)
        self.scheduler = UnifiedScheduler(scheduler_config)
        self.model_runner = ModelRunner(model_config)
        self.kv_cache_manager = BlockManager(scheduler_config)

    async def generate(self, request):
        """Entry point for a new request."""
        # Validate parameters
        self._validate_params(request.sampling_params)

        # Tokenize
        prompt_tokens = self.tokenizer.encode(
            request.prompt,
            add_special_tokens=True,
        )

        # Check prompt length against model max
        if len(prompt_tokens) > self.model_config.max_model_len:
            raise ValueError(
                f"Prompt length {len(prompt_tokens)} exceeds "
                f"max {self.model_config.max_model_len}"
            )

        # Create sequence group
        seq = Sequence(
            seq_id=self._next_seq_id(),
            prompt_token_ids=prompt_tokens,
            sampling_params=request.sampling_params,
        )

        # Add to scheduler queue
        self.scheduler.add_request(seq)

        # Return async generator for streaming
        async for output in self._stream_outputs(seq):
            yield output

Layer 4: Scheduler (Unified Continuous Batching)

The scheduler runs a tight loop: every iteration, it decides which requests to include in the next forward pass.

class UnifiedScheduler:
    """vLLM v1 unified scheduler: single queue for prefill and decode."""

    def __init__(self, config):
        self.max_num_batched_tokens = config.max_num_batched_tokens  # e.g., 8192
        self.max_num_seqs = config.max_num_seqs  # e.g., 256
        self.waiting_queue = []   # New requests awaiting prefill
        self.running_queue = []   # Active requests in decode phase

    def schedule(self):
        """One scheduling decision. Called every iteration (~30-50ms)."""
        batch = ScheduledBatch()
        token_budget = self.max_num_batched_tokens
        seq_budget = self.max_num_seqs

        # Step 1: Include all running decode requests (1 token each)
        for seq in self.running_queue:
            if seq_budget <= 0:
                break
            batch.add_decode(seq)
            token_budget -= 1
            seq_budget -= 1

        # Step 2: Fill remaining budget with prefill chunks
        for seq in self.waiting_queue:
            if token_budget <= 0 or seq_budget <= 0:
                break

            remaining_prefill = seq.prompt_len - seq.num_prefilled
            chunk_size = min(remaining_prefill, token_budget)

            if chunk_size > 0:
                batch.add_prefill_chunk(seq, chunk_size)
                token_budget -= chunk_size
                seq_budget -= 1

                if seq.num_prefilled + chunk_size >= seq.prompt_len:
                    # Prefill complete, move to running queue
                    self.waiting_queue.remove(seq)
                    self.running_queue.append(seq)

        return batch

Scheduler decision time: approximately 0.1ms (pure CPU, operates on metadata only).

Layer 5: Model Runner

The model runner translates the scheduled batch into GPU tensor operations:

class ModelRunner:
    """Prepare inputs and execute model forward pass."""

    def __init__(self, model_config):
        self.model = load_model(model_config)
        self.cuda_graphs = CUDAGraphManager(self.model)
        self.graph_batch_sizes = [1, 2, 4, 8, 16, 32, 64, 128, 256]

    def execute(self, scheduled_batch):
        """Execute one forward pass for the scheduled batch."""

        # Prepare input tensors
        input_ids = scheduled_batch.get_input_ids()      # [total_tokens]
        positions = scheduled_batch.get_positions()        # [total_tokens]
        slot_mapping = scheduled_batch.get_slot_mapping()  # [total_tokens]

        # Build attention metadata
        attn_metadata = self._build_attn_metadata(scheduled_batch)

        # Choose execution mode
        if scheduled_batch.is_decode_only():
            # Pure decode: use CUDA graph
            padded_bs = self._pad_to_graph_size(len(scheduled_batch.decode_seqs))
            logits = self.cuda_graphs.replay(
                padded_bs, input_ids, positions, slot_mapping
            )
            logits = logits[:len(scheduled_batch.decode_seqs)]
        else:
            # Mixed prefill + decode: eager execution
            logits = self.model.forward(
                input_ids=input_ids,
                positions=positions,
                kv_caches=self.kv_caches,
                attn_metadata=attn_metadata,
            )

        return logits

Layer 6: Model Forward Pass

For Llama 70B with TP=8 across 8x H100:

class LlamaForCausalLM:
    """Llama model forward pass - 80 transformer layers."""

    def forward(self, input_ids, positions, kv_caches, attn_metadata):
        # Embedding lookup: input_ids -> hidden_states
        # [total_tokens] -> [total_tokens, 8192]
        hidden_states = self.embed_tokens(input_ids)  # ~0.1ms

        # 80 transformer layers
        for i, layer in enumerate(self.layers):
            hidden_states = layer(
                hidden_states=hidden_states,
                positions=positions,
                kv_cache=kv_caches[i],
                attn_metadata=attn_metadata,
            )
            # Each layer: ~1ms (prefill, 2K tokens, TP=8)
            # Breakdown per layer:
            #   QKV projection: 0.15ms (GEMM)
            #   RoPE: 0.02ms
            #   Attention kernel: 0.25ms (FlashAttention)
            #   O projection: 0.05ms (GEMM)
            #   All-reduce (TP): 0.08ms
            #   RMSNorm: 0.02ms
            #   Gate+Up projection: 0.15ms (GEMM)
            #   SiLU + multiply: 0.02ms
            #   Down projection: 0.08ms (GEMM)
            #   All-reduce (TP): 0.08ms
            #   Residual add: 0.01ms
            # Total per layer: ~0.91ms

        # Final RMSNorm + output projection
        hidden_states = self.norm(hidden_states)
        logits = self.lm_head(hidden_states)  # [total_tokens, vocab_size/TP]

        # All-gather logits across TP ranks (for sampling)
        logits = tensor_model_parallel_all_gather(logits)

        return logits
πŸ“Š

Per-Layer Latency Breakdown (Llama 70B, TP=8, H100, Prefill 2K tokens)

OperationTime (ms)BottleneckKernel
QKV Projection 0.15 Compute (GEMM) sm90_xmma_gemm
RoPE Embedding 0.02 Compute rotary_embedding_kernel
FlashAttention-3 0.25 Compute (tiled) flash_fwd_kernel
O Projection 0.05 Compute (GEMM) sm90_xmma_gemm
All-Reduce (TP, attn) 0.08 NVLink bandwidth ncclAllReduce
RMSNorm 0.02 Bandwidth rmsnorm_kernel
Gate+Up Projection 0.15 Compute (GEMM) sm90_xmma_gemm
SiLU Activation 0.02 Bandwidth silu_mul_kernel
Down Projection 0.08 Compute (GEMM) sm90_xmma_gemm
All-Reduce (TP, FFN) 0.08 NVLink bandwidth ncclAllReduce
Residual Add 0.01 Bandwidth add_kernel
Total per layer 0.91

Total for 80 layers: 80Γ—0.91=72.8ms80 \times 0.91 = 72.8\text{ms}. Add embedding (0.1ms) + final norm (0.02ms) + LM head (0.5ms) + all-gather (0.3ms) = 73.7ms for the prefill forward pass.

Layer 7: Attention Kernel Deep Dive

During prefill, FlashAttention-3 computes full causal attention. During decode, FlashInfer’s BatchDecodeWithPagedKV handles paged KV cache lookups.

# What happens inside the attention kernel during prefill:

def flash_attention_3_simplified(Q, K, V, causal=True):
    """Simplified FlashAttention-3 (Hopper) execution flow.
    Real implementation uses WGMMA instructions and TMA."""

    # Q: [batch, num_heads, seq_len, head_dim]
    # K, V: [batch, num_kv_heads, seq_len, head_dim]
    B, H, S, D = Q.shape

    # Tile sizes for H100 (192KB shared memory per SM)
    BLOCK_Q = 128   # Query tile size
    BLOCK_KV = 128  # KV tile size

    # Output accumulator
    O = torch.zeros_like(Q)
    L = torch.zeros(B, H, S, 1, device=Q.device)  # Log-sum-exp

    # Outer loop: tiles of queries
    for q_start in range(0, S, BLOCK_Q):
        q_end = min(q_start + BLOCK_Q, S)
        q_tile = Q[:, :, q_start:q_end, :]  # Load Q tile to SRAM

        # Inner loop: tiles of keys/values
        # (with causal masking, only iterate up to q_end)
        kv_end = q_end if causal else S

        o_tile = torch.zeros_like(q_tile)
        l_tile = torch.zeros(B, H, q_end - q_start, 1, device=Q.device)

        for kv_start in range(0, kv_end, BLOCK_KV):
            kv_block_end = min(kv_start + BLOCK_KV, kv_end)
            k_tile = K[:, :, kv_start:kv_block_end, :]  # Load K tile to SRAM
            v_tile = V[:, :, kv_start:kv_block_end, :]  # Load V tile to SRAM

            # Compute attention scores in SRAM (no HBM write)
            scores = q_tile @ k_tile.transpose(-2, -1) / (D ** 0.5)

            # Apply causal mask
            if causal:
                # Mask future positions within this tile pair
                pass  # (masking logic)

            # Online softmax: update running max and sum
            new_max = torch.max(scores.max(dim=-1, keepdim=True).values, l_tile)
            scores_exp = torch.exp(scores - new_max)
            o_tile = o_tile * torch.exp(l_tile - new_max) + scores_exp @ v_tile
            l_tile = new_max + torch.log(
                torch.exp(l_tile - new_max) + scores_exp.sum(dim=-1, keepdim=True)
            )

        # Write output tile to HBM
        O[:, :, q_start:q_end, :] = o_tile / torch.exp(l_tile)
        L[:, :, q_start:q_end, :] = l_tile

    return O

During decode, the attention kernel changes fundamentally:

def paged_decode_attention(Q, kv_cache, page_table, seq_lens):
    """Decode attention with paged KV cache.
    Q: [batch, num_heads, 1, head_dim] (one query per request)
    kv_cache: [num_blocks, 2, num_kv_heads, block_size, head_dim]
    page_table: [batch, max_blocks] (maps logical to physical blocks)
    seq_lens: [batch] (current sequence length per request)
    """
    batch_size = Q.shape[0]
    output = torch.zeros_like(Q)

    for b in range(batch_size):
        # Gather this request's KV blocks from the page table
        num_blocks = (seq_lens[b] + 15) // 16  # block_size = 16
        k_blocks = []
        v_blocks = []
        for block_idx in range(num_blocks):
            physical_block = page_table[b, block_idx]
            k_blocks.append(kv_cache[physical_block, 0])  # K
            v_blocks.append(kv_cache[physical_block, 1])  # V

        # Concatenate gathered blocks
        K = torch.cat(k_blocks, dim=1)[:, :seq_lens[b], :]
        V = torch.cat(v_blocks, dim=1)[:, :seq_lens[b], :]

        # Single query attention: Q [1, head_dim] x K [seq_len, head_dim]
        scores = (Q[b] @ K.transpose(-2, -1)) / (Q.shape[-1] ** 0.5)
        attn = torch.softmax(scores, dim=-1)
        output[b] = attn @ V

    return output

Layer 8: KV Cache Manager

The block manager allocates and tracks KV cache blocks using PagedAttention:

class BlockManager:
    """Manage paged KV cache allocation."""

    def __init__(self, num_gpu_blocks, block_size=16):
        self.num_blocks = num_gpu_blocks  # e.g., 65536 blocks
        self.block_size = block_size       # 16 tokens per block
        self.free_blocks = list(range(num_gpu_blocks))
        self.block_tables = {}  # seq_id -> list of physical block indices

    def allocate(self, seq_id, num_tokens):
        """Allocate blocks for a sequence."""
        num_blocks_needed = (num_tokens + self.block_size - 1) // self.block_size

        if len(self.free_blocks) < num_blocks_needed:
            raise RuntimeError("Out of KV cache blocks")

        allocated = []
        for _ in range(num_blocks_needed):
            block = self.free_blocks.pop()
            allocated.append(block)

        self.block_tables[seq_id] = allocated
        return allocated

    def append_slot(self, seq_id):
        """Allocate one more token slot for a decode step."""
        blocks = self.block_tables[seq_id]
        current_tokens = len(blocks) * self.block_size
        # Check if current last block has room
        last_block_usage = current_tokens % self.block_size
        if last_block_usage == 0:
            # Last block is full, need a new block
            new_block = self.free_blocks.pop()
            blocks.append(new_block)
        # Return the slot index within the last block
        return blocks[-1], last_block_usage

Layer 9: Sampling

After the forward pass produces logits, sampling selects the next token:

class Sampler:
    """GPU-accelerated token sampling."""

    def __init__(self):
        pass

    def sample(self, logits, sampling_params_batch):
        """Sample next tokens for all sequences in the batch.

        logits: [batch_size, vocab_size]
        sampling_params_batch: list of SamplingParams per sequence
        """
        # Step 1: Apply logit processors (all on GPU)
        for i, params in enumerate(sampling_params_batch):
            if params.temperature != 1.0:
                logits[i] /= params.temperature

            if params.top_p < 1.0:
                logits[i] = self._top_p_filter(logits[i], params.top_p)

            if params.repetition_penalty != 1.0:
                logits[i] = self._apply_repetition_penalty(
                    logits[i], params.previous_tokens, params.repetition_penalty
                )

        # Step 2: Convert to probabilities
        probs = torch.softmax(logits, dim=-1)

        # Step 3: Sample
        # Greedy: argmax
        # Random: multinomial
        next_tokens = torch.zeros(
            logits.shape[0], dtype=torch.long, device=logits.device
        )
        for i, params in enumerate(sampling_params_batch):
            if params.temperature == 0:
                next_tokens[i] = logits[i].argmax()
            else:
                next_tokens[i] = torch.multinomial(probs[i:i+1], 1).squeeze()

        return next_tokens

    def _top_p_filter(self, logits, p):
        """Nucleus sampling: keep smallest set of tokens with cumulative prob >= p."""
        sorted_logits, sorted_indices = torch.sort(logits, descending=True)
        cumulative_probs = torch.cumsum(
            torch.softmax(sorted_logits, dim=-1), dim=-1
        )
        # Remove tokens with cumulative probability above the threshold
        sorted_indices_to_remove = cumulative_probs > p
        # Keep at least one token
        sorted_indices_to_remove[0] = False
        # Scatter back to original indices
        indices_to_remove = sorted_indices[sorted_indices_to_remove]
        logits[indices_to_remove] = float("-inf")
        return logits

Sampling latency: 0.2-0.5ms depending on the number of logit processors and whether structured output constraints are active.

Layer 10-12: Detokenization and Streaming

class StreamingResponseWriter:
    """Convert token IDs to text and stream as SSE."""

    def __init__(self, tokenizer):
        self.tokenizer = tokenizer
        self.buffer = ""  # Partial token buffer for multi-byte chars

    async def stream_token(self, token_id, websocket):
        """Convert token to text and stream immediately."""
        # Detokenize (handles partial UTF-8 sequences)
        token_text = self.tokenizer.decode(
            [token_id], skip_special_tokens=True
        )

        if token_text:
            # Format as SSE (Server-Sent Events)
            chunk = {
                "id": f"chatcmpl-{self.request_id}",
                "object": "chat.completion.chunk",
                "choices": [{
                    "index": 0,
                    "delta": {"content": token_text},
                    "finish_reason": None,
                }],
            }
            # Write to HTTP response stream
            await websocket.send_text(
                f"data: {json.dumps(chunk)}\n\n"
            )

End-to-End Latency Budget

End-to-End Latency Budget: TTFT (Llama 70B, TP=8, H100, 2K Prompt)

Metric GatewayRouterTokenizeScheduleModel RunnerForward PassSamplingDetokenizeStream
Latency (ms)
2
3
1
0.1
0.5
73.7
0.3
0.05
0.1
πŸ“Š

TTFT Latency Budget Summary

CategoryComponentsTime (ms)Percentage
Network + Gateway TLS, auth, routing 5.0 6.2%
Engine Overhead Tokenize, schedule, runner 1.6 2.0%
GPU Forward Pass 80 layers (GEMM + attention + comm) 73.7 91.4%
Post-Processing Sample, detokenize, stream 0.45 0.6%
Total TTFT 80.75 100%

91% of TTFT is the GPU forward pass. All other components combined account for less than 7ms. This is why GPU kernel optimization (FlashAttention, quantization, tensor parallelism) dominates inference optimization: the forward pass is the overwhelming bottleneck.

Decode Iteration Breakdown

After the first token, each subsequent token follows a faster path (no prefill, just decode):

def decode_iteration_trace():
    """Trace of a single decode iteration with batch=128 requests."""
    trace = {
        "scheduler_decision": 0.08,    # ms - pick decode batch
        "prepare_inputs": 0.12,         # ms - copy token IDs, positions
        "cuda_graph_replay": 0.005,     # ms - launch graph
        "gpu_forward_80_layers": 33.2,  # ms - bandwidth-bound
        "sampling_batch": 0.25,         # ms - 128 samples
        "kv_cache_append": 0.02,        # ms - update slot mapping
        "detokenize_batch": 0.15,       # ms - 128 token decodes
        "stream_responses": 0.10,       # ms - SSE to 128 clients
    }
    trace["total_itl"] = sum(trace.values())
    # total_itl ~= 33.9ms per decode iteration
    # = 33.9ms inter-token latency for each of the 128 clients
    return trace

Decode Iteration Latency vs Batch Size (Llama 70B, TP=8, H100)

line
Metric 183264128256512
Total ITL (ms)
33.5
33.6
33.8
33.9
34.1
35.8
42.3
GPU Forward Only (ms)
33.2
33.2
33.3
33.4
33.5
35
41.2
Throughput (tokens/s)
30
238
947
1888
3753
7151
12104
⚑ Performance

Decode ITL stays nearly flat from batch=1 to batch=128 (33.5ms to 34.1ms) because the forward pass is bandwidth-bound: loading weights takes the same time regardless of batch size. But throughput scales linearly: 128 tokens per iteration instead of 1, for 125x throughput at only 1.8% higher latency. This is why continuous batching is critical for production serving economics.

The Complete Timing Diagram

Time (ms)
0        10       20       30       40       50       60       70       80       90
|--------|--------|--------|--------|--------|--------|--------|--------|--------|
[GW][Router][Tok][======== GPU Forward Pass (80 layers) ==============][S][D][SSE]
 2ms  3ms   1ms                    73.7ms                               0.3 0.05 0.1

Per-token decode after TTFT:
|[Sched][==== GPU Decode (80 layers, BW-bound) ====][S][D][SSE]|
  0.1ms              33.2ms                          0.3 0.05 0.1
  Total ITL: ~33.9ms

Where the Optimizations Live

Each layer of the stack has specific optimization techniques:

OPTIMIZATION_MAP = {
    "layer_1_gateway": {
        "techniques": ["TLS session resumption", "HTTP/3 QUIC", "0-RTT"],
        "savings_ms": 1.5,
        "complexity": "low",
    },
    "layer_2_router": {
        "techniques": ["KV-aware routing", "prefix hash matching", "load prediction"],
        "savings_ms": 0,  # No latency savings, but higher cache hit rate
        "throughput_gain": "20-40% from prefix cache hits",
    },
    "layer_3_tokenizer": {
        "techniques": ["Rust tokenizer (tokenizers library)", "batch tokenization"],
        "savings_ms": 0.5,
        "complexity": "low",
    },
    "layer_4_scheduler": {
        "techniques": ["Chunked prefill", "priority scheduling", "SLO-aware admission"],
        "savings_ms": 0,  # Trades TTFT for ITL or vice versa
        "impact": "Reduces tail latency by 2-5x",
    },
    "layer_5_model_runner": {
        "techniques": ["CUDA graph capture", "persistent tensor buffers"],
        "savings_ms": 8.0,  # Eliminates launch overhead
        "complexity": "medium",
    },
    "layer_6_forward_pass": {
        "techniques": [
            "FP8 quantization (2x speedup)",
            "FlashAttention-3 (1.3x over FA-2)",
            "Tensor parallelism (Nx speedup)",
            "Speculative decoding (2-3x decode speedup)",
        ],
        "savings_ms": "30-60% of forward pass time",
        "complexity": "high",
    },
    "layer_7_attention": {
        "techniques": ["FlashInfer paged KV", "FP8 KV cache", "H2O eviction"],
        "savings_ms": "Varies with context length",
        "memory_savings": "2-4x KV cache reduction",
    },
    "layer_9_sampling": {
        "techniques": ["Batched GPU sampling", "speculative draft verification"],
        "savings_ms": 0.1,
        "complexity": "low",
    },
}

Profiling the Full Stack

To identify bottlenecks in a production deployment, instrument every layer:

import time
import torch

class StackProfiler:
    """Instrument the full inference stack for bottleneck identification."""

    def __init__(self):
        self.timings = {}

    def profile_request(self, engine, request):
        """Profile a single request through every stack layer."""
        self.timings = {}

        # Layer 1-2: Network (measure from client side)
        t0 = time.perf_counter()
        # ... gateway + router handled externally ...

        # Layer 3: Tokenization
        t_tok_start = time.perf_counter()
        tokens = engine.tokenizer.encode(request.prompt)
        self.timings["tokenize_ms"] = (time.perf_counter() - t_tok_start) * 1000

        # Layer 4: Scheduling
        t_sched_start = time.perf_counter()
        batch = engine.scheduler.schedule()
        self.timings["schedule_ms"] = (time.perf_counter() - t_sched_start) * 1000

        # Layer 5-8: Model execution (GPU profiled with events)
        start_event = torch.cuda.Event(enable_timing=True)
        end_event = torch.cuda.Event(enable_timing=True)

        start_event.record()
        logits = engine.model_runner.execute(batch)
        end_event.record()
        torch.cuda.synchronize()
        self.timings["gpu_forward_ms"] = start_event.elapsed_time(end_event)

        # Layer 9: Sampling
        t_sample_start = time.perf_counter()
        next_tokens = engine.sampler.sample(logits, batch.sampling_params)
        torch.cuda.synchronize()
        self.timings["sample_ms"] = (time.perf_counter() - t_sample_start) * 1000

        # Layer 10: Detokenization
        t_detok_start = time.perf_counter()
        text = engine.tokenizer.decode(next_tokens.tolist())
        self.timings["detokenize_ms"] = (time.perf_counter() - t_detok_start) * 1000

        self.timings["total_ms"] = sum(self.timings.values())
        self.timings["gpu_pct"] = (
            self.timings["gpu_forward_ms"] / self.timings["total_ms"] * 100
        )

        return self.timings
πŸ“Š

Stack Layer Optimization Priority

PriorityLayerTechniqueExpected Impact
1 (highest) Forward Pass FP8 quantization 2x throughput, 50% latency reduction
2 Forward Pass Tensor parallelism (TP=8) 8x faster prefill, 8x HBM bandwidth
3 Model Runner CUDA graphs 20% decode latency reduction
4 Scheduler Chunked prefill 3-5x better tail ITL
5 Router KV-aware routing 20-40% prefix cache hit rate
6 Attention FP8 KV cache 2x more concurrent requests
7 Gateway TLS 0-RTT 1-2ms TTFT reduction

Every millisecond in this trace maps to a specific software component and hardware resource. Optimizing the inference stack in 2026 means knowing which component owns which milliseconds and applying the right technique at each layer: faster networks at the gateway, smarter routing at the router, better scheduling in the engine, faster kernels on the GPU, and efficient streaming on the return path. The GPU forward pass dominates at 91% of TTFT, making kernel-level optimizations (quantization, attention backends, TP) the highest-impact investments. But the remaining 9% β€” scheduling, routing, sampling β€” determines tail latency behavior and can make the difference between a 1.2x and 15x P99/P50 ratio.