Part of Series Inference Optimization Timeline 19 of 60
1 Transformer Fundamentals for Systems Engineers: The 10-Minute Bridge from Architecture to Inference 2 LLM Inference Fundamentals: Prefill, Decode, and the Memory-Compute Divide 3 KV Cache: The Hidden Memory Giant in LLM Serving 4 Quantization for LLM Inference: From FP16 to INT4 — A Deep Dive into Precision, Performance, and Production Deployment 5 FlashAttention: Why Tiling Attention Through the Memory Hierarchy Changes Everything 6 PagedAttention: How vLLM Borrowed OS Virtual Memory to Fix LLM Serving 7 Continuous Batching: The Complete Guide to LLM Inference Scheduling 8 Speculative Decoding: Why Autoregressive LLMs Leave 99% of Your GPU Idle and How to Fix It 9 Prefix Caching: RadixAttention, Cache Hierarchies, and Reusing Computation Across Requests 10 LoRA and QLoRA for Serving: Multi-Adapter Inference, S-LoRA, and When to Merge 11 Disaggregated Prefill-Decode: Why Splitting LLM Inference Changes Everything 12 Constrained Generation: FSM-Based Decoding, Outlines, and Grammar-Guided LLM Output 13 Mamba and State Space Models: The O(n) Alternative to Attention 14 Inference-Time Compute Scaling: When More Thinking Helps (o1, DeepSeek-R1, and the Reasoning Frontier) 15 CPU and Edge Inference: llama.cpp Internals, GGUF Format, and When CPU Actually Wins 16 Inference Cost Economics: Tokens per Dollar, GPU-Hours, and the Real Math of LLM Serving 17 Model Loading and Cold Start: safetensors, mmap, and Startup Optimization 18 Batched GEMM: Why Matrix Multiply Throughput Determines Everything in LLM Inference 19 Kernel Autotuning: How TensorRT and torch.compile Find Optimal CUDA Kernels 20 Attention Kernel Comparison: FlashAttention vs FlashInfer vs xformers vs Triton 21 Token Generation Pipeline: Logit Processing, Sampling Strategies, and Stop Criteria 22 Dynamic Batching: Orca, Sarathi, and Iteration-Level Scheduling Algorithms 23 Memory Pool Management: Slab Allocators for GPU Inference 24 Prefill vs Decode Optimization: Different Bottlenecks, Different Solutions 25 Decode Optimization: CUDA Graphs, Persistent Batches, and Speculative Verification 26 Multi-Model Serving: GPU Sharing, Model Switching, and Adapter Pool Management 27 Structured Output Acceleration: Compressed FSMs, Speculative JSON, and Grammar Caching 28 Vision-Language Model Serving: ViT Encoding, Cross-Attention, and KV Cache Paging for Multimodal 29 Long-Context Serving: Ring Attention, KV Offloading, and Chunked Processing in Production 30 Inference Profiling: Nsight Systems, torch.profiler, and Finding Where Time Actually Goes 31 FP8 Inference: E4M3 Format, Per-Tensor Scaling, and the Hardware Support Matrix 32 Speculative Decoding v2: Medusa, EAGLE, Lookahead, and Token Tree Verification 33 Disaggregated Serving v2: Mooncake KV-Centric Architecture and LoongServe Elastic SP 34 Request Preemption and Priority Scheduling in Production LLM Serving 35 Autoscaling LLM Inference: Signals, Lag, Warm Pools, and Cost-Optimal Scaling 36 The Inference Stack in 2026: From HTTP Request to GPU Kernel and Back 37 Video and Audio LLM Serving: Temporal Encoding, Chunked Streaming, and Latency Budgets 38 KV Cache Compression and Eviction: H2O, Attention Sinks, Sliding Window, and Quantized KV 39 Distributed Inference: Tensor Parallelism vs Pipeline Parallelism for Serving 40 Serving Benchmark Methodology: How to Properly Measure LLM Inference Performance 41 Compute-Communication Overlap: Hiding Distributed Training Latency 42 DeepSpeed ZeRO: Memory Optimization for Distributed Training at Scale 43 Pipeline Parallelism: From GPipe to DualPipe -- Eliminating the Bubble 44 Gradient Compression for Distributed Training: Promise, Reality, and Where It Still Wins 45 The Definitive Guide to Distributed Parallelism: Data, Tensor, Pipeline, Expert, and Sequence Parallelism for Large-Scale Training 46 Decoding Performance: Beam Search vs Sampling — Latency, Throughput, Memory, and the Full Design Space 47 LLM Prefill Phase Optimization: Why Prompt Processing Is Compute-Bound and How to Fix It 48 LLM Serving Engines: vLLM vs SGLang vs TensorRT-LLM — A Systems Comparison 49 Request Routing for LLM Inference: From Naive Load Balancing to KV Cache-Aware Scheduling 50 Why Adam Is Expensive and What To Do About It: 8-bit Adam, Adafactor, CAME, and the Memory Math of Optimizers 51 How Large Models Actually Get Loaded: Safetensors, mmap, Tensor Parallelism, and Progressive Loading 52 Mixed Precision Training: The Complete Precision Landscape from FP32 to FP4 53 Model Compression: Pruning, Distillation, and Why Quantization Won 54 From NAS to Scaling Laws: How We Design LLM Architectures Now 55 NVIDIA NCCL Performance Tuning for Multi-GPU Training 56 ONNX Runtime in Practice: Graph Optimization, Execution Providers, Quantization, and When ORT Is the Right Choice 57 Optimizing GEMM for Neural Networks: BLAS vs Custom Kernels (Nov 2019) 58 Long Context: From Sparse Attention to Ring Attention 59 TensorRT-LLM: Graph Optimization for Maximum Inference Performance 60 Long Context LLMs: From 2K to 1M Tokens

Your choice of attention kernel can leave 40% of your GPU’s potential on the table. I learned this when we profiled our 70B serving stack and found the decode phase running at 58% of memory bandwidth — not because the workload was fundamentally limited, but because we’d defaulted to FlashAttention for everything, including paged KV cache decode where it’s simply not designed to excel. Switching to FlashInfer for decode alone recovered 18% throughput. The attention kernel is not a detail; it’s the difference between underutilizing your hardware and saturating it. Four major implementations compete for LLM inference workloads, and each one makes fundamentally different tradeoffs that matter when you’re serving production traffic.

The Contenders

📊

Attention Kernel Overview

KernelLanguagePrefillDecode (Paged)GQA/MQAFP8Best For
FlashAttention-2 CUDA Excellent No paging support Yes No Training, prefill-heavy
FlashAttention-3 CUDA (Hopper) Best No paging support Yes Yes H100 prefill
FlashInfer CUDA + Triton Good Excellent (BatchDecodeWithPagedKV) Yes Yes Serving with paged KV
xformers CUDA Good Limited Yes No PyTorch research
Custom Triton Triton Varies Varies Manual Limited Novel attention patterns

FlashAttention-2/3

The gold standard for prefill. Tiled attention that keeps intermediates in SRAM, reducing HBM traffic from O(N2)O(N^2) to O(N)O(N). FA-3 adds Hopper-specific optimizations (WGMMA, TMA, warp specialization) for ~1.3x over FA-2 on H100.

Limitation: Designed for contiguous Q, K, V tensors. Does not natively support paged KV cache (non-contiguous blocks). Serving systems use FA for prefill (contiguous KV) and a separate paged kernel for decode.

FlashInfer

Built for serving. Provides both contiguous and paged attention kernels:

import flashinfer

# Prefill: contiguous KV
output = flashinfer.single_prefill_with_kv_cache(
    q, k, v, causal=True
)

# Decode: paged KV cache
output = flashinfer.batch_decode_with_padded_kv_cache(
    q, kv_cache, page_table, seq_lens
)

Advantage over FlashAttention: native paged KV cache support means vLLM/SGLang can use one library for both prefill and decode instead of switching between FlashAttention and a custom paged kernel.

xformers Memory-Efficient Attention

Meta’s library. Provides memory_efficient_attention() that automatically selects the best backend:

from xformers.ops import memory_efficient_attention

output = memory_efficient_attention(query, key, value, attn_bias=causal_mask)

Advantage: cleanest PyTorch API, good for research. Disadvantage: slower than FlashAttention for production workloads, limited paged KV support.

Custom Triton Kernels

For novel attention patterns (sliding window, block sparse, linear attention), write a custom Triton kernel:

import triton
import triton.language as tl

@triton.jit
def sliding_window_attention_kernel(
    Q, K, V, Out,
    stride_qb, stride_qh, stride_qs, stride_qd,
    window_size: tl.constexpr,
    BLOCK_SIZE: tl.constexpr,
):
    # Custom attention with sliding window
    # Each query attends to only [max(0, pos-W), pos] keys
    pid = tl.program_id(0)
    # ... kernel implementation

Advantage: maximum flexibility, Python-like syntax. Disadvantage: 20-40% slower than hand-tuned CUDA for standard patterns, harder to optimize for specific GPU architectures.

Throughput Benchmarks

Prefill Throughput: Llama 70B GQA, H100, seq_len=4096

(TFLOPS achieved)
FlashAttention-3 Best on H100
290 TFLOPS achieved
FlashAttention-2
225 TFLOPS achieved
FlashInfer
210 TFLOPS achieved
xformers
180 TFLOPS achieved
Triton (custom)
165 TFLOPS achieved

Decode Throughput: Llama 70B GQA, H100, batch=64, seq=2048

(% of peak memory bandwidth)
FlashInfer (paged) Best for paged decode
100 % of peak memory bandwidth
vLLM PagedAttention v2
92 % of peak memory bandwidth
FlashAttention-2 (contiguous) Not usable with paging
110 % of peak memory bandwidth
💡 The Practical Decision

For production serving (vLLM/SGLang): use FlashAttention-3 for prefill (highest throughput) and FlashInfer for decode (best paged KV support). For training: use FlashAttention-2/3 exclusively. For research with novel attention: use Triton for prototyping, then port to CUDA if it works.

When Each Wins

📊

Decision Matrix: Which Attention Kernel to Use

ScenarioBest ChoiceWhy
Training on H100 FlashAttention-3 Highest prefill TFLOPS, FP8 support
Training on A100 FlashAttention-2 Best Ampere optimization
vLLM serving (decode) FlashInfer Native paged KV, batch decode kernel
SGLang serving FlashInfer RadixAttention integration
Research (novel attention) Triton Fastest iteration, Python-like syntax
PyTorch prototyping torch SDPA Zero setup, auto-selects backend