vLLM serves more LLM traffic than any other open-source inference engine. It handles everything from single-GPU laptop deployments to multi-node clusters serving thousands of concurrent requests. Understanding how it works at the code level — not just the concepts, but the actual modules, classes, and data flow — is essential for anyone operating, extending, or debugging a vLLM deployment.
This post is a guided tour of the vLLM codebase. We trace the lifecycle of a single request from HTTP arrival to token output, identifying every major component it touches.
The Request Lifecycle at 10,000 Feet
When a request arrives at vLLM, it flows through these stages:
vLLM Request Lifecycle
Each stage corresponds to a major module in the codebase:
vLLM Source Code Modules
| Module | Path | Key Classes | Responsibility |
|---|---|---|---|
| API Server | vllm/entrypoints/ | OpenAIServingChat, APIServer | HTTP endpoint, request parsing, streaming |
| Engine | vllm/engine/ | LLMEngine, AsyncLLMEngine | Orchestrate scheduler + workers |
| Scheduler | vllm/core/scheduler.py | Scheduler | Batch composition, preemption, budget |
| Block Manager | vllm/core/block_manager.py | BlockSpaceManager | KV cache block allocation/deallocation |
| Worker | vllm/worker/ | Worker, GPUModelRunner | GPU execution, TP coordination |
| Model Executor | vllm/model_executor/ | ModelRunner, model implementations | Forward pass, CUDA graphs, input prep |
| Attention | vllm/attention/ | AttentionBackend, PagedAttention | Attention kernel selection and execution |
| CUDA Kernels | csrc/ | C++/CUDA implementations | Paged attention, cache ops, activations |
LLMEngine: The Central Coordinator
The LLMEngine is the entry point for all inference. It ties together the scheduler, workers, and tokenizer into a single step() loop:
class LLMEngine:
def step(self):
# 1. Scheduler decides what to run
scheduler_output = self.scheduler.schedule()
# 2. If nothing to do, return empty
if scheduler_output.is_empty():
return []
# 3. Send work to GPU workers
model_output = self.model_executor.execute_model(scheduler_output)
# 4. Process outputs (decode tokens, check stop conditions)
request_outputs = self._process_model_outputs(model_output)
# 5. Update scheduler state (mark finished, update token counts)
self.scheduler.update(scheduler_output, model_output)
return request_outputs
The step() method runs once per iteration (~20-50ms). The AsyncLLMEngine wraps this in an async loop that continuously calls step() and streams results back to clients.
vLLM has two modes: offline (LLM class for batch processing) and online (AsyncLLMEngine for serving). Both use the same scheduler and workers underneath. The difference is how requests enter (all at once vs streaming) and how results exit (returned vs streamed).
The Scheduler: Brain of the System
The scheduler (detailed in Part 2 of this series) maintains three queues — waiting, running, and swapped — and decides the batch composition each iteration. Its output is a SchedulerOutput containing:
- Which sequences to prefill (and how many tokens each)
- Which sequences to continue decoding (1 token each)
- Which sequences to preempt (and whether to swap or recompute)
- Block allocation/deallocation instructions for the block manager
The scheduler operates entirely on CPU. Its execution time (~0.5-2ms) is negligible compared to the GPU forward pass (~10-50ms). But its decisions determine throughput — a bad scheduling policy can waste 50%+ of GPU capacity.
Block Manager: Virtual Memory for KV Cache
The BlockSpaceManager implements the paged KV cache allocation described in the PagedAttention paper. It divides GPU HBM into fixed-size blocks (default: 16 tokens per block) and manages them like an OS manages physical memory pages:
class BlockSpaceManager:
def __init__(self, block_size, num_gpu_blocks, num_cpu_blocks):
self.gpu_allocator = BlockAllocator(num_gpu_blocks) # Free list
self.cpu_allocator = BlockAllocator(num_cpu_blocks) # For swap
self.block_tables = {} # seq_id -> list of physical block IDs
def allocate(self, seq_id, num_blocks):
blocks = [self.gpu_allocator.allocate() for _ in range(num_blocks)]
self.block_tables[seq_id] = blocks
def free(self, seq_id):
for block in self.block_tables[seq_id]:
self.gpu_allocator.free(block)
del self.block_tables[seq_id]
def swap_out(self, seq_id):
# Copy GPU blocks to CPU blocks
cpu_blocks = [self.cpu_allocator.allocate() for _ in self.block_tables[seq_id]]
# ... initiate GPU->CPU copy
self.gpu_allocator.free_all(self.block_tables[seq_id])
self.block_tables[seq_id] = cpu_blocks
The block table maps logical block indices to physical block pointers. This indirection is what enables: (a) non-contiguous KV cache storage, (b) zero-fragmentation allocation, (c) copy-on-write for beam search, (d) prefix sharing across requests.
Workers and Model Runner
Workers execute on each GPU. In tensor-parallel setups, multiple workers coordinate:
class Worker:
def execute_model(self, scheduler_output):
# 1. Prepare model inputs (token IDs, positions, attention metadata)
inputs = self.model_runner.prepare_input(scheduler_output)
# 2. Run model forward pass (potentially via CUDA graph)
output = self.model_runner.execute(inputs)
# 3. Sample next tokens from logits
sampled = self.model_runner.sample(output.logits)
return sampled
The ModelRunner handles the critical details:
- Input preparation: Packing variable-length sequences into padded tensors, computing position IDs, building attention metadata (block tables, sequence lengths)
- CUDA graph management: For decode steps with fixed batch sizes, captured CUDA graphs eliminate kernel launch overhead
- Tensor parallelism: Distributing computation across GPUs, handling all-reduce communication
In a typical iteration at batch=64 on Llama 70B with H100: GPU forward pass ~30ms (90%), input preparation ~2ms (6%), scheduling ~1ms (3%), sampling ~0.5ms (1.5%). The forward pass dominates. Optimizing anything else gives marginal returns.
Attention Backends
vLLM abstracts attention computation behind an AttentionBackend interface, allowing different kernels for different scenarios:
vLLM Attention Backend Selection
| Backend | Used For | Key Property | When Selected |
|---|---|---|---|
| FlashAttention-2 | Prefill | Contiguous Q,K,V — maximum throughput | Default for prefill on CUDA GPUs |
| PagedAttention v2 | Decode | Handles non-contiguous KV blocks via block tables | Default for decode |
| FlashInfer | Both | Alternative implementation with different tradeoffs | When explicitly selected |
| Torch SDPA | Fallback | PyTorch native, broadest compatibility | When no optimized backend available |
The critical split: prefill uses FlashAttention (contiguous KV, maximum throughput) while decode uses PagedAttention (non-contiguous blocks, necessary for paged memory). Part 3 of this series details the PagedAttention kernel implementation.
The CUDA Kernels
The csrc/ directory contains the C++/CUDA code that makes vLLM fast:
csrc/attention/: Paged attention kernels (v1 and v2). The most performance-critical CUDA code.csrc/cache_kernels.cu: KV cache operations — reshape_and_cache (write new KV to blocks), swap (GPU to CPU copy), copy (for COW beam search).csrc/activation_kernels.cu: Fused activation functions (SiLU, GELU).csrc/layernorm_kernels.cu: Fused RMS normalization.csrc/quantization/: Quantized GEMM kernels for INT4/INT8/FP8.
These custom CUDA kernels exist because PyTorch’s default operators don’t handle the paged memory layout. Standard torch.nn.functional.scaled_dot_product_attention expects contiguous tensors — vLLM’s KV cache lives in scattered blocks that require custom addressing.
How It All Connects: End-to-End Trace
Let’s trace a single request through every component:
- User sends
"Explain transformers"to the OpenAI-compatible API - API server tokenizes:
[849, 11187, 88146](3 tokens) - LLMEngine.add_request() creates a
SequenceGroupand places it in the scheduler’s waiting queue - Scheduler._schedule() sees the waiting request. Budget allows 3 prefill tokens. Block manager allocates 1 block (16-token capacity). Request moves to running queue.
- Worker.execute_model() prepares inputs: token_ids=[849, 11187, 88146], positions=[0, 1, 2], attention metadata (no KV cache to read yet)
- Model forward pass: FlashAttention for prefill (3 tokens, contiguous). Produces logits for position 2.
- Sampling: Top-p selects token 578 (“Trans”)
- Engine updates: Token 578 appended to sequence. Block manager records KV cache now has 4 tokens (3 prompt + 1 generated).
- Next iteration: Scheduler sees 1 running request. Decode: token_ids=[578], positions=[3], PagedAttention reads KV from block, produces logits for position 3.
- Repeat until EOS token or max length.
Each iteration takes ~10-50ms depending on model size and batch. The request generates tokens at 20-100 tokens/sec, streamed back to the client in real time.
You don’t need to understand every line of code to use vLLM effectively. But knowing the architecture helps you: (1) choose the right configuration parameters (max_num_batched_tokens, max_num_seqs, gpu_memory_utilization), (2) diagnose performance issues (is the bottleneck scheduling, memory, or compute?), and (3) understand when vLLM is the right tool vs when SGLang or TRT-LLM might be better.
The next two posts in this series go deeper: Part 2 covers the scheduler’s algorithms, and Part 3 covers the PagedAttention CUDA kernel — the two most technically interesting components of the system.