vLLM Architecture: A Source Code Walkthrough of the Most Popular LLM Serving Engine

Part of Series vLLM Internals 1 of 4

1 vLLM Architecture: A Source Code Walkthrough of the Most Popular LLM Serving Engine 2 vLLM's Scheduler: How Continuous Batching Actually Works in Code 3 vLLM's PagedAttention Kernel: How Virtual Memory for KV Cache Works at the CUDA Level 4 KV Cache Quantization: Doubling Serving Capacity with INT8/FP8 Keys and Values

vLLM serves more LLM traffic than any other open-source inference engine. It handles everything from single-GPU laptop deployments to multi-node clusters serving thousands of concurrent requests. Understanding how it works at the code level — not just the concepts, but the actual modules, classes, and data flow — is essential for anyone operating, extending, or debugging a vLLM deployment.

This post is a guided tour of the vLLM codebase. We trace the lifecycle of a single request from HTTP arrival to token output, identifying every major component it touches.

The Request Lifecycle at 10,000 Feet

When a request arrives at vLLM, it flows through these stages:

vLLM Request Lifecycle

1. API Server HTTP/OpenAI-compatible endpoint Receives request, tokenizes prompt

2. LLMEngine Central coordinator Manages scheduler + workers + output processing

3. Scheduler Brain of the system Decides what runs each iteration (prefill/decode/preempt)

4. Block Manager Memory allocator Allocates/frees KV cache blocks in GPU HBM

5. Worker GPU executor Prepares inputs, runs model forward pass, returns outputs

6. Model Runner Forward pass manager Handles CUDA graphs, input preparation, attention metadata

7. Attention Backend Kernel dispatcher Selects FlashAttention (prefill) or PagedAttention (decode)

Each stage corresponds to a major module in the codebase:

📊

vLLM Source Code Modules

Module	Path	Key Classes	Responsibility
API Server	vllm/entrypoints/	OpenAIServingChat, APIServer	HTTP endpoint, request parsing, streaming
Engine	vllm/engine/	LLMEngine, AsyncLLMEngine	Orchestrate scheduler + workers
Scheduler	vllm/core/scheduler.py	Scheduler	Batch composition, preemption, budget
Block Manager	vllm/core/block_manager.py	BlockSpaceManager	KV cache block allocation/deallocation
Worker	vllm/worker/	Worker, GPUModelRunner	GPU execution, TP coordination
Model Executor	vllm/model_executor/	ModelRunner, model implementations	Forward pass, CUDA graphs, input prep
Attention	vllm/attention/	AttentionBackend, PagedAttention	Attention kernel selection and execution
CUDA Kernels	csrc/	C++/CUDA implementations	Paged attention, cache ops, activations

LLMEngine: The Central Coordinator

The LLMEngine is the entry point for all inference. It ties together the scheduler, workers, and tokenizer into a single step() loop:

class LLMEngine:
    def step(self):
        # 1. Scheduler decides what to run
        scheduler_output = self.scheduler.schedule()

        # 2. If nothing to do, return empty
        if scheduler_output.is_empty():
            return []

        # 3. Send work to GPU workers
        model_output = self.model_executor.execute_model(scheduler_output)

        # 4. Process outputs (decode tokens, check stop conditions)
        request_outputs = self._process_model_outputs(model_output)

        # 5. Update scheduler state (mark finished, update token counts)
        self.scheduler.update(scheduler_output, model_output)

        return request_outputs

The step() method runs once per iteration (~20-50ms). The AsyncLLMEngine wraps this in an async loop that continuously calls step() and streams results back to clients.

ℹ️ Offline vs Online Mode

vLLM has two modes: offline (LLM class for batch processing) and online (AsyncLLMEngine for serving). Both use the same scheduler and workers underneath. The difference is how requests enter (all at once vs streaming) and how results exit (returned vs streamed).

The Scheduler: Brain of the System

The scheduler (detailed in Part 2 of this series) maintains three queues — waiting, running, and swapped — and decides the batch composition each iteration. Its output is a SchedulerOutput containing:

Which sequences to prefill (and how many tokens each)
Which sequences to continue decoding (1 token each)
Which sequences to preempt (and whether to swap or recompute)
Block allocation/deallocation instructions for the block manager

The scheduler operates entirely on CPU. Its execution time (~0.5-2ms) is negligible compared to the GPU forward pass (~10-50ms). But its decisions determine throughput — a bad scheduling policy can waste 50%+ of GPU capacity.

Block Manager: Virtual Memory for KV Cache

The BlockSpaceManager implements the paged KV cache allocation described in the PagedAttention paper. It divides GPU HBM into fixed-size blocks (default: 16 tokens per block) and manages them like an OS manages physical memory pages:

class BlockSpaceManager:
    def __init__(self, block_size, num_gpu_blocks, num_cpu_blocks):
        self.gpu_allocator = BlockAllocator(num_gpu_blocks)  # Free list
        self.cpu_allocator = BlockAllocator(num_cpu_blocks)   # For swap
        self.block_tables = {}  # seq_id -> list of physical block IDs

    def allocate(self, seq_id, num_blocks):
        blocks = [self.gpu_allocator.allocate() for _ in range(num_blocks)]
        self.block_tables[seq_id] = blocks

    def free(self, seq_id):
        for block in self.block_tables[seq_id]:
            self.gpu_allocator.free(block)
        del self.block_tables[seq_id]

    def swap_out(self, seq_id):
        # Copy GPU blocks to CPU blocks
        cpu_blocks = [self.cpu_allocator.allocate() for _ in self.block_tables[seq_id]]
        # ... initiate GPU->CPU copy
        self.gpu_allocator.free_all(self.block_tables[seq_id])
        self.block_tables[seq_id] = cpu_blocks

The block table maps logical block indices to physical block pointers. This indirection is what enables: (a) non-contiguous KV cache storage, (b) zero-fragmentation allocation, (c) copy-on-write for beam search, (d) prefix sharing across requests.

Workers and Model Runner

Workers execute on each GPU. In tensor-parallel setups, multiple workers coordinate:

class Worker:
    def execute_model(self, scheduler_output):
        # 1. Prepare model inputs (token IDs, positions, attention metadata)
        inputs = self.model_runner.prepare_input(scheduler_output)

        # 2. Run model forward pass (potentially via CUDA graph)
        output = self.model_runner.execute(inputs)

        # 3. Sample next tokens from logits
        sampled = self.model_runner.sample(output.logits)

        return sampled

The ModelRunner handles the critical details:

Input preparation: Packing variable-length sequences into padded tensors, computing position IDs, building attention metadata (block tables, sequence lengths)
CUDA graph management: For decode steps with fixed batch sizes, captured CUDA graphs eliminate kernel launch overhead
Tensor parallelism: Distributing computation across GPUs, handling all-reduce communication

⚡ Where Time Actually Goes

In a typical iteration at batch=64 on Llama 70B with H100: GPU forward pass ~30ms (90%), input preparation ~2ms (6%), scheduling ~1ms (3%), sampling ~0.5ms (1.5%). The forward pass dominates. Optimizing anything else gives marginal returns.

Attention Backends

vLLM abstracts attention computation behind an AttentionBackend interface, allowing different kernels for different scenarios:

📊

vLLM Attention Backend Selection

Backend	Used For	Key Property	When Selected
FlashAttention-2	Prefill	Contiguous Q,K,V — maximum throughput	Default for prefill on CUDA GPUs
PagedAttention v2	Decode	Handles non-contiguous KV blocks via block tables	Default for decode
FlashInfer	Both	Alternative implementation with different tradeoffs	When explicitly selected
Torch SDPA	Fallback	PyTorch native, broadest compatibility	When no optimized backend available

The critical split: prefill uses FlashAttention (contiguous KV, maximum throughput) while decode uses PagedAttention (non-contiguous blocks, necessary for paged memory). Part 3 of this series details the PagedAttention kernel implementation.

The CUDA Kernels

The csrc/ directory contains the C++/CUDA code that makes vLLM fast:

csrc/attention/: Paged attention kernels (v1 and v2). The most performance-critical CUDA code.
csrc/cache_kernels.cu: KV cache operations — reshape_and_cache (write new KV to blocks), swap (GPU to CPU copy), copy (for COW beam search).
csrc/activation_kernels.cu: Fused activation functions (SiLU, GELU).
csrc/layernorm_kernels.cu: Fused RMS normalization.
csrc/quantization/: Quantized GEMM kernels for INT4/INT8/FP8.

These custom CUDA kernels exist because PyTorch’s default operators don’t handle the paged memory layout. Standard torch.nn.functional.scaled_dot_product_attention expects contiguous tensors — vLLM’s KV cache lives in scattered blocks that require custom addressing.

How It All Connects: End-to-End Trace

Let’s trace a single request through every component:

User sends "Explain transformers" to the OpenAI-compatible API
API server tokenizes: [849, 11187, 88146] (3 tokens)
LLMEngine.add_request() creates a SequenceGroup and places it in the scheduler’s waiting queue
Scheduler._schedule() sees the waiting request. Budget allows 3 prefill tokens. Block manager allocates 1 block (16-token capacity). Request moves to running queue.
Worker.execute_model() prepares inputs: token_ids=[849, 11187, 88146], positions=[0, 1, 2], attention metadata (no KV cache to read yet)
Model forward pass: FlashAttention for prefill (3 tokens, contiguous). Produces logits for position 2.
Sampling: Top-p selects token 578 (“Trans”)
Engine updates: Token 578 appended to sequence. Block manager records KV cache now has 4 tokens (3 prompt + 1 generated).
Next iteration: Scheduler sees 1 running request. Decode: token_ids=[578], positions=[3], PagedAttention reads KV from block, produces logits for position 3.
Repeat until EOS token or max length.

Each iteration takes ~10-50ms depending on model size and batch. The request generates tokens at 20-100 tokens/sec, streamed back to the client in real time.

💡 The Takeaway for vLLM Users

You don’t need to understand every line of code to use vLLM effectively. But knowing the architecture helps you: (1) choose the right configuration parameters (max_num_batched_tokens, max_num_seqs, gpu_memory_utilization), (2) diagnose performance issues (is the bottleneck scheduling, memory, or compute?), and (3) understand when vLLM is the right tool vs when SGLang or TRT-LLM might be better.

The next two posts in this series go deeper: Part 2 covers the scheduler’s algorithms, and Part 3 covers the PagedAttention CUDA kernel — the two most technically interesting components of the system.