When you type a long prompt into ChatGPT and wait for the first token to appear, you are waiting for the prefill phase. This phase processes the entire prompt through every layer of the transformer before a single output token can be generated. For a 4096-token prompt on Llama 70B, prefill alone can take over a second on a single A100 GPU. Understanding why prefill behaves the way it does — and how to optimize it — is essential for building responsive inference systems.

This article examines prefill from first principles: its computational profile, why it is fundamentally different from decode, and the techniques that modern serving systems use to reduce time-to-first-token (TTFT) without sacrificing decode throughput.

Prefill vs Decode: Two Fundamentally Different Workloads

The transformer inference pipeline has two phases, and they have almost opposite computational characteristics. Conflating them leads to poor optimization decisions.

Prefill processes the entire prompt at once. If the prompt has LL tokens, the model performs attention and feed-forward computations over all LL tokens simultaneously in a single forward pass. The result is the KV cache for all prompt tokens and the logits for the next token.

Decode generates tokens one at a time. Each step processes exactly one new token, appending it to the KV cache, and producing logits for the next token.

The critical distinction is arithmetic intensity — the ratio of compute (FLOPs) to memory traffic (bytes moved). This ratio determines whether a workload is compute-bound or memory-bandwidth-bound.

# Prefill: large matrix multiplications (compute-bound)
# Q, K, V shapes: [batch=1, seq_len=4096, d_model=8192]
# Attention scores: [4096, 4096] -- large GEMM
# FFN: [4096, 8192] x [8192, 28672] -- large GEMM

# Decode: thin matrix-vector products (memory-bound)
# Q, K, V shapes: [batch=1, seq_len=1, d_model=8192]
# Attention scores: [1, seq_len] -- dot product against full KV cache
# FFN: [1, 8192] x [8192, 28672] -- matrix-vector product

During prefill, the GEMMs are large enough to saturate the GPU’s tensor cores. An A100’s tensor cores deliver 312 TFLOPS (FP16), and a single prefill forward pass on a 70B model with L=4096L = 4096 generates enough FLOPs to keep them busy. During decode, the same weight matrices are loaded from HBM but multiplied against a single vector — the tensor cores are starved, and performance is gated by how fast you can stream weights from memory.

ℹ️ The Roofline Perspective

On an A100 (312 TFLOPS, 2 TB/s HBM bandwidth), the crossover arithmetic intensity is 312×1012/(2×1012)=156312 \times 10^{12} / (2 \times 10^{12}) = 156 FLOPs/byte. Prefill GEMMs with large LL easily exceed this. Decode with batch size 1 sits at roughly 1-2 FLOPs/byte — deep in the memory-bound regime.

This difference has profound implications for optimization strategy:

PropertyPrefillDecode
Tokens processed per stepLL (entire prompt)1
Dominant operationLarge GEMMsMatrix-vector products
BottleneckCompute (tensor cores)Memory bandwidth (HBM)
Arithmetic intensityHigh (100\gg 100 FLOPs/byte)Low (1\sim 1-22 FLOPs/byte)
Scaling with model sizeNear linear in FLOPsNear linear in parameter bytes
Primary optimization targetReduce FLOPs, parallelize computeIncrease batch size, reduce memory traffic

Prefill Computational Profile

Let us quantify the FLOPs involved in prefill for a single layer of a transformer. Consider a model with hidden dimension DD, number of attention heads HH, head dimension d=D/Hd = D / H, and FFN intermediate dimension DffD_{ff} (typically 4D4D or 8D/38D / 3 for SwiGLU).

Attention FLOPs

For a prompt of length LL:

  1. QKV projection: Three linear projections from DD to DD. FLOPs: 3×2LD2=6LD23 \times 2 L D^2 = 6 L D^2
  2. Attention scores (QKTQK^T): For each head, multiply [L,d][L, d] by [d,L][d, L]. FLOPs per head: 2L2d2 L^2 d. Total across HH heads: 2L2dH=2L2D2 L^2 d H = 2 L^2 D
  3. Attention output (softmax(QKT)V\text{softmax}(QK^T)V): Same shape. FLOPs: 2L2D2 L^2 D
  4. Output projection: Linear from DD to DD. FLOPs: 2LD22 L D^2

Total attention FLOPs per layer: 8LD2+4L2D8 L D^2 + 4 L^2 D

FFN FLOPs

For a standard two-layer FFN with intermediate dimension DffD_{ff}:

  1. Up projection: [L,D]×[D,Dff][L, D] \times [D, D_{ff}]. FLOPs: 2LDDff2 L D \cdot D_{ff}
  2. Down projection: [L,Dff]×[Dff,D][L, D_{ff}] \times [D_{ff}, D]. FLOPs: 2LDDff2 L D \cdot D_{ff}

Total FFN FLOPs per layer: 4LDDff4 L D \cdot D_{ff}

For SwiGLU (used in Llama), there is an additional gate projection, making it 6LDDff6 L D \cdot D_{ff} with Dff=8D/3D_{ff} = 8D/3, which simplifies to 16LD216 L D^2.

When Does Attention Dominate?

The attention-specific term is 4L2D4 L^2 D (the QKTQK^T and softmax-V multiplications). The rest scales as O(LD2)O(L D^2). Attention dominates when:

4L2DcLD24 L^2 D \gg c \cdot L D^2

which simplifies to LcD/4L \gg c \cdot D / 4. For Llama 70B (D=8192D = 8192), with the constant factor from FFN included, attention starts dominating around L4000L \approx 4000-80008000 tokens.

📊

FLOPs Breakdown per Layer: Llama 70B (D=8192, Dff=28672)

Prompt Length (L)Attention QKV+O (TFLOPs)Attention Scores (TFLOPs)FFN (TFLOPs)Attention % of Total
256 0.028 0.001 0.060 32%
1024 0.110 0.017 0.240 35%
4096 0.440 0.275 0.960 43%
8192 0.879 1.100 1.920 51%
16384 1.759 4.400 3.840 62%
32768 3.518 17.600 7.680 73%

Attention vs FFN FLOPs Scaling (Llama 70B, per layer)

line
Metric 2561024409681921638432768
Attention FLOPs (TFLOPs)
FFN FLOPs (TFLOPs)

At L=32768L = 32768 (a typical “long context” scenario), attention consumes nearly 3x the FLOPs of the FFN. This is why long-context prefill optimization focuses heavily on attention.

Arithmetic Intensity of Prefill GEMMs

Consider the FFN up-projection: [L,D]×[D,Dff][L, D] \times [D, D_{ff}]. FLOPs = 2LDDff2 L D \cdot D_{ff}. Bytes loaded = 2(LD+DDff+LDff)2(L D + D \cdot D_{ff} + L \cdot D_{ff}) (in FP16, 2 bytes per element). For L=4096L = 4096, D=8192D = 8192, Dff=28672D_{ff} = 28672:

  • FLOPs: 2×4096×8192×286721.93×10122 \times 4096 \times 8192 \times 28672 \approx 1.93 \times 10^{12}
  • Bytes: 2×(4096×8192+8192×28672+4096×28672)1.0×1092 \times (4096 \times 8192 + 8192 \times 28672 + 4096 \times 28672) \approx 1.0 \times 10^9
  • Arithmetic intensity: 1930\sim 1930 FLOPs/byte

This vastly exceeds the A100’s crossover point of 156 FLOPs/byte. Prefill is deep in the compute-bound regime.

⚠️ Arithmetic Intensity Caveat

The attention score computation QKTQK^T has lower arithmetic intensity than the FFN projections because the output matrix [L,L][L, L] is large relative to the input. For very long sequences with small DD, this can push attention toward memory-bound territory — which is exactly the regime where FlashAttention provides the most benefit.

Prefill Latency: The TTFT Problem

Time-to-first-token (TTFT) is the wall-clock time from when a request arrives to when the first output token is produced. It is dominated by prefill time. For interactive applications — chatbots, code assistants, search — TTFT directly affects perceived responsiveness. Users notice delays above 200-500ms.

For Llama 70B on a single A100 (80GB), approximate prefill times:

📊

Prefill Latency: Llama 70B on 1x A100-80GB (FP16)

Prompt LengthPrefill Time (ms)TTFT (ms)Prefill TFLOPs
128 45 52 1.7
512 120 130 6.8
1024 210 225 13.7
2048 390 410 27.7
4096 780 810 57.5
8192 1800 1850 125
32768 12500 12600 920

A 32K-token prompt takes over 12 seconds just for prefill. During that time, the GPU is exclusively occupied — no decode steps for any other request can proceed. This is the head-of-line blocking problem.

Chunked Prefill: Breaking the Head-of-Line Block

Long prefills monopolize the GPU. If you are serving hundreds of concurrent users and one sends a 32K-token prompt, every other user’s decode latency (time-between-tokens, TBT) spikes because the GPU cannot interleave decode steps during a monolithic prefill.

Chunked prefill solves this by splitting a long prompt into smaller chunks and processing one chunk per scheduling iteration, interleaved with decode steps from other requests.

How Chunked Prefill Works

Instead of processing all LL tokens in one forward pass, break them into chunks of size CC:

class ChunkedPrefillScheduler:
    def __init__(self, chunk_size=512, max_batch_tokens=4096):
        self.chunk_size = chunk_size
        self.max_batch_tokens = max_batch_tokens

    def schedule_iteration(self, prefill_queue, decode_queue):
        """Schedule one iteration: mix prefill chunks with decode tokens."""
        batch = []
        token_budget = self.max_batch_tokens

        # First, include all pending decode requests (1 token each)
        for req in decode_queue:
            if token_budget >= 1:
                batch.append(ScheduleUnit(req, num_tokens=1, is_prefill=False))
                token_budget -= 1

        # Fill remaining budget with prefill chunks
        for req in prefill_queue:
            remaining = req.prompt_length - req.tokens_prefilled
            chunk = min(remaining, self.chunk_size, token_budget)
            if chunk > 0:
                batch.append(ScheduleUnit(req, num_tokens=chunk, is_prefill=True))
                token_budget -= chunk
                req.tokens_prefilled += chunk

        return batch

Each iteration processes at most max_batch_tokens tokens total. Decode requests contribute 1 token each and are always prioritized. The remaining token budget is allocated to prefill chunks.

Choosing Chunk Size

The chunk size CC controls the trade-off between TTFT and TBT:

  • Small chunks (128-256 tokens): Minimal impact on decode latency, but prefill takes many iterations to complete, increasing TTFT.
  • Large chunks (2048-4096 tokens): Prefill completes faster (lower TTFT), but each iteration takes longer, increasing TBT for decode requests.

The Sarathi-Serve paper provides a principled analysis. They observe that the optimal chunk size depends on the model, hardware, and the decode batch size. Their key insight: the prefill chunk should be sized so that the iteration time remains within the “decode latency SLO” — typically 50-100ms per token.

📊

Chunk Size Trade-offs: Llama 70B on A100 (200 concurrent users)

Chunk SizeMedian TTFT (ms)P99 TBT (ms)Throughput (tok/s)GPU Util %
No chunking 1200 1180 3200 72%
256 1800 28 3050 88%
512 1400 35 3180 91%
1024 1250 52 3250 93%
2048 1180 85 3280 94%
4096 1150 142 3300 95%

TTFT vs TBT Trade-off by Chunk Size

line
Metric 256512102420484096
Median TTFT (ms)
P99 TBT (ms)

A chunk size of 512-1024 often provides the best balance. The TTFT penalty over no chunking is modest (10-20%), while TBT improves dramatically (from 1180ms down to 35-52ms).

ℹ️ Chunked Prefill in Production

vLLM enables chunked prefill via --enable-chunked-prefill with a configurable --max-num-batched-tokens. Sarathi-Serve extends this with a decode-priority scheduler that dynamically adjusts chunk sizes based on load. TensorRT-LLM supports “inflight batching” which achieves a similar effect.

The Attention Computation Across Chunks

A subtlety: when processing chunk ii of a prefill, the attention computation must attend to all previously-prefilled tokens (chunks 00 through i1i-1), not just the current chunk. This means the KV cache from earlier chunks must be available, and the attention FLOPs for chunk ii are O(C×(iC))O(C \times (i \cdot C)), not just O(C2)O(C^2).

def chunked_prefill_attention(query_chunk, kv_cache_so_far, new_keys, new_values):
    """
    Attention for a prefill chunk that must attend to all prior chunks.

    query_chunk: [chunk_size, d_model] -- current chunk's queries
    kv_cache_so_far: accumulated K,V from previous chunks
    new_keys, new_values: K,V for current chunk
    """
    # Concatenate prior KV cache with current chunk
    all_keys = torch.cat([kv_cache_so_far.keys, new_keys], dim=0)
    all_values = torch.cat([kv_cache_so_far.values, new_values], dim=0)

    # Attention: [chunk_size, total_prefilled + chunk_size]
    # This is a rectangular attention, not square
    scores = torch.matmul(query_chunk, all_keys.transpose(-2, -1))
    scores = scores / math.sqrt(d_head)

    # Causal mask: each query can attend to keys at positions <= its own
    scores = apply_causal_mask(scores, offset=kv_cache_so_far.length)

    attn_weights = F.softmax(scores, dim=-1)
    output = torch.matmul(attn_weights, all_values)

    # Update KV cache
    kv_cache_so_far.append(new_keys, new_values)

    return output

The total FLOPs across all chunks sum to the same as a monolithic prefill — chunking does not add extra computation. But it does add scheduling overhead and may reduce GPU utilization per iteration if the chunks are too small to saturate the tensor cores.

FlashAttention in Prefill

The attention mechanism computes softmax(QKT/d)V\text{softmax}(QK^T / \sqrt{d}) V. The naive implementation materializes the full [L,L][L, L] attention matrix in HBM, consuming O(L2)O(L^2) memory and generating O(L2)O(L^2) bytes of HBM traffic.

FlashAttention eliminates this by tiling the computation. Instead of materializing the full attention matrix, it processes blocks of QQ, KK, VV in SRAM (shared memory), computing partial softmax results and combining them using the online softmax trick.

Why Prefill Benefits More Than Decode

FlashAttention’s benefit is proportional to how much HBM traffic it eliminates. During prefill with prompt length LL:

  • Standard attention HBM traffic: O(L2)O(L^2) for the attention matrix (read and write)
  • FlashAttention HBM traffic: O(L2d/M)O(L^2 d / M) where MM is SRAM size, which simplifies to approximately O(L)O(L) for typical MM values

For L=4096L = 4096, this is a massive reduction. During decode, the attention “matrix” is [1,L][1, L] — already small. FlashAttention still helps decode (it avoids materializing intermediate results), but the relative improvement is much smaller.

# Simplified FlashAttention algorithm for prefill
def flash_attention_prefill(Q, K, V, block_size_q=128, block_size_kv=256):
    """
    FlashAttention: tiled computation avoiding O(L^2) materialization.
    Q, K, V: [L, d] -- full prompt tensors
    """
    L, d = Q.shape
    output = torch.zeros_like(Q)

    for i in range(0, L, block_size_q):
        q_block = Q[i:i+block_size_q]          # Load Q block to SRAM
        o_block = torch.zeros_like(q_block)     # Accumulator in SRAM
        m_block = torch.full((block_size_q,), float('-inf'))  # Row maxima
        l_block = torch.zeros(block_size_q)     # Row sums

        for j in range(0, L, block_size_kv):
            k_block = K[j:j+block_size_kv]      # Load K block to SRAM
            v_block = V[j:j+block_size_kv]      # Load V block to SRAM

            # Compute attention scores in SRAM -- never written to HBM
            s_block = torch.matmul(q_block, k_block.T) / math.sqrt(d)

            # Apply causal mask
            s_block = apply_causal_mask(s_block, q_offset=i, k_offset=j)

            # Online softmax update
            m_new = torch.max(m_block, s_block.max(dim=-1).values)
            exp_s = torch.exp(s_block - m_new.unsqueeze(-1))
            l_new = l_block * torch.exp(m_block - m_new) + exp_s.sum(dim=-1)

            # Rescale previous output and add new contribution
            o_block = o_block * (l_block / l_new).unsqueeze(-1) * \
                      torch.exp((m_block - m_new)).unsqueeze(-1)
            o_block = o_block + torch.matmul(exp_s, v_block) / l_new.unsqueeze(-1)

            m_block = m_new
            l_block = l_new

        output[i:i+block_size_q] = o_block  # Write final result to HBM

    return output

The key insight: the [block_size_q,block_size_kv][block\_size\_q, block\_size\_kv] attention score matrix lives entirely in SRAM and is never written to HBM. This eliminates the dominant memory bottleneck for long-sequence prefill.

FlashAttention Speedup for Prefill

📊

FlashAttention vs Standard Attention: Prefill on A100 (Llama 70B, FP16)

Prompt LengthStandard (ms)FlashAttention (ms)SpeedupPeak Memory Saved
512 42 28 1.5x 0.5 GB
1024 125 52 2.4x 2.0 GB
2048 410 138 3.0x 8.0 GB
4096 1520 420 3.6x 32 GB
8192 5800 1250 4.6x 128 GB
16384 OOM 4200 N/A N/A

FlashAttention Speedup vs Prompt Length

line
Metric 5121024204840968192
Standard Attention (ms)
FlashAttention (ms)

At 8192 tokens, FlashAttention delivers a 4.6x speedup. At 16384 tokens, standard attention causes an out-of-memory error on an 80GB A100 (the [16384,16384][16384, 16384] attention matrices across 80 heads require over 160 GB), while FlashAttention completes comfortably.

ℹ️ FlashAttention Versions

FlashAttention-2 improved on the original by better work partitioning across warps and reducing non-matmul FLOPs. FlashAttention-3 (Hopper architecture) exploits hardware features like TMA (tensor memory accelerator) and warp specialization for an additional 1.5-1.8x speedup on H100 GPUs.

FlashAttention and Chunked Prefill Interaction

FlashAttention and chunked prefill are complementary. Chunked prefill reduces the effective sequence length per iteration (from LL to CC), which reduces the attention FLOPs per iteration. FlashAttention reduces the HBM traffic for whatever attention computation remains. When combined:

  • Each prefill chunk of size CC processes attention using FlashAttention kernels
  • The KV cache from previous chunks is accessed efficiently through FlashAttention’s block-wise iteration
  • Memory overhead per iteration is O(C)O(C) instead of O(L2)O(L^2)

Tensor Parallelism for Prefill Latency

When a single GPU cannot deliver acceptable TTFT, spread the model across NN GPUs using tensor parallelism (TP). Each GPU processes 1/N1/N of each layer’s computation, then synchronizes via all-reduce.

How Tensor Parallelism Splits Prefill

For each transformer layer, tensor parallelism splits the weight matrices column-wise (for the first linear layer) and row-wise (for the second):

# Tensor-parallel attention on GPU rank `rank` out of `world_size` GPUs
class TPAttention:
    def __init__(self, d_model, n_heads, rank, world_size):
        # Each GPU handles n_heads / world_size attention heads
        self.local_heads = n_heads // world_size
        self.local_dim = self.local_heads * (d_model // n_heads)

        # Each GPU holds a shard of QKV and output projections
        self.wq = nn.Linear(d_model, self.local_dim, bias=False)  # Column-parallel
        self.wk = nn.Linear(d_model, self.local_dim, bias=False)
        self.wv = nn.Linear(d_model, self.local_dim, bias=False)
        self.wo = nn.Linear(self.local_dim, d_model, bias=False)  # Row-parallel

    def forward(self, x):
        # x: [L, d_model] -- full hidden states (replicated across GPUs)
        q = self.wq(x)  # [L, local_dim] -- only this GPU's heads
        k = self.wk(x)
        v = self.wv(x)

        # Local attention -- only on this GPU's heads
        attn_output = flash_attention(q, k, v)  # [L, local_dim]

        # Row-parallel output projection
        output = self.wo(attn_output)  # [L, d_model] -- partial result

        # All-reduce to sum partial results across GPUs
        dist.all_reduce(output, op=dist.ReduceOp.SUM)

        return output

Communication Cost

Each transformer layer requires 2 all-reduce operations in the standard Megatron-LM scheme: one after attention, one after FFN. For NN GPUs connected via NVLink:

  • All-reduce of tensor [L,D][L, D] in FP16: 2×(N1)/N×L×D×22 \times (N-1)/N \times L \times D \times 2 bytes
  • For L=4096L = 4096, D=8192D = 8192, N=8N = 8: each all-reduce moves approximately 2×7/8×4096×8192×21172 \times 7/8 \times 4096 \times 8192 \times 2 \approx 117 MB
  • On NVLink (600 GB/s bidirectional on A100): approximately 0.2ms per all-reduce
  • For 80 layers: 80×2×0.2=3280 \times 2 \times 0.2 = 32 ms total communication overhead

The communication overhead is small relative to the compute savings: splitting across 8 GPUs reduces compute per GPU by 8x, while adding only 32ms of communication.

📊

Tensor Parallelism Scaling: Llama 70B Prefill (L=4096, FP16)

TP DegreePrefill Time (ms)SpeedupCommunication Overhead (ms)GPU Utilization
1 GPU 780 1.0x 0 89%
2 GPUs 410 1.9x 12 85%
4 GPUs 215 3.6x 22 80%
8 GPUs 120 6.5x 35 73%

Prefill Latency Scaling with Tensor Parallelism

line
Metric 1 GPU2 GPUs4 GPUs8 GPUs
Prefill Time (ms)
Ideal Linear Scaling (ms)

Scaling efficiency drops from 95% at TP=2 to 81% at TP=8 due to communication overhead and reduced per-GPU utilization. For TTFT-sensitive applications, TP=4 often provides the best cost-efficiency point.

⚠️ TP vs PP for Prefill

Pipeline parallelism (PP) is poorly suited for prefill latency reduction. PP adds pipeline bubble overhead proportional to the number of stages, and prefill is a single forward pass (no microbatching benefit). TP reduces latency of each layer; PP only helps throughput. For TTFT optimization, always prefer TP.

Prefix Caching: Reusing Shared Computation

Many LLM applications share a common system prompt across requests. A customer support chatbot might prepend 2000 tokens of instructions and few-shot examples to every user query. Without prefix caching, every request redundantly computes the KV cache for this shared prefix.

The Opportunity

If 100 requests per second share a 2000-token system prompt, you are redundantly computing 100×prefill(2000)100 \times \text{prefill}(2000) per second. With prefix caching, you compute it once and reuse the KV cache.

Savings per request:

  • Avoided FLOPs: For Llama 70B, prefilling 2000 tokens costs approximately 50 TFLOPs. At 100 req/s, that is 5 PFLOPs/s saved.
  • Avoided TTFT: The shared prefix portion of TTFT drops from ~200ms to ~0ms (just a memory copy or reference).

RadixAttention (SGLang)

SGLang implements prefix caching using a radix tree (trie) data structure to store KV cache entries:

class RadixTree:
    """
    Trie-based KV cache sharing. Each node corresponds to a token,
    and stores a pointer to the KV cache block for that position.
    """
    def __init__(self):
        self.root = RadixNode()

    def match_prefix(self, token_ids):
        """
        Find the longest prefix of token_ids that exists in the tree.
        Returns (matched_length, kv_cache_blocks).
        """
        node = self.root
        matched = 0
        kv_blocks = []

        for token_id in token_ids:
            if token_id in node.children:
                node = node.children[token_id]
                kv_blocks.append(node.kv_block)
                matched += 1
            else:
                break

        return matched, kv_blocks

    def insert(self, token_ids, kv_blocks):
        """Insert a new sequence's KV cache into the tree."""
        node = self.root
        for i, token_id in enumerate(token_ids):
            if token_id not in node.children:
                node.children[token_id] = RadixNode(kv_block=kv_blocks[i])
            node = node.children[token_id]

When a new request arrives, the scheduler queries the radix tree: “What is the longest prefix of this request’s tokens that already has a cached KV?” Only the uncached suffix needs prefill.

The radix tree structure naturally handles multiple different prefixes and partial overlaps. If requests A and B share the first 1000 tokens but diverge after that, the tree stores the shared 1000-token KV cache once and branches.

vLLM Automatic Prefix Caching (APC)

vLLM takes a different approach: hash-based prefix caching. It divides the KV cache into fixed-size blocks (typically 16 tokens per block) and computes a hash of the token content for each block:

class HashBasedPrefixCache:
    def __init__(self, block_size=16):
        self.block_size = block_size
        self.cache = {}  # hash -> physical_block_id

    def compute_block_hash(self, token_ids, block_idx):
        """
        Hash is computed over all tokens up to and including this block,
        ensuring that identical prefixes produce identical hashes.
        """
        prefix_end = (block_idx + 1) * self.block_size
        prefix_tokens = tuple(token_ids[:prefix_end])
        return hash(prefix_tokens)

    def allocate_with_prefix_caching(self, token_ids):
        """Allocate blocks, reusing cached blocks where possible."""
        num_blocks = (len(token_ids) + self.block_size - 1) // self.block_size
        physical_blocks = []
        first_uncached_block = num_blocks  # Assume all cached initially

        for i in range(num_blocks):
            block_hash = self.compute_block_hash(token_ids, i)
            if block_hash in self.cache:
                physical_blocks.append(self.cache[block_hash])
            else:
                first_uncached_block = i
                break

        # Allocate new blocks for uncached suffix
        for i in range(first_uncached_block, num_blocks):
            new_block = self.block_manager.allocate()
            block_hash = self.compute_block_hash(token_ids, i)
            self.cache[block_hash] = new_block
            physical_blocks.append(new_block)

        cached_tokens = first_uncached_block * self.block_size
        return physical_blocks, cached_tokens

The hash-based approach has a key advantage: it does not require explicit management of a tree structure and handles arbitrary sharing patterns automatically. If two requests happen to share any prefix (even without being explicitly marked as using the same system prompt), the cache detects and exploits this.

ℹ️ Enabling Prefix Caching

In vLLM: --enable-prefix-caching. In SGLang, RadixAttention is enabled by default. Both systems use LRU eviction when the cache is full. The cache hit rate depends on the workload — high for chatbot scenarios (shared system prompts), low for diverse single-turn requests.

Prefix Caching Impact

📊

Prefix Caching: Llama 70B with 2048-token System Prompt

ScenarioTTFT (ms)Prefill TokensCache Hit RateThroughput Gain
No caching 410 2048 + query 0% 1.0x
APC (cold start) 415 2048 + query 0% 1.0x
APC (warm, same prompt) 85 query only 100% 2.8x
APC (warm, similar prompt) 180 partial + query 62% 1.9x
RadixAttention (warm) 80 query only 100% 2.9x

With a warm cache and a shared 2048-token system prompt, TTFT drops from 410ms to 80-85ms — a 5x reduction. The throughput improvement comes from freeing GPU compute that would otherwise be spent on redundant prefill.

Prompt Optimization Techniques

Beyond system-level optimizations, reducing the prompt itself directly improves prefill performance. Every token in the prompt contributes to prefill FLOPs, and for long prompts the relationship is superlinear due to the O(L2D)O(L^2 D) attention term.

Structured Prompting

Replace verbose natural-language instructions with structured formats:

# Before: 847 tokens
You are a helpful customer support agent for Acme Corp. When a customer
asks about their order status, you should look up their order number
in the system and provide them with the current status. If the order
is delayed, apologize and offer a 10% discount code. Always be polite
and professional. Here are some examples of good responses...
[... 20 few-shot examples ...]

# After: 312 tokens
Role: Acme Corp support agent
Rules:
- Order status: lookup order_number, return status
- Delayed orders: apologize, offer 10% discount code DELAY10
- Tone: polite, professional
[... 3 few-shot examples ...]

This is a 63% reduction in prompt tokens, translating to roughly 63% less FFN compute and approximately 86% less attention compute (0.372=0.140.37^2 = 0.14).

Few-Shot to Zero-Shot Conversion

Few-shot examples are expensive. Each example adds hundreds of tokens to the prompt, and their contribution to the O(L2)O(L^2) attention term is quadratic. If the model performs adequately with zero-shot or one-shot prompting (often the case for well-tuned instruction-following models), the TTFT savings are significant.

📊

Prompt Length Impact on TTFT: Llama 70B on A100

Prompting StrategyPrompt TokensTTFT (ms)Relative FLOPs
20-shot with verbose instructions 3800 720 1.0x
5-shot with verbose instructions 1400 280 0.26x
5-shot with structured instructions 800 165 0.13x
1-shot with structured instructions 350 78 0.04x
Zero-shot with structured instructions 180 48 0.01x

Going from 20-shot (3800 tokens) to zero-shot with structured prompts (180 tokens) reduces TTFT by 15x. Even if zero-shot quality is slightly lower, the latency improvement often justifies it for real-time applications.

Prompt Compression

Research systems like LLMLingua and AutoCompressor can compress prompts by 2-10x while preserving most of the information content. The approach typically involves:

  1. Using a small model to score token importance
  2. Removing low-importance tokens
  3. Optionally learning “summary tokens” that encode compressed information

For production systems, a simpler approach works well: precompute a compressed version of static prompt components and cache both the compressed text and its KV cache.

Context Window Management

For retrieval-augmented generation (RAG), the number and length of retrieved passages directly affects prefill cost. Strategies:

  • Truncate passages: Use only the first NN tokens of each passage (information density is usually front-loaded)
  • Reduce top-k: Retrieve fewer passages (3 instead of 10)
  • Hierarchical retrieval: First retrieve short summaries, then fetch full passages only if needed

Each of these reduces LL, with quadratic benefits on the attention computation.

When Prefill Is NOT the Bottleneck

Not every workload is prefill-bound. Understanding when prefill is irrelevant saves you from premature optimization.

Short Prompts (fewer than 256 tokens)

With a 128-token prompt on Llama 70B, prefill takes approximately 45ms. This is below the perceptual threshold for most users. The decode phase (generating 500 tokens at 50ms each) dominates the total request time by a factor of 500x.

For short-prompt workloads, optimizing decode throughput (larger decode batches, speculative decoding, KV cache quantization) has far more impact than optimizing prefill.

High Decode-to-Prefill Ratio

If the average output length is much longer than the average input length, the system spends most of its time decoding. Consider a code generation task:

  • Input: 200 tokens (function signature + docstring)
  • Output: 2000 tokens (implementation)
  • Prefill time: ~55ms
  • Decode time: ~40,000ms (at 50 tokens/s)
  • Prefill fraction: 0.14%

Optimizing prefill here is essentially meaningless. Focus on decode throughput.

Streaming Applications

In streaming applications where the user starts reading the output immediately, TTFT matters only for the perceived start of the response. If the output streams at a comfortable reading speed (say, 3-5 tokens per second rendered), even a 500ms TTFT is acceptable because the user cannot read faster than the generation speed anyway.

Batch Inference Workloads

For offline batch processing (evaluating a dataset, generating synthetic data), individual request latency is irrelevant. Only aggregate throughput matters. In this regime:

  • Large monolithic prefills are fine (no real-time users waiting)
  • Chunked prefill adds overhead without benefit
  • The optimization target is maximizing GPU utilization across the batch
⚠️ Profile Before Optimizing

Always profile your actual workload before investing in prefill optimization. Use vLLM’s --collect-detailed-traces or SGLang’s built-in metrics to determine what fraction of GPU time is spent in prefill vs decode. If prefill is less than 20% of total GPU time, optimize decode first.

Putting It All Together: A Decision Framework

Given a specific deployment scenario, here is how to decide which prefill optimizations to apply:

Step 1: Measure your baseline. Profile TTFT, TBT, and throughput. Determine the prefill-to-decode time ratio.

Step 2: Enable FlashAttention. This is a universal win with zero trade-offs. If you are not already using FlashAttention (or FlashDecoding for the decode phase), enable it immediately. Expected improvement: 2-4x on prefill latency for prompts longer than 1024 tokens.

Step 3: Enable prefix caching. If your workload has shared prefixes (system prompts, common RAG passages), enable APC in vLLM or use SGLang’s RadixAttention. Expected improvement: proportional to the fraction of shared tokens, up to 5x for long shared prefixes.

Step 4: Choose tensor parallelism degree. If TTFT still exceeds your SLO after FlashAttention and prefix caching, increase TP. Each doubling of TP roughly halves prefill time (with diminishing returns due to communication overhead). TP=4 is usually the sweet spot for cost efficiency.

Step 5: Enable chunked prefill. If you are serving concurrent users and long prefills are causing TBT spikes for decode users, enable chunked prefill. Tune chunk size to balance TTFT and TBT for your latency SLOs.

Step 6: Optimize prompts. If prefill is still a bottleneck, reduce prompt length. Convert few-shot to zero-shot, use structured prompts, compress retrieved contexts.

Decision tree:

Is TTFT > SLO?
├── No → Focus on decode optimization
└── Yes → Is FlashAttention enabled?
    ├── No → Enable it (2-4x improvement)
    └── Yes → Are there shared prefixes?
        ├── Yes → Enable prefix caching (up to 5x for shared portion)
        └── No/Still slow → Increase tensor parallelism
            └── Still slow? → Enable chunked prefill + optimize prompts

Performance Summary

📊

Combined Optimization Impact: Llama 70B, 4096-token Prompt, A100

ConfigurationTTFT (ms)Cumulative SpeedupNotes
Baseline (1 GPU, standard attention) 1520 1.0x Memory-bound attention
+ FlashAttention 420 3.6x Eliminates O(L^2) HBM traffic
+ Prefix caching (2K shared prefix) 180 8.4x Avoids 50% of prefill compute
+ TP=4 55 27.6x 4-way compute split
+ Prompt optimization (2K tokens) 22 69x Reduced prompt from 4096 to 2048

Cumulative TTFT Reduction

line
Metric Baseline+FlashAttn+PrefixCache+TP=4+PromptOpt
TTFT (ms)

Starting from a 1520ms TTFT, the combination of FlashAttention, prefix caching, tensor parallelism, and prompt optimization brings it down to 22ms — a 69x improvement. Each technique addresses a different aspect of the problem: FlashAttention reduces memory traffic, prefix caching eliminates redundant computation, tensor parallelism splits the remaining compute across GPUs, and prompt optimization reduces the work that needs to be done in the first place.

The key takeaway: prefill is compute-bound for long sequences, which makes it fundamentally amenable to parallelization and algorithmic optimization. Unlike decode (where you are fighting memory bandwidth, a hardware constraint), prefill performance can be dramatically improved through software techniques alone. The challenge is choosing the right combination for your specific workload, hardware, and latency requirements.