Part of Series Inference Optimization Timeline 60 of 60
1 Transformer Fundamentals for Systems Engineers: The 10-Minute Bridge from Architecture to Inference 2 LLM Inference Fundamentals: Prefill, Decode, and the Memory-Compute Divide 3 KV Cache: The Hidden Memory Giant in LLM Serving 4 Quantization for LLM Inference: From FP16 to INT4 — A Deep Dive into Precision, Performance, and Production Deployment 5 FlashAttention: Why Tiling Attention Through the Memory Hierarchy Changes Everything 6 PagedAttention: How vLLM Borrowed OS Virtual Memory to Fix LLM Serving 7 Continuous Batching: The Complete Guide to LLM Inference Scheduling 8 Speculative Decoding: Why Autoregressive LLMs Leave 99% of Your GPU Idle and How to Fix It 9 Prefix Caching: RadixAttention, Cache Hierarchies, and Reusing Computation Across Requests 10 LoRA and QLoRA for Serving: Multi-Adapter Inference, S-LoRA, and When to Merge 11 Disaggregated Prefill-Decode: Why Splitting LLM Inference Changes Everything 12 Constrained Generation: FSM-Based Decoding, Outlines, and Grammar-Guided LLM Output 13 Mamba and State Space Models: The O(n) Alternative to Attention 14 Inference-Time Compute Scaling: When More Thinking Helps (o1, DeepSeek-R1, and the Reasoning Frontier) 15 CPU and Edge Inference: llama.cpp Internals, GGUF Format, and When CPU Actually Wins 16 Inference Cost Economics: Tokens per Dollar, GPU-Hours, and the Real Math of LLM Serving 17 Model Loading and Cold Start: safetensors, mmap, and Startup Optimization 18 Batched GEMM: Why Matrix Multiply Throughput Determines Everything in LLM Inference 19 Kernel Autotuning: How TensorRT and torch.compile Find Optimal CUDA Kernels 20 Attention Kernel Comparison: FlashAttention vs FlashInfer vs xformers vs Triton 21 Token Generation Pipeline: Logit Processing, Sampling Strategies, and Stop Criteria 22 Dynamic Batching: Orca, Sarathi, and Iteration-Level Scheduling Algorithms 23 Memory Pool Management: Slab Allocators for GPU Inference 24 Prefill vs Decode Optimization: Different Bottlenecks, Different Solutions 25 Decode Optimization: CUDA Graphs, Persistent Batches, and Speculative Verification 26 Multi-Model Serving: GPU Sharing, Model Switching, and Adapter Pool Management 27 Structured Output Acceleration: Compressed FSMs, Speculative JSON, and Grammar Caching 28 Vision-Language Model Serving: ViT Encoding, Cross-Attention, and KV Cache Paging for Multimodal 29 Long-Context Serving: Ring Attention, KV Offloading, and Chunked Processing in Production 30 Inference Profiling: Nsight Systems, torch.profiler, and Finding Where Time Actually Goes 31 FP8 Inference: E4M3 Format, Per-Tensor Scaling, and the Hardware Support Matrix 32 Speculative Decoding v2: Medusa, EAGLE, Lookahead, and Token Tree Verification 33 Disaggregated Serving v2: Mooncake KV-Centric Architecture and LoongServe Elastic SP 34 Request Preemption and Priority Scheduling in Production LLM Serving 35 Autoscaling LLM Inference: Signals, Lag, Warm Pools, and Cost-Optimal Scaling 36 The Inference Stack in 2026: From HTTP Request to GPU Kernel and Back 37 Video and Audio LLM Serving: Temporal Encoding, Chunked Streaming, and Latency Budgets 38 KV Cache Compression and Eviction: H2O, Attention Sinks, Sliding Window, and Quantized KV 39 Distributed Inference: Tensor Parallelism vs Pipeline Parallelism for Serving 40 Serving Benchmark Methodology: How to Properly Measure LLM Inference Performance 41 Compute-Communication Overlap: Hiding Distributed Training Latency 42 DeepSpeed ZeRO: Memory Optimization for Distributed Training at Scale 43 Pipeline Parallelism: From GPipe to DualPipe -- Eliminating the Bubble 44 Gradient Compression for Distributed Training: Promise, Reality, and Where It Still Wins 45 The Definitive Guide to Distributed Parallelism: Data, Tensor, Pipeline, Expert, and Sequence Parallelism for Large-Scale Training 46 Decoding Performance: Beam Search vs Sampling — Latency, Throughput, Memory, and the Full Design Space 47 LLM Prefill Phase Optimization: Why Prompt Processing Is Compute-Bound and How to Fix It 48 LLM Serving Engines: vLLM vs SGLang vs TensorRT-LLM — A Systems Comparison 49 Request Routing for LLM Inference: From Naive Load Balancing to KV Cache-Aware Scheduling 50 Why Adam Is Expensive and What To Do About It: 8-bit Adam, Adafactor, CAME, and the Memory Math of Optimizers 51 How Large Models Actually Get Loaded: Safetensors, mmap, Tensor Parallelism, and Progressive Loading 52 Mixed Precision Training: The Complete Precision Landscape from FP32 to FP4 53 Model Compression: Pruning, Distillation, and Why Quantization Won 54 From NAS to Scaling Laws: How We Design LLM Architectures Now 55 NVIDIA NCCL Performance Tuning for Multi-GPU Training 56 ONNX Runtime in Practice: Graph Optimization, Execution Providers, Quantization, and When ORT Is the Right Choice 57 Optimizing GEMM for Neural Networks: BLAS vs Custom Kernels (Nov 2019) 58 Long Context: From Sparse Attention to Ring Attention 59 TensorRT-LLM: Graph Optimization for Maximum Inference Performance 60 Long Context LLMs: From 2K to 1M Tokens

The original Transformer handled 512 tokens. Gemini 1.5 Pro handles 1 million — a 2,000x increase achieved through a decade of fixing bottlenecks as they appeared. Transformer-XL introduced segment recurrence to break the 512-token barrier. RoPE enabled position encoding beyond training length. FlashAttention made dense attention fast enough for 32K contexts. Ring Attention distributed long sequences across GPUs for 128K-1M contexts. Each innovation unlocked new task categories: 4K isn’t enough for a research paper, 8K isn’t enough for a codebase, 32K isn’t enough for legal contracts with appendices. But long context has costs: quadratic memory scaling, attention dilution where important tokens drown in noise, and retrieval degradation as context grows. This post traces the full technical arc from 512 tokens to 1M+ and the tradeoffs at each step.

Why Long Context Matters

The Limitations of Short Context

The original GPT-2 had a context window of 1,024 tokens — roughly 750 words. GPT-3 extended this to 2,048 tokens. At these lengths, the model could handle a few paragraphs of conversation but nothing more. Any task requiring awareness of information spread across a longer document was out of reach.

📊

What Fits in Different Context Lengths

Context LengthApproximate WordsExample ContentFits?
2K tokens ~1,500 words A short blog post Yes
4K tokens ~3,000 words A research paper abstract + intro Barely
8K tokens ~6,000 words A full research paper No (truncated)
32K tokens ~24,000 words A short novel chapter Yes
128K tokens ~96,000 words A full codebase (medium project) Yes
1M tokens ~750,000 words Multiple books or an entire codebase Yes

Long Context vs. RAG

Retrieval-Augmented Generation (RAG) is often presented as an alternative to long context. Instead of feeding the entire document into the model, you retrieve the relevant chunks and include only those. This works well for factual lookup tasks, but it has fundamental limitations.

RAG requires knowing which chunks are relevant before generating the answer. For questions that depend on synthesizing information from multiple sections of a document — “How does the conclusion contradict the methodology?” or “Summarize the key themes across all chapters” — retrieval often fails because no single chunk contains the answer.

Long context eliminates this retrieval step entirely. The model sees everything and can attend to any part of the input. The tradeoff is computational cost: processing 128K tokens is far more expensive than processing 4K tokens of retrieved context.

ℹ️ Long Context vs. RAG: Not Either/Or

In practice, production systems often combine both approaches. Long context handles tasks where global understanding is needed (summarization, code analysis, multi-document reasoning). RAG handles tasks where the answer is localized (factual Q&A over large corpora). The choice depends on the task, latency budget, and cost constraints.

Use Cases Enabled by Long Context

Document processing: Legal contracts, medical records, and financial reports often exceed 50K tokens. Long context allows end-to-end processing without chunking.

Code understanding: A medium-sized software project might have 100K-500K tokens of source code. With sufficient context, the model can understand cross-file dependencies, API contracts, and architectural patterns.

Multi-turn conversation: Long conversations accumulate context over time. A 128K context window can hold roughly 100 pages of conversation history, enabling the model to reference earlier parts of the discussion.

Few-shot learning with many examples: More context means more in-context examples, which directly improves task performance for pattern matching and classification.

The Quadratic Problem

The fundamental challenge with extending context length is the self-attention mechanism. Standard self-attention computes a score between every pair of tokens, giving it O(n2)O(n^2) time and memory complexity where nn is the sequence length.

📊

Attention Computation Cost vs Sequence Length

Sequence LengthAttention FLOPs (per layer)KV Cache Memory (FP16)Wall Time (A100)
2K 8M 32 MB 0.3 ms
8K 128M 128 MB 2.1 ms
32K 2B 512 MB 18 ms
128K 33B 2 GB 240 ms
512K 524B 8 GB ~3.8 s
1M 2T 16 GB ~15 s
Note: Estimates for a single attention head with d_model=128. Actual models have multiple heads and layers, multiplying these costs.

Attention FLOPs Growth (log scale conceptual)

(relative cost)
2K tokens Baseline
1 relative cost
8K tokens
16 relative cost
32K tokens
256 relative cost
128K tokens
4,096 relative cost
512K tokens
65,536 relative cost
1M tokens 262,144x baseline
262,144 relative cost

This quadratic scaling means that naively doubling the context length quadruples the computation. Going from 2K to 1M tokens represents a 250,000x increase in attention FLOPs. Without architectural innovations to reduce this cost, long context would be completely impractical.

Historical Foundation: Transformer-XL Segment Recurrence

Transformer-XL (Dai et al., 2019) was among the first architectures to seriously tackle the context length limitation. Its key insight was that you do not need to process the entire sequence at once — you can process it in segments and carry information forward through a recurrence mechanism.

How Segment Recurrence Works

Instead of treating each segment independently (which loses all cross-segment information), Transformer-XL caches the hidden states from the previous segment and makes them available to the current segment during attention computation:

# Conceptual Transformer-XL forward pass
def transformer_xl_forward(current_segment, cached_states):
    """
    current_segment: [batch, seg_len, d_model]
    cached_states: [batch, mem_len, d_model] from previous segment
    """
    # Concatenate cached states with current segment for key/value
    extended_context = torch.cat([cached_states, current_segment], dim=1)

    # Query comes from current segment only
    Q = W_q(current_segment)          # [batch, seg_len, d_model]
    K = W_k(extended_context)          # [batch, seg_len + mem_len, d_model]
    V = W_v(extended_context)          # [batch, seg_len + mem_len, d_model]

    # Attention spans both current and cached segments
    attention = softmax(Q @ K.T / sqrt(d)) @ V

    # Cache current hidden states for next segment
    new_cache = current_segment.detach()  # Stop gradients through cache

    return attention, new_cache

The .detach() call is critical: gradients do not flow through the cached states, which keeps training memory manageable. The model learns to produce hidden states that are useful when attended to by future segments, even though it never directly optimizes for this objective through backpropagation across segments.

Relative Positional Encoding

Segment recurrence creates a problem for positional encoding. If you use absolute positions (position 0, 1, 2, …, 511 in each segment), then position 0 in the current segment and position 0 in the cached segment have the same encoding, even though they are semantically at very different positions in the overall sequence.

Transformer-XL solved this with relative positional encoding, which encodes the distance between tokens rather than their absolute position. This was a precursor to the rotary positional encoding (RoPE) used in most modern LLMs.

📊

Transformer-XL vs Vanilla Transformer (2019 Benchmarks)

ModelEffective ContextWikiText-103 PPLenwik8 (bpc)Memory Usage
Vanilla Transformer 512 tokens 24.0 1.08 1x
Transformer-XL (seg=512, mem=512) 1,024 tokens 20.5 1.03 1.5x
Transformer-XL (seg=512, mem=2048) 2,560 tokens 18.3 0.99 2.5x
Note: Perplexity (PPL) and bits-per-character (bpc) are lower-is-better metrics.
ℹ️ Historical Significance

Transformer-XL’s segment recurrence was influential but is not used in modern LLMs directly. Its legacy lives on through two ideas: (1) relative positional encoding, which evolved into RoPE, and (2) the concept of extending effective context beyond the training window, which is now done through RoPE scaling rather than recurrence.

Limitations of Segment Recurrence

Despite its elegance, segment recurrence had practical drawbacks. The effective context grew linearly with the number of layers (each layer could “see” one more cached segment back), meaning deep networks were needed for long-range dependencies. The information from distant segments was also progressively degraded as it passed through multiple layers of processing. Modern approaches achieve much longer contexts more directly.

Sliding Window Attention

Sliding window attention, popularized by Mistral 7B (2023), takes a different approach to the quadratic problem. Instead of attending to all previous tokens, each layer only attends to a fixed window of the most recent WW tokens.

How It Works

def sliding_window_attention(Q, K, V, window_size=4096):
    """
    Each token attends only to the previous window_size tokens.
    """
    seq_len = Q.shape[1]

    # Create sliding window mask
    mask = torch.ones(seq_len, seq_len, dtype=torch.bool)
    for i in range(seq_len):
        # Token i can attend to tokens max(0, i-window_size+1) through i
        start = max(0, i - window_size + 1)
        mask[i, :start] = False
        mask[i, i+1:] = False  # Causal: can't attend to future

    scores = Q @ K.transpose(-1, -2) / math.sqrt(d_k)
    scores.masked_fill_(~mask, float('-inf'))

    return softmax(scores) @ V

With a window size of WW, each token computes attention over at most WW key-value pairs instead of all nn pairs. This reduces the per-token cost from O(n)O(n) to O(W)O(W), making the total cost O(nW)O(n \cdot W) — linear in sequence length when WW is fixed.

Effective Context Through Layer Stacking

The key insight is that while each layer has a limited window, stacking LL layers creates an effective receptive field of L×WL \times W tokens. In Mistral 7B with W=4096W = 4096 and L=32L = 32 layers, the theoretical receptive field is 32×4096=131,07232 \times 4096 = 131,072 tokens, even though each layer only looks at 4,096 tokens.

📊

Sliding Window vs Full Attention

MethodPer-Layer CostKV Cache SizeEffective ContextQuality on Long Tasks
Full Attention O(n^2) O(n) n tokens Best
Sliding Window (W=4096) O(n*W) O(W) L*W tokens (theoretical) Good for most tasks
Sliding Window (W=1024) O(n*W) O(W) L*W tokens (theoretical) Degraded on long-range
⚠️ Receptive Field vs. Effective Context

The theoretical receptive field of L×WL \times W does not mean the model can effectively use information that far back. Information must propagate through multiple layers to reach distant tokens, and this propagation is lossy. In practice, sliding window models perform well on most tasks but can struggle with tasks requiring precise retrieval of information from early in a long sequence.

Memory Savings

The most immediate benefit of sliding window attention is memory savings during inference. Instead of storing KV cache entries for all previous tokens (which grows linearly with sequence length), you only need to store the most recent WW entries. For Mistral 7B with a 4,096 window, the KV cache is capped at about 1 GB regardless of how long the sequence is, compared to 32 GB for a 128K full-attention model.

RoPE and Context Length Extension

Rotary Position Embedding (RoPE), introduced by Su et al. (2021), has become the standard positional encoding in modern LLMs (Llama, Mistral, Qwen, and many others). RoPE encodes position by rotating the query and key vectors in 2D subspaces at frequencies that depend on the position.

How RoPE Works

For a token at position mm, RoPE applies a rotation to each pair of dimensions (2i,2i+1)(2i, 2i+1) in the query and key vectors:

RoPE(x,m)=(x2icos(mθi)x2i+1sin(mθi)x2isin(mθi)+x2i+1cos(mθi))\text{RoPE}(x, m) = \begin{pmatrix} x_{2i} \cos(m\theta_i) - x_{2i+1} \sin(m\theta_i) \\ x_{2i} \sin(m\theta_i) + x_{2i+1} \cos(m\theta_i) \end{pmatrix}

where θi=100002i/d\theta_i = 10000^{-2i/d} defines the rotation frequency for dimension pair ii. Low-frequency dimensions rotate slowly and encode coarse positional information; high-frequency dimensions rotate quickly and encode fine-grained position.

The inner product between two RoPE-encoded vectors depends only on the relative distance between them, making RoPE a form of relative positional encoding. This is what makes context extension possible.

RoPE Scaling for Context Extension

A model trained with RoPE at context length LtrainL_{\text{train}} has only seen rotation angles up to Ltrain×θiL_{\text{train}} \times \theta_i. When you try to use it at length L>LtrainL > L_{\text{train}}, the rotation angles exceed what was seen during training, causing performance to degrade.

Several techniques extend context by modifying the RoPE frequencies:

Position Interpolation (PI): Scale all positions by Ltrain/LtargetL_{\text{train}} / L_{\text{target}} so that the maximum rotation angle stays within the training range. A model trained at 4K and extended to 32K would divide all positions by 8.

def position_interpolation(position, base_context=4096, target_context=32768):
    scale = base_context / target_context
    return position * scale  # Compress positions into original range

NTK-Aware Scaling: Instead of scaling all frequencies uniformly, modify the base frequency θ\theta to spread the rotation angles more evenly. This preserves high-frequency (local) positional information while extending low-frequency (global) range.

def ntk_aware_rope(position, dim, base=10000, scale=4):
    """NTK-aware RoPE scaling."""
    # Increase the base frequency
    new_base = base * (scale ** (dim / (dim - 2)))
    theta = 1.0 / (new_base ** (torch.arange(0, dim, 2).float() / dim))
    return position * theta

YaRN (Yet another RoPE extensioN): Combines NTK scaling with an attention scaling factor and a temperature adjustment. YaRN further refines the approach by treating different frequency bands differently — high frequencies are interpolated, low frequencies are extrapolated, and middle frequencies are blended.

📊

RoPE Scaling Methods Comparison

MethodExtension FactorFine-tuning NeededQuality at Extended LengthQuality at Original Length
No scaling (extrapolation) 2-4x No Poor (rapid degradation) Unchanged
Position Interpolation 4-8x Yes (1K steps) Good Slightly degraded
NTK-Aware 4-8x Optional Good Preserved
YaRN 8-32x Yes (400 steps) Very good Preserved
Llama 3.1 (custom RoPE) 16x (8K to 128K) Yes (full training) Excellent Preserved
Note: Extension factor is relative to original training context length.
The Cost of Context Extension

RoPE scaling is computationally free at inference time — it just changes the rotation angles. The cost is in the fine-tuning required to adapt the model to the new positional distribution. Position Interpolation requires about 1,000 gradient steps on long-context data. YaRN needs about 400 steps. This is orders of magnitude cheaper than training from scratch.

Ring Attention for Distributed Long Sequences

When context lengths reach hundreds of thousands or millions of tokens, a single GPU cannot hold the KV cache in memory. Ring attention (Liu et al., 2023) distributes the sequence across multiple devices and computes attention in a communication-efficient ring topology.

The Core Idea

Split the sequence into chunks and distribute them across PP devices arranged in a ring. Each device holds one chunk of queries and one chunk of key-value pairs. Attention is computed in PP rounds: in each round, every device computes the partial attention between its local queries and the current key-value chunk, then passes the key-value chunk to the next device in the ring.

def ring_attention(Q_local, K_local, V_local, num_devices, device_id):
    """
    Each device holds Q_local and starts with its own K_local, V_local.
    After P rounds, each device has computed full attention for its queries.
    """
    seq_chunk = Q_local.shape[1]  # tokens per device

    # Initialize accumulators for online softmax
    running_max = torch.full((Q_local.shape[0], seq_chunk, 1), float('-inf'))
    running_sum = torch.zeros_like(running_max)
    running_output = torch.zeros_like(Q_local)

    K_current, V_current = K_local, V_local

    for round_idx in range(num_devices):
        # Compute local attention scores
        scores = Q_local @ K_current.transpose(-1, -2) / math.sqrt(d_k)

        # Apply causal mask (only for appropriate chunks)
        if needs_causal_mask(device_id, round_idx, num_devices):
            scores = apply_causal_mask(scores)

        # Online softmax update (numerically stable accumulation)
        local_max = scores.max(dim=-1, keepdim=True).values
        new_max = torch.maximum(running_max, local_max)

        # Rescale previous accumulator
        correction = torch.exp(running_max - new_max)
        running_output = running_output * correction
        running_sum = running_sum * correction

        # Add current block's contribution
        local_exp = torch.exp(scores - new_max)
        running_output += local_exp @ V_current
        running_sum += local_exp.sum(dim=-1, keepdim=True)
        running_max = new_max

        # Ring communication: send K,V to next device, receive from previous
        K_current = ring_send_recv(K_current)
        V_current = ring_send_recv(V_current)

    return running_output / running_sum

Why Ring Topology Is Efficient

The key property of ring attention is that computation and communication overlap. While a device is computing attention with the current K,V block, it is simultaneously sending/receiving the next K,V block. If the computation time exceeds the communication time (which is typically true for large enough chunks), the communication cost is completely hidden.

📊

Ring Attention Scaling

DevicesTokens per DeviceTotal ContextCommunication OverheadThroughput vs Single GPU
1 128K 128K 0% 1.0x
4 128K 512K ~5% 3.8x
8 128K 1M ~8% 7.4x
16 64K 1M ~12% 14.1x
32 32K 1M ~18% 26.2x
Note: Communication overhead increases as chunk size decreases because the compute-to-communicate ratio shrinks.
ℹ️ Ring Attention and Flash Attention

Ring attention composes naturally with Flash Attention. Each local attention computation (Q_local against the current K,V chunk) uses Flash Attention’s tiled, memory-efficient algorithm. The ring structure handles the distribution across devices while Flash Attention handles the efficiency within each device. This combination is what makes million-token contexts practical.

Chunked Prefill for Long Prompts

When a user submits a long prompt (say, a 100K-token document), the model must process the entire prompt before generating the first output token. This prefill phase is compute-bound and can take many seconds. Chunked prefill splits the prompt into smaller chunks and processes them sequentially, which provides several benefits.

Why Chunk the Prefill?

Memory management: Processing 128K tokens at once requires enormous temporary memory for the attention computation. Chunking reduces peak memory usage.

Scheduling flexibility: In a serving system handling multiple requests, a 128K-token prefill would monopolize the GPU for seconds. Chunking allows the scheduler to interleave prefill chunks from one request with decode steps from other requests, improving overall throughput and latency.

Pipeline efficiency: Chunks can be distributed across pipeline stages more evenly than a single monolithic prefill.

def chunked_prefill(prompt_tokens, model, chunk_size=8192):
    """
    Process a long prompt in chunks, building up the KV cache incrementally.
    """
    kv_cache = None
    num_chunks = (len(prompt_tokens) + chunk_size - 1) // chunk_size

    for i in range(num_chunks):
        start = i * chunk_size
        end = min(start + chunk_size, len(prompt_tokens))
        chunk = prompt_tokens[start:end]

        # Process chunk, attending to all previous KV cache entries + current chunk
        hidden, new_kv = model.forward(
            chunk,
            kv_cache=kv_cache,
            is_prefill=True
        )

        # Append new KV entries to cache
        if kv_cache is None:
            kv_cache = new_kv
        else:
            kv_cache = concatenate_kv(kv_cache, new_kv)

    return kv_cache  # Ready for autoregressive decoding
📊

Chunked Prefill Impact on Serving

Chunk SizePrefill Latency (128K prompt)Decode Latency ImpactGPU Utilization
No chunking (128K at once) 4.2 s Blocked for 4.2 s 95% (prefill only)
32K chunks 4.5 s Interleaved, ~50 ms bubbles 88%
8K chunks 4.8 s Interleaved, ~12 ms bubbles 82%
2K chunks 5.5 s Nearly transparent 75%
Note: Smaller chunks reduce decode latency impact but increase total prefill time due to overhead.

Memory Requirements at Scale

Understanding memory consumption is critical for capacity planning. The KV cache is the dominant memory consumer for long-context inference.

KV Cache Calculation

For a model with LL layers, HH attention heads, head dimension dhd_h, and sequence length nn, the KV cache size is:

KV Cache=2×L×H×dh×n×bytes_per_element\text{KV Cache} = 2 \times L \times H \times d_h \times n \times \text{bytes\_per\_element}

The factor of 2 accounts for both keys and values.

📊

KV Cache Memory for Popular Models

ModelLayersKV HeadsHead DimContextKV Cache (FP16)
Llama 3.1 8B 32 8 (GQA) 128 8K 0.5 GB
Llama 3.1 8B 32 8 (GQA) 128 128K 8 GB
Llama 3.1 70B 80 8 (GQA) 128 128K 20 GB
Llama 3.1 405B 126 8 (GQA) 128 128K 32 GB
Gemini 1.5 Pro (est.) ~96 ~16 128 1M ~384 GB
Note: GQA (Grouped Query Attention) reduces KV heads relative to query heads, dramatically reducing cache size.
GQA Is Essential for Long Context

Grouped Query Attention (GQA) is not optional for long-context models. Llama 3.1 70B uses only 8 KV heads (vs 64 query heads), reducing the KV cache by 8x compared to standard multi-head attention. Without GQA, the 128K context KV cache would be 160 GB instead of 20 GB — larger than the model weights themselves.

Total Memory Budget

For a complete picture, the total GPU memory needed for inference includes model weights, KV cache, and activation memory:

📊

Total Memory Budget for Long-Context Inference

Component8B model @ 8K8B model @ 128K70B model @ 128K
Model weights (FP16) 16 GB 16 GB 140 GB
KV cache (FP16) 0.5 GB 8 GB 20 GB
Activations (peak) 1 GB 4 GB 8 GB
Total 17.5 GB 28 GB 168 GB
GPUs needed (80GB H100) 1 1 3 (tensor parallel)

How Gemini Reaches 1M Tokens

Google’s Gemini 1.5 Pro was the first production model to support 1 million tokens of context. While Google has not published full architectural details, the system likely combines several techniques:

Sparse/efficient attention: Some form of attention sparsity (possibly learned or structured) to reduce the quadratic cost. This might include a mix of local and global attention patterns.

Ring attention across TPU pods: Google’s TPU v5 pods provide massive interconnect bandwidth (up to 4.8 Tbps per chip via ICI). Ring attention across a TPU pod can distribute the 1M-token KV cache with minimal communication overhead.

Multi-query or grouped-query attention: Reducing KV heads is essential at 1M context. Even with 8 KV heads and 128-dim heads, a 100-layer model would need 2×100×8×128×106×2=4092 \times 100 \times 8 \times 128 \times 10^6 \times 2 = 409 GB for the KV cache alone.

Quantized KV cache: Quantizing the KV cache to INT8 or INT4 halves or quarters the memory requirement without significant quality loss.

Hierarchical attention: Some evidence suggests Gemini uses a form of hierarchical attention where early layers process local context and later layers have access to broader context, reducing the average attention span.

Estimated Cost per 1M Token Query (Gemini-class model)

(seconds)
Prefill (1M tokens) Compute-bound
30 seconds
First token latency
32 seconds
Decode per token Memory-bound
0.05 seconds
Generate 1K tokens
50 seconds

How Llama 3.1 Reaches 128K Tokens

Meta’s Llama 3.1 paper provides more detail about their long-context approach:

Training strategy: Llama 3.1 was trained in stages. The initial pretraining used 8K context. Then the model was continued-pretrained with progressively longer sequences: 16K, 32K, 64K, and finally 128K. This staged approach is more stable than training at 128K from the start.

RoPE frequency adjustment: Llama 3.1 uses a modified RoPE with an increased base frequency (θ=500,000\theta = 500,000 instead of the original 10,00010,000). This larger base allows the rotation angles to stay within a well-behaved range at longer positions.

GQA with 8 KV heads: All Llama 3.1 models use 8 KV heads regardless of model size, keeping the KV cache manageable at 128K context.

Long-context data: A significant fraction of the continued-pretraining data consisted of long documents (books, code repositories, long articles) to teach the model to actually use long context effectively.

📊

Llama 3.1 Long-Context Performance

Task4K Context32K Context128K Context
RULER (synthetic retrieval) 95.2% 93.8% 88.6%
Needle in haystack (single) 100% 99.5% 98.7%
Needle in haystack (multi) 98.1% 94.2% 87.3%
Long document QA 82.4% 86.1% 84.7%
Code repo understanding 71.3% 78.9% 81.2%
Note: Performance at 128K is strong but not as good as at shorter contexts for retrieval tasks. Understanding tasks benefit from more context.
💡 Staged Training Is Key

Llama 3.1’s staged approach — pretraining at 8K then extending to 128K — is now standard practice. Training at very long context from the start wastes compute on short-range patterns that could be learned more efficiently with shorter sequences. The extension phase is relatively cheap (about 5% of total pretraining compute).

When Long Context Hurts

Long context is not universally beneficial. There are several failure modes and costs that practitioners must understand.

Cost Scaling

The most obvious cost is computational. Processing 128K tokens costs roughly 128K/4K=32128K/4K = 32x more than processing 4K tokens for the prefill phase. For API-based services, this translates directly to cost:

📊

API Cost Comparison at Different Context Lengths

Provider/ModelInput Cost (per 1M tokens)128K prompt cost4K prompt costRatio
GPT-4o $2.50 $0.32 $0.01 32x
Claude 3.5 Sonnet $3.00 $0.38 $0.012 32x
Gemini 1.5 Pro $1.25 $0.16 $0.005 32x
Llama 3.1 70B (self-hosted) ~$0.80 ~$0.10 ~$0.003 32x
Note: Costs are approximate and subject to change. The 32x ratio reflects the linear input token pricing.

Attention Dilution

As context length increases, the attention distribution becomes more spread out. Each token has more keys to attend to, which can dilute the attention weights on the most relevant tokens. This manifests as:

  • Decreased retrieval accuracy: The “needle in a haystack” task (finding a specific piece of information in a long context) becomes harder as the haystack grows.
  • Lost in the middle: Several studies have shown that LLMs are better at using information at the beginning and end of the context than in the middle. This “U-shaped” attention pattern means that simply having long context does not guarantee the model will use all of it effectively.
  • Decreased generation quality: For tasks like summarization, very long input can lead to less focused, more generic outputs as the model tries to attend to too much information simultaneously.

Retrieval Accuracy vs Context Length (Typical LLM)

(% accuracy)
4K context Near-perfect
99 % accuracy
16K context
97 % accuracy
32K context
94 % accuracy
64K context
90 % accuracy
128K context Noticeable degradation
85 % accuracy
512K context
75 % accuracy
1M context Significant degradation
65 % accuracy

Latency Impact

Long context dramatically increases time-to-first-token (TTFT), which is often the most important latency metric for interactive applications:

📊

Time-to-First-Token vs Context Length

Context LengthTTFT (8B model, 1xH100)TTFT (70B model, 8xH100)User Experience
2K tokens 50 ms 120 ms Instant
8K tokens 180 ms 450 ms Acceptable
32K tokens 700 ms 1.8 s Noticeable delay
128K tokens 2.8 s 7.2 s Significant wait
512K tokens 11 s ~30 s Very long wait
Note: Estimates for optimized serving with Flash Attention. Actual numbers vary with implementation.

When to Avoid Long Context

⚠️ Think Before You Fill the Context

Long context should not be the default. Consider these guidelines:

  • If your task can be solved with a 4K-token prompt and RAG, do that. It is 32x cheaper and faster.
  • Use long context when the task genuinely requires global understanding (summarization, cross-reference analysis, code understanding).
  • Be aware that model quality degrades gradually with context length. Test your specific use case at the target length.
  • Monitor TTFT in production. Users will not wait 10 seconds for a response in an interactive application.

Flash Attention: The Enabler

No discussion of long context is complete without Flash Attention (Dao et al., 2022). While Flash Attention does not change the asymptotic complexity of attention (O(n2)O(n^2) remains), it reduces the constant factor dramatically by eliminating the need to materialize the full n×nn \times n attention matrix in GPU HBM.

How Flash Attention Helps Long Context

Standard attention computes the full n×nn \times n score matrix, stores it in HBM, applies softmax, then multiplies by values. For n=128Kn = 128K, this matrix alone requires 128K×128K×2128K \times 128K \times 2 bytes =32= 32 GB in FP16 — more than the memory of most GPUs.

Flash Attention tiles the computation so that only small blocks of the attention matrix exist at any time, stored in fast SRAM (shared memory) rather than slow HBM. This reduces memory usage from O(n2)O(n^2) to O(n)O(n) and also improves speed by 2-4x due to reduced HBM traffic.

📊

Standard vs Flash Attention at Long Context

Sequence LengthStandard Attention MemoryFlash Attention MemoryFlash Speedup
4K 32 MB 4 MB 1.5x
16K 512 MB 16 MB 2.1x
64K 8 GB 64 MB 2.8x
128K 32 GB (OOM on most GPUs) 128 MB 3.2x
Note: Memory shown is for the attention computation workspace only, not KV cache.

Without Flash Attention, 128K context would be impossible on current hardware. With it, the bottleneck shifts from attention computation memory to KV cache storage.

Putting It All Together: A Modern Long-Context Stack

A production 128K-token serving system in 2025 typically combines:

  1. RoPE with extended base frequency for positional encoding that generalizes beyond training length
  2. GQA with 8 KV heads to keep the KV cache manageable
  3. Flash Attention v2 for memory-efficient, fast attention computation
  4. Paged KV cache (vLLM-style) for flexible memory management across requests
  5. Chunked prefill for scheduling efficiency in multi-request serving
  6. KV cache quantization (INT8 or FP8) to further reduce memory footprint
  7. Tensor parallelism across multiple GPUs for larger models

For 1M-token systems, add:

  1. Ring attention or sequence parallelism for distributing the sequence across devices
  2. Sparse or hierarchical attention to reduce the effective quadratic cost
  3. Specialized hardware (TPU v5, Grace Hopper) with high-bandwidth interconnects

Long Context Technique Effectiveness

(context multiplier)
Vanilla Transformer (2017) 512 tokens
1 context multiplier
+ Segment recurrence ~4K
8 context multiplier
+ RoPE scaling ~32K
64 context multiplier
+ Flash Attention + GQA ~128K
256 context multiplier
+ Ring Attention ~1M
2,000 context multiplier

Conclusion

The journey from 2K to 1M tokens has been driven by a stack of innovations, each addressing a different bottleneck:

  • Positional encoding (Transformer-XL relative PE to RoPE to RoPE scaling) solved the problem of representing position at arbitrary lengths.
  • Attention efficiency (Flash Attention, sliding window) solved the memory and compute cost of the attention operation itself.
  • KV cache reduction (GQA, quantization, paged allocation) solved the memory cost of storing past context.
  • Distribution (ring attention, sequence parallelism) solved the problem of fitting long sequences across multiple devices.
  • Serving (chunked prefill, continuous batching) solved the problem of efficiently serving long-context requests alongside short ones.

The frontier continues to push forward. Context lengths of 10M tokens are being explored in research, and new architectures (state-space models, linear attention variants) promise to break the quadratic barrier entirely. But for now, the Transformer with these accumulated optimizations remains the dominant architecture, capable of processing context lengths that would have seemed impossible just a few years ago.

The practical lesson is clear: long context is a powerful capability, but it is not free. Every doubling of context length doubles the cost and increases latency. The best systems use long context surgically — for tasks that genuinely benefit from it — while falling back to cheaper approaches (RAG, summarization, truncation) when the task does not require global understanding over the full input.