Long Context LLMs: From 2K to 1M Tokens

Part of Series Inference Optimization Timeline 60 of 60

1 Transformer Fundamentals for Systems Engineers: The 10-Minute Bridge from Architecture to Inference 2 LLM Inference Fundamentals: Prefill, Decode, and the Memory-Compute Divide 3 KV Cache: The Hidden Memory Giant in LLM Serving 4 Quantization for LLM Inference: From FP16 to INT4 — A Deep Dive into Precision, Performance, and Production Deployment 5 FlashAttention: Why Tiling Attention Through the Memory Hierarchy Changes Everything 6 PagedAttention: How vLLM Borrowed OS Virtual Memory to Fix LLM Serving 7 Continuous Batching: The Complete Guide to LLM Inference Scheduling 8 Speculative Decoding: Why Autoregressive LLMs Leave 99% of Your GPU Idle and How to Fix It 9 Prefix Caching: RadixAttention, Cache Hierarchies, and Reusing Computation Across Requests 10 LoRA and QLoRA for Serving: Multi-Adapter Inference, S-LoRA, and When to Merge 11 Disaggregated Prefill-Decode: Why Splitting LLM Inference Changes Everything 12 Constrained Generation: FSM-Based Decoding, Outlines, and Grammar-Guided LLM Output 13 Mamba and State Space Models: The O(n) Alternative to Attention 14 Inference-Time Compute Scaling: When More Thinking Helps (o1, DeepSeek-R1, and the Reasoning Frontier) 15 CPU and Edge Inference: llama.cpp Internals, GGUF Format, and When CPU Actually Wins 16 Inference Cost Economics: Tokens per Dollar, GPU-Hours, and the Real Math of LLM Serving 17 Model Loading and Cold Start: safetensors, mmap, and Startup Optimization 18 Batched GEMM: Why Matrix Multiply Throughput Determines Everything in LLM Inference 19 Kernel Autotuning: How TensorRT and torch.compile Find Optimal CUDA Kernels 20 Attention Kernel Comparison: FlashAttention vs FlashInfer vs xformers vs Triton 21 Token Generation Pipeline: Logit Processing, Sampling Strategies, and Stop Criteria 22 Dynamic Batching: Orca, Sarathi, and Iteration-Level Scheduling Algorithms 23 Memory Pool Management: Slab Allocators for GPU Inference 24 Prefill vs Decode Optimization: Different Bottlenecks, Different Solutions 25 Decode Optimization: CUDA Graphs, Persistent Batches, and Speculative Verification 26 Multi-Model Serving: GPU Sharing, Model Switching, and Adapter Pool Management 27 Structured Output Acceleration: Compressed FSMs, Speculative JSON, and Grammar Caching 28 Vision-Language Model Serving: ViT Encoding, Cross-Attention, and KV Cache Paging for Multimodal 29 Long-Context Serving: Ring Attention, KV Offloading, and Chunked Processing in Production 30 Inference Profiling: Nsight Systems, torch.profiler, and Finding Where Time Actually Goes 31 FP8 Inference: E4M3 Format, Per-Tensor Scaling, and the Hardware Support Matrix 32 Speculative Decoding v2: Medusa, EAGLE, Lookahead, and Token Tree Verification 33 Disaggregated Serving v2: Mooncake KV-Centric Architecture and LoongServe Elastic SP 34 Request Preemption and Priority Scheduling in Production LLM Serving 35 Autoscaling LLM Inference: Signals, Lag, Warm Pools, and Cost-Optimal Scaling 36 The Inference Stack in 2026: From HTTP Request to GPU Kernel and Back 37 Video and Audio LLM Serving: Temporal Encoding, Chunked Streaming, and Latency Budgets 38 KV Cache Compression and Eviction: H2O, Attention Sinks, Sliding Window, and Quantized KV 39 Distributed Inference: Tensor Parallelism vs Pipeline Parallelism for Serving 40 Serving Benchmark Methodology: How to Properly Measure LLM Inference Performance 41 Compute-Communication Overlap: Hiding Distributed Training Latency 42 DeepSpeed ZeRO: Memory Optimization for Distributed Training at Scale 43 Pipeline Parallelism: From GPipe to DualPipe -- Eliminating the Bubble 44 Gradient Compression for Distributed Training: Promise, Reality, and Where It Still Wins 45 The Definitive Guide to Distributed Parallelism: Data, Tensor, Pipeline, Expert, and Sequence Parallelism for Large-Scale Training 46 Decoding Performance: Beam Search vs Sampling — Latency, Throughput, Memory, and the Full Design Space 47 LLM Prefill Phase Optimization: Why Prompt Processing Is Compute-Bound and How to Fix It 48 LLM Serving Engines: vLLM vs SGLang vs TensorRT-LLM — A Systems Comparison 49 Request Routing for LLM Inference: From Naive Load Balancing to KV Cache-Aware Scheduling 50 Why Adam Is Expensive and What To Do About It: 8-bit Adam, Adafactor, CAME, and the Memory Math of Optimizers 51 How Large Models Actually Get Loaded: Safetensors, mmap, Tensor Parallelism, and Progressive Loading 52 Mixed Precision Training: The Complete Precision Landscape from FP32 to FP4 53 Model Compression: Pruning, Distillation, and Why Quantization Won 54 From NAS to Scaling Laws: How We Design LLM Architectures Now 55 NVIDIA NCCL Performance Tuning for Multi-GPU Training 56 ONNX Runtime in Practice: Graph Optimization, Execution Providers, Quantization, and When ORT Is the Right Choice 57 Optimizing GEMM for Neural Networks: BLAS vs Custom Kernels (Nov 2019) 58 Long Context: From Sparse Attention to Ring Attention 59 TensorRT-LLM: Graph Optimization for Maximum Inference Performance 60 Long Context LLMs: From 2K to 1M Tokens

The original Transformer handled 512 tokens. Gemini 1.5 Pro handles 1 million — a 2,000x increase achieved through a decade of fixing bottlenecks as they appeared. Transformer-XL introduced segment recurrence to break the 512-token barrier. RoPE enabled position encoding beyond training length. FlashAttention made dense attention fast enough for 32K contexts. Ring Attention distributed long sequences across GPUs for 128K-1M contexts. Each innovation unlocked new task categories: 4K isn’t enough for a research paper, 8K isn’t enough for a codebase, 32K isn’t enough for legal contracts with appendices. But long context has costs: quadratic memory scaling, attention dilution where important tokens drown in noise, and retrieval degradation as context grows. This post traces the full technical arc from 512 tokens to 1M+ and the tradeoffs at each step.

Why Long Context Matters

The Limitations of Short Context

The original GPT-2 had a context window of 1,024 tokens — roughly 750 words. GPT-3 extended this to 2,048 tokens. At these lengths, the model could handle a few paragraphs of conversation but nothing more. Any task requiring awareness of information spread across a longer document was out of reach.

📊

What Fits in Different Context Lengths

Context Length	Approximate Words	Example Content	Fits?
2K tokens	~1,500 words	A short blog post	Yes
4K tokens	~3,000 words	A research paper abstract + intro	Barely
8K tokens	~6,000 words	A full research paper	No (truncated)
32K tokens	~24,000 words	A short novel chapter	Yes
128K tokens	~96,000 words	A full codebase (medium project)	Yes
1M tokens	~750,000 words	Multiple books or an entire codebase	Yes

Long Context vs. RAG

Retrieval-Augmented Generation (RAG) is often presented as an alternative to long context. Instead of feeding the entire document into the model, you retrieve the relevant chunks and include only those. This works well for factual lookup tasks, but it has fundamental limitations.

RAG requires knowing which chunks are relevant before generating the answer. For questions that depend on synthesizing information from multiple sections of a document — “How does the conclusion contradict the methodology?” or “Summarize the key themes across all chapters” — retrieval often fails because no single chunk contains the answer.

Long context eliminates this retrieval step entirely. The model sees everything and can attend to any part of the input. The tradeoff is computational cost: processing 128K tokens is far more expensive than processing 4K tokens of retrieved context.

ℹ️ Long Context vs. RAG: Not Either/Or

In practice, production systems often combine both approaches. Long context handles tasks where global understanding is needed (summarization, code analysis, multi-document reasoning). RAG handles tasks where the answer is localized (factual Q&A over large corpora). The choice depends on the task, latency budget, and cost constraints.

Use Cases Enabled by Long Context

Document processing: Legal contracts, medical records, and financial reports often exceed 50K tokens. Long context allows end-to-end processing without chunking.

Code understanding: A medium-sized software project might have 100K-500K tokens of source code. With sufficient context, the model can understand cross-file dependencies, API contracts, and architectural patterns.

Multi-turn conversation: Long conversations accumulate context over time. A 128K context window can hold roughly 100 pages of conversation history, enabling the model to reference earlier parts of the discussion.

Few-shot learning with many examples: More context means more in-context examples, which directly improves task performance for pattern matching and classification.

The Quadratic Problem

The fundamental challenge with extending context length is the self-attention mechanism. Standard self-attention computes a score between every pair of tokens, giving it $O(n^2)$ time and memory complexity where $n$ is the sequence length.

📊

Attention Computation Cost vs Sequence Length

Sequence Length	Attention FLOPs (per layer)	KV Cache Memory (FP16)	Wall Time (A100)
2K	8M	32 MB	0.3 ms
8K	128M	128 MB	2.1 ms
32K	2B	512 MB	18 ms
128K	33B	2 GB	240 ms
512K	524B	8 GB	~3.8 s
1M	2T	16 GB	~15 s

Note: Estimates for a single attention head with d_model=128. Actual models have multiple heads and layers, multiplying these costs.

Attention FLOPs Growth (log scale conceptual)

(relative cost)

2K tokens Baseline

1 relative cost

8K tokens

16 relative cost

32K tokens

256 relative cost

128K tokens

4,096 relative cost

512K tokens

65,536 relative cost

1M tokens 262,144x baseline

262,144 relative cost

This quadratic scaling means that naively doubling the context length quadruples the computation. Going from 2K to 1M tokens represents a 250,000x increase in attention FLOPs. Without architectural innovations to reduce this cost, long context would be completely impractical.

Historical Foundation: Transformer-XL Segment Recurrence

Transformer-XL (Dai et al., 2019) was among the first architectures to seriously tackle the context length limitation. Its key insight was that you do not need to process the entire sequence at once — you can process it in segments and carry information forward through a recurrence mechanism.

How Segment Recurrence Works

Instead of treating each segment independently (which loses all cross-segment information), Transformer-XL caches the hidden states from the previous segment and makes them available to the current segment during attention computation:

# Conceptual Transformer-XL forward pass
def transformer_xl_forward(current_segment, cached_states):
    """
    current_segment: [batch, seg_len, d_model]
    cached_states: [batch, mem_len, d_model] from previous segment
    """
    # Concatenate cached states with current segment for key/value
    extended_context = torch.cat([cached_states, current_segment], dim=1)

    # Query comes from current segment only
    Q = W_q(current_segment)          # [batch, seg_len, d_model]
    K = W_k(extended_context)          # [batch, seg_len + mem_len, d_model]
    V = W_v(extended_context)          # [batch, seg_len + mem_len, d_model]

    # Attention spans both current and cached segments
    attention = softmax(Q @ K.T / sqrt(d)) @ V

    # Cache current hidden states for next segment
    new_cache = current_segment.detach()  # Stop gradients through cache

    return attention, new_cache

The .detach() call is critical: gradients do not flow through the cached states, which keeps training memory manageable. The model learns to produce hidden states that are useful when attended to by future segments, even though it never directly optimizes for this objective through backpropagation across segments.

Relative Positional Encoding

Segment recurrence creates a problem for positional encoding. If you use absolute positions (position 0, 1, 2, …, 511 in each segment), then position 0 in the current segment and position 0 in the cached segment have the same encoding, even though they are semantically at very different positions in the overall sequence.

Transformer-XL solved this with relative positional encoding, which encodes the distance between tokens rather than their absolute position. This was a precursor to the rotary positional encoding (RoPE) used in most modern LLMs.

📊

Transformer-XL vs Vanilla Transformer (2019 Benchmarks)

Model	Effective Context	WikiText-103 PPL	enwik8 (bpc)	Memory Usage
Vanilla Transformer	512 tokens	24.0	1.08	1x
Transformer-XL (seg=512, mem=512)	1,024 tokens	20.5	1.03	1.5x
Transformer-XL (seg=512, mem=2048)	2,560 tokens	18.3	0.99	2.5x

Note: Perplexity (PPL) and bits-per-character (bpc) are lower-is-better metrics.

ℹ️ Historical Significance

Transformer-XL’s segment recurrence was influential but is not used in modern LLMs directly. Its legacy lives on through two ideas: (1) relative positional encoding, which evolved into RoPE, and (2) the concept of extending effective context beyond the training window, which is now done through RoPE scaling rather than recurrence.

Limitations of Segment Recurrence

Despite its elegance, segment recurrence had practical drawbacks. The effective context grew linearly with the number of layers (each layer could “see” one more cached segment back), meaning deep networks were needed for long-range dependencies. The information from distant segments was also progressively degraded as it passed through multiple layers of processing. Modern approaches achieve much longer contexts more directly.

Sliding Window Attention

Sliding window attention, popularized by Mistral 7B (2023), takes a different approach to the quadratic problem. Instead of attending to all previous tokens, each layer only attends to a fixed window of the most recent $W$ tokens.

How It Works

def sliding_window_attention(Q, K, V, window_size=4096):
    """
    Each token attends only to the previous window_size tokens.
    """
    seq_len = Q.shape[1]

    # Create sliding window mask
    mask = torch.ones(seq_len, seq_len, dtype=torch.bool)
    for i in range(seq_len):
        # Token i can attend to tokens max(0, i-window_size+1) through i
        start = max(0, i - window_size + 1)
        mask[i, :start] = False
        mask[i, i+1:] = False  # Causal: can't attend to future

    scores = Q @ K.transpose(-1, -2) / math.sqrt(d_k)
    scores.masked_fill_(~mask, float('-inf'))

    return softmax(scores) @ V

With a window size of $W$ , each token computes attention over at most $W$ key-value pairs instead of all $n$ pairs. This reduces the per-token cost from $O(n)$ to $O(W)$ , making the total cost $O(n \cdot W)$ — linear in sequence length when $W$ is fixed.

Effective Context Through Layer Stacking

The key insight is that while each layer has a limited window, stacking $L$ layers creates an effective receptive field of $L \times W$ tokens. In Mistral 7B with $W = 4096$ and $L = 32$ layers, the theoretical receptive field is $32 \times 4096 = 131,072$ tokens, even though each layer only looks at 4,096 tokens.

📊

Sliding Window vs Full Attention

Method	Per-Layer Cost	KV Cache Size	Effective Context	Quality on Long Tasks
Full Attention	O(n^2)	O(n)	n tokens	Best
Sliding Window (W=4096)	O(n*W)	O(W)	L*W tokens (theoretical)	Good for most tasks
Sliding Window (W=1024)	O(n*W)	O(W)	L*W tokens (theoretical)	Degraded on long-range

⚠️ Receptive Field vs. Effective Context

The theoretical receptive field of $L \times W$ does not mean the model can effectively use information that far back. Information must propagate through multiple layers to reach distant tokens, and this propagation is lossy. In practice, sliding window models perform well on most tasks but can struggle with tasks requiring precise retrieval of information from early in a long sequence.

Memory Savings

The most immediate benefit of sliding window attention is memory savings during inference. Instead of storing KV cache entries for all previous tokens (which grows linearly with sequence length), you only need to store the most recent $W$ entries. For Mistral 7B with a 4,096 window, the KV cache is capped at about 1 GB regardless of how long the sequence is, compared to 32 GB for a 128K full-attention model.

RoPE and Context Length Extension

Rotary Position Embedding (RoPE), introduced by Su et al. (2021), has become the standard positional encoding in modern LLMs (Llama, Mistral, Qwen, and many others). RoPE encodes position by rotating the query and key vectors in 2D subspaces at frequencies that depend on the position.

How RoPE Works

For a token at position $m$ , RoPE applies a rotation to each pair of dimensions $(2i, 2i+1)$ in the query and key vectors:

\text{RoPE}(x, m) = \begin{pmatrix} x_{2i} \cos(m\theta_i) - x_{2i+1} \sin(m\theta_i) \\ x_{2i} \sin(m\theta_i) + x_{2i+1} \cos(m\theta_i) \end{pmatrix}

where $\theta_i = 10000^{-2i/d}$ defines the rotation frequency for dimension pair $i$ . Low-frequency dimensions rotate slowly and encode coarse positional information; high-frequency dimensions rotate quickly and encode fine-grained position.

The inner product between two RoPE-encoded vectors depends only on the relative distance between them, making RoPE a form of relative positional encoding. This is what makes context extension possible.

RoPE Scaling for Context Extension

A model trained with RoPE at context length $L_{\text{train}}$ has only seen rotation angles up to $L_{\text{train}} \times \theta_i$ . When you try to use it at length $L > L_{\text{train}}$ , the rotation angles exceed what was seen during training, causing performance to degrade.

Several techniques extend context by modifying the RoPE frequencies:

Position Interpolation (PI): Scale all positions by $L_{\text{train}} / L_{\text{target}}$ so that the maximum rotation angle stays within the training range. A model trained at 4K and extended to 32K would divide all positions by 8.

def position_interpolation(position, base_context=4096, target_context=32768):
    scale = base_context / target_context
    return position * scale  # Compress positions into original range

NTK-Aware Scaling: Instead of scaling all frequencies uniformly, modify the base frequency $\theta$ to spread the rotation angles more evenly. This preserves high-frequency (local) positional information while extending low-frequency (global) range.

def ntk_aware_rope(position, dim, base=10000, scale=4):
    """NTK-aware RoPE scaling."""
    # Increase the base frequency
    new_base = base * (scale ** (dim / (dim - 2)))
    theta = 1.0 / (new_base ** (torch.arange(0, dim, 2).float() / dim))
    return position * theta

YaRN (Yet another RoPE extensioN): Combines NTK scaling with an attention scaling factor and a temperature adjustment. YaRN further refines the approach by treating different frequency bands differently — high frequencies are interpolated, low frequencies are extrapolated, and middle frequencies are blended.

📊

RoPE Scaling Methods Comparison

Method	Extension Factor	Fine-tuning Needed	Quality at Extended Length	Quality at Original Length
No scaling (extrapolation)	2-4x	No	Poor (rapid degradation)	Unchanged
Position Interpolation	4-8x	Yes (1K steps)	Good	Slightly degraded
NTK-Aware	4-8x	Optional	Good	Preserved
YaRN	8-32x	Yes (400 steps)	Very good	Preserved
Llama 3.1 (custom RoPE)	16x (8K to 128K)	Yes (full training)	Excellent	Preserved

Note: Extension factor is relative to original training context length.

⚡ The Cost of Context Extension

RoPE scaling is computationally free at inference time — it just changes the rotation angles. The cost is in the fine-tuning required to adapt the model to the new positional distribution. Position Interpolation requires about 1,000 gradient steps on long-context data. YaRN needs about 400 steps. This is orders of magnitude cheaper than training from scratch.

Ring Attention for Distributed Long Sequences

When context lengths reach hundreds of thousands or millions of tokens, a single GPU cannot hold the KV cache in memory. Ring attention (Liu et al., 2023) distributes the sequence across multiple devices and computes attention in a communication-efficient ring topology.

The Core Idea

Split the sequence into chunks and distribute them across $P$ devices arranged in a ring. Each device holds one chunk of queries and one chunk of key-value pairs. Attention is computed in $P$ rounds: in each round, every device computes the partial attention between its local queries and the current key-value chunk, then passes the key-value chunk to the next device in the ring.

def ring_attention(Q_local, K_local, V_local, num_devices, device_id):
    """
    Each device holds Q_local and starts with its own K_local, V_local.
    After P rounds, each device has computed full attention for its queries.
    """
    seq_chunk = Q_local.shape[1]  # tokens per device

    # Initialize accumulators for online softmax
    running_max = torch.full((Q_local.shape[0], seq_chunk, 1), float('-inf'))
    running_sum = torch.zeros_like(running_max)
    running_output = torch.zeros_like(Q_local)

    K_current, V_current = K_local, V_local

    for round_idx in range(num_devices):
        # Compute local attention scores
        scores = Q_local @ K_current.transpose(-1, -2) / math.sqrt(d_k)

        # Apply causal mask (only for appropriate chunks)
        if needs_causal_mask(device_id, round_idx, num_devices):
            scores = apply_causal_mask(scores)

        # Online softmax update (numerically stable accumulation)
        local_max = scores.max(dim=-1, keepdim=True).values
        new_max = torch.maximum(running_max, local_max)

        # Rescale previous accumulator
        correction = torch.exp(running_max - new_max)
        running_output = running_output * correction
        running_sum = running_sum * correction

        # Add current block's contribution
        local_exp = torch.exp(scores - new_max)
        running_output += local_exp @ V_current
        running_sum += local_exp.sum(dim=-1, keepdim=True)
        running_max = new_max

        # Ring communication: send K,V to next device, receive from previous
        K_current = ring_send_recv(K_current)
        V_current = ring_send_recv(V_current)

    return running_output / running_sum

Why Ring Topology Is Efficient

The key property of ring attention is that computation and communication overlap. While a device is computing attention with the current K,V block, it is simultaneously sending/receiving the next K,V block. If the computation time exceeds the communication time (which is typically true for large enough chunks), the communication cost is completely hidden.

📊

Ring Attention Scaling

Devices	Tokens per Device	Total Context	Communication Overhead	Throughput vs Single GPU
1	128K	128K	0%	1.0x
4	128K	512K	~5%	3.8x
8	128K	1M	~8%	7.4x
16	64K	1M	~12%	14.1x
32	32K	1M	~18%	26.2x

Note: Communication overhead increases as chunk size decreases because the compute-to-communicate ratio shrinks.

ℹ️ Ring Attention and Flash Attention

Ring attention composes naturally with Flash Attention. Each local attention computation (Q_local against the current K,V chunk) uses Flash Attention’s tiled, memory-efficient algorithm. The ring structure handles the distribution across devices while Flash Attention handles the efficiency within each device. This combination is what makes million-token contexts practical.

Chunked Prefill for Long Prompts

When a user submits a long prompt (say, a 100K-token document), the model must process the entire prompt before generating the first output token. This prefill phase is compute-bound and can take many seconds. Chunked prefill splits the prompt into smaller chunks and processes them sequentially, which provides several benefits.

Why Chunk the Prefill?

Memory management: Processing 128K tokens at once requires enormous temporary memory for the attention computation. Chunking reduces peak memory usage.

Scheduling flexibility: In a serving system handling multiple requests, a 128K-token prefill would monopolize the GPU for seconds. Chunking allows the scheduler to interleave prefill chunks from one request with decode steps from other requests, improving overall throughput and latency.

Pipeline efficiency: Chunks can be distributed across pipeline stages more evenly than a single monolithic prefill.

def chunked_prefill(prompt_tokens, model, chunk_size=8192):
    """
    Process a long prompt in chunks, building up the KV cache incrementally.
    """
    kv_cache = None
    num_chunks = (len(prompt_tokens) + chunk_size - 1) // chunk_size

    for i in range(num_chunks):
        start = i * chunk_size
        end = min(start + chunk_size, len(prompt_tokens))
        chunk = prompt_tokens[start:end]

        # Process chunk, attending to all previous KV cache entries + current chunk
        hidden, new_kv = model.forward(
            chunk,
            kv_cache=kv_cache,
            is_prefill=True
        )

        # Append new KV entries to cache
        if kv_cache is None:
            kv_cache = new_kv
        else:
            kv_cache = concatenate_kv(kv_cache, new_kv)

    return kv_cache  # Ready for autoregressive decoding

📊

Chunked Prefill Impact on Serving

Chunk Size	Prefill Latency (128K prompt)	Decode Latency Impact	GPU Utilization
No chunking (128K at once)	4.2 s	Blocked for 4.2 s	95% (prefill only)
32K chunks	4.5 s	Interleaved, ~50 ms bubbles	88%
8K chunks	4.8 s	Interleaved, ~12 ms bubbles	82%
2K chunks	5.5 s	Nearly transparent	75%

Note: Smaller chunks reduce decode latency impact but increase total prefill time due to overhead.

Memory Requirements at Scale

Understanding memory consumption is critical for capacity planning. The KV cache is the dominant memory consumer for long-context inference.

KV Cache Calculation

For a model with $L$ layers, $H$ attention heads, head dimension $d_h$ , and sequence length $n$ , the KV cache size is:

\text{KV Cache} = 2 \times L \times H \times d_h \times n \times \text{bytes\_per\_element}

The factor of 2 accounts for both keys and values.

📊

KV Cache Memory for Popular Models

Model	Layers	KV Heads	Head Dim	Context	KV Cache (FP16)
Llama 3.1 8B	32	8 (GQA)	128	8K	0.5 GB
Llama 3.1 8B	32	8 (GQA)	128	128K	8 GB
Llama 3.1 70B	80	8 (GQA)	128	128K	20 GB
Llama 3.1 405B	126	8 (GQA)	128	128K	32 GB
Gemini 1.5 Pro (est.)	~96	~16	128	1M	~384 GB

Note: GQA (Grouped Query Attention) reduces KV heads relative to query heads, dramatically reducing cache size.

⚡ GQA Is Essential for Long Context

Grouped Query Attention (GQA) is not optional for long-context models. Llama 3.1 70B uses only 8 KV heads (vs 64 query heads), reducing the KV cache by 8x compared to standard multi-head attention. Without GQA, the 128K context KV cache would be 160 GB instead of 20 GB — larger than the model weights themselves.

Total Memory Budget

For a complete picture, the total GPU memory needed for inference includes model weights, KV cache, and activation memory:

📊

Total Memory Budget for Long-Context Inference

Component	8B model @ 8K	8B model @ 128K	70B model @ 128K
Model weights (FP16)	16 GB	16 GB	140 GB
KV cache (FP16)	0.5 GB	8 GB	20 GB
Activations (peak)	1 GB	4 GB	8 GB
Total	17.5 GB	28 GB	168 GB
GPUs needed (80GB H100)	1	1	3 (tensor parallel)

How Gemini Reaches 1M Tokens

Google’s Gemini 1.5 Pro was the first production model to support 1 million tokens of context. While Google has not published full architectural details, the system likely combines several techniques:

Sparse/efficient attention: Some form of attention sparsity (possibly learned or structured) to reduce the quadratic cost. This might include a mix of local and global attention patterns.

Ring attention across TPU pods: Google’s TPU v5 pods provide massive interconnect bandwidth (up to 4.8 Tbps per chip via ICI). Ring attention across a TPU pod can distribute the 1M-token KV cache with minimal communication overhead.

Multi-query or grouped-query attention: Reducing KV heads is essential at 1M context. Even with 8 KV heads and 128-dim heads, a 100-layer model would need $2 \times 100 \times 8 \times 128 \times 10^6 \times 2 = 409$ GB for the KV cache alone.

Quantized KV cache: Quantizing the KV cache to INT8 or INT4 halves or quarters the memory requirement without significant quality loss.

Hierarchical attention: Some evidence suggests Gemini uses a form of hierarchical attention where early layers process local context and later layers have access to broader context, reducing the average attention span.

Estimated Cost per 1M Token Query (Gemini-class model)

(seconds)

Prefill (1M tokens) Compute-bound

30 seconds

First token latency

32 seconds

Decode per token Memory-bound

0.05 seconds

Generate 1K tokens

50 seconds

How Llama 3.1 Reaches 128K Tokens

Meta’s Llama 3.1 paper provides more detail about their long-context approach:

Training strategy: Llama 3.1 was trained in stages. The initial pretraining used 8K context. Then the model was continued-pretrained with progressively longer sequences: 16K, 32K, 64K, and finally 128K. This staged approach is more stable than training at 128K from the start.

RoPE frequency adjustment: Llama 3.1 uses a modified RoPE with an increased base frequency ( $\theta = 500,000$ instead of the original $10,000$ ). This larger base allows the rotation angles to stay within a well-behaved range at longer positions.

GQA with 8 KV heads: All Llama 3.1 models use 8 KV heads regardless of model size, keeping the KV cache manageable at 128K context.

Long-context data: A significant fraction of the continued-pretraining data consisted of long documents (books, code repositories, long articles) to teach the model to actually use long context effectively.

📊

Llama 3.1 Long-Context Performance

Task	4K Context	32K Context	128K Context
RULER (synthetic retrieval)	95.2%	93.8%	88.6%
Needle in haystack (single)	100%	99.5%	98.7%
Needle in haystack (multi)	98.1%	94.2%	87.3%
Long document QA	82.4%	86.1%	84.7%
Code repo understanding	71.3%	78.9%	81.2%

Note: Performance at 128K is strong but not as good as at shorter contexts for retrieval tasks. Understanding tasks benefit from more context.

💡 Staged Training Is Key

Llama 3.1’s staged approach — pretraining at 8K then extending to 128K — is now standard practice. Training at very long context from the start wastes compute on short-range patterns that could be learned more efficiently with shorter sequences. The extension phase is relatively cheap (about 5% of total pretraining compute).

When Long Context Hurts

Long context is not universally beneficial. There are several failure modes and costs that practitioners must understand.

Cost Scaling

The most obvious cost is computational. Processing 128K tokens costs roughly $128K/4K = 32$ x more than processing 4K tokens for the prefill phase. For API-based services, this translates directly to cost:

📊

API Cost Comparison at Different Context Lengths

Provider/Model	Input Cost (per 1M tokens)	128K prompt cost	4K prompt cost	Ratio
GPT-4o	$2.50	$0.32	$0.01	32x
Claude 3.5 Sonnet	$3.00	$0.38	$0.012	32x
Gemini 1.5 Pro	$1.25	$0.16	$0.005	32x
Llama 3.1 70B (self-hosted)	~$0.80	~$0.10	~$0.003	32x

Note: Costs are approximate and subject to change. The 32x ratio reflects the linear input token pricing.

Attention Dilution

As context length increases, the attention distribution becomes more spread out. Each token has more keys to attend to, which can dilute the attention weights on the most relevant tokens. This manifests as:

Decreased retrieval accuracy: The “needle in a haystack” task (finding a specific piece of information in a long context) becomes harder as the haystack grows.
Lost in the middle: Several studies have shown that LLMs are better at using information at the beginning and end of the context than in the middle. This “U-shaped” attention pattern means that simply having long context does not guarantee the model will use all of it effectively.
Decreased generation quality: For tasks like summarization, very long input can lead to less focused, more generic outputs as the model tries to attend to too much information simultaneously.

Retrieval Accuracy vs Context Length (Typical LLM)

(% accuracy)

4K context Near-perfect

99 % accuracy

16K context

97 % accuracy

32K context

94 % accuracy

64K context

90 % accuracy

128K context Noticeable degradation

85 % accuracy

512K context

75 % accuracy

1M context Significant degradation

65 % accuracy

Latency Impact

Long context dramatically increases time-to-first-token (TTFT), which is often the most important latency metric for interactive applications:

📊

Time-to-First-Token vs Context Length

Context Length	TTFT (8B model, 1xH100)	TTFT (70B model, 8xH100)	User Experience
2K tokens	50 ms	120 ms	Instant
8K tokens	180 ms	450 ms	Acceptable
32K tokens	700 ms	1.8 s	Noticeable delay
128K tokens	2.8 s	7.2 s	Significant wait
512K tokens	11 s	~30 s	Very long wait

Note: Estimates for optimized serving with Flash Attention. Actual numbers vary with implementation.

When to Avoid Long Context

⚠️ Think Before You Fill the Context

Long context should not be the default. Consider these guidelines:

If your task can be solved with a 4K-token prompt and RAG, do that. It is 32x cheaper and faster.
Use long context when the task genuinely requires global understanding (summarization, cross-reference analysis, code understanding).
Be aware that model quality degrades gradually with context length. Test your specific use case at the target length.
Monitor TTFT in production. Users will not wait 10 seconds for a response in an interactive application.

Flash Attention: The Enabler

No discussion of long context is complete without Flash Attention (Dao et al., 2022). While Flash Attention does not change the asymptotic complexity of attention ( $O(n^2)$ remains), it reduces the constant factor dramatically by eliminating the need to materialize the full $n \times n$ attention matrix in GPU HBM.

How Flash Attention Helps Long Context

Standard attention computes the full $n \times n$ score matrix, stores it in HBM, applies softmax, then multiplies by values. For $n = 128K$ , this matrix alone requires $128K \times 128K \times 2$ bytes $= 32$ GB in FP16 — more than the memory of most GPUs.

Flash Attention tiles the computation so that only small blocks of the attention matrix exist at any time, stored in fast SRAM (shared memory) rather than slow HBM. This reduces memory usage from $O(n^2)$ to $O(n)$ and also improves speed by 2-4x due to reduced HBM traffic.

📊

Standard vs Flash Attention at Long Context

Sequence Length	Standard Attention Memory	Flash Attention Memory	Flash Speedup
4K	32 MB	4 MB	1.5x
16K	512 MB	16 MB	2.1x
64K	8 GB	64 MB	2.8x
128K	32 GB (OOM on most GPUs)	128 MB	3.2x

Note: Memory shown is for the attention computation workspace only, not KV cache.

Without Flash Attention, 128K context would be impossible on current hardware. With it, the bottleneck shifts from attention computation memory to KV cache storage.

Putting It All Together: A Modern Long-Context Stack

A production 128K-token serving system in 2025 typically combines:

RoPE with extended base frequency for positional encoding that generalizes beyond training length
GQA with 8 KV heads to keep the KV cache manageable
Flash Attention v2 for memory-efficient, fast attention computation
Paged KV cache (vLLM-style) for flexible memory management across requests
Chunked prefill for scheduling efficiency in multi-request serving
KV cache quantization (INT8 or FP8) to further reduce memory footprint
Tensor parallelism across multiple GPUs for larger models

For 1M-token systems, add:

Ring attention or sequence parallelism for distributing the sequence across devices
Sparse or hierarchical attention to reduce the effective quadratic cost
Specialized hardware (TPU v5, Grace Hopper) with high-bandwidth interconnects

Long Context Technique Effectiveness

(context multiplier)

Vanilla Transformer (2017) 512 tokens

1 context multiplier

+ Segment recurrence ~4K

8 context multiplier

+ RoPE scaling ~32K

64 context multiplier

+ Flash Attention + GQA ~128K

256 context multiplier

+ Ring Attention ~1M

2,000 context multiplier

Conclusion

The journey from 2K to 1M tokens has been driven by a stack of innovations, each addressing a different bottleneck:

Positional encoding (Transformer-XL relative PE to RoPE to RoPE scaling) solved the problem of representing position at arbitrary lengths.
Attention efficiency (Flash Attention, sliding window) solved the memory and compute cost of the attention operation itself.
KV cache reduction (GQA, quantization, paged allocation) solved the memory cost of storing past context.
Distribution (ring attention, sequence parallelism) solved the problem of fitting long sequences across multiple devices.
Serving (chunked prefill, continuous batching) solved the problem of efficiently serving long-context requests alongside short ones.

The frontier continues to push forward. Context lengths of 10M tokens are being explored in research, and new architectures (state-space models, linear attention variants) promise to break the quadratic barrier entirely. But for now, the Transformer with these accumulated optimizations remains the dominant architecture, capable of processing context lengths that would have seemed impossible just a few years ago.

The practical lesson is clear: long context is a powerful capability, but it is not free. Every doubling of context length doubles the cost and increases latency. The best systems use long context surgically — for tasks that genuinely benefit from it — while falling back to cheaper approaches (RAG, summarization, truncation) when the task does not require global understanding over the full input.