MiniMax-01: Lightning Attention, 4M Token Context, and the Linear Attention Revival

Part of Series Frontier Model Architectures 2 of 3

1 Kimi K2: How Moonshot Built a 1T MoE That Rivals Claude and GPT-4o 2 MiniMax-01: Lightning Attention, 4M Token Context, and the Linear Attention Revival 3 Frontier Models in 2025: The Architectural Convergence and Where Innovation Happens

Standard scaled dot-product attention scales as $O(n^2)$ in sequence length. At 1 million tokens, that is $10^{12}$ operations per layer — even with FlashAttention’s IO optimization, this requires enormous compute. MiniMax-01 takes a fundamentally different approach: Lightning Attention, a linear attention mechanism that scales as $O(n)$ , combined with a 456B-parameter MoE architecture. The result: training on 1 million tokens and inference extrapolation to 4 million — with a 20-32x longer context window than comparable models.

The Long-Context Problem

Attention’s $O(n^2)$ cost creates an escalating wall:

📊

Attention Cost at Different Context Lengths (per layer, per head, d=128)

Context Length	Attention FLOPs	KV Cache (FP16)	Wall Time Estimate
4K	4.2M	1 MB	0.1 ms
32K	268M	8 MB	2 ms
128K	4.3B	32 MB	25 ms
1M	262B	256 MB	1.5 sec
4M	4.2T	1 GB	24 sec

Note: Standard attention. Even FlashAttention only reduces memory, not compute FLOPs.

FlashAttention solves the memory problem (no $O(n^2)$ HBM traffic) but not the compute problem. At 1M tokens, the quadratic FLOPs are simply too expensive for production serving. You need a subquadratic attention mechanism.

Lightning Attention: Linear Scaling

Lightning Attention is MiniMax’s solution — a linear attention variant that processes sequences in $O(n)$ time and memory.

The Linear Attention Idea

Standard attention: $\text{Attn}(Q, K, V) = \text{softmax}(QK^T / \sqrt{d}) \cdot V$

The softmax creates the $O(n^2)$ bottleneck — it requires materializing the full $n \times n$ score matrix. Linear attention removes the softmax and uses a kernel trick:

$\text{LinearAttn}(Q, K, V) = \phi(Q) \cdot (\phi(K)^T V)$

where $\phi$ is a feature map. The key: by computing $\phi(K)^T V$ first (an $O(nd^2)$ operation), then multiplying by $\phi(Q)$ ( $O(nd)$ ), we avoid the $O(n^2)$ matrix. Total cost: $O(nd^2)$ — linear in $n$ .

Why Previous Linear Attention Failed

Performer (2020), Katformer, and other linear attention variants achieved the $O(n)$ scaling but with significant quality degradation. The softmax in standard attention performs two critical functions:

Normalization: Attention weights sum to 1, creating a proper weighted average
Sharpening: The exponential amplifies score differences, allowing focused attention on relevant tokens

Without softmax, attention distributions become too uniform — the model can’t focus. Quality drops by 2-5 perplexity points, making linear attention impractical for frontier models.

Lightning Attention’s Innovation

Lightning Attention addresses both problems through a hybrid approach:

Improved kernel function: Instead of naive ReLU or ELU feature maps, Lightning Attention uses a carefully designed $\phi$ that preserves the sharpening property of softmax while remaining computationally efficient.
Chunk-wise computation: Sequences are divided into chunks. Within each chunk, the attention can use a more precise local computation. Across chunks, the linear formulation carries information forward through a compressed state.
Integration with the MoE FFN: The linear attention mechanism is co-designed with the MoE layers. Experts can specialize for different regions of long contexts — some experts handle local patterns (recent tokens), others handle global patterns (distant context).

ℹ️ Connection to Mamba

Lightning Attention shares a philosophical similarity with Mamba (covered in Inference Optimization Timeline Part 12): both replace quadratic attention with a linear-time mechanism that carries state forward. The key difference: Mamba uses a state-space model formulation, while Lightning Attention stays within the attention framework with a modified kernel. This makes Lightning Attention easier to integrate into existing transformer architectures.

Architecture: 456B MoE with Lightning Attention

📊

MiniMax-01 Architecture

Spec	MiniMax-01	DeepSeek V3	Kimi K2
Total params	456B	671B	1T
Activated params	45.9B	37B	32B
Experts	32	256 + 1 shared	384
Attention	Lightning (linear)	Standard + MLA	Standard + MLA
Max context (train)	1M tokens	128K tokens	128K tokens
Max context (inference)	4M tokens	128K tokens	128K tokens
KV cache scaling	O(n) per layer	O(n) per layer (MLA compressed)	O(n) per layer (MLA compressed)

The critical differentiator: Lightning Attention’s linear compute scaling. While DeepSeek V3 and Kimi K2 reduce KV cache memory through MLA compression, they still pay $O(n^2)$ compute for attention. MiniMax-01 pays $O(n)$ for both compute and memory.

Training for 1M Context

Training on 1M-token sequences with 456B parameters requires solving several problems:

Memory Management

A 1M-token sequence at d_model=8192 requires massive activation memory. Solutions:

Activation checkpointing: Recompute activations during backward pass instead of storing them
Sequence parallelism: Distribute the sequence across multiple GPUs, each holding a segment
Progressive context extension: Train initially on shorter sequences (32K), gradually extend to 128K, 512K, then 1M

Computation-Communication Overlap

With sequence distributed across GPUs, Lightning Attention’s linear formulation enables efficient distributed computation. Each GPU processes its chunk and passes a compressed state to the next — no all-to-all communication needed for attention (unlike Ring Attention, which must pass KV blocks in a ring).

Communication Volume: Ring Attention vs Lightning Attention (1M tokens, 8 GPUs)

(relative communication)

Ring Attention O(n x d) KV blocks passed in ring

100 relative communication

Lightning Attention O(d^2) state passed forward

8 relative communication

Lightning Attention’s compressed state is only $O(d^2)$ per layer — vastly smaller than Ring Attention’s $O(nd)$ KV blocks. This is why MiniMax can train at 1M tokens efficiently.

Extrapolation to 4M Tokens

MiniMax-01 extrapolates from 1M training context to 4M inference context. This works because:

Linear attention has no position-dependent components that break: Unlike RoPE, where unseen rotation angles cause attention score degradation, Lightning Attention’s state-based formulation naturally handles longer sequences.
The compressed state carries sufficient information: The $O(d^2)$ state matrix accumulates context information across the entire sequence. As long as the state has capacity to represent the relevant information, additional tokens can be processed.
MoE routing adapts: Different experts activate for different parts of the context, effectively increasing the model’s capacity for longer inputs without proportional compute increase.

⚠️ Extrapolation Is Not Free

While MiniMax-01 can process 4M tokens, quality degrades gradually beyond the 1M training length. Evaluation on long-context benchmarks (Needle-in-a-Haystack) shows near-perfect retrieval up to 1M tokens, with accuracy dropping to 85-90% at 4M tokens. The extrapolation is useful but not lossless.

Performance Analysis

📊

MiniMax-01 Long-Context Performance

Benchmark	MiniMax-01	GPT-4o	Claude 3.5 Sonnet
NIAH (128K)	99.8%	99.5%	99.7%
NIAH (1M)	98.2%	N/A (128K limit)	N/A (200K limit)
RULER (128K)	91.4%	89.2%	90.1%
MMLU	88.5%	88.7%	88.3%
HumanEval	83.2%	90.2%	92.0%

Note: MiniMax-01 matches frontier models on standard benchmarks while offering dramatically longer context.

MiniMax-01’s strength is clear: it matches GPT-4o and Claude on standard benchmarks while offering 8-20x longer context windows. The tradeoff: slightly lower scores on code generation (HumanEval), likely because the linear attention mechanism loses some of the precise token-level focus that softmax attention provides for code.

Implications for Serving

4M-token context creates new serving challenges:

Memory: KV cache for 4M tokens at d=8192 with 32 heads: $4M \times 32 \times 128 \times 2 \times 2 = 64$ GB per layer. Across 64 layers: 4 TB. This doesn’t fit on any single GPU.
Throughput: Processing a 4M-token prompt takes minutes even with linear attention. Batch size is effectively 1 for very long contexts.
Serving pattern: Long-context requests are rare but expensive. A disaggregated architecture (covered in Inference Timeline Part 10) helps: dedicate specific nodes to long-context prefill while others handle short-context decode.

💡 When Linear Attention Actually Helps in Serving

Lightning Attention’s $O(n)$ scaling doesn’t help for typical serving workloads (2K-32K tokens) — FlashAttention with standard attention is fast enough and higher quality. The benefit is exclusively for very long sequences (128K+) where quadratic attention becomes impractical. If your workload is mostly short contexts, standard attention + FlashAttention remains the better choice.

What MiniMax-01 Means for the Field

MiniMax-01 demonstrates that linear attention is viable for frontier-quality models when:

The kernel function is carefully designed (not naive ReLU/ELU)
The architecture is co-optimized (MoE + Lightning Attention)
Training is progressive (short to long context)
The use case genuinely requires very long context (1M+ tokens)

For most applications, standard attention + FlashAttention + RoPE scaling remains the pragmatic choice. But for document-scale processing, code repository understanding, and multi-document reasoning, Lightning Attention opens possibilities that quadratic attention simply cannot reach.

The frontier model landscape in 2025 now has two viable attention paradigms: softmax-based (DeepSeek V3, Kimi K2, Llama 4) and linear (MiniMax-01, with Mamba hybrids as a third path). The next post in this series surveys where all frontier models are converging and where they diverge.