Standard scaled dot-product attention scales as in sequence length. At 1 million tokens, that is operations per layer — even with FlashAttention’s IO optimization, this requires enormous compute. MiniMax-01 takes a fundamentally different approach: Lightning Attention, a linear attention mechanism that scales as , combined with a 456B-parameter MoE architecture. The result: training on 1 million tokens and inference extrapolation to 4 million — with a 20-32x longer context window than comparable models.
The Long-Context Problem
Attention’s cost creates an escalating wall:
Attention Cost at Different Context Lengths (per layer, per head, d=128)
| Context Length | Attention FLOPs | KV Cache (FP16) | Wall Time Estimate |
|---|---|---|---|
| 4K | 4.2M | 1 MB | 0.1 ms |
| 32K | 268M | 8 MB | 2 ms |
| 128K | 4.3B | 32 MB | 25 ms |
| 1M | 262B | 256 MB | 1.5 sec |
| 4M | 4.2T | 1 GB | 24 sec |
FlashAttention solves the memory problem (no HBM traffic) but not the compute problem. At 1M tokens, the quadratic FLOPs are simply too expensive for production serving. You need a subquadratic attention mechanism.
Lightning Attention: Linear Scaling
Lightning Attention is MiniMax’s solution — a linear attention variant that processes sequences in time and memory.
The Linear Attention Idea
Standard attention:
The softmax creates the bottleneck — it requires materializing the full score matrix. Linear attention removes the softmax and uses a kernel trick:
where is a feature map. The key: by computing first (an operation), then multiplying by (), we avoid the matrix. Total cost: — linear in .
Why Previous Linear Attention Failed
Performer (2020), Katformer, and other linear attention variants achieved the scaling but with significant quality degradation. The softmax in standard attention performs two critical functions:
- Normalization: Attention weights sum to 1, creating a proper weighted average
- Sharpening: The exponential amplifies score differences, allowing focused attention on relevant tokens
Without softmax, attention distributions become too uniform — the model can’t focus. Quality drops by 2-5 perplexity points, making linear attention impractical for frontier models.
Lightning Attention’s Innovation
Lightning Attention addresses both problems through a hybrid approach:
-
Improved kernel function: Instead of naive ReLU or ELU feature maps, Lightning Attention uses a carefully designed that preserves the sharpening property of softmax while remaining computationally efficient.
-
Chunk-wise computation: Sequences are divided into chunks. Within each chunk, the attention can use a more precise local computation. Across chunks, the linear formulation carries information forward through a compressed state.
-
Integration with the MoE FFN: The linear attention mechanism is co-designed with the MoE layers. Experts can specialize for different regions of long contexts — some experts handle local patterns (recent tokens), others handle global patterns (distant context).
Lightning Attention shares a philosophical similarity with Mamba (covered in Inference Optimization Timeline Part 12): both replace quadratic attention with a linear-time mechanism that carries state forward. The key difference: Mamba uses a state-space model formulation, while Lightning Attention stays within the attention framework with a modified kernel. This makes Lightning Attention easier to integrate into existing transformer architectures.
Architecture: 456B MoE with Lightning Attention
MiniMax-01 Architecture
| Spec | MiniMax-01 | DeepSeek V3 | Kimi K2 |
|---|---|---|---|
| Total params | 456B | 671B | 1T |
| Activated params | 45.9B | 37B | 32B |
| Experts | 32 | 256 + 1 shared | 384 |
| Attention | Lightning (linear) | Standard + MLA | Standard + MLA |
| Max context (train) | 1M tokens | 128K tokens | 128K tokens |
| Max context (inference) | 4M tokens | 128K tokens | 128K tokens |
| KV cache scaling | O(n) per layer | O(n) per layer (MLA compressed) | O(n) per layer (MLA compressed) |
The critical differentiator: Lightning Attention’s linear compute scaling. While DeepSeek V3 and Kimi K2 reduce KV cache memory through MLA compression, they still pay compute for attention. MiniMax-01 pays for both compute and memory.
Training for 1M Context
Training on 1M-token sequences with 456B parameters requires solving several problems:
Memory Management
A 1M-token sequence at d_model=8192 requires massive activation memory. Solutions:
- Activation checkpointing: Recompute activations during backward pass instead of storing them
- Sequence parallelism: Distribute the sequence across multiple GPUs, each holding a segment
- Progressive context extension: Train initially on shorter sequences (32K), gradually extend to 128K, 512K, then 1M
Computation-Communication Overlap
With sequence distributed across GPUs, Lightning Attention’s linear formulation enables efficient distributed computation. Each GPU processes its chunk and passes a compressed state to the next — no all-to-all communication needed for attention (unlike Ring Attention, which must pass KV blocks in a ring).
Communication Volume: Ring Attention vs Lightning Attention (1M tokens, 8 GPUs)
(relative communication)Lightning Attention’s compressed state is only per layer — vastly smaller than Ring Attention’s KV blocks. This is why MiniMax can train at 1M tokens efficiently.
Extrapolation to 4M Tokens
MiniMax-01 extrapolates from 1M training context to 4M inference context. This works because:
-
Linear attention has no position-dependent components that break: Unlike RoPE, where unseen rotation angles cause attention score degradation, Lightning Attention’s state-based formulation naturally handles longer sequences.
-
The compressed state carries sufficient information: The state matrix accumulates context information across the entire sequence. As long as the state has capacity to represent the relevant information, additional tokens can be processed.
-
MoE routing adapts: Different experts activate for different parts of the context, effectively increasing the model’s capacity for longer inputs without proportional compute increase.
While MiniMax-01 can process 4M tokens, quality degrades gradually beyond the 1M training length. Evaluation on long-context benchmarks (Needle-in-a-Haystack) shows near-perfect retrieval up to 1M tokens, with accuracy dropping to 85-90% at 4M tokens. The extrapolation is useful but not lossless.
Performance Analysis
MiniMax-01 Long-Context Performance
| Benchmark | MiniMax-01 | GPT-4o | Claude 3.5 Sonnet |
|---|---|---|---|
| NIAH (128K) | 99.8% | 99.5% | 99.7% |
| NIAH (1M) | 98.2% | N/A (128K limit) | N/A (200K limit) |
| RULER (128K) | 91.4% | 89.2% | 90.1% |
| MMLU | 88.5% | 88.7% | 88.3% |
| HumanEval | 83.2% | 90.2% | 92.0% |
MiniMax-01’s strength is clear: it matches GPT-4o and Claude on standard benchmarks while offering 8-20x longer context windows. The tradeoff: slightly lower scores on code generation (HumanEval), likely because the linear attention mechanism loses some of the precise token-level focus that softmax attention provides for code.
Implications for Serving
4M-token context creates new serving challenges:
- Memory: KV cache for 4M tokens at d=8192 with 32 heads: GB per layer. Across 64 layers: 4 TB. This doesn’t fit on any single GPU.
- Throughput: Processing a 4M-token prompt takes minutes even with linear attention. Batch size is effectively 1 for very long contexts.
- Serving pattern: Long-context requests are rare but expensive. A disaggregated architecture (covered in Inference Timeline Part 10) helps: dedicate specific nodes to long-context prefill while others handle short-context decode.
Lightning Attention’s scaling doesn’t help for typical serving workloads (2K-32K tokens) — FlashAttention with standard attention is fast enough and higher quality. The benefit is exclusively for very long sequences (128K+) where quadratic attention becomes impractical. If your workload is mostly short contexts, standard attention + FlashAttention remains the better choice.
What MiniMax-01 Means for the Field
MiniMax-01 demonstrates that linear attention is viable for frontier-quality models when:
- The kernel function is carefully designed (not naive ReLU/ELU)
- The architecture is co-optimized (MoE + Lightning Attention)
- Training is progressive (short to long context)
- The use case genuinely requires very long context (1M+ tokens)
For most applications, standard attention + FlashAttention + RoPE scaling remains the pragmatic choice. But for document-scale processing, code repository understanding, and multi-document reasoning, Lightning Attention opens possibilities that quadratic attention simply cannot reach.
The frontier model landscape in 2025 now has two viable attention paradigms: softmax-based (DeepSeek V3, Kimi K2, Llama 4) and linear (MiniMax-01, with Mamba hybrids as a third path). The next post in this series surveys where all frontier models are converging and where they diverge.