Attention is the most successful sequence modeling mechanism in the history of deep learning. It is also in sequence length. For a 1-million-token context window, the attention matrix alone contains entries — one trillion floating-point values that must be computed, stored (or streamed), and multiplied against. FlashAttention reduces the memory traffic, but the compute remains quadratic. At some point, the question becomes inescapable: can we build a sequence model that is and still competitive with attention for language?
State space models (SSMs) are the most serious attempt at answering yes. The lineage runs from classical control theory through the S4 architecture’s HiPPO matrix to Mamba’s selective state spaces, which finally made SSMs competitive with transformers on language modeling. This post traces the full arc: why attention is quadratic, why the obvious fix (linear attention) degrades quality, how continuous-time state space models offer a different path, what Mamba’s input-dependent parameterization changes, why the parallel scan is GPU-friendly, where SSMs still fall short, and how hybrid architectures try to get the best of both worlds.
1. Why Attention Is and Why It Matters
The Quadratic Bottleneck
Standard scaled dot-product attention computes:
where and . The matrix product produces an matrix. Computing it requires FLOPs. Multiplying the resulting attention weights by costs another FLOPs. Total: per head, per layer.
For a single attention head with :
Attention FLOPs by Sequence Length (Single Head, d=128)
| Sequence Length (n) | QK^T FLOPs | Softmax + AV FLOPs | Total FLOPs | Relative to n=1K |
|---|---|---|---|---|
| 1,024 | 2.68e8 | 2.68e8 | 5.37e8 | 1x |
| 4,096 | 4.29e9 | 4.29e9 | 8.59e9 | 16x |
| 16,384 | 6.87e10 | 6.87e10 | 1.37e11 | 256x |
| 65,536 | 1.10e12 | 1.10e12 | 2.20e12 | 4,096x |
| 1,048,576 | 2.81e14 | 2.81e14 | 5.63e14 | 1,048,576x |
With 32 heads and 80 layers, a 1M-token forward pass requires FLOPs for attention alone. An H100 at 990 TFLOPS would need over 1,400 seconds — just for attention, ignoring FFN layers. This is not practically viable without sparse attention, sliding windows, or a fundamentally different architecture.
Attention compute grows quadratically. GPU FLOPS grow roughly 2x per hardware generation (every 2 years). To double the context length at constant latency, you need 4x the compute — two hardware generations. At this rate, going from 128K tokens to 1M tokens requires 64x more compute for attention alone. Hardware improvements alone cannot bridge this gap.
The Memory Problem at Inference
During autoregressive generation, each new token must attend to all previous tokens. The KV cache grows linearly with sequence length, and each decode step’s attention computation grows linearly too (it is computing one row of the matrix). But the KV cache memory is the binding constraint: for a 70B model at FP16, the KV cache for a 1M-token context is approximately:
This exceeds the memory of any single GPU. Even with GQA (8 KV heads instead of 32), it is still 82 GB — more than an 80GB A100 can hold alongside model weights.
An model that maintains a fixed-size state regardless of sequence length would solve both problems: linear compute for processing and constant memory per token at inference.
2. Linear Attention: The First Attempt at
The Kernel Trick
The most direct route to attention is to remove the softmax. Standard attention can be written as:
The and normalization (softmax) are what force us to compute all pairwise scores before we can normalize. If we replace with a kernel function that decomposes as an inner product of feature maps, we get:
The key insight is associativity. Instead of computing the attention matrix, we can first compute , which is a matrix (where is the feature map dimension). Then for each query , the output is , which is per query. Total cost: — linear in .
# Standard attention
# 1. Compute n x n attention matrix
scores = Q @ K.T / sqrt(d) # O(n^2 d)
weights = softmax(scores) # O(n^2)
output = weights @ V # O(n^2 d)
# Total: O(n^2 d) -- quadratic in n
# Cannot reorder: softmax breaks associativity
# Must materialize the full n x n matrix
# Linear attention (kernel trick)
# 1. Apply feature map
Q_phi = phi(Q) # n x d'
K_phi = phi(K) # n x d'
# 2. Compute KV summary FIRST (associativity!)
S = K_phi.T @ V # d' x d -- independent of n
Z = K_phi.sum(dim=0) # d' -- independent of n
# 3. Compute output per query
output = (Q_phi @ S) / (Q_phi @ Z) # O(n d' d)
# Total: O(n d' d) -- linear in n
Why Linear Attention Degrades
On paper, linear attention is the perfect solution: same expressiveness class, compute, state at inference (just maintain and as running sums). In practice, it produces significantly worse language models.
The problem is the softmax. Softmax does two critical things:
- Sparsification: It concentrates attention weights on a few relevant tokens. The function amplifies differences — tokens with slightly higher scores get exponentially more weight.
- Normalization with interaction: The denominator depends on all scores, creating global competition among keys. Each key’s attention weight depends on every other key.
Without softmax, the attention weights are “flat.” The model cannot sharply focus on specific tokens. For language modeling, this means the model struggles with:
- Copying: “Repeat the word in quotes: ‘elephant’” requires sharp attention to a specific position.
- Retrieval: “What was the name mentioned in paragraph 3?” requires attending to a specific context span.
- In-context learning: Few-shot learning requires attending to the specific examples and their answers.
Perplexity Gap: Linear Attention vs. Standard Attention (WikiText-103)
(perplexity)The perplexity gap of 20-40% for linear attention variants is too large for practical deployment. Numerous papers tried to close this gap with better feature maps (, random Fourier features, etc.), but none achieved parity with softmax attention on language benchmarks. The gap is not in the feature map — it is in the fundamental inability to compute a sharp, context-dependent reweighting without the nonlinear softmax.
This is what motivated the SSM approach: rather than trying to linearize attention, start from a completely different mathematical framework.
3. State Space Models: The Continuous-Time Perspective
From Attention to Recurrence
Attention processes a sequence by letting every position interact with every other position. Recurrence processes a sequence by maintaining a hidden state that is updated at each step. The classical RNN is:
RNNs are in sequence length and use memory at inference — exactly what we want. But vanilla RNNs (and LSTMs, GRUs) fail to model long-range dependencies due to vanishing/exploding gradients, and they cannot be parallelized during training because each step depends on the previous one.
State space models start from a different place: continuous-time linear dynamical systems, borrowed from control theory.
The Continuous-Time State Space
A linear time-invariant (LTI) system is defined by:
where is the hidden state, is the input signal, is the output, and , , , are the system matrices.
This is a linear ODE. Given an input signal , the state evolves continuously, and the output is a linear function of the state. The key property: because the system is linear, it can be solved analytically and admits a convolution representation.
Discretization
To process discrete sequences (like token embeddings), we discretize the continuous system using a step size . The zero-order hold (ZOH) discretization gives:
The discrete-time system is then:
This is a linear recurrence. At inference, we process tokens one at a time, updating the hidden state with a single matrix-vector multiply — per step, memory. During training, because the system is linear, we can unroll it as a convolution and compute all outputs in parallel.
The Dual View: Recurrence and Convolution
The linear recurrence can be unrolled:
The output is a convolution of the input sequence with the kernel:
During training, we can compute this convolution using FFT in time. During inference, we use the recurrence in time with state.
SSMs have two computation modes: (1) Recurrence mode for inference — process one token at a time, update state, per step, memory. (2) Convolution mode for training — compute the full kernel , convolve with the input using FFT, time, fully parallelizable. This dual mode is the fundamental advantage over RNNs, which can only compute sequentially.
4. S4: The HiPPO Matrix and Structured State Spaces
The generic linear SSM described above does not work well out of the box. The matrix is the critical component, and a random matrix leads to either exploding states (eigenvalues outside the unit circle) or rapid forgetting (eigenvalues too close to zero). The S4 architecture (Gu et al., 2022) solved this with two key ideas: the HiPPO matrix and structured parameterization.
The HiPPO Framework
HiPPO (High-order Polynomial Projection Operator) provides a principled initialization for that optimally compresses the history of the input signal into the hidden state. The idea is to choose such that the hidden state maintains a polynomial approximation of the input history.
Specifically, the HiPPO-LegS matrix defines:
This specific matrix has the property that the hidden state at time contains the coefficients of the Legendre polynomial expansion of the input signal over . In other words, the state is an optimal compressed representation of the entire history, in the sense of minimizing the error of the polynomial approximation.
Why HiPPO Matters
The HiPPO initialization solves the long-range dependency problem that plagued RNNs. With a random matrix, the state quickly forgets old inputs (exponential decay). With the HiPPO matrix, the state retains information about the entire history with graceful degradation — recent inputs are represented with higher precision than distant ones, but nothing is completely lost.
On the Long Range Arena benchmark (tasks requiring dependencies over 1K-16K steps), S4 with HiPPO initialization achieved dramatic improvements over transformers:
Long Range Arena Accuracy (Higher is Better)
(% accuracy)The gap between S4 with HiPPO and S4 with random (+27.8 points) demonstrates that the initialization is doing most of the work. The HiPPO matrix gives the model a principled inductive bias for remembering long-range history.
The S4 Architecture
S4 stacks multiple SSM layers, each with its own parameters. The architecture processes each channel of the input independently (like a depthwise convolution), then mixes channels with a linear projection. The full block is:
- Linear projection: input expanded dimension
- SSM layer: process each channel with independent state space dynamics
- Nonlinear activation (GELU)
- Linear projection: expanded dimension output dimension
- Residual connection
During training, the SSM layer uses the convolution mode (FFT-based), making the entire model parallelizable. During inference, it switches to recurrence mode.
S4 excels on continuous signal tasks (audio, time series, long-range classification) but underperforms transformers on language modeling. The reason: S4’s parameters are input-independent. The same dynamics process every token identically, regardless of content. Language requires content-based routing — attending to semantically relevant tokens, not just recently proximate ones. This is exactly what attention’s softmax achieves, and what S4 lacks.
5. Mamba: Selective State Spaces
Mamba (Gu and Dao, 2023) is the architecture that finally made SSMs competitive with transformers for language modeling. The key insight is deceptively simple: make the SSM parameters depend on the input.
The Selection Mechanism
In S4, the matrices , , and the step size are learned parameters that remain fixed during inference. Every input token is processed by the same dynamics. Mamba makes them functions of the input:
where is the input at position . The discretized dynamics become:
This is no longer a linear time-invariant system. The matrices change at every step based on the input. This means:
-
Content-based selection: The model can selectively incorporate or ignore information based on what the current token is. When processing the word “elephant” in the context of a copying task, can be large (updating the state significantly). When processing filler words, can be small (preserving the existing state).
-
Input-dependent gating: The and matrices control what goes into the state and what comes out, respectively. By making them input-dependent, the model can learn to write specific information to the state and read it back based on context.
Why Selection Closes the Gap
The input-dependent parameterization solves the exact problem that made S4 weak at language: the inability to perform content-based reasoning. Consider a synthetic copying task:
Input: a b c d [COPY] _ _ _ _
Output: _ _ _ _ [COPY] a b c d
With fixed SSM parameters (S4), the model processes a, b, c, d the same way it processes [COPY] and the padding tokens. It has no way to “decide” to store the specific values of a b c d and retrieve them later.
With Mamba’s input-dependent , the model can learn:
- When is a content token (a, b, c, d): set to write the token value into the state (large , aligned ).
- When is a padding token: set , leaving the state unchanged (small ).
- When is
[COPY]: set to start reading from the state.
This is analogous to what attention achieves: selectively aggregating information based on content. The mechanism is different (state update vs. pairwise comparison), but the functional capability is similar.
Language Modeling Perplexity: Mamba vs. Transformer (Matched Parameters)
| Model | Parameters | Tokens Trained | Pile Perplexity | LAMBADA Acc. | HellaSwag Acc. |
|---|---|---|---|---|---|
| Transformer (GPT-3 arch.) | 125M | 300B | 29.6 | 38.4% | 33.7% |
| Mamba-130M | 130M | 300B | 29.1 | 40.1% | 35.3% |
| Transformer (GPT-3 arch.) | 1.3B | 300B | 14.5 | 63.2% | 52.1% |
| Mamba-1.4B | 1.4B | 300B | 14.2 | 64.5% | 53.6% |
| Transformer (GPT-3 arch.) | 2.8B | 300B | 12.4 | 67.8% | 59.2% |
| Mamba-2.8B | 2.8B | 300B | 12.0 | 69.2% | 60.4% |
6. The Hardware Story: Making Scans GPU-Friendly
Mamba’s input-dependent parameterization means the convolution trick from S4 no longer works (the system is time-varying, so there is no fixed convolution kernel). The recurrence must be computed directly. But a naive sequential recurrence is slow on GPUs — it cannot utilize the massive parallelism.
The Parallel Prefix Scan
The key algorithmic tool is the parallel prefix scan (also called parallel scan or Blelloch scan). Given a sequence of operations (which is Mamba’s recurrence with and ), the parallel scan computes all in parallel steps using total work.
The insight: the recurrence is an associative operation. Define the binary operation on tuples as:
Then the composition of two consecutive recurrence steps is:
which matches the definition. Because is associative, we can compute the prefix scan (cumulative application of ) using a parallel tree reduction in steps.
The Fused SRAM Kernel
The parallel scan gives us parallel time (the factor comes from the tree depth). But for Mamba’s practical performance, the raw algorithmic complexity matters less than the memory access pattern.
Gu and Dao implemented Mamba’s core computation as a fused CUDA kernel that keeps the entire state evolution in GPU SRAM (shared memory), avoiding HBM round-trips for intermediate states. The kernel:
- Loads a chunk of the input sequence into SRAM.
- Computes the discretized , for each position in the chunk (input-dependent computation).
- Runs the parallel scan within the chunk, keeping all intermediate states in SRAM.
- Writes only the final output to HBM.
This is conceptually similar to FlashAttention’s tiling strategy: keep the hot intermediate data in fast on-chip memory, minimize traffic to slow HBM. The difference is that Mamba’s intermediate data is the hidden state sequence, not an attention matrix.
Throughput: Mamba vs. Transformer (A100, Sequence Length Sweep)
(tokens/sec (thousands))The throughput advantage of Mamba grows dramatically with sequence length. At 2K tokens, Mamba is only ~8% faster (the overhead of attention is small at short sequences). At 32K tokens, Mamba is 17x faster. At 128K tokens, the transformer runs out of memory while Mamba continues to operate.
Inference-Time Efficiency
At inference time, Mamba’s advantage is even more dramatic. A transformer must read the entire KV cache at each decode step — memory access per token, growing with context length. Mamba maintains a fixed-size state (typically 16-64 floats per channel) that is updated with a constant amount of work per token, regardless of how long the context is.
Per-Token Decode Cost: Mamba vs. Transformer (70B-scale)
| Context Length | Transformer Decode (ms/tok) | Mamba Decode (ms/tok) | Transformer KV Cache | Mamba State Size |
|---|---|---|---|---|
| 1K | 35 | 32 | 1.3 GB | 16 MB |
| 4K | 38 | 32 | 5.2 GB | 16 MB |
| 16K | 52 | 32 | 20.8 GB | 16 MB |
| 64K | 105 | 32 | 83.2 GB | 16 MB |
| 256K | OOM | 32 | 332 GB | 16 MB |
Mamba’s per-token decode cost is constant regardless of context length. The hidden state is a fixed-size vector (e.g., state dimension with expansion factor 2 and model dimension 4096 gives KB per layer). This is fundamentally different from attention, where the KV cache grows without bound. For long-context applications, this is transformative: a 1M-token conversation costs the same per token as a 100-token conversation.
7. Mamba vs. Transformer: Strengths and Weaknesses
Mamba is not a strict upgrade over transformers. Each architecture has regimes where it excels and regimes where it struggles.
Where Mamba Wins
- Long sequences: compute and constant memory make Mamba practical for sequence lengths where attention is infeasible.
- Autoregressive generation speed: Constant per-token decode cost means generation speed does not degrade with context length.
- Memory efficiency: No KV cache means more memory available for batching, enabling higher throughput.
- Continuous signals: Audio, time series, DNA sequences — domains where the long-range structure is smooth and continuous, matching SSM’s continuous-time inductive bias.
Where Attention Wins
-
In-context learning: Transformers excel at few-shot in-context learning because attention can directly copy patterns from examples. Mamba must compress all context into the fixed-size state, losing fine-grained information.
-
Exact retrieval: “What was the 5th word in the 3rd paragraph?” requires attending to a specific position. Attention can do this with a sharp attention weight. Mamba’s compressed state loses positional specificity.
-
Copying and matching: Tasks that require comparing or copying specific subsequences are hard for SSMs. The information must pass through the state bottleneck.
In-Context Learning: Mamba vs. Transformer (5-Shot Accuracy)
(% accuracy)The pattern is clear: on aggregate benchmarks (MMLU), the gap is small. On tasks that specifically require precise in-context retrieval or copying, the gap is large. This reflects the fundamental difference: attention maintains access to every past token, while Mamba compresses the past into a fixed-size state.
The Information Bottleneck
The core limitation of SSMs can be understood as an information bottleneck. At any point during processing, the model’s “memory” of the past is entirely contained in the hidden state . If the state dimension is (Mamba’s default), then the model can store at most bits of information about an arbitrarily long past (at 32-bit precision).
Attention, by contrast, stores the full KV cache — every past token’s key and value vectors. For a 1K-token context with , that is million bits. The information capacity is three orders of magnitude larger.
Increasing Mamba’s state dimension reduces the information bottleneck but increases per-step compute ( for the state update, for the multiplication if is dense). Mamba uses a diagonal matrix to keep the state update , but the information capacity is still limited by . There is no free lunch: compute fundamentally trades away the ability to store information about the past.
8. Hybrid Architectures: The Best of Both Worlds
Given that attention excels at precise retrieval and SSMs excel at efficient long-range processing, the natural question is: can we combine them?
Jamba: Interleaving Attention and Mamba
AI21 Labs’ Jamba (2024) is the first large-scale hybrid architecture. The design is straightforward: alternate between Mamba layers and attention layers in the model’s depth.
A 52B-parameter Jamba model uses:
- Mamba layers: For the majority of the layers (approximately 6 out of every 8 layers). These handle the bulk of sequence processing efficiently.
- Attention layers: Interspersed every few layers (approximately 2 out of every 8). These provide the precise retrieval capability that Mamba lacks.
- MoE (Mixture of Experts): In the FFN layers, for parameter efficiency.
The attention layers maintain a KV cache, but because there are far fewer attention layers than in a pure transformer, the KV cache is proportionally smaller.
KV Cache Size: Pure Transformer vs. Jamba (52B Parameter Class)
| Architecture | Attention Layers | KV Cache (64K context) | KV Cache (256K context) | Relative Size |
|---|---|---|---|---|
| Pure Transformer (52B) | 80 | 83.2 GB | 332.8 GB | 1.0x |
| Jamba (52B) | 20 (of 80) | 20.8 GB | 83.2 GB | 0.25x |
| Pure Mamba (52B equiv.) | 0 | 0.016 GB | 0.016 GB | ~0x |
Mamba-2: Structured State Space Duality
Mamba-2 (Dao and Gu, 2024) takes a more theoretical approach to the hybrid question. The key result is the State Space Duality (SSD) framework, which shows that structured SSMs and structured attention are mathematically equivalent under certain conditions.
Specifically, Mamba-2 shows that an SSM with a specific structure on the matrix (diagonal, with a particular scalar structure) computes the same function as a form of linear attention with a specific causal mask. This duality means:
- Unified implementation: The same computation can be performed either as a recurrence (SSM mode) or as a matrix multiplication (attention-like mode). The implementation can choose the faster option based on sequence length and hardware.
- Hybrid within a single layer: Rather than alternating SSM and attention layers, a single layer can smoothly interpolate between SSM-like and attention-like computation.
Mamba-2’s practical improvements:
- 2-8x faster training throughput than Mamba-1 (because the SSD formulation enables larger matrix multiplications that better utilize tensor cores).
- Slightly better perplexity at matched compute budget.
- Simpler implementation (the core operation reduces to a structured matrix multiply).
Other Hybrid Approaches
The hybrid design space is rapidly expanding:
- Griffin (Google DeepMind, 2024): Uses a gated linear recurrence (simpler than Mamba’s SSM) with local attention (sliding window). Achieves strong results with a clean, simple design.
- RWKV (Bo Peng et al., 2023-2024): A linear attention variant with channel-wise gating. Not technically an SSM, but shares the property and constant inference memory. RWKV-6 uses a data-dependent linear recurrence similar to Mamba’s selection mechanism.
- StripedHyena (Together AI, 2023): Alternates between gated convolution layers and attention layers. Designed for maximum hardware efficiency.
- Zamba (Zyphra, 2024): Shared attention layers across the depth, with Mamba blocks between them. Reduces attention parameter count while maintaining retrieval capability.
MMLU Accuracy vs. Inference Throughput (7B-class Models)
The trend is clear: hybrids with a small number of attention layers achieve quality close to pure transformers while retaining most of the speed advantage of pure SSMs. The sweet spot appears to be 10-25% attention layers, giving a 3-5x reduction in KV cache and 1.3-1.8x inference speedup with minimal quality loss.
9. Where Things Stand in 2025
The Current Landscape
As of early 2025, the state of SSMs and hybrid architectures can be summarized as follows:
Frontier LLMs are still transformers. GPT-4, Claude 3.5, Gemini 1.5, and Llama 3.1 are all attention-based. FlashAttention, GQA, and various KV cache optimizations have pushed the practical limits of attention far enough to support 128K-1M token context windows. The quality advantage of attention on in-context learning tasks — the capability that matters most for frontier model performance — has kept the industry on the transformer path.
Mamba and hybrids are gaining traction for specific use cases. Domains where the SSM advantages are decisive:
- DNA/protein modeling: Sequences of millions of bases where long-range dependencies matter and exact retrieval is less important. Models like Evo (2.7B parameters, 131K context on DNA) use SSM architectures.
- Audio processing: Continuous signals where the continuous-time inductive bias of SSMs is a natural fit.
- Edge/mobile deployment: The constant memory footprint makes SSMs attractive for resource-constrained inference.
- Long-context generation: Applications that primarily generate long sequences (e.g., long-form writing) where the generation speed advantage is most valuable.
Hybrid architectures are the emerging consensus for the next generation. Several groups are training large-scale hybrid models. The design philosophy: use attention for the capabilities that need it (retrieval, copying, in-context learning), use SSMs for the efficient bulk processing of long sequences, and minimize the attention footprint to reduce memory and compute costs.
Architecture Comparison Summary (7B-Class Models)
| Property | Pure Transformer | Pure Mamba | Hybrid (75% SSM + 25% Attn) |
|---|---|---|---|
| Training compute | O(n^2 d) attention | O(n d N) scan | O(n^2 d) for attn layers, O(n d N) for SSM layers |
| Inference memory | KV cache grows O(n) | Fixed state O(N) | Reduced KV cache (25% of layers) |
| Per-token decode cost | O(n) at length n | O(1) | ~O(n/4) with 25% attn |
| In-context learning | Excellent | Weak | Good (from attn layers) |
| Long-range modeling | Good with FlashAttn | Excellent | Excellent |
| Exact retrieval | Excellent | Poor | Good (from attn layers) |
| 1M token context | Requires multi-GPU | Single GPU feasible | Significantly reduced KV cache |
| Hardware efficiency | Good (mature ecosystem) | Good (fused scan kernel) | Moderate (two code paths) |
Open Questions
Several fundamental questions remain unresolved:
-
Scaling laws for hybrids: Do hybrid architectures have the same scaling exponents as pure transformers? Early evidence suggests yes (Jamba scales predictably), but we do not yet have Chinchilla-scale compute-optimal experiments for hybrids.
-
The right state dimension: Mamba-1 uses , Mamba-2 uses larger values. How should scale with model size and context length? The information-theoretic limits of the compressed state are not fully understood.
-
Distillation from attention to SSM: Can we train a transformer and distill into a hybrid, getting the best training signal from attention while deploying the efficient SSM? Early results from Zyphra and others suggest this works, but the quality gap from distillation is nonzero.
-
Hardware co-design: Current GPUs are optimized for matrix multiplications (attention). Specialized hardware for parallel scans (SSMs) could shift the efficiency comparison. Groq’s LPU and other novel architectures may favor recurrent models.
-
Beyond language: For multimodal models processing images, audio, and text together, the optimal mix of attention and SSM may be different for each modality. Images may benefit from full attention (spatial relationships), while audio may be best served by SSMs (temporal dynamics).
10. Summary: The Alternative Is Real, With Caveats
State space models, and Mamba in particular, have demonstrated that sequence modeling is not just a theoretical possibility but a practical one. The progression from S4’s continuous-time formulation and HiPPO initialization through Mamba’s input-dependent parameterization to Mamba-2’s state-space duality represents a coherent line of research that has produced models genuinely competitive with transformers on language benchmarks.
The key ideas to take away:
- Linear attention removes the softmax to achieve , but the quality degradation is too severe for language modeling. The softmax is doing more work than it appears.
- S4 uses continuous-time state space dynamics with the HiPPO matrix to model long-range dependencies, but its fixed parameters prevent content-based reasoning.
- Mamba makes the SSM parameters input-dependent, enabling content-based selection that closes the gap with attention. The parallel scan makes this GPU-efficient.
- Hybrid architectures combine a few attention layers (for retrieval and in-context learning) with many SSM layers (for efficient processing), achieving near-transformer quality with significantly reduced memory and compute for long sequences.
The transformer is not going away. Its ecosystem is vast, its scaling properties are well-understood, and its capabilities on the tasks that matter most for frontier AI — in-context learning, complex reasoning, precise retrieval — remain unmatched. But the pure-attention architecture is likely not the endpoint. The future probably involves models that use attention surgically, where it provides the most value, and efficient linear-time mechanisms for the bulk of sequence processing. Mamba and its descendants are the leading candidates for the efficient half of that equation.
This concludes the Inference Optimization Timeline series. From KV cache optimization through continuous batching, speculative decoding, disaggregated serving, constrained generation, and now alternative architectures — the theme throughout has been the same: understanding the hardware-software interface, identifying the true bottleneck, and restructuring computation to match what the hardware can deliver.