Mistral and the Sliding Window: Efficient Long-Context with Linear Memory

Part of Series Frontier Model Architectures 12 of 27

1 Kimi K2: How Moonshot Built a 1T MoE That Rivals Claude and GPT-4o 2 MiniMax-01: Lightning Attention, 4M Token Context, and the Linear Attention Revival 3 Frontier Models in 2025: The Architectural Convergence and Where Innovation Happens 4 Llama 3 Architecture Decisions: Why Meta Chose Dense, GQA-8, 128K Vocab, and RoPE 5 Qwen 2.5: Alibaba's Architecture, Training Recipe, and What Makes It Competitive 6 Gemini: Google's Natively Multimodal Architecture and the 1M Token Context 7 Claude Architecture: Constitutional AI, RLHF at Scale, and the 200K Context Window 8 Grok: xAI's Architecture, Massive Scale, and Real-Time Information Integration 9 Open vs Closed Models in 2026: Llama vs GPT-4o vs Claude vs DeepSeek — The Capability Gap Analysis 10 DeepSeek-R1: The Architecture of Reasoning — GRPO Training, Multi-Stage Pipeline, and Open Weights 11 Phi and Small Language Models: How Microsoft Achieves GPT-3.5 Quality at 3B Parameters 12 Mistral and the Sliding Window: Efficient Long-Context with Linear Memory 13 Llama 4: Meta's Shift to Multimodal MoE and What It Signals 14 Training Infrastructure: How Frontier Labs Build Their GPU Clusters 15 Benchmark Deep Dive: What MMLU, HumanEval, MATH, and SWE-bench Actually Measure 16 Jamba: AI21's Hybrid Mamba-Attention Architecture 17 Yi Series: 01.AI's Bilingual Architecture 18 DBRX: Databricks' Enterprise MoE Architecture 19 OpenAI o1: Reasoning Compute Budgets and Internal CoT 20 Distilled Models: Phi, Gemma, Llama 3.2 at Small Scale 21 Llama 4: Meta's Shift to Multimodal MoE 22 Scaling Laws and Model Design: How Chinchilla Changed Architecture Decisions 23 Open Weight Release Strategy: Llama vs Mistral vs DeepSeek — Licensing and Ecosystem Impact 24 Safety Architecture: How Frontier Models Build Guardrails Into the Model Itself 25 Multimodal Model Comparison 2026: GPT-4o vs Gemini vs Claude vs Llama Vision 26 MoE vs Dense in Production: Serving Cost, Latency, and When Each Wins 27 Chinese Frontier Models: DeepSeek, Qwen, Yi, and Kimi — Architecture Comparison

Mistral 7B serves 32K context windows with the same memory footprint that Llama 2 needs for 4K. The trick: sliding window attention with $W=4096$ bounds KV cache to $O(W)$ per layer, not $O(N)$ . Each token only attends to the last 4K tokens, but information propagates across the full context through 32 layers of overlapping windows. The quality cost is negligible (MMLU: 62.5% vs 63.0% for full attention), but the memory savings enable deployment scenarios that full-attention models cannot reach.

Sliding Window Attention: The Mechanism

Standard vs Sliding Window

import torch
import torch.nn.functional as F

class StandardAttention:
    """
    Standard causal attention: each token attends to ALL previous tokens.
    Memory per layer: O(N) where N is sequence length.
    """

    def forward(self, Q, K, V):
        # Q, K, V: [batch, num_heads, seq_len, head_dim]
        seq_len = Q.shape[2]
        d_k = Q.shape[3]

        # Attention scores: Q @ K^T
        scores = torch.matmul(Q, K.transpose(-2, -1)) / (d_k ** 0.5)

        # Causal mask: token i can attend to tokens 0..i
        causal_mask = torch.triu(
            torch.ones(seq_len, seq_len, dtype=torch.bool), diagonal=1
        )
        scores.masked_fill_(causal_mask, float('-inf'))

        attn_weights = F.softmax(scores, dim=-1)
        return torch.matmul(attn_weights, V)

    def kv_cache_size(self, seq_len, num_layers, num_heads, head_dim):
        """KV cache grows linearly with sequence length."""
        bytes_per_element = 2  # FP16
        # Per layer: 2 (K, V) * num_heads * seq_len * head_dim
        per_layer = 2 * num_heads * seq_len * head_dim * bytes_per_element
        return per_layer * num_layers

class SlidingWindowAttention:
    """
    Sliding window attention: each token attends to last W tokens only.
    Memory per layer: O(W) where W is window size (constant).
    """

    def __init__(self, window_size=4096):
        self.W = window_size

    def forward(self, Q, K, V):
        # Q, K, V: [batch, num_heads, seq_len, head_dim]
        seq_len = Q.shape[2]
        d_k = Q.shape[3]

        scores = torch.matmul(Q, K.transpose(-2, -1)) / (d_k ** 0.5)

        # Sliding window mask: token i attends to tokens max(0, i-W+1)..i
        for i in range(seq_len):
            # Mask out tokens outside the window
            if i >= self.W:
                scores[:, :, i, :i-self.W+1] = float('-inf')
            # Standard causal mask for tokens after position i
            scores[:, :, i, i+1:] = float('-inf')

        attn_weights = F.softmax(scores, dim=-1)
        return torch.matmul(attn_weights, V)

    def kv_cache_size(self, seq_len, num_layers, num_heads, head_dim):
        """KV cache is bounded by window size, not sequence length."""
        bytes_per_element = 2  # FP16
        effective_len = min(seq_len, self.W)
        per_layer = 2 * num_heads * effective_len * head_dim * bytes_per_element
        return per_layer * num_layers

Memory Comparison

def memory_comparison():
    """
    Compare KV cache memory for standard vs sliding window attention.
    """

    # Mistral 7B configuration
    num_layers = 32
    num_kv_heads = 8   # GQA with 8 KV heads
    head_dim = 128
    window_size = 4096

    results = {}
    for seq_len in [1024, 4096, 8192, 16384, 32768, 65536, 131072]:
        standard_bytes = (
            2 * num_kv_heads * seq_len * head_dim * 2 * num_layers
        )
        sliding_bytes = (
            2 * num_kv_heads * min(seq_len, window_size) * head_dim * 2 * num_layers
        )

        results[seq_len] = {
            'standard_MB': standard_bytes / 1e6,
            'sliding_MB': sliding_bytes / 1e6,
            'ratio': standard_bytes / sliding_bytes if sliding_bytes > 0 else 0,
        }

    return results

📊

KV Cache Memory: Standard vs Sliding Window (Mistral 7B, W=4096)

Sequence Length	Standard KV Cache	Sliding Window KV Cache	Memory Savings
1,024	128 MB	128 MB	1.0x (no savings)
4,096	512 MB	512 MB	1.0x (at window size)
8,192	1,024 MB	512 MB	2.0x
32,768	4,096 MB	512 MB	8.0x
131,072	16,384 MB	512 MB	32.0x

⚡ Performance

At 128K tokens, sliding window attention uses 32x less KV cache memory than standard attention (512 MB vs 16.4 GB). The KV cache is bounded at the window size regardless of how long the sequence gets. This makes long-context serving feasible on smaller GPUs.

Information Propagation Across Layers

The Multi-Layer Receptive Field

def receptive_field_analysis():
    """
    Key insight: although each layer attends to only W tokens,
    STACKING layers creates a larger effective receptive field.

    Layer 1: token at position N attends to positions [N-W+1, N]
    Layer 2: those positions attend to positions [N-2W+1, N]
    Layer L: effective receptive field = [N-L*W+1, N]

    For Mistral 7B: W=4096, L=32 layers
    Effective receptive field = 32 * 4096 = 131,072 tokens
    This matches the 128K context length.
    """

    W = 4096   # Window size
    L = 32     # Number of layers

    receptive_field_per_layer = {}
    for layer in range(1, L + 1):
        field = layer * W
        receptive_field_per_layer[layer] = {
            'theoretical_max': field,
            'practical_coverage': min(field, 131072),
        }

    # Layer 1:  4,096 tokens
    # Layer 8:  32,768 tokens
    # Layer 16: 65,536 tokens
    # Layer 32: 131,072 tokens (covers full 128K context)

    return receptive_field_per_layer

def information_dilution():
    """
    The catch: information from distant tokens must propagate
    through MANY intermediate layers. Each hop dilutes the signal.

    Token at position 0 must pass through ~32 layers of
    attention + FFN to reach token at position 131072.
    At each layer, the representation is mixed with ~4096 nearby tokens.

    This means: direct, precise recall of distant tokens is weaker
    than in standard attention (where token N can directly attend to token 0).
    """

    dilution_model = {
        'layer_0': {
            'direct_attention_range': '4096 tokens',
            'information_from_token_0': 'Direct (if within window)',
        },
        'layer_16': {
            'direct_attention_range': '4096 tokens (positions 61440-65536)',
            'information_from_token_0': 'Indirect: passed through 16 layers of mixing',
            'estimated_signal_strength': '~30% of direct attention',
        },
        'layer_32': {
            'direct_attention_range': '4096 tokens (positions 126976-131072)',
            'information_from_token_0': 'Highly indirect: 32 layers of mixing',
            'estimated_signal_strength': '~5-10% of direct attention',
        },
    }

    return dilution_model

Quality Impact: Needle in a Haystack

def needle_in_haystack_swa():
    """
    The needle-in-a-haystack test: place a fact at position P
    in a long context, then ask about it at the end.

    Standard attention: can directly attend to position P.
    Sliding window: must propagate information through layers.
    """

    results = {
        'within_window': {
            'needle_distance': 2000,     # Within W=4096
            'standard_recall': 0.98,
            'sliding_window_recall': 0.97,
            'note': 'Nearly identical — needle is in direct attention range',
        },
        'slightly_outside': {
            'needle_distance': 8000,     # 2x window size
            'standard_recall': 0.96,
            'sliding_window_recall': 0.89,
            'note': 'SWA starts degrading — needs 2 hops',
        },
        'far_outside': {
            'needle_distance': 32000,    # 8x window size
            'standard_recall': 0.93,
            'sliding_window_recall': 0.72,
            'note': 'Significant degradation — 8 hops required',
        },
        'very_far': {
            'needle_distance': 100000,   # ~25x window size
            'standard_recall': 0.88,
            'sliding_window_recall': 0.45,
            'note': 'Major degradation — information heavily diluted',
        },
    }
    return results

Needle-in-Haystack Recall: Standard vs Sliding Window Attention

2K (in window)

8K (2x window)

16K (4x window)

32K (8x window)

64K (16x window)

100K (25x window)

⚠️ Warning

Sliding window attention degrades significantly for precise recall of distant tokens. At 32K tokens (8x the window), recall drops to 72%. For most NLP tasks (summarization, QA, code generation), this is acceptable because relevant context is usually nearby. For tasks requiring precise recall of specific distant facts, standard attention is superior.

Mistral 7B: Complete Architecture

class Mistral7BConfig:
    """
    Full Mistral 7B configuration.
    """

    def __init__(self):
        # Transformer backbone
        self.d_model = 4096
        self.num_layers = 32
        self.num_heads = 32
        self.head_dim = 128
        self.d_ff = 14336
        self.vocab_size = 32000

        # Grouped Query Attention
        self.num_kv_heads = 8
        self.gqa_ratio = 4

        # Sliding Window Attention
        self.sliding_window = 4096
        self.max_position = 32768  # Can handle up to 32K tokens

        # Activation
        self.activation = 'silu'  # SwiGLU FFN

        # Positional encoding
        self.rope = True
        self.rope_theta = 10000.0

        # Total parameters
        self.total_params = 7.3e9

        # Key innovations over Llama 2 7B:
        self.innovations = [
            "Sliding window attention (W=4096)",
            "Grouped query attention (8 KV heads vs 32)",
            "Larger FFN (14336 vs 11008)",
            "Better training data and hyperparameters",
        ]

def mistral_vs_llama2():
    """
    Mistral 7B vs Llama 2 7B: same parameter count, different design choices.
    """

    comparison = {
        'params': {'mistral': '7.3B', 'llama2': '6.7B'},
        'layers': {'mistral': 32, 'llama2': 32},
        'd_model': {'mistral': 4096, 'llama2': 4096},
        'heads': {'mistral': 32, 'llama2': 32},
        'kv_heads': {'mistral': 8, 'llama2': 32},
        'd_ff': {'mistral': 14336, 'llama2': 11008},
        'attention': {'mistral': 'SWA (W=4096)', 'llama2': 'Full causal'},
        'context': {'mistral': '32K (effective)', 'llama2': '4096'},
        'kv_cache_at_8K': {'mistral': '512 MB', 'llama2': '1024 MB'},
    }
    return comparison

📊

Mistral 7B vs Llama 2 7B/13B

Benchmark	Mistral 7B	Llama 2 7B	Llama 2 13B	Winner
MMLU (5-shot)	60.1%	45.3%	54.8%	Mistral (+5.3% vs 13B)
HumanEval	30.5%	12.8%	18.3%	Mistral
GSM-8K	52.2%	14.6%	28.7%	Mistral
HellaSwag	81.3%	77.2%	80.7%	Mistral (marginal)
ARC-C	78.0%	67.3%	72.4%	Mistral
KV cache (8K ctx)	512 MB	1024 MB	1536 MB	Mistral (2-3x less)

Rolling Buffer KV Cache

Implementation

class RollingBufferKVCache:
    """
    Implementation of the rolling buffer KV cache for sliding window attention.
    Fixed-size buffer that overwrites the oldest entries.
    """

    def __init__(self, max_window, num_layers, num_kv_heads, head_dim, dtype=torch.float16):
        self.max_window = max_window
        self.num_layers = num_layers
        self.num_kv_heads = num_kv_heads
        self.head_dim = head_dim

        # Pre-allocated buffers: [num_layers, 2, num_kv_heads, max_window, head_dim]
        # 2 = K and V
        self.buffer = torch.zeros(
            num_layers, 2, num_kv_heads, max_window, head_dim,
            dtype=dtype
        )
        self.position = 0  # Current write position (circular)
        self.length = 0     # Number of valid entries

    def update(self, layer_idx, new_k, new_v):
        """
        Add new K, V entries to the rolling buffer.
        new_k, new_v: [batch, num_kv_heads, num_new_tokens, head_dim]
        """
        num_new = new_k.shape[2]

        for i in range(num_new):
            write_pos = (self.position + i) % self.max_window
            self.buffer[layer_idx, 0, :, write_pos, :] = new_k[0, :, i, :]
            self.buffer[layer_idx, 1, :, write_pos, :] = new_v[0, :, i, :]

        self.position = (self.position + num_new) % self.max_window
        self.length = min(self.length + num_new, self.max_window)

    def get_kv(self, layer_idx):
        """
        Get current K, V tensors for attention computation.
        Returns only valid entries (up to window size).
        """
        if self.length < self.max_window:
            # Haven't filled the buffer yet
            k = self.buffer[layer_idx, 0, :, :self.length, :]
            v = self.buffer[layer_idx, 1, :, :self.length, :]
        else:
            # Full buffer — need to handle circular ordering
            # Reorder so oldest entries come first
            start = self.position  # Oldest entry
            indices = [(start + i) % self.max_window for i in range(self.max_window)]
            k = self.buffer[layer_idx, 0, :, indices, :]
            v = self.buffer[layer_idx, 1, :, indices, :]

        return k.unsqueeze(0), v.unsqueeze(0)  # Add batch dim

    def memory_usage_bytes(self):
        """Total memory used by the cache."""
        return self.buffer.nelement() * self.buffer.element_size()

def rolling_buffer_demo():
    """
    Demonstrate the rolling buffer with W=4.
    """

    W = 4  # Small window for demonstration
    cache = RollingBufferKVCache(
        max_window=W, num_layers=1, num_kv_heads=1, head_dim=4
    )

    # Token sequence: A B C D E F G H
    # After A: buffer = [A _ _ _], length=1
    # After B: buffer = [A B _ _], length=2
    # After C: buffer = [A B C _], length=3
    # After D: buffer = [A B C D], length=4 (full)
    # After E: buffer = [E B C D], length=4 (A overwritten)
    # After F: buffer = [E F C D], length=4 (B overwritten)
    # After G: buffer = [E F G D], length=4 (C overwritten)
    # After H: buffer = [E F G H], length=4 (D overwritten)

    # Token H attends to: E, F, G, H (last W=4 tokens)
    # This is exactly the sliding window behavior

    return cache

Mixtral 8x7B: First Open-Source MoE

Architecture

class Mixtral8x7BConfig:
    """
    Mixtral 8x7B: Mistral 7B architecture with MoE FFN layers.
    Released December 2023 — first competitive open-source MoE.
    """

    def __init__(self):
        # Same attention as Mistral 7B
        self.d_model = 4096
        self.num_layers = 32
        self.num_heads = 32
        self.head_dim = 128
        self.num_kv_heads = 8

        # MoE FFN (replacing dense FFN)
        self.num_experts = 8
        self.top_k = 2
        self.expert_d_ff = 14336       # Each expert same size as Mistral 7B FFN

        # Sliding window attention (same as Mistral 7B)
        self.sliding_window = 4096

        # Total vs active parameters
        self.total_params = 46.7e9     # 8 experts * ~5.6B FFN + shared components
        self.active_params = 12.9e9    # 2 experts active + attention + embeddings

        # Vocabulary
        self.vocab_size = 32000

    def parameter_breakdown(self):
        """Where the parameters live."""
        attention_params = self.num_layers * (
            # Q projection
            self.d_model * self.num_heads * self.head_dim +
            # K projection
            self.d_model * self.num_kv_heads * self.head_dim +
            # V projection
            self.d_model * self.num_kv_heads * self.head_dim +
            # O projection
            self.num_heads * self.head_dim * self.d_model
        )

        expert_params = self.num_layers * self.num_experts * (
            3 * self.d_model * self.expert_d_ff  # SwiGLU: gate + up + down
        )

        router_params = self.num_layers * self.d_model * self.num_experts

        embedding_params = self.vocab_size * self.d_model * 2  # Input + output

        return {
            'attention_B': attention_params / 1e9,
            'experts_B': expert_params / 1e9,
            'router_B': router_params / 1e9,
            'embedding_B': embedding_params / 1e9,
        }

Impact on the Open-Source Ecosystem

def mixtral_impact():
    """
    Mixtral 8x7B's influence on the open-source LLM ecosystem.
    """

    contributions = {
        'proved_moe_works_open': {
            'before': 'MoE was only used by Google (GShard, Switch) behind closed APIs',
            'after': 'Open weights showed MoE quality. Community could study routing.',
            'impact': 'Enabled DeepSeek, Kimi, and others to build on MoE openly',
        },
        'efficiency_demonstration': {
            'before': 'Open models required large dense models for GPT-3.5 quality',
            'after': 'Mixtral matched Llama 2 70B with 12.9B active params (5.4x fewer)',
            'impact': 'Established MoE as the path to efficient open-source frontier models',
        },
        'inference_tooling': {
            'before': 'No open-source MoE serving infrastructure',
            'after': 'vLLM, TGI, llama.cpp added MoE support for Mixtral',
            'impact': 'Created the serving ecosystem that DeepSeek V3 later used',
        },
        'fine_tuning_paradigm': {
            'before': 'Fine-tuning MoE was unexplored in open source',
            'after': 'Community developed LoRA-for-MoE, expert-specific fine-tuning',
            'impact': 'Established best practices for MoE adaptation',
        },
    }
    return contributions

📊

Mixtral 8x7B vs Dense Alternatives

Benchmark	Mixtral 8x7B	Llama 2 70B	GPT-3.5 Turbo
MMLU (5-shot)	70.6%	69.8%	70.0%
HumanEval	40.2%	29.9%	48.1%
GSM-8K	58.4%	56.8%	57.1%
HellaSwag	81.8%	85.3%	78.4%
Active Params	12.9B	70B	~20B (est.)
Inference FLOPs/token	~25.8G	~140G	~40G

Mistral’s Broader Model Family

def mistral_model_family():
    """
    Mistral's expanding model family.
    """

    models = {
        'mistral_7B': {
            'release': 'September 2023',
            'params': '7.3B',
            'key_feature': 'Sliding window attention, GQA',
            'open_weights': True,
            'significance': 'First "GPT-3.5 at 7B" claim',
        },
        'mixtral_8x7B': {
            'release': 'December 2023',
            'params': '46.7B (12.9B active)',
            'key_feature': 'First open-source MoE',
            'open_weights': True,
            'significance': 'Proved MoE works for open models',
        },
        'mistral_small': {
            'release': 'February 2024',
            'params': '~22B',
            'key_feature': 'Optimized for cost-effective API serving',
            'open_weights': False,
            'significance': 'Mistral starts monetizing through API',
        },
        'mistral_large': {
            'release': 'February 2024',
            'params': '~123B',
            'key_feature': 'Frontier-quality, multilingual',
            'open_weights': False,
            'significance': 'Competing with GPT-4 / Claude for enterprise',
        },
        'codestral': {
            'release': 'May 2024',
            'params': '~22B',
            'key_feature': 'Code-specialized model',
            'open_weights': True,
            'significance': 'Competitive with GPT-4 on code benchmarks',
        },
        'mistral_nemo': {
            'release': 'July 2024',
            'params': '12B',
            'key_feature': 'Collaboration with NVIDIA, optimized for TensorRT-LLM',
            'open_weights': True,
            'significance': 'Strong 12B model for practical deployment',
        },
        'pixtral': {
            'release': 'September 2024',
            'params': '12B',
            'key_feature': 'Multimodal (vision + language)',
            'open_weights': True,
            'significance': 'Mistral enters multimodal space',
        },
    }
    return models

SWA Variants and Successors

Alternatives to Fixed Sliding Window

class AlternatingSWA:
    """
    Variant: alternate between sliding window and full attention layers.
    Even layers: sliding window (local attention)
    Odd layers: full causal attention (global attention)
    """

    def __init__(self, d_model, num_heads, head_dim, window_size):
        self.window_size = window_size
        self.d_model = d_model
        self.num_heads = num_heads
        self.head_dim = head_dim

    def forward(self, Q, K, V, layer_idx):
        if layer_idx % 2 == 0:
            # Sliding window (local)
            return self._sliding_window_attention(Q, K, V)
        else:
            # Full causal (global)
            return self._full_causal_attention(Q, K, V)

    def _sliding_window_attention(self, Q, K, V):
        seq_len = Q.shape[2]
        d_k = Q.shape[3]
        scores = torch.matmul(Q, K.transpose(-2, -1)) / (d_k ** 0.5)

        # Apply sliding window mask
        for i in range(seq_len):
            if i >= self.window_size:
                scores[:, :, i, :i-self.window_size+1] = float('-inf')
            scores[:, :, i, i+1:] = float('-inf')

        return torch.matmul(F.softmax(scores, dim=-1), V)

    def _full_causal_attention(self, Q, K, V):
        d_k = Q.shape[3]
        seq_len = Q.shape[2]
        scores = torch.matmul(Q, K.transpose(-2, -1)) / (d_k ** 0.5)

        causal_mask = torch.triu(torch.ones(seq_len, seq_len, dtype=torch.bool), diagonal=1)
        scores.masked_fill_(causal_mask, float('-inf'))

        return torch.matmul(F.softmax(scores, dim=-1), V)

class ChunkedPrefillWithSWA:
    """
    During prefill (processing the prompt), compute attention in chunks.
    Each chunk is window-size tokens. Attention across chunks uses
    the rolling buffer for KV cache.
    """

    def __init__(self, window_size=4096):
        self.W = window_size

    def prefill(self, tokens, model):
        """
        Process a long prompt in chunks of size W.
        Each chunk has full attention within itself and
        SWA attention to previous chunks via the rolling buffer.
        """
        num_tokens = len(tokens)
        num_chunks = (num_tokens + self.W - 1) // self.W

        kv_cache = RollingBufferKVCache(
            max_window=self.W,
            num_layers=model.num_layers,
            num_kv_heads=model.num_kv_heads,
            head_dim=model.head_dim,
        )

        for chunk_idx in range(num_chunks):
            start = chunk_idx * self.W
            end = min(start + self.W, num_tokens)
            chunk_tokens = tokens[start:end]

            # Forward pass: attention uses current chunk + rolling buffer
            hidden = model.forward_chunk(chunk_tokens, kv_cache)

            # Update rolling buffer with new KV entries
            for layer_idx in range(model.num_layers):
                new_k, new_v = model.get_kv_for_layer(layer_idx, hidden)
                kv_cache.update(layer_idx, new_k, new_v)

        return hidden, kv_cache

What Mistral Changed

def mistral_legacy():
    """
    Mistral's lasting contributions to the open-source LLM ecosystem.
    """

    legacy = {
        'sliding_window_adoption': {
            'status': 'Partially adopted',
            'detail': 'Some models use SWA variants. Others (Llama 3, DeepSeek) '
                     'use full attention with GQA, arguing the quality tradeoff '
                     'is not worth the memory savings at their scale.',
            'verdict': 'SWA remains useful for small-model deployment where '
                      'KV cache memory is the bottleneck.',
        },
        'gqa_standardization': {
            'status': 'Universally adopted',
            'detail': 'Mistral 7B used 8 KV heads (4:1 GQA ratio). '
                     'Every major model since uses GQA.',
            'verdict': 'GQA is now the default. Mistral helped establish this.',
        },
        'moe_open_source': {
            'status': 'Transformative',
            'detail': 'Mixtral created the open MoE ecosystem. '
                     'Inference frameworks, fine-tuning tools, and serving '
                     'infrastructure all trace back to Mixtral support.',
            'verdict': 'Without Mixtral, the open MoE ecosystem would be '
                      'years behind where it is.',
        },
        'open_weight_standard': {
            'status': 'Widely adopted',
            'detail': 'Mistral released weights under Apache 2.0, setting '
                     'the standard for truly open models (not just "community license").',
            'verdict': 'Influenced Meta, DeepSeek, and others toward more '
                      'permissive licensing.',
        },
    }
    return legacy

Mistral’s influence on the LLM landscape extends beyond any single architectural innovation. Sliding window attention proved that attention does not need to be quadratic in practice. Mixtral proved MoE could work in the open. The combination accelerated the entire open-source frontier by at least a year. While newer models have moved beyond SWA (preferring GQA with full attention and larger contexts), the serving infrastructure and MoE ecosystem that Mistral and Mixtral created remains the foundation for everything that followed.