Part of Series Frontier Model Architectures 34 of 27
1 Kimi K2: How Moonshot Built a 1T MoE That Rivals Claude and GPT-4o 2 MiniMax-01: Lightning Attention, 4M Token Context, and the Linear Attention Revival 3 Frontier Models in 2025: The Architectural Convergence and Where Innovation Happens 4 Llama 3 Architecture Decisions: Why Meta Chose Dense, GQA-8, 128K Vocab, and RoPE 5 Qwen 2.5: Alibaba's Architecture, Training Recipe, and What Makes It Competitive 6 Gemini: Google's Natively Multimodal Architecture and the 1M Token Context 7 Claude Architecture: Constitutional AI, RLHF at Scale, and the 200K Context Window 8 Grok: xAI's Architecture, Massive Scale, and Real-Time Information Integration 9 Open vs Closed Models in 2026: Llama vs GPT-4o vs Claude vs DeepSeek — The Capability Gap Analysis 10 DeepSeek-R1: The Architecture of Reasoning — GRPO Training, Multi-Stage Pipeline, and Open Weights 11 Phi and Small Language Models: How Microsoft Achieves GPT-3.5 Quality at 3B Parameters 12 Mistral and the Sliding Window: Efficient Long-Context with Linear Memory 13 Llama 4: Meta's Shift to Multimodal MoE and What It Signals 14 Training Infrastructure: How Frontier Labs Build Their GPU Clusters 15 Benchmark Deep Dive: What MMLU, HumanEval, MATH, and SWE-bench Actually Measure 16 Jamba: AI21's Hybrid Mamba-Attention Architecture 17 Yi Series: 01.AI's Bilingual Architecture 18 DBRX: Databricks' Enterprise MoE Architecture 19 OpenAI o1: Reasoning Compute Budgets and Internal CoT 20 Distilled Models: Phi, Gemma, Llama 3.2 at Small Scale 21 Llama 4: Meta's Shift to Multimodal MoE 22 Scaling Laws and Model Design: How Chinchilla Changed Architecture Decisions 23 Open Weight Release Strategy: Llama vs Mistral vs DeepSeek — Licensing and Ecosystem Impact 24 Safety Architecture: How Frontier Models Build Guardrails Into the Model Itself 25 Multimodal Model Comparison 2026: GPT-4o vs Gemini vs Claude vs Llama Vision 26 MoE vs Dense in Production: Serving Cost, Latency, and When Each Wins 27 Chinese Frontier Models: DeepSeek, Qwen, Yi, and Kimi — Architecture Comparison

DeepSeek-V3’s Multi-head Latent Attention compresses KV cache from 1.8 GB to 128 MB per batch — a 93% reduction that makes 128K context windows practical for production serving. Western labs (OpenAI, Anthropic, Meta) have not published equivalent KV compression techniques, giving Chinese models a structural cost advantage. When MLA-equipped models serve the same traffic at one-tenth the memory footprint, the TCO gap is not incremental; it is existential for deployment at scale.

Model Overview

📊

Chinese Frontier Model Specifications

ModelTotal ParamsActive ParamsArchitectureContext LengthRelease
DeepSeek-V3 671B 37B MoE + MLA 128K 2024-12
DeepSeek-R1 671B 37B MoE + MLA + CoT 128K 2025-01
Qwen2.5-72B 72B 72B Dense 128K 2024-09
Qwen2.5-MoE-57B 57B 14B MoE 128K 2024-09
Yi-Lightning ~200B ~30B MoE 16K 2024-07
Kimi 1.5 ~200B Unknown Dense/MoE 2M 2025-01

DeepSeek-V3: Multi-head Latent Attention

DeepSeek-V3’s most significant innovation is MLA (Multi-head Latent Attention), which compresses the KV cache through low-rank projection.

Standard Multi-Head Attention KV Cache

In standard MHA, each layer stores full K and V projections:

# Standard MHA: KV cache per token per layer
# K: [num_kv_heads, head_dim] = [8, 128] = 1,024 values
# V: [num_kv_heads, head_dim] = [8, 128] = 1,024 values
# Total: 2,048 values * 2 bytes (FP16) = 4,096 bytes per token per layer

MLA: Compressed KV Cache

MLA projects KV into a low-dimensional latent space:

class MultiHeadLatentAttention(nn.Module):
    def __init__(self, hidden_dim: int, num_heads: int,
                 head_dim: int, kv_lora_rank: int):
        super().__init__()
        self.hidden_dim = hidden_dim      # 7168
        self.num_heads = num_heads        # 128
        self.head_dim = head_dim          # 128
        self.kv_lora_rank = kv_lora_rank  # 512

        # Compressed KV projection
        # Instead of projecting to [num_kv_heads * head_dim] directly,
        # project to low-rank [kv_lora_rank]
        self.kv_compress = nn.Linear(hidden_dim, kv_lora_rank)
        # Then decompress during attention
        self.kv_decompress_k = nn.Linear(kv_lora_rank,
                                          num_heads * head_dim)
        self.kv_decompress_v = nn.Linear(kv_lora_rank,
                                          num_heads * head_dim)

        # Query projection (standard)
        self.q_proj = nn.Linear(hidden_dim, num_heads * head_dim)
        self.o_proj = nn.Linear(num_heads * head_dim, hidden_dim)

    def forward(self, x: torch.Tensor,
                kv_cache: Optional[torch.Tensor] = None):
        batch, seq_len, _ = x.shape

        # Query: standard projection
        q = self.q_proj(x).view(batch, seq_len, self.num_heads, self.head_dim)

        # KV: compress to low-rank latent
        kv_latent = self.kv_compress(x)
        # Shape: [batch, seq_len, kv_lora_rank]
        # This is what gets CACHED — only 512 values instead of 32,768

        # Decompress for attention computation
        k = self.kv_decompress_k(kv_latent).view(
            batch, seq_len, self.num_heads, self.head_dim
        )
        v = self.kv_decompress_v(kv_latent).view(
            batch, seq_len, self.num_heads, self.head_dim
        )

        # Standard attention
        attn_out = flash_attention(q, k, v)
        return self.o_proj(attn_out.reshape(batch, seq_len, -1))

The KV cache savings:

# Standard MHA (Llama 70B equivalent):
# Cache per token per layer = 2 * num_kv_heads * head_dim * dtype
# = 2 * 8 * 128 * 2 = 4,096 bytes

# MLA (DeepSeek-V3):
# Cache per token per layer = kv_lora_rank * dtype
# = 512 * 2 = 1,024 bytes

# Compression ratio: 4,096 / 1,024 = 4x
# But DeepSeek uses 128 attention heads (not GQA), so vs standard MHA:
# Standard 128-head: 2 * 128 * 128 * 2 = 65,536 bytes
# MLA: 512 * 2 = 1,024 bytes
# Compression: 64x
📊

KV Cache per Token per Layer

ModelAttention TypeCache Bytes/Token/Layervs DeepSeek MLA
Llama 70B (GQA, 8 KV heads) GQA 4,096 4.0x more
Llama 405B (GQA, 8 KV heads) GQA 4,096 4.0x more
GPT-4 (est. MHA, 96 heads) MHA 49,152 48.0x more
DeepSeek-V3 (MLA, rank 512) MLA 1,024 1.0x (baseline)
Qwen2.5-72B (GQA, 8 KV heads) GQA 4,096 4.0x more
Performance

MLA enables DeepSeek-V3 to cache 4-64x more tokens in the same GPU memory compared to standard attention models. For a 128K context, MLA saves approximately 7.6 GB of KV cache per layer compared to standard 128-head MHA. Across 61 layers, that is 464 GB of savings — enough to serve 4x more concurrent sequences.

DeepSeek-V3: Auxiliary-Loss-Free MoE

Traditional MoE uses an auxiliary loss to encourage balanced expert utilization:

# Traditional auxiliary loss (Mixtral, Switch Transformer)
def aux_loss(router_probs, expert_assignments, num_experts):
    # Fraction of tokens assigned to each expert
    f = expert_assignments.float().mean(0)  # [num_experts]
    # Routing probability mass for each expert
    p = router_probs.mean(0)  # [num_experts]
    # Auxiliary loss encourages uniform distribution
    return num_experts * (f * p).sum()

DeepSeek-V3 replaces this with a bias-based approach:

class AuxLossFreeRouter(nn.Module):
    def __init__(self, hidden_dim: int, num_experts: int,
                 top_k: int):
        super().__init__()
        self.gate = nn.Linear(hidden_dim, num_experts, bias=False)
        # Learnable expert biases — NOT trained by gradient descent
        # Updated by a running average of expert load
        self.expert_bias = nn.Parameter(
            torch.zeros(num_experts), requires_grad=False
        )
        self.top_k = top_k
        self.bias_update_speed = 0.001

    def forward(self, x: torch.Tensor):
        # Raw routing scores
        scores = self.gate(x)  # [batch, seq, num_experts]

        # Add bias to encourage balanced routing
        biased_scores = scores + self.expert_bias

        # Select top-K experts
        top_k_scores, top_k_indices = torch.topk(
            biased_scores, self.top_k, dim=-1
        )

        # Normalize scores (softmax over selected experts only)
        weights = F.softmax(top_k_scores, dim=-1)

        # Update bias based on load (no gradient needed)
        with torch.no_grad():
            load = torch.zeros(self.num_experts, device=x.device)
            load.scatter_add_(0, top_k_indices.view(-1),
                             torch.ones_like(top_k_indices.view(-1),
                                            dtype=torch.float))
            avg_load = load.mean()
            # Increase bias for underloaded experts
            # Decrease bias for overloaded experts
            self.expert_bias += self.bias_update_speed * (avg_load - load)

        return weights, top_k_indices

This avoids the auxiliary loss interfering with the main training objective while still maintaining balanced expert utilization.

Qwen2.5: Dense Scaling with Data Engineering

Qwen2.5 takes a different approach: instead of architectural novelty, it scales a standard dense transformer with massive data curation.

# Qwen2.5-72B architecture (published)
class Qwen25Config:
    hidden_dim = 8192
    num_layers = 80
    num_attention_heads = 64
    num_kv_heads = 8          # GQA with 8 KV heads
    head_dim = 128
    intermediate_size = 29568  # MLP hidden dim
    vocab_size = 152064       # Large multilingual vocab
    max_position_embeddings = 131072  # 128K context
    rope_theta = 1000000.0    # High RoPE base for long context

    # Notable: standard architecture, no MoE, no MLA
    # Competitive through data quality and scale

Qwen’s Technical Innovations

# Large vocabulary (152K tokens)
# - Covers Chinese, English, code, and 29 other languages
# - Reduces sequence length for CJK text by ~30% vs Llama's 32K vocab
# - Trade-off: larger embedding table (152K * 8192 * 2 = 2.5 GB)

# YaRN for long context
# - Extends RoPE from 4K to 128K through interpolation
# - No architectural change, just modified position encoding

# Attention sink tokens
# - First few tokens get disproportionate attention
# - Qwen preserves these as "sink" tokens during KV cache eviction
# - Prevents quality degradation in long conversations
📊

Qwen2.5 Vocabulary Efficiency (Tokens per 1K Characters)

LanguageLlama 3 (128K vocab)Qwen2.5 (152K vocab)Efficiency Gain
English 267 258 3.4%
Chinese 450 312 30.7%
Japanese 520 345 33.7%
Python code 285 270 5.3%
Mixed CJK+English 380 290 23.7%

The 30% token reduction for Chinese text directly translates to 30% less KV cache usage and 30% faster prefill for Chinese-language workloads.

Yi-Lightning: Efficient Training Architecture

Yi-Lightning from 01.AI focuses on training efficiency — achieving competitive quality with fewer compute resources:

# Yi-Lightning architecture (inferred from public information)
class YiLightningConfig:
    # MoE architecture
    total_params = "~200B"
    active_params = "~30B"
    num_experts = 64
    active_experts = 4

    # Efficient training innovations:
    # 1. Progressive training: start small, grow model
    # 2. Staged data mixing: different data ratios per stage
    # 3. Expert initialization from dense checkpoint

Progressive Training

Yi-Lightning uses a grow-and-train strategy:

# Stage 1: Train dense 7B model on 2T tokens
# Stage 2: "Grow" to MoE by replicating FFN into 64 experts
# Stage 3: Continue training MoE on 500B tokens
# Stage 4: Fine-tune on high-quality instruction data

def grow_dense_to_moe(dense_model, num_experts: int):
    """Convert dense FFN layers to MoE layers."""
    for layer in dense_model.layers:
        dense_ffn = layer.ffn

        # Create experts by copying the dense FFN
        experts = nn.ModuleList([
            copy.deepcopy(dense_ffn) for _ in range(num_experts)
        ])

        # Add noise to break symmetry
        for expert in experts:
            for param in expert.parameters():
                param.data += torch.randn_like(param) * 0.01

        # Replace FFN with MoE layer
        layer.ffn = MoELayer(
            router=nn.Linear(dense_ffn.hidden_dim, num_experts),
            experts=experts,
            top_k=4
        )

    return dense_model

This approach is 3-5x cheaper than training a MoE model from scratch because the experts start from a strong initialization.

Kimi: Extreme Long Context

Kimi by Moonshot AI supports 2M token context windows. This requires fundamentally different attention computation:

# Kimi's approach to 2M context (inferred from behavior)
class KimiLongContext:
    def __init__(self):
        self.max_context = 2_000_000

        # Cannot use standard attention: O(n^2) for 2M tokens
        # 2M * 2M * num_heads * head_dim = impossible

        # Uses hierarchical attention:
        # Level 1: Local window attention (4K window)
        # Level 2: Strided attention (every 16th token)
        # Level 3: Global tokens (CLS-style summary tokens)

    def hierarchical_attention(self, q, k, v, seq_len):
        # Local attention: each token attends to +/- 2048 neighbors
        local_out = sliding_window_attention(q, k, v, window=4096)

        # Strided attention: every 16th token across full context
        stride = 16
        strided_k = k[:, ::stride, :, :]
        strided_v = v[:, ::stride, :, :]
        strided_out = attention(q, strided_k, strided_v)

        # Global tokens: summary tokens attend to everything
        # These are inserted every 4096 tokens
        global_out = global_summary_attention(q, k, v, interval=4096)

        # Combine via gating
        gate = self.gate_network(local_out, strided_out, global_out)
        return gate[0] * local_out + gate[1] * strided_out + gate[2] * global_out
📊

KV Cache Requirements by Context Length

Context LengthStandard Attn CacheGQA (8 heads) CacheMLA CacheKimi (est.) Cache
4K 1.0 GB 0.13 GB 0.03 GB 0.10 GB
32K 8.0 GB 1.0 GB 0.25 GB 0.75 GB
128K 32.0 GB 4.0 GB 1.0 GB 2.8 GB
512K 128.0 GB 16.0 GB 4.0 GB 8.5 GB
2M 512.0 GB 64.0 GB 16.0 GB 28.0 GB

KV Cache at 128K Context (GB)

MLA (DeepSeek)
1
Kimi (hierarchical)
2.8
GQA-8 (Qwen/Llama)
4
Standard MHA
32

Serving Performance Comparison

📊

Serving Throughput — 4xA100-80GB (or equivalent)

ModelGPU ConfigFP16 ThroughputINT4 Throughputp50 TPOT (batch=1)
DeepSeek-V3 8xA100 EP=8 8,200 tok/s N/A (FP8 only) 8.2 ms
Qwen2.5-72B 4xA100 TP=4 5,100 tok/s 8,400 tok/s 14.1 ms
Qwen2.5-MoE-57B 4xA100 EP=4 7,800 tok/s 10,200 tok/s 9.8 ms
Yi-Lightning 4xA100 EP=4 6,500 tok/s N/A 11.5 ms
Llama 70B (reference) 4xA100 TP=4 5,120 tok/s 8,350 tok/s 12.5 ms

DeepSeek-V3’s MLA attention reduces the KV cache bottleneck during decode, resulting in lower TPOT despite being a much larger model (671B total). Qwen2.5-MoE-57B achieves high throughput by combining MoE efficiency with a compact total parameter count.

Quality Benchmarks

Architecture matters for serving, but quality determines if the model is worth serving:

📊

Benchmark Scores (Higher is Better)

ModelMMLUHumanEvalGSM8KMATHMT-Bench
DeepSeek-V3 87.1 82.6 91.5 61.6 8.8
DeepSeek-R1 90.8 86.2 97.3 79.8 ---
Qwen2.5-72B 85.3 86.4 91.6 60.4 8.6
Yi-Lightning 82.8 74.3 86.0 51.2 8.1
Llama 3.1-70B 82.0 80.5 84.5 50.4 8.3
GPT-4o 88.7 90.2 95.8 76.6 9.1

DeepSeek-R1 (reasoning variant) achieves GPT-4o-level scores while being servable on 8 GPUs. Qwen2.5-72B matches Llama 70B quality with better multilingual performance, especially for CJK languages.

Deployment Considerations

deployment_guide = {
    "DeepSeek-V3": {
        "min_gpus": 8,          # 671B FP8 = 335 GB
        "recommended": "8xA100-80GB or 8xH100",
        "parallelism": "EP=8 (one expert group per GPU)",
        "quantization": "FP8 official, INT4 community",
        "serving_framework": "vLLM, SGLang (MLA support needed)",
        "kv_cache_advantage": "4x more concurrent seqs vs Llama 70B",
        "complexity": "High — MLA + MoE requires specialized kernels"
    },
    "Qwen2.5-72B": {
        "min_gpus": 2,          # 72B FP16 = 144 GB, or INT4 on 2 GPUs
        "recommended": "4xA100-80GB TP=4",
        "parallelism": "TP=2 or TP=4",
        "quantization": "GPTQ, AWQ, FP8 all supported",
        "serving_framework": "vLLM, TGI, SGLang (standard Llama-like)",
        "kv_cache_advantage": "None vs Llama (same GQA design)",
        "complexity": "Low — standard dense transformer"
    },
    "Qwen2.5-MoE-57B": {
        "min_gpus": 2,          # 57B FP16 = 114 GB
        "recommended": "4xA100-80GB EP=4",
        "parallelism": "EP=4 or EP=2+TP=2",
        "quantization": "GPTQ, AWQ supported",
        "serving_framework": "vLLM (MoE support)",
        "kv_cache_advantage": "None (standard GQA)",
        "complexity": "Medium — standard MoE routing"
    }
}
💡 Tip

For most production deployments serving Chinese and English content, Qwen2.5-72B is the pragmatic choice: standard architecture (every framework supports it), strong multilingual performance, and the 152K vocabulary reduces token count for CJK text by 30%. DeepSeek-V3 offers higher quality but requires 2x the GPUs and specialized MLA kernel support.

Innovation Impact on the Field

Each model contributes specific techniques that are being adopted more broadly:

innovations_and_adoption = {
    "MLA (DeepSeek)": {
        "impact": "93% KV cache reduction with minimal quality loss",
        "adoption": "Being integrated into vLLM, SGLang",
        "limitation": "Requires custom attention kernels",
        "future": "Likely standard for models > 100B params"
    },
    "Auxiliary-loss-free routing (DeepSeek)": {
        "impact": "Better MoE training stability",
        "adoption": "Referenced in subsequent MoE papers",
        "limitation": "Bias update requires careful tuning",
        "future": "May replace auxiliary loss in most MoE models"
    },
    "Large multilingual vocab (Qwen)": {
        "impact": "30% token reduction for CJK languages",
        "adoption": "Llama 3 also expanded to 128K vocab",
        "limitation": "Larger embedding table",
        "future": "Vocab size converging to 128-256K across all models"
    },
    "Progressive MoE training (Yi)": {
        "impact": "3-5x training cost reduction",
        "adoption": "Referenced in efficiency-focused research",
        "limitation": "Expert diversity may be limited by initialization",
        "future": "Standard for cost-constrained MoE training"
    }
}

Training Cost Estimate (Relative to Llama 70B)

Llama 70B
1
Qwen2.5-72B
1.2
DeepSeek-V3
2.8
Yi-Lightning
0.6
Llama 405B
5.8

Summary

Chinese frontier models introduce distinct architectural innovations that affect serving characteristics. DeepSeek-V3’s MLA compresses KV cache by 4-64x versus standard attention, enabling 4x more concurrent sequences on the same GPU memory — the most impactful serving optimization among these models. Its auxiliary-loss-free MoE routing improves expert utilization without training instability. Qwen2.5-72B offers the simplest deployment path: standard dense transformer with GQA, compatible with all serving frameworks, and its 152K vocabulary reduces CJK token counts by 30%. Yi-Lightning’s progressive MoE training from a dense checkpoint reduces training cost by 3-5x. Kimi’s hierarchical attention enables 2M context at manageable memory cost. For production deployments, Qwen2.5-72B (INT4 on 2 GPUs) offers the best simplicity-to-quality ratio, while DeepSeek-V3 (FP8 on 8 GPUs) provides the highest quality with a KV cache efficiency advantage that scales with concurrency.