By mid-2025, the frontier model landscape has crystallized. A handful of design choices are now universal — every serious LLM uses them. Others remain hotly contested, with different labs making different bets. Understanding this convergence-divergence pattern tells you which problems are solved and where the next breakthroughs will come from.

This post surveys the major frontier models of 2025, maps their architectural choices, and identifies the patterns.

The Universal Stack: What Every Frontier Model Shares

These choices are effectively settled. Deviating from them results in measurably worse models:

📊

The Settled Stack (Universal Across Frontier Models)

ComponentConsensus ChoiceWhy It WonSeries Reference
Normalization RMSNorm, Pre-Norm 10-15% faster than LayerNorm, more stable than Post-Norm Transformer Anatomy Part 7
FFN Activation SwiGLU Consistent quality improvement over GELU/ReLU Transformer Anatomy Part 9
Position Encoding RoPE Enables context extension, no learned parameters Transformer Anatomy Part 4
Tokenization BPE, 100K+ vocab Optimal compression, multilingual coverage Transformer Anatomy Part 2
Architecture Causal decoder-only Best for generation, simplest to scale Transformer Anatomy Part 13
Weight Tying Input/output embeddings shared Free memory savings Transformer Anatomy Part 11
Training Precision BF16 or FP8 Range matters more than precision for stability Inference Timeline (mixed precision)
Residual Connections Identity skip + scaling Non-negotiable for depth Transformer Anatomy Part 8

If you are designing a new LLM in 2025 and deviate from ANY of these choices, you need a very strong reason. The Transformer Anatomy series covers each of these in detail.

The Big Split: MoE vs Dense

The most significant architectural divergence in 2025:

📊

Frontier Models: MoE vs Dense

ModelTypeTotal ParamsActivatedActivation Ratio
Kimi K2 MoE 1,000B 32B 3.2%
DeepSeek V3 MoE 671B 37B 5.5%
MiniMax-01 MoE 456B 45.9B 10.1%
Qwen 2.5 72B Dense 72B 72B 100%
Llama 3.1 405B Dense 405B 405B 100%
GPT-4o (rumored) MoE ~1.8T ~200B ~11%

MoE is winning for open models because it enables frontier quality at 5-10x less training cost. DeepSeek V3 matches Llama 3 405B quality at roughly 1/10 the training compute. Kimi K2 reaches frontier performance with 32B activated from 1T total.

Dense persists where serving simplicity matters (Meta’s Llama: no expert parallelism overhead) or where training budget is not the constraint.

Training Compute Efficiency: Quality per FLOP

(quality per training FLOP (relative))
Llama 3.1 405B (dense) Baseline
100 quality per training FLOP (relative)
DeepSeek V3 671B (MoE) 8.5x more efficient
850 quality per training FLOP (relative)
Kimi K2 1T (MoE) ~7x more efficient
700 quality per training FLOP (relative)
The MoE Serving Tax

MoE models are cheaper to train but harder to serve. All 1T parameters must be accessible in memory even though only 32B fire per token. Expert parallelism adds inter-node communication latency. For latency-sensitive serving, a dense 70B model can outperform a 1T MoE model on time-to-first-token despite lower quality. The tradeoff depends on your SLO.

Attention Mechanism Divergence

The second major split: how to handle the KV cache memory problem.

📊

Attention Mechanisms Across Frontier Models

ModelAttention TypeKV HeadsKV Cache ReductionMax Context
Llama 3.1 GQA-8 8 of 64 8x vs MHA 128K
DeepSeek V3 MLA Latent d_c=512 ~23x vs MHA 128K
Kimi K2 MLA Latent (similar to DSV3) ~23x vs MHA 128K
MiniMax-01 Lightning (linear) N/A (no KV cache) No quadratic scaling 4M
Qwen 2.5 GQA Varies by size 4-8x vs MHA 128K

Three distinct approaches:

  1. GQA (Llama, Qwen): Simple, well-understood. 8x KV reduction. Sufficient for 128K context.
  2. MLA (DeepSeek, Kimi): Aggressive compression via latent vectors. 23x reduction. Same context length but far more concurrent users.
  3. Linear Attention (MiniMax): Eliminate quadratic scaling entirely. Enables 1M+ context. But slight quality tradeoff on short-context tasks.

The trend: MLA is becoming the default for high-performance MoE models. GQA remains standard for dense models and smaller MoE variants. Linear attention is niche but critical for very long context.

Context Length Race

Maximum Context Length by Model (2025)

(K tokens)
Llama 3.1
128 K tokens
Kimi K2
128 K tokens
Claude 3.5+
200 K tokens
Gemini 1.5 Pro
1,000 K tokens
MiniMax-01 (train)
1,000 K tokens
MiniMax-01 (infer)
4,000 K tokens

Most frontier models cluster around 128K. The outliers — Gemini at 1M, MiniMax at 4M — use fundamentally different approaches (specialized long-context training and linear attention respectively).

For most production applications, 128K is sufficient. The long-context models serve specific niches: entire codebases, multi-document analysis, long-form content processing.

Training Innovations

Optimizers

The optimizer landscape is shifting:

📊

Optimizer Choices Across Frontier Models

ModelOptimizerKey Innovation
Llama 3.1 AdamW Standard, well-understood
DeepSeek V3 AdamW + FP8 training First 671B model in FP8 without instability
Kimi K2 MuonClip Newton-style optimizer scaled to 1T, zero instabilities
Qwen 2.5 AdamW Standard

MuonClip (Kimi K2) is the most novel: it uses Newton-Schulz orthogonalization for preconditioning with a clipping mechanism for stability. The result — training 1T parameters on 15.5T tokens with zero instabilities — is a significant engineering achievement.

FP8 Training

DeepSeek V3 proved that FP8 training works at 671B scale. This nearly doubles training throughput compared to BF16. Expect FP8 to become standard for large-scale training in 2025-2026.

Multi-Token Prediction

DeepSeek V3’s multi-token prediction training objective — predicting tokens 2, 3, …, K steps ahead — provides richer gradients AND enables self-speculative decoding at inference. This dual benefit makes it likely to be adopted broadly.

Reasoning Models

The emergence of reasoning models (o1, DeepSeek R1, QwQ) represents a paradigm shift from scaling training compute to scaling inference compute:

📊

Reasoning Models: Training vs Inference Compute

ModelApproachInference Tokens per QueryQuality Gain
Standard LLM Direct answer 50-500 Baseline
o1 Internal chain-of-thought 1K-50K +15-30% on math/code
DeepSeek R1 RL-trained reasoning 1K-20K +10-25% on reasoning
QwQ Long-form reasoning 2K-10K +8-20% on reasoning
Note: Quality gain varies by task. Reasoning models excel on math, code, and logic but offer minimal improvement on creative or retrieval tasks.

Reasoning models change the inference cost equation dramatically — covered in detail in Inference Optimization Timeline Part 13.

Agentic Capabilities

2025 frontier models are increasingly designed for agentic use — tool calling, multi-step planning, code execution:

SWE-bench Verified Scores (Agentic Coding)

(% solved)
Claude 3.5 Sonnet
49 % solved
GPT-4o
38 % solved
DeepSeek V3
42 % solved
Kimi K2 Best open model
65.8 % solved

Kimi K2’s dominance on agentic benchmarks suggests that MoE architecture provides a natural advantage: different experts can specialize in different tools, code patterns, and reasoning modes. The diverse expert pool gives the model more “tools in its toolbox” for complex multi-step tasks.

What Comes Next

Based on the convergence-divergence patterns:

Near-certain (2025-2026):

  • MoE becomes the default for all models above 100B activated parameters
  • FP8 training becomes standard (following DeepSeek’s proof-of-concept)
  • MLA or similar KV compression replaces GQA for high-end models
  • Reasoning capabilities become a standard feature, not a separate model

Likely (2026-2027):

  • FP4 training on next-gen hardware (NVIDIA Blackwell natively supports FP4)
  • 1M+ context becomes mainstream through improved linear attention or hybrid architectures
  • Expert counts reach 1000+ with even finer-grained specialization
  • Self-speculative decoding (via multi-token prediction) replaces separate draft models

Speculative (2027+):

  • Attention-free architectures (Mamba descendants) for specific domains
  • Continuous learning during inference (not just during training)
  • Hardware co-designed with specific attention mechanisms (custom silicon for MLA or linear attention)
💡 Where to Innovate

If you’re looking for research directions, focus on the divergence points: attention mechanisms (MLA vs linear vs hybrid), expert routing strategies (loss-free balancing, dynamic expert allocation), and inference-time compute allocation (when to think more, when to stop). These are the active frontiers where the next breakthroughs will emerge. The settled components (RMSNorm, SwiGLU, RoPE, BPE) offer little room for improvement.

This completes the Frontier Model Architectures series. Combined with the Transformer Anatomy series (which explains every component from first principles) and the Inference Optimization Timeline (which covers how to make these models fast), you now have a comprehensive map of the LLM landscape in 2025 — where it came from, where it is, and where it is going.