Llama 3.1 405B scores 88.6% on MMLU. GPT-4o scores 88.7%. For the first time in LLM history, the open-closed capability gap has compressed to rounding error on general knowledge benchmarks. Closed models still lead on complex reasoning (GPQA: 53.6% vs 50.7%) and safety alignment, but the 18-month lag that once separated open from closed has collapsed to 3-6 months. The cost implications are seismic: you can now self-host near-frontier intelligence for 15.
Defining the Landscape
Open-Weight Models
Open-weight models release their trained parameters publicly. Users can download, run, fine-tune, and modify them without API access.
OPEN_WEIGHT_MODELS_2026 = {
"Llama 3.1 405B": {
"lab": "Meta",
"params": "405B",
"architecture": "Dense",
"license": "Llama 3.1 Community License",
"available_since": "July 2024",
},
"DeepSeek V3": {
"lab": "DeepSeek",
"params": "671B (37B active)",
"architecture": "MoE",
"license": "MIT-like",
"available_since": "December 2024",
},
"Qwen 2.5 72B": {
"lab": "Alibaba",
"params": "72B",
"architecture": "Dense",
"license": "Apache 2.0",
"available_since": "September 2024",
},
"Llama 4 (projected)": {
"lab": "Meta",
"params": "Unknown (likely MoE)",
"architecture": "MoE (rumored)",
"license": "Expected open-weight",
"available_since": "2025",
},
"Kimi K2": {
"lab": "Moonshot AI",
"params": "1T (32B active)",
"architecture": "MoE",
"license": "Permissive",
"available_since": "2025",
},
}
Closed-Source Models
Closed models are accessible only through APIs. The trained weights are not released.
CLOSED_SOURCE_MODELS_2026 = {
"GPT-4o": {
"lab": "OpenAI",
"params": "Unknown (rumored 1.8T MoE)",
"architecture": "Unknown (rumored MoE)",
"access": "API only",
},
"Claude 3.5 Sonnet": {
"lab": "Anthropic",
"params": "Unknown",
"architecture": "Dense (likely)",
"access": "API only",
},
"Gemini 2.0": {
"lab": "Google",
"params": "Unknown",
"architecture": "Likely MoE, multimodal",
"access": "API only",
},
"Grok-3": {
"lab": "xAI",
"params": "Unknown (1T+ estimated)",
"architecture": "MoE",
"access": "API + X Premium",
},
}
Where Open Models Match Closed
General Knowledge (MMLU)
On MMLU, the flagship knowledge benchmark, open models have reached parity:
MMLU Scores (5-shot)
| Model | Type | MMLU Score | Gap from Best |
|---|---|---|---|
| GPT-4o | Closed | 88.7 | - |
| Llama 3.1 405B | Open | 88.6 | -0.1 |
| DeepSeek V3 | Open | 88.5 | -0.2 |
| Claude 3.5 Sonnet | Closed | 88.3 | -0.4 |
| Qwen 2.5 72B | Open | 85.3 | -3.4 |
At the 405B / 671B scale, open models match GPT-4o on general knowledge. The gap is within noise. Even the smaller Qwen 2.5 72B is within 4 points.
Coding (HumanEval, MBPP)
Code generation is another area where open models have caught up:
def code_benchmark_analysis():
"""
Systematic code benchmark comparison.
"""
results = {
"HumanEval": {
"DeepSeek V3": 92.7,
"Claude 3.5 Sonnet": 92.0,
"GPT-4o": 90.2,
"Llama 3.1 405B": 89.0,
"Qwen 2.5 72B": 86.6,
},
"MBPP": {
"DeepSeek V3": 90.1,
"GPT-4o": 88.5,
"Qwen 2.5 72B": 88.2,
"Claude 3.5 Sonnet": 87.8,
"Llama 3.1 405B": 82.4,
},
}
return results
HumanEval Scores: Open vs Closed
(HumanEval pass@1 (%))DeepSeek V3 leads HumanEval, ahead of all closed models. The coding gap has effectively closed for standard benchmarks.
Mathematics
On mathematical reasoning, the picture is more nuanced:
Math Benchmark Comparison
| Benchmark | DeepSeek V3 (Open) | Claude 3.5 (Closed) | GPT-4o (Closed) | Llama 405B (Open) |
|---|---|---|---|---|
| GSM8K | 97.8 | 96.4 | 95.8 | 96.8 |
| MATH-500 | 90.2 | 78.3 | 76.6 | 73.8 |
| AIME 2024 | 39.2 | 16.0 | 26.7 | 23.3 |
DeepSeek V3 leads on math benchmarks by a significant margin. This is an area where the best open model is definitively ahead of all closed models.
Where Closed Models Still Lead
Complex Reasoning (Hard Benchmarks)
On the most challenging reasoning benchmarks, closed models maintain an edge — but the lead is shrinking:
def reasoning_gap_analysis():
"""
Analyze the reasoning gap between open and closed models.
"""
hard_benchmarks = {
"GPQA Diamond": {
"Claude 3.5 Sonnet": 59.4,
"DeepSeek V3": 59.1,
"GPT-4o": 53.6,
"Llama 3.1 405B": 51.1,
"gap": "Closed=59.4, Open=59.1 (gap: 0.3 points)",
},
"ARC-Challenge (hard reasoning)": {
"GPT-4o": 96.4,
"Claude 3.5 Sonnet": 96.7,
"DeepSeek V3": 95.2,
"Llama 3.1 405B": 93.2,
"gap": "Closed=96.7, Open=95.2 (gap: 1.5 points)",
},
}
return hard_benchmarks
Safety and Alignment
Closed models have a persistent advantage in safety properties:
Safety and Alignment Comparison
| Metric | Claude 3.5 Sonnet | GPT-4o | DeepSeek V3 | Llama 3.1 405B |
|---|---|---|---|---|
| Harmful request refusal | 95%+ | 90%+ | 75% | 80% |
| Jailbreak resistance | High | High | Moderate | Moderate |
| Truthfulness (TruthfulQA) | 88.5 | 85.2 | 82.0 | 78.4 |
| Calibration quality | Strong | Good | Moderate | Weak |
| Instruction following | Excellent | Excellent | Good | Good |
Closed models invest substantially more in alignment. Anthropic’s Constitutional AI pipeline and OpenAI’s extensive RLHF produce measurably safer models. Open models like DeepSeek V3 and Llama 3.1 have basic safety training but are more susceptible to jailbreaks and more willing to produce potentially harmful content. For safety-critical applications (healthcare, legal, education for minors), this gap matters.
Agentic Capabilities
For complex multi-step tasks requiring tool use, planning, and self-correction, closed models currently lead:
def agentic_capability_comparison():
"""
Compare agentic (multi-step, tool-using) capabilities.
"""
capabilities = {
"tool_use_accuracy": {
"Claude 3.5 Sonnet": 88,
"GPT-4o": 85,
"DeepSeek V3": 78,
"Llama 3.1 405B": 72,
"notes": "% of correct tool invocations on complex tasks",
},
"multi_step_planning": {
"Claude 3.5 Sonnet": 82,
"GPT-4o": 80,
"DeepSeek V3": 74,
"Llama 3.1 405B": 68,
"notes": "% of plans that lead to correct outcome",
},
"self_correction": {
"Claude 3.5 Sonnet": 75,
"GPT-4o": 72,
"DeepSeek V3": 65,
"Llama 3.1 405B": 58,
"notes": "% of times model corrects its own errors when prompted",
},
}
return capabilities
Cost Analysis: Self-Hosted Open vs API Closed
Cost Model
def cost_comparison(
monthly_tokens_M=100, # 100M tokens per month
gpu_cost_per_hour=2.5, # A100 80GB spot price
api_costs=None,
):
"""
Compare cost of self-hosting open models vs using closed APIs.
"""
if api_costs is None:
api_costs = {
# Per million tokens (input/output average)
"GPT-4o": 7.50, # $2.50 input, $10.00 output, averaged
"Claude 3.5 Sonnet": 9.00, # $3 input, $15 output, averaged
"DeepSeek V3 API": 0.55, # DeepSeek's own API pricing
}
# Self-hosted configurations
self_hosted = {
"Llama 3.1 70B (INT4, 1x A100)": {
"gpus": 1,
"throughput_tps": 25, # tokens per second, batch 1
"monthly_tokens_M": 25 * 3600 * 24 * 30 / 1e6, # ~64.8M
"monthly_cost": 1 * gpu_cost_per_hour * 24 * 30, # $1,800
},
"DeepSeek V3 (INT4, 4x A100)": {
"gpus": 4,
"throughput_tps": 15,
"monthly_tokens_M": 15 * 3600 * 24 * 30 / 1e6, # ~38.9M
"monthly_cost": 4 * gpu_cost_per_hour * 24 * 30, # $7,200
},
"Llama 3.1 405B (FP16, 12x A100)": {
"gpus": 12,
"throughput_tps": 30,
"monthly_tokens_M": 30 * 3600 * 24 * 30 / 1e6, # ~77.8M
"monthly_cost": 12 * gpu_cost_per_hour * 24 * 30, # $21,600
},
"Qwen 2.5 72B (INT4, 1x A100)": {
"gpus": 1,
"throughput_tps": 22,
"monthly_tokens_M": 22 * 3600 * 24 * 30 / 1e6, # ~57.0M
"monthly_cost": 1 * gpu_cost_per_hour * 24 * 30, # $1,800
},
}
# API costs for 100M tokens/month
api_monthly_costs = {
name: cost * monthly_tokens_M
for name, cost in api_costs.items()
}
# Self-hosted cost per million tokens
self_hosted_per_M = {
name: cfg["monthly_cost"] / cfg["monthly_tokens_M"]
for name, cfg in self_hosted.items()
}
return {
"api_monthly": api_monthly_costs,
"self_hosted_monthly": {n: c["monthly_cost"] for n, c in self_hosted.items()},
"self_hosted_per_M_tokens": self_hosted_per_M,
}
Cost Comparison: 100M Tokens/Month
| Model | Type | Monthly Cost | Cost/1M Tokens | Quality Tier |
|---|---|---|---|---|
| GPT-4o API | Closed API | $750 | $7.50 | Frontier |
| Claude 3.5 Sonnet API | Closed API | $900 | $9.00 | Frontier |
| DeepSeek V3 API | Open API | $55 | $0.55 | Frontier |
| Llama 3.1 70B (self-hosted) | Self-hosted | $1,800 | $0.03* | Strong |
| Qwen 2.5 72B (self-hosted) | Self-hosted | $1,800 | $0.03* | Strong |
| Llama 3.1 405B (self-hosted) | Self-hosted | $21,600 | $0.28* | Frontier |
*Self-hosted costs assume full GPU utilization. Actual costs depend on utilization rate.
The Crossover Point
def find_cost_crossover(
self_hosted_monthly_fixed, # GPU rental cost
self_hosted_capacity_M, # Max tokens per month
api_cost_per_M, # API cost per million tokens
):
"""
Find the monthly token volume where self-hosting becomes cheaper than API.
"""
# API cost = api_cost_per_M * volume_M
# Self-hosted cost = self_hosted_monthly_fixed (regardless of volume)
# Crossover: api_cost_per_M * volume_M = self_hosted_monthly_fixed
crossover_M = self_hosted_monthly_fixed / api_cost_per_M
return {
"crossover_tokens_M": crossover_M,
"self_hosted_capacity_M": self_hosted_capacity_M,
"makes_sense": crossover_M <= self_hosted_capacity_M,
}
# Examples:
# Llama 70B ($1800/month) vs GPT-4o ($7.50/M tokens)
# Crossover: $1800 / $7.50 = 240M tokens/month
# Capacity: ~65M tokens/month
# Verdict: Need ~4 GPUs to match GPT-4o at the crossover point
# Llama 70B ($1800/month) vs Claude ($9/M tokens)
# Crossover: 200M tokens/month
Monthly Cost vs Token Volume (Log Scale)
(USD/month)Self-hosting open models is cost-effective when: (1) volume exceeds 50M+ tokens/month, (2) you need data privacy (no data leaves your infrastructure), (3) you need to fine-tune for a specific domain, or (4) you need guaranteed latency SLOs. For low volume or rapid prototyping, APIs are cheaper and simpler.
The Narrowing Gap Trend
Historical Gap Analysis
def gap_history():
"""
Track the open-closed gap over time.
"""
timeline = {
"2023-03 (GPT-4 launch)": {
"best_closed_mmlu": 86.4, # GPT-4
"best_open_mmlu": 70.0, # Llama 65B
"gap": 16.4,
},
"2023-07 (Llama 2)": {
"best_closed_mmlu": 86.4,
"best_open_mmlu": 68.9, # Llama 2 70B
"gap": 17.5,
},
"2024-01 (Mixtral)": {
"best_closed_mmlu": 86.4,
"best_open_mmlu": 75.0, # Mixtral 8x7B
"gap": 11.4,
},
"2024-04 (Llama 3)": {
"best_closed_mmlu": 87.0, # GPT-4 Turbo
"best_open_mmlu": 82.0, # Llama 3 70B
"gap": 5.0,
},
"2024-07 (Llama 3.1)": {
"best_closed_mmlu": 88.7, # GPT-4o
"best_open_mmlu": 88.6, # Llama 3.1 405B
"gap": 0.1,
},
"2024-12 (DeepSeek V3)": {
"best_closed_mmlu": 88.7,
"best_open_mmlu": 88.5, # DeepSeek V3
"gap": 0.2,
},
"2025-06 (projected)": {
"best_closed_mmlu": 90.0, # Estimated
"best_open_mmlu": 89.5, # Estimated
"gap": 0.5,
},
}
return timeline
MMLU Gap: Closed vs Open (Best Models)
(MMLU score gap (closed minus open))Gap by Capability Domain
def gap_by_domain():
"""
Current gap between best open and best closed model, by domain.
"""
gaps = {
"general_knowledge": {
"best_open": ("DeepSeek V3", 88.5),
"best_closed": ("GPT-4o", 88.7),
"gap": 0.2,
"status": "closed",
},
"coding": {
"best_open": ("DeepSeek V3", 92.7),
"best_closed": ("Claude 3.5", 92.0),
"gap": -0.7, # Open leads
"status": "open leads",
},
"math": {
"best_open": ("DeepSeek V3", 90.2),
"best_closed": ("Claude 3.5", 78.3),
"gap": -11.9, # Open leads
"status": "open leads significantly",
},
"safety": {
"best_open": ("Llama 3.1 405B", 80),
"best_closed": ("Claude 3.5", 95),
"gap": 15,
"status": "closed leads significantly",
},
"instruction_following": {
"best_open": ("DeepSeek V3", 85),
"best_closed": ("Claude 3.5", 92),
"gap": 7,
"status": "closed leads",
},
"multilingual_chinese": {
"best_open": ("Qwen 2.5 72B", 86.1),
"best_closed": ("GPT-4o", 78.5),
"gap": -7.6, # Open leads
"status": "open leads",
},
"long_context": {
"best_open": ("DeepSeek V3", "128K"),
"best_closed": ("Gemini 1.5 Pro", "1M"),
"gap": "8x context length advantage for closed",
"status": "closed leads on length",
},
}
return gaps
Capability Gap Summary: Open vs Closed (2025-2026)
| Domain | Leader | Gap Size | Trend |
|---|---|---|---|
| General knowledge (MMLU) | Tie | 0.2 points | Closed |
| Coding (HumanEval) | Open (DeepSeek V3) | Open leads by 0.7 | Narrowed, now reversed |
| Math (MATH-500) | Open (DeepSeek V3) | Open leads by 12 | Open pulled ahead |
| Safety/alignment | Closed (Claude) | Closed leads by 15% | Persistent gap |
| Instruction following | Closed (Claude) | Closed leads by 7% | Narrowing slowly |
| Multilingual (Chinese) | Open (Qwen) | Open leads by 7.6 | Open dominates |
| Long context | Closed (Gemini) | 8x context advantage | Narrowing (128K vs 1M) |
| Agentic/tool use | Closed (Claude) | Closed leads by ~10% | Narrowing |
Why the Gap Persists (Where It Does)
Safety and Alignment
def why_safety_gap_persists():
"""
Structural reasons the safety gap between open and closed models persists.
"""
reasons = {
"rlhf_investment": {
"closed_labs": "Millions of dollars on RLHF data and compute. "
"Anthropic has 100+ people on alignment.",
"open_labs": "Basic DPO/GRPO with limited preference data. "
"Alignment is not the primary research focus.",
},
"iterative_red_teaming": {
"closed_labs": "Multiple rounds of red-teaming with internal and "
"external teams, feeding back into training.",
"open_labs": "Single round of safety training, limited red-teaming.",
},
"incentive_structure": {
"closed_labs": "API providers face legal/reputational risk from unsafe outputs. "
"Strong incentive to invest in safety.",
"open_labs": "Releasing weights means you cannot control deployment. "
"Less incentive (and ability) to invest in post-deployment safety.",
},
"constitutional_ai": {
"closed_labs": "Anthropic's CAI scales safety training with AI-generated data.",
"open_labs": "No equivalent methodology widely adopted in open-source.",
},
}
return reasons
Agentic Capabilities
def why_agentic_gap_persists():
"""
Why closed models lead on agentic/tool-use tasks.
"""
reasons = {
"tool_use_training_data": {
"description": "Closed labs have proprietary tool-use datasets from "
"API usage logs. Open labs must construct synthetic data.",
},
"system_prompt_optimization": {
"description": "Closed models are optimized for specific system prompts "
"that enable tool use patterns. Open models receive generic "
"instruction-following training.",
},
"multi_turn_optimization": {
"description": "Agentic tasks involve many turns of interaction. "
"Closed labs optimize for multi-turn coherence; "
"open models are typically evaluated on single-turn.",
},
}
return reasons
Unique Advantages of Open Models
Fine-Tuning
def open_model_advantages():
"""
Advantages that open models have over closed models.
"""
advantages = {
"fine_tuning": {
"description": "Train on your domain-specific data",
"impact": "10-30% improvement on domain-specific tasks",
"examples": "Medical, legal, financial, code-specific domains",
"with_closed_models": "Limited fine-tuning via API (expensive, less control)",
},
"data_privacy": {
"description": "No data leaves your infrastructure",
"impact": "Required for HIPAA, GDPR, defense, financial compliance",
"with_closed_models": "Data sent to third-party API (compliance risk)",
},
"latency_control": {
"description": "Full control over serving infrastructure",
"impact": "Predictable latency, no API rate limits",
"with_closed_models": "Subject to provider's capacity and rate limits",
},
"customization": {
"description": "Modify architecture, add adapters, change tokenizer",
"impact": "Extreme flexibility for specialized use cases",
"with_closed_models": "No architectural modification possible",
},
"cost_at_scale": {
"description": "Fixed GPU cost regardless of token volume",
"impact": "Dramatically cheaper at high volume",
"with_closed_models": "Cost scales linearly with usage",
},
"research_and_development": {
"description": "Full access to internals for research",
"impact": "Essential for ML research, interpretability, safety research",
"with_closed_models": "Black box",
},
}
return advantages
Quantization and Optimization
Open models can be aggressively optimized for specific hardware:
def quantization_options():
"""
Quantization options available only for open models.
"""
options = {
"INT8 (GPTQ/AWQ)": {
"quality_loss": "0.5-1%",
"memory_reduction": "2x",
"speed_improvement": "1.5-2x",
"available_for_closed": False,
},
"INT4 (GPTQ/AWQ)": {
"quality_loss": "1-3%",
"memory_reduction": "4x",
"speed_improvement": "2-3x",
"available_for_closed": False,
},
"INT4 + speculative decoding": {
"quality_loss": "0% (verified by large model)",
"speed_improvement": "2-4x",
"available_for_closed": False,
},
"GGUF (llama.cpp)": {
"quality_loss": "Variable (Q2_K to Q8_0)",
"platform": "CPU + GPU hybrid, any hardware",
"available_for_closed": False,
},
}
return options
Decision Framework
When to Use What
def decision_framework():
"""
Decision tree for choosing between open and closed models.
"""
decisions = {
"need_data_privacy": {
"answer": "Open model (self-hosted)",
"reason": "Cannot send data to third-party API",
},
"need_maximum_safety": {
"answer": "Claude (closed)",
"reason": "Best alignment, Constitutional AI",
},
"need_best_math_code": {
"answer": "DeepSeek V3 (open)",
"reason": "Leads on MATH-500 and HumanEval",
},
"need_chinese_language": {
"answer": "Qwen 2.5 (open)",
"reason": "Best CJK performance",
},
"budget_under_100_per_month": {
"answer": "DeepSeek V3 API or open model on consumer GPU",
"reason": "Cheapest options",
},
"need_1M_context": {
"answer": "Gemini (closed)",
"reason": "Only model with 1M verified context",
},
"need_fine_tuning": {
"answer": "Open model",
"reason": "Full control over training",
},
"prototype_quickly": {
"answer": "Closed API (any provider)",
"reason": "No infrastructure to manage",
},
"production_at_scale": {
"answer": "Open model (self-hosted)",
"reason": "Cost-effective at high volume",
},
}
return decisions
Decision Matrix: Open vs Closed
| Requirement | Best Choice | Specific Model | Reason |
|---|---|---|---|
| Data privacy required | Open (self-hosted) | Llama 3.1 70B/405B | No data leaves your infra |
| Maximum safety | Closed | Claude 3.5 Sonnet | Best alignment |
| Best code/math | Open | DeepSeek V3 | Leads benchmarks |
| Chinese language | Open | Qwen 2.5 72B | Best CJK quality |
| Lowest cost | Open API | DeepSeek V3 API | $0.55/M tokens |
| 1M context | Closed | Gemini 2.0 | Only option |
| Fine-tuning needed | Open | Llama 3.1 / Qwen 2.5 | Full weight access |
| Quick prototype | Closed API | Any provider | Zero infrastructure |
The 2026 Outlook
Trends
def outlook_2026():
"""
Projected trends for open vs closed models in 2026.
"""
trends = {
"gap_closing_on_benchmarks": {
"prediction": "Open models will match or lead on all standard benchmarks",
"confidence": "Very high",
"driver": "DeepSeek, Meta, Alibaba all investing heavily",
},
"safety_gap_persists": {
"prediction": "Closed models will maintain 10-15% safety advantage",
"confidence": "High",
"driver": "Alignment requires sustained investment that open labs under-prioritize",
},
"cost_gap_widens": {
"prediction": "Self-hosting will become 10-50x cheaper than APIs",
"confidence": "High",
"driver": "Better quantization, cheaper GPUs, more efficient models",
},
"moe_becomes_dominant_for_open": {
"prediction": "Most frontier open models will be MoE by 2026",
"confidence": "High",
"driver": "DeepSeek V3 proved MoE is strictly better for training efficiency",
},
"agentic_gap_narrows": {
"prediction": "Open models will close the agentic/tool-use gap",
"confidence": "Medium",
"driver": "Tool-use training data becoming more available",
},
}
return trends
What Would Change the Trajectory
def trajectory_changers():
"""
Events that could alter the open vs closed trajectory.
"""
scenarios = {
"regulation_restricts_open_weights": {
"impact": "Open models stagnate, closed models gain permanent lead",
"likelihood": "Low-medium in US, higher in EU",
},
"breakthrough_in_efficient_alignment": {
"impact": "Open models close safety gap quickly",
"likelihood": "Medium — active research area",
},
"new_architectural_paradigm": {
"impact": "Whoever adopts first gains temporary lead",
"likelihood": "Low — current paradigm is well-optimized",
},
"compute_becomes_abundant": {
"impact": "Open models can train at closed-model scale",
"likelihood": "Medium — GPU supply increasing rapidly",
},
}
return scenarios
Practical Recommendations
For Startups
def startup_recommendations():
"""
Model selection advice for startups in 2026.
"""
return {
"default": "Start with DeepSeek V3 API ($0.55/M tokens) for prototyping. "
"Switch to self-hosted Llama/Qwen when volume exceeds 50M tokens/month.",
"safety_critical": "Use Claude API for anything involving healthcare, legal, "
"or content moderation. The safety premium is worth it.",
"budget_constrained": "Qwen 2.5 7B or Llama 3 8B on a single consumer GPU. "
"Good enough for many tasks at near-zero marginal cost.",
"need_customization": "Self-host Llama 3.1 70B and fine-tune with LoRA on "
"your domain data. Best quality per dollar for specialized tasks.",
}
For Enterprises
def enterprise_recommendations():
"""
Model selection advice for enterprises in 2026.
"""
return {
"regulated_industry": "Self-hosted open model (Llama or Qwen). "
"Required for HIPAA, SOC2, FedRAMP compliance.",
"general_purpose": "Claude or GPT-4o API for maximum quality and safety. "
"Cost is justified by reduced risk.",
"high_volume": "Self-hosted DeepSeek V3 or Llama 405B. "
"At 1B+ tokens/month, self-hosting saves 90%+ vs API.",
"multilingual_global": "Qwen 2.5 for CJK markets, Llama 3 for Western markets. "
"Consider separate deployments per region.",
}
The Convergence Thesis
Architecture Has Converged
The most striking finding: open and closed models now use nearly identical architectures. The differentiation has shifted entirely to data and post-training.
def convergence_evidence():
"""
Evidence that architecture has converged across all labs.
"""
universal_choices = {
"architecture": "Causal decoder-only (100% of frontier models)",
"normalization": "RMSNorm, pre-norm (100%)",
"activation": "SwiGLU (95%+)",
"position_encoding": "RoPE (95%+)",
"tokenizer": "BPE, 100K+ vocab (90%+)",
}
differentiating_choices = {
"moe_vs_dense": "Still debated (MoE trending upward)",
"attention_variant": "GQA vs MLA vs MHA (GQA dominant)",
"training_data": "Primary differentiator (not architectural)",
"post_training": "Primary differentiator (RLHF methodology)",
"safety_training": "Significant differentiator (Claude vs others)",
}
return universal_choices, differentiating_choices
Summary
The open-closed model gap in 2026 has a clear shape:
Closed (open models match or lead):
- General knowledge (MMLU): Tied
- Coding (HumanEval): Open leads (DeepSeek V3)
- Math (MATH-500): Open leads significantly (DeepSeek V3)
- Multilingual (Chinese): Open leads (Qwen 2.5)
Still Open (closed models lead):
- Safety and alignment: Closed leads by 15%+ (Claude)
- Agentic/tool use: Closed leads by 10% (Claude, GPT-4o)
- Long context (>200K): Closed leads (Gemini at 1M)
- Instruction following quality: Closed leads by 5-7%
Cost:
- Self-hosted open models: $0.02-0.30 per million tokens
- Closed APIs: $0.55-9.00 per million tokens
- Self-hosting is 3-300x cheaper depending on model and volume
The trajectory: The benchmark gap has collapsed from 16 points (2023) to under 1 point (2025). The remaining gaps are in dimensions that require sustained investment in alignment, agentic training data, and infrastructure (long context). These gaps are narrowing but will persist because they reflect different lab priorities, not architectural limitations.
The most important practical implication: model choice in 2026 should be driven by requirements (privacy, safety, cost, fine-tuning needs), not by raw quality. On benchmarks, the top open and closed models are functionally equivalent.