Part of Series Frontier Model Architectures 5 of 27
1 Kimi K2: How Moonshot Built a 1T MoE That Rivals Claude and GPT-4o 2 MiniMax-01: Lightning Attention, 4M Token Context, and the Linear Attention Revival 3 Frontier Models in 2025: The Architectural Convergence and Where Innovation Happens 4 Llama 3 Architecture Decisions: Why Meta Chose Dense, GQA-8, 128K Vocab, and RoPE 5 Qwen 2.5: Alibaba's Architecture, Training Recipe, and What Makes It Competitive 6 Gemini: Google's Natively Multimodal Architecture and the 1M Token Context 7 Claude Architecture: Constitutional AI, RLHF at Scale, and the 200K Context Window 8 Grok: xAI's Architecture, Massive Scale, and Real-Time Information Integration 9 Open vs Closed Models in 2026: Llama vs GPT-4o vs Claude vs DeepSeek — The Capability Gap Analysis 10 DeepSeek-R1: The Architecture of Reasoning — GRPO Training, Multi-Stage Pipeline, and Open Weights 11 Phi and Small Language Models: How Microsoft Achieves GPT-3.5 Quality at 3B Parameters 12 Mistral and the Sliding Window: Efficient Long-Context with Linear Memory 13 Llama 4: Meta's Shift to Multimodal MoE and What It Signals 14 Training Infrastructure: How Frontier Labs Build Their GPU Clusters 15 Benchmark Deep Dive: What MMLU, HumanEval, MATH, and SWE-bench Actually Measure 16 Jamba: AI21's Hybrid Mamba-Attention Architecture 17 Yi Series: 01.AI's Bilingual Architecture 18 DBRX: Databricks' Enterprise MoE Architecture 19 OpenAI o1: Reasoning Compute Budgets and Internal CoT 20 Distilled Models: Phi, Gemma, Llama 3.2 at Small Scale 21 Llama 4: Meta's Shift to Multimodal MoE 22 Scaling Laws and Model Design: How Chinchilla Changed Architecture Decisions 23 Open Weight Release Strategy: Llama vs Mistral vs DeepSeek — Licensing and Ecosystem Impact 24 Safety Architecture: How Frontier Models Build Guardrails Into the Model Itself 25 Multimodal Model Comparison 2026: GPT-4o vs Gemini vs Claude vs Llama Vision 26 MoE vs Dense in Production: Serving Cost, Latency, and When Each Wins 27 Chinese Frontier Models: DeepSeek, Qwen, Yi, and Kimi — Architecture Comparison

Alibaba’s Qwen 2.5 72B beats Llama 3.1 70B on code (LiveCodeBench: 38.2% vs 33.1%) and multilingual tasks (MGSM zh: 75.6% vs 61.9%) while matching it on English. The 72B model was trained on 18+ trillion tokens — 20% more data than Llama 3 — with a bilingual tokenizer that compresses Chinese at 2.8 bytes per token versus BPE’s typical 4.2. When Western labs ignore CJK compression, Chinese labs gain a 33% context window advantage for free.

Architecture Overview

The Qwen 2.5 Family

Qwen 2.5 spans from 0.5B to 72B dense models, plus MoE variants (Qwen 2.5-MoE). The architecture follows the modern consensus with a few distinctive choices.

QWEN25_CONFIGS = {
    "0.5B": {
        "d_model": 896,
        "num_layers": 24,
        "num_q_heads": 14,
        "num_kv_heads": 2,
        "head_dim": 64,
        "d_ff": 4864,
        "vocab_size": 151936,
        "context": 131072,
        "total_params": "0.49B",
    },
    "3B": {
        "d_model": 2048,
        "num_layers": 36,
        "num_q_heads": 16,
        "num_kv_heads": 2,
        "head_dim": 128,
        "d_ff": 11008,
        "vocab_size": 151936,
        "context": 131072,
        "total_params": "3.09B",
    },
    "7B": {
        "d_model": 3584,
        "num_layers": 28,
        "num_q_heads": 28,
        "num_kv_heads": 4,
        "head_dim": 128,
        "d_ff": 18944,
        "vocab_size": 151936,
        "context": 131072,
        "total_params": "7.62B",
    },
    "14B": {
        "d_model": 5120,
        "num_layers": 48,
        "num_q_heads": 40,
        "num_kv_heads": 8,
        "head_dim": 128,
        "d_ff": 13824,
        "vocab_size": 151936,
        "context": 131072,
        "total_params": "14.77B",
    },
    "32B": {
        "d_model": 5120,
        "num_layers": 64,
        "num_q_heads": 40,
        "num_kv_heads": 8,
        "head_dim": 128,
        "d_ff": 27648,
        "vocab_size": 152064,
        "context": 131072,
        "total_params": "32.76B",
    },
    "72B": {
        "d_model": 8192,
        "num_layers": 80,
        "num_q_heads": 64,
        "num_kv_heads": 8,
        "head_dim": 128,
        "d_ff": 29568,
        "vocab_size": 152064,
        "context": 131072,
        "total_params": "72.71B",
    },
}
📊

Qwen 2.5 Architecture Compared to Llama 3

FeatureQwen 2.5 72BLlama 3.1 70BNotes
d_model 8192 8192 Same
Layers 80 80 Same
Q heads 64 64 Same
KV heads 8 8 Same
Head dim 128 128 Same
FFN dim 29568 28672 Qwen slightly larger
Vocab size 152064 128256 Qwen 19% larger
Context 131072 131072 Same (after extension)
Total params 72.71B 70.55B Qwen 3% larger

The Convergence

At the 70B scale, Qwen 2.5 and Llama 3.1 have nearly identical architectures. The differences are marginal: Qwen has a slightly larger FFN dimension and vocabulary. The quality differences come almost entirely from training data and post-training.

Vocabulary: 152K Tokens

CJK Optimization

Qwen 2.5 uses a 152K vocabulary, even larger than Llama 3’s 128K. The extra tokens are primarily CJK (Chinese, Japanese, Korean) characters and common subwords.

def vocabulary_comparison():
    """
    Compare tokenization efficiency across vocabularies.
    """
    comparisons = {
        "english_paragraph": {
            "llama2_32k": 147,    # tokens
            "llama3_128k": 124,   # tokens
            "qwen25_152k": 121,   # tokens
        },
        "chinese_paragraph_1000_chars": {
            "llama2_32k": 583,    # Most characters need 2+ tokens
            "llama3_128k": 342,   # Better CJK coverage
            "qwen25_152k": 298,   # Best CJK coverage
        },
        "python_code_100_lines": {
            "llama2_32k": 412,
            "llama3_128k": 356,
            "qwen25_152k": 348,
        },
    }
    return comparisons

Tokenization Efficiency: Tokens per 1000 Characters

(tokens per 1000 characters (lower = better))
Qwen 2.5 (Chinese) Best CJK compression
298 tokens per 1000 characters (lower = better)
Llama 3 (Chinese) 13% more tokens
342 tokens per 1000 characters (lower = better)
Llama 2 (Chinese) 96% more tokens
583 tokens per 1000 characters (lower = better)
Qwen 2.5 (English) Similar to Llama 3
121 tokens per 1000 characters (lower = better)
Llama 3 (English) Slightly more tokens
124 tokens per 1000 characters (lower = better)

Tiktoken-Compatible BPE

Qwen 2.5 uses a BPE tokenizer with byte-level fallback, similar to GPT-4’s tiktoken. Tokens are learned from a multilingual corpus with explicit coverage targets for Chinese, English, code, and mathematical notation.

# Token type distribution in Qwen 2.5 vocabulary
TOKEN_DISTRIBUTION = {
    "English subwords": 45000,          # ~30%
    "Chinese characters/subwords": 35000,  # ~23%
    "Code tokens": 20000,              # ~13%
    "Multilingual (other)": 25000,     # ~16%
    "Digits and special": 15000,       # ~10%
    "Byte fallback (256)": 256,        # ~0.2%
    "Special tokens": 11808,           # ~8%
}

Training Data: 18T+ Tokens

Data Composition

Qwen 2.5 was trained on over 18 trillion tokens, making it one of the longest-trained models. The data composition emphasizes Chinese and code more heavily than Llama 3.

def qwen25_training_data():
    """
    Estimated training data composition for Qwen 2.5.
    (Alibaba has not published exact proportions.)
    """
    estimated_composition = {
        "web_english": {
            "proportion": 0.35,
            "tokens_T": 6.3,
            "source": "CommonCrawl filtered, similar to FineWeb",
        },
        "web_chinese": {
            "proportion": 0.20,
            "tokens_T": 3.6,
            "source": "Chinese web corpora, proprietary filtering",
        },
        "code": {
            "proportion": 0.15,
            "tokens_T": 2.7,
            "source": "GitHub, code competition data",
        },
        "math": {
            "proportion": 0.08,
            "tokens_T": 1.4,
            "source": "Mathematical text, textbooks, proofs",
        },
        "multilingual_other": {
            "proportion": 0.10,
            "tokens_T": 1.8,
            "source": "Japanese, Korean, Arabic, European languages",
        },
        "books_academic": {
            "proportion": 0.07,
            "tokens_T": 1.3,
            "source": "Books, academic papers, encyclopedias",
        },
        "synthetic": {
            "proportion": 0.05,
            "tokens_T": 0.9,
            "source": "Synthetic data for math, code, instruction following",
        },
    }
    return estimated_composition
ℹ️ Synthetic Data Emphasis

Qwen 2.5 notably uses synthetic data for math and code training. The Qwen team generated millions of step-by-step mathematical solutions and code problems using earlier Qwen models, filtered for correctness, and included them in the pretraining data. This bootstrapping approach (training on outputs from previous model generations) contributed significantly to Qwen 2.5’s strong math and code performance.

Data Quality Pipeline

def qwen_data_pipeline():
    """
    Qwen's data processing pipeline (inferred from publications).
    """
    stages = [
        {
            "stage": "Deduplication",
            "method": "MinHash + exact dedup",
            "reduction": "~40% of raw data removed",
        },
        {
            "stage": "Language detection",
            "method": "FastText classifier",
            "purpose": "Route to language-specific filters",
        },
        {
            "stage": "Quality filtering",
            "method": "Perplexity-based + classifier ensemble",
            "reduction": "~30% of remaining data removed",
        },
        {
            "stage": "Safety filtering",
            "method": "Toxicity classifier + PII removal",
            "reduction": "~5% removed",
        },
        {
            "stage": "Domain upsampling",
            "method": "Increase weight for code, math, science",
            "effect": "2-3x oversampling of high-value domains",
        },
    ]
    return stages

GQA Configuration

Variable KV Heads Across Model Sizes

Unlike Llama 3 (which uses 8 KV heads for all model sizes), Qwen 2.5 varies the KV head count based on model size:

def qwen_gqa_rationale():
    """
    Qwen's GQA configuration rationale.
    Smaller models use fewer KV heads (more aggressive compression).
    """
    configs = {
        "0.5B": {"q_heads": 14, "kv_heads": 2, "ratio": 7, "reason": "Tiny model, aggressive KV compression needed"},
        "3B":   {"q_heads": 16, "kv_heads": 2, "ratio": 8, "reason": "Small model, still aggressive"},
        "7B":   {"q_heads": 28, "kv_heads": 4, "ratio": 7, "reason": "Medium model, moderate compression"},
        "14B":  {"q_heads": 40, "kv_heads": 8, "ratio": 5, "reason": "Large model, balanced"},
        "32B":  {"q_heads": 40, "kv_heads": 8, "ratio": 5, "reason": "Same as 14B (deeper, not wider attention)"},
        "72B":  {"q_heads": 64, "kv_heads": 8, "ratio": 8, "reason": "Match Llama 3 configuration"},
    }
    return configs

The reasoning: smaller models have less representational capacity, so they can afford more aggressive KV compression without quality loss. The 0.5B model uses only 2 KV heads (7:1 ratio), while the 72B uses 8 KV heads (8:1 ratio).

📊

KV Cache Size per Token Across Qwen 2.5 Family

ModelKV HeadsKV Cache/Token/Layer (FP16)KV Cache at 128K (All Layers)
0.5B 2 512 bytes 0.30 GB
3B 2 512 bytes 0.56 GB
7B 4 1024 bytes 0.87 GB
14B 8 2048 bytes 2.99 GB
32B 8 2048 bytes 3.99 GB
72B 8 2048 bytes 4.98 GB

SwiGLU FFN and Layer Dimensions

FFN Dimension Choices

Qwen 2.5 uses SwiGLU activation with a slightly different FFN multiplier than Llama 3:

def ffn_analysis():
    """Compare FFN dimensions and parameter ratios."""
    models = {
        "Llama 3.1 70B": {
            "d_model": 8192,
            "d_ff": 28672,
            "ratio": 28672 / 8192,  # 3.5x
            "ffn_params_per_layer": 3 * 8192 * 28672,  # SwiGLU: 3 matrices
        },
        "Qwen 2.5 72B": {
            "d_model": 8192,
            "d_ff": 29568,
            "ratio": 29568 / 8192,  # 3.61x
            "ffn_params_per_layer": 3 * 8192 * 29568,
        },
    }

    # Qwen uses a slightly larger FFN multiplier
    # This means slightly more compute per layer
    # The extra ~3% FFN params account for most of the
    # parameter difference between the two models

    return models

The FFN/attention parameter ratio in Qwen 2.5 72B is approximately 3:1, meaning 75% of each layer’s parameters are in the FFN block. This is consistent across most modern LLMs.

Context Extension and RoPE

YaRN-Based Extension

Qwen 2.5 supports 128K context through YaRN (Yet another RoPE extensioN), a technique that modifies RoPE frequencies based on their wavelength relative to the target context length.

def yarn_rope_scaling(
    base=10000.0,
    head_dim=128,
    original_ctx=4096,
    target_ctx=131072,
    beta_fast=32,
    beta_slow=1,
):
    """
    YaRN RoPE scaling for context extension.
    Divides frequencies into three regions:
    1. High frequency (short wavelength): no scaling needed
    2. Medium frequency: interpolate between original and scaled
    3. Low frequency (long wavelength): full scaling
    """
    scale_factor = target_ctx / original_ctx  # 32x

    freqs = 1.0 / (base ** (torch.arange(0, head_dim, 2).float() / head_dim))
    wavelengths = 2 * torch.pi / freqs

    scaled_freqs = torch.zeros_like(freqs)
    for i, (f, w) in enumerate(zip(freqs, wavelengths)):
        if w < original_ctx / beta_fast:
            # High frequency: no scaling
            scaled_freqs[i] = f
        elif w > original_ctx / beta_slow:
            # Low frequency: full interpolation scaling
            scaled_freqs[i] = f / scale_factor
        else:
            # Medium frequency: smooth interpolation
            smooth = (original_ctx / w - beta_slow) / (beta_fast - beta_slow)
            smooth = max(0.0, min(1.0, smooth))
            scaled_freqs[i] = f * (1 - smooth) + (f / scale_factor) * smooth

    return scaled_freqs

Continued Pretraining for Long Context

Like Llama 3.1, Qwen 2.5 undergoes continued pretraining on progressively longer sequences after the initial short-context training:

def qwen25_context_extension():
    """Context extension schedule."""
    stages = [
        {"ctx": 4096, "tokens": "~18T", "phase": "Main pretraining"},
        {"ctx": 32768, "tokens": "~200B", "phase": "First extension"},
        {"ctx": 65536, "tokens": "~100B", "phase": "Second extension"},
        {"ctx": 131072, "tokens": "~50B", "phase": "Final extension"},
    ]
    return stages

Code and Math Specialization

Why Qwen Excels at Code

Qwen 2.5 was explicitly designed to be strong at code generation. The training recipe includes:

  1. 15%+ code in pretraining data: Higher proportion than most models.
  2. Code-specific synthetic data: Generated coding problems and solutions.
  3. Multi-language code coverage: Python, JavaScript, TypeScript, Java, C++, Rust, Go, and more.
  4. Qwen2.5-Coder variant: Specialized code model trained on additional code-focused data.
def code_benchmark_comparison():
    """Code benchmark results."""
    benchmarks = {
        "HumanEval": {
            "Qwen 2.5 72B": 86.6,
            "Llama 3.1 70B": 80.5,
            "DeepSeek V3": 92.7,
            "GPT-4o": 90.2,
        },
        "MBPP": {
            "Qwen 2.5 72B": 88.2,
            "Llama 3.1 70B": 82.4,
            "DeepSeek V3": 90.1,
            "GPT-4o": 88.5,
        },
        "MultiPL-E (avg across languages)": {
            "Qwen 2.5 72B": 78.3,
            "Llama 3.1 70B": 71.2,
            "DeepSeek V3": 82.5,
            "GPT-4o": 79.8,
        },
    }
    return benchmarks
📊

Code and Math Benchmarks

BenchmarkQwen 2.5 72BLlama 3.1 70BDeepSeek V3GPT-4o
HumanEval 86.6 80.5 92.7 90.2
MBPP 88.2 82.4 90.1 88.5
MATH-500 83.1 68.0 90.2 76.6
GSM8K 91.5 95.1 97.8 95.8
MMLU 85.3 86.0 88.5 88.7

Math Performance

Qwen 2.5 72B scores 83.1 on MATH-500, significantly above Llama 3.1 70B’s 68.0. The gap comes from:

  1. Math-specific synthetic data: Step-by-step solutions generated and verified.
  2. Chain-of-thought training: The SFT data includes detailed mathematical reasoning chains.
  3. Higher proportion of mathematical text in pretraining: Textbooks, arXiv papers, competition problems.

MATH-500 Score Comparison

(MATH-500 accuracy (%))
DeepSeek V3 (671B MoE) State of the art
90.2 MATH-500 accuracy (%)
Qwen 2.5 72B Strong for dense 72B
83.1 MATH-500 accuracy (%)
GPT-4o
76.6 MATH-500 accuracy (%)
Llama 3.1 70B Notably lower
68 MATH-500 accuracy (%)
Llama 3.1 405B 5x larger, still below Qwen 72B
73.8 MATH-500 accuracy (%)
Qwen 2.5 72B vs Llama 3.1 405B on Math

On MATH-500, Qwen 2.5 72B (83.1) outperforms Llama 3.1 405B (73.8) by a significant margin despite being 5.6x smaller. This demonstrates that training data quality and composition matter more than raw parameter count for specialized capabilities like mathematical reasoning.

Multilingual Performance

CJK Advantage

Qwen 2.5’s primary differentiator is multilingual performance, especially for Chinese:

def multilingual_benchmarks():
    """
    Multilingual benchmark comparison.
    """
    results = {
        "C-Eval (Chinese)": {
            "Qwen 2.5 72B": 86.1,
            "Llama 3.1 70B": 55.2,
            "DeepSeek V3": 78.0,
        },
        "CMMLU (Chinese)": {
            "Qwen 2.5 72B": 85.3,
            "Llama 3.1 70B": 52.8,
            "DeepSeek V3": 77.4,
        },
        "JLPT (Japanese)": {
            "Qwen 2.5 72B": 78.5,
            "Llama 3.1 70B": 62.3,
            "DeepSeek V3": 71.2,
        },
    }
    return results
📊

Multilingual Benchmark Comparison

BenchmarkQwen 2.5 72BLlama 3.1 70BGap
C-Eval (Chinese) 86.1 55.2 +30.9 points for Qwen
CMMLU (Chinese) 85.3 52.8 +32.5 points for Qwen
MMLU (English) 85.3 86.0 -0.7 points (Llama slightly better)
ARC-C (English) 88.4 87.3 +1.1 points for Qwen

The 30+ point gap on Chinese benchmarks comes from three factors:

  1. 20% Chinese training data vs Llama’s estimated 5-10%.
  2. 152K vocabulary optimized for CJK compression.
  3. Chinese-specific synthetic data for math and knowledge tasks.

Open-Weight Release Strategy

Model Variants

Alibaba releases Qwen 2.5 in an unusually comprehensive set of variants:

QWEN25_RELEASE_MATRIX = {
    "base_models": ["0.5B", "1.5B", "3B", "7B", "14B", "32B", "72B"],
    "instruct_models": ["0.5B-Instruct", "1.5B-Instruct", "3B-Instruct",
                        "7B-Instruct", "14B-Instruct", "32B-Instruct",
                        "72B-Instruct"],
    "specialized": [
        "Qwen2.5-Coder-7B",
        "Qwen2.5-Coder-32B",
        "Qwen2.5-Math-7B",
        "Qwen2.5-Math-72B",
    ],
    "moe_variants": [
        "Qwen2.5-MoE-A2.7B (14.3B total)",
    ],
    "license": "Apache 2.0 (permissive) for most sizes",
}

This breadth is strategic: by providing models at every scale from mobile (0.5B) to datacenter (72B), Alibaba ensures Qwen becomes the default choice for the Chinese-language AI ecosystem.

Comparison: Qwen 2.5 vs Llama 3 vs DeepSeek V3

Systematic Analysis

def three_way_comparison():
    """Systematic comparison across dimensions."""
    dimensions = {
        "architecture": {
            "Qwen 2.5 72B": "Dense, GQA-8, SwiGLU, RoPE, 152K vocab",
            "Llama 3.1 70B": "Dense, GQA-8, SwiGLU, RoPE, 128K vocab",
            "DeepSeek V3": "MoE (256e), MLA, SwiGLU, RoPE, 128K vocab",
        },
        "training_tokens": {
            "Qwen 2.5 72B": "18T+",
            "Llama 3.1 70B": "15T+",
            "DeepSeek V3": "14.8T",
        },
        "chinese_quality": {
            "Qwen 2.5 72B": "Best",
            "Llama 3.1 70B": "Weakest",
            "DeepSeek V3": "Strong",
        },
        "english_quality": {
            "Qwen 2.5 72B": "Strong",
            "Llama 3.1 70B": "Strong",
            "DeepSeek V3": "Best",
        },
        "math_quality": {
            "Qwen 2.5 72B": "Very strong",
            "Llama 3.1 70B": "Moderate",
            "DeepSeek V3": "Best",
        },
        "serving_simplicity": {
            "Qwen 2.5 72B": "Simple (dense)",
            "Llama 3.1 70B": "Simple (dense)",
            "DeepSeek V3": "Complex (MoE)",
        },
        "memory_requirement": {
            "Qwen 2.5 72B": "~145 GB FP16",
            "Llama 3.1 70B": "~140 GB FP16",
            "DeepSeek V3": "~1340 GB FP16",
        },
    }
    return dimensions
📊

Three-Way Model Comparison

DimensionQwen 2.5 72BLlama 3.1 70BDeepSeek V3
MMLU 85.3 86.0 88.5
MATH-500 83.1 68.0 90.2
HumanEval 86.6 80.5 92.7
C-Eval (Chinese) 86.1 55.2 78.0
Training tokens 18T+ 15T+ 14.8T
Memory (FP16) 145 GB 140 GB 1,342 GB
Serving GPUs (A100) 2 2 20

What Makes Qwen 2.5 Competitive

The Data Advantage

Qwen 2.5’s primary competitive advantage is training data, not architecture. The architecture is standard (nearly identical to Llama 3). The quality comes from:

  1. More training tokens (18T vs 15T): 20% more training data.
  2. Better multilingual data: Proprietary Chinese web corpora that Western labs cannot easily access.
  3. Synthetic data augmentation: Code and math synthetic data generated by earlier Qwen models.
  4. Data quality filtering: Aggressive perplexity-based and classifier-based filtering.
def training_data_impact_analysis():
    """
    Estimate the contribution of different factors to Qwen's performance.
    """
    factors = {
        "architecture": {
            "contribution": "~5%",
            "evidence": "Nearly identical to Llama 3 architecture",
        },
        "training_tokens": {
            "contribution": "~20%",
            "evidence": "18T vs 15T = 20% more data",
        },
        "data_quality_filtering": {
            "contribution": "~25%",
            "evidence": "Better Chinese data, stronger dedup",
        },
        "synthetic_data": {
            "contribution": "~25%",
            "evidence": "Math and code synthetic data (huge impact on MATH benchmark)",
        },
        "post_training": {
            "contribution": "~25%",
            "evidence": "DPO/RLHF quality, instruction following",
        },
    }
    return factors
💡 The Lesson from Qwen 2.5

Qwen 2.5 demonstrates that the architecture wars are largely over. Two labs can build nearly identical architectures and get very different results based on training data. The frontier of LLM quality has shifted from architecture innovation to data engineering — curation, filtering, synthetic generation, and domain balance. Qwen 2.5’s strength in code and math comes from better data in those domains, not from any architectural innovation.

Summary

Qwen 2.5 is a strong open-weight LLM family that demonstrates data-driven quality:

  • Architecture: Standard modern LLM architecture, nearly identical to Llama 3. GQA-8, SwiGLU, RoPE, pre-norm.
  • Vocabulary: 152K tokens with strong CJK optimization, contributing to 13% better Chinese tokenization.
  • Training: 18T+ tokens with emphasis on Chinese, code, and math data including synthetic augmentation.
  • Performance: Matches Llama 3 on English, dramatically exceeds it on Chinese (+30 points on C-Eval), and outperforms on math (+15 points on MATH-500).
  • Serving: Dense architecture at 72B is straightforward to deploy (2 A100 80GB for FP16).
  • Release strategy: Comprehensive model family from 0.5B to 72B with specialized code and math variants.

For Chinese-language applications, Qwen 2.5 is the clear best choice among open-weight models. For English, it trades blows with Llama 3. For math and code, it punches well above its weight class.