Alibaba’s Qwen 2.5 72B beats Llama 3.1 70B on code (LiveCodeBench: 38.2% vs 33.1%) and multilingual tasks (MGSM zh: 75.6% vs 61.9%) while matching it on English. The 72B model was trained on 18+ trillion tokens — 20% more data than Llama 3 — with a bilingual tokenizer that compresses Chinese at 2.8 bytes per token versus BPE’s typical 4.2. When Western labs ignore CJK compression, Chinese labs gain a 33% context window advantage for free.
Architecture Overview
The Qwen 2.5 Family
Qwen 2.5 spans from 0.5B to 72B dense models, plus MoE variants (Qwen 2.5-MoE). The architecture follows the modern consensus with a few distinctive choices.
QWEN25_CONFIGS = {
"0.5B": {
"d_model": 896,
"num_layers": 24,
"num_q_heads": 14,
"num_kv_heads": 2,
"head_dim": 64,
"d_ff": 4864,
"vocab_size": 151936,
"context": 131072,
"total_params": "0.49B",
},
"3B": {
"d_model": 2048,
"num_layers": 36,
"num_q_heads": 16,
"num_kv_heads": 2,
"head_dim": 128,
"d_ff": 11008,
"vocab_size": 151936,
"context": 131072,
"total_params": "3.09B",
},
"7B": {
"d_model": 3584,
"num_layers": 28,
"num_q_heads": 28,
"num_kv_heads": 4,
"head_dim": 128,
"d_ff": 18944,
"vocab_size": 151936,
"context": 131072,
"total_params": "7.62B",
},
"14B": {
"d_model": 5120,
"num_layers": 48,
"num_q_heads": 40,
"num_kv_heads": 8,
"head_dim": 128,
"d_ff": 13824,
"vocab_size": 151936,
"context": 131072,
"total_params": "14.77B",
},
"32B": {
"d_model": 5120,
"num_layers": 64,
"num_q_heads": 40,
"num_kv_heads": 8,
"head_dim": 128,
"d_ff": 27648,
"vocab_size": 152064,
"context": 131072,
"total_params": "32.76B",
},
"72B": {
"d_model": 8192,
"num_layers": 80,
"num_q_heads": 64,
"num_kv_heads": 8,
"head_dim": 128,
"d_ff": 29568,
"vocab_size": 152064,
"context": 131072,
"total_params": "72.71B",
},
}
Qwen 2.5 Architecture Compared to Llama 3
| Feature | Qwen 2.5 72B | Llama 3.1 70B | Notes |
|---|---|---|---|
| d_model | 8192 | 8192 | Same |
| Layers | 80 | 80 | Same |
| Q heads | 64 | 64 | Same |
| KV heads | 8 | 8 | Same |
| Head dim | 128 | 128 | Same |
| FFN dim | 29568 | 28672 | Qwen slightly larger |
| Vocab size | 152064 | 128256 | Qwen 19% larger |
| Context | 131072 | 131072 | Same (after extension) |
| Total params | 72.71B | 70.55B | Qwen 3% larger |
The Convergence
At the 70B scale, Qwen 2.5 and Llama 3.1 have nearly identical architectures. The differences are marginal: Qwen has a slightly larger FFN dimension and vocabulary. The quality differences come almost entirely from training data and post-training.
Vocabulary: 152K Tokens
CJK Optimization
Qwen 2.5 uses a 152K vocabulary, even larger than Llama 3’s 128K. The extra tokens are primarily CJK (Chinese, Japanese, Korean) characters and common subwords.
def vocabulary_comparison():
"""
Compare tokenization efficiency across vocabularies.
"""
comparisons = {
"english_paragraph": {
"llama2_32k": 147, # tokens
"llama3_128k": 124, # tokens
"qwen25_152k": 121, # tokens
},
"chinese_paragraph_1000_chars": {
"llama2_32k": 583, # Most characters need 2+ tokens
"llama3_128k": 342, # Better CJK coverage
"qwen25_152k": 298, # Best CJK coverage
},
"python_code_100_lines": {
"llama2_32k": 412,
"llama3_128k": 356,
"qwen25_152k": 348,
},
}
return comparisons
Tokenization Efficiency: Tokens per 1000 Characters
(tokens per 1000 characters (lower = better))Tiktoken-Compatible BPE
Qwen 2.5 uses a BPE tokenizer with byte-level fallback, similar to GPT-4’s tiktoken. Tokens are learned from a multilingual corpus with explicit coverage targets for Chinese, English, code, and mathematical notation.
# Token type distribution in Qwen 2.5 vocabulary
TOKEN_DISTRIBUTION = {
"English subwords": 45000, # ~30%
"Chinese characters/subwords": 35000, # ~23%
"Code tokens": 20000, # ~13%
"Multilingual (other)": 25000, # ~16%
"Digits and special": 15000, # ~10%
"Byte fallback (256)": 256, # ~0.2%
"Special tokens": 11808, # ~8%
}
Training Data: 18T+ Tokens
Data Composition
Qwen 2.5 was trained on over 18 trillion tokens, making it one of the longest-trained models. The data composition emphasizes Chinese and code more heavily than Llama 3.
def qwen25_training_data():
"""
Estimated training data composition for Qwen 2.5.
(Alibaba has not published exact proportions.)
"""
estimated_composition = {
"web_english": {
"proportion": 0.35,
"tokens_T": 6.3,
"source": "CommonCrawl filtered, similar to FineWeb",
},
"web_chinese": {
"proportion": 0.20,
"tokens_T": 3.6,
"source": "Chinese web corpora, proprietary filtering",
},
"code": {
"proportion": 0.15,
"tokens_T": 2.7,
"source": "GitHub, code competition data",
},
"math": {
"proportion": 0.08,
"tokens_T": 1.4,
"source": "Mathematical text, textbooks, proofs",
},
"multilingual_other": {
"proportion": 0.10,
"tokens_T": 1.8,
"source": "Japanese, Korean, Arabic, European languages",
},
"books_academic": {
"proportion": 0.07,
"tokens_T": 1.3,
"source": "Books, academic papers, encyclopedias",
},
"synthetic": {
"proportion": 0.05,
"tokens_T": 0.9,
"source": "Synthetic data for math, code, instruction following",
},
}
return estimated_composition
Qwen 2.5 notably uses synthetic data for math and code training. The Qwen team generated millions of step-by-step mathematical solutions and code problems using earlier Qwen models, filtered for correctness, and included them in the pretraining data. This bootstrapping approach (training on outputs from previous model generations) contributed significantly to Qwen 2.5’s strong math and code performance.
Data Quality Pipeline
def qwen_data_pipeline():
"""
Qwen's data processing pipeline (inferred from publications).
"""
stages = [
{
"stage": "Deduplication",
"method": "MinHash + exact dedup",
"reduction": "~40% of raw data removed",
},
{
"stage": "Language detection",
"method": "FastText classifier",
"purpose": "Route to language-specific filters",
},
{
"stage": "Quality filtering",
"method": "Perplexity-based + classifier ensemble",
"reduction": "~30% of remaining data removed",
},
{
"stage": "Safety filtering",
"method": "Toxicity classifier + PII removal",
"reduction": "~5% removed",
},
{
"stage": "Domain upsampling",
"method": "Increase weight for code, math, science",
"effect": "2-3x oversampling of high-value domains",
},
]
return stages
GQA Configuration
Variable KV Heads Across Model Sizes
Unlike Llama 3 (which uses 8 KV heads for all model sizes), Qwen 2.5 varies the KV head count based on model size:
def qwen_gqa_rationale():
"""
Qwen's GQA configuration rationale.
Smaller models use fewer KV heads (more aggressive compression).
"""
configs = {
"0.5B": {"q_heads": 14, "kv_heads": 2, "ratio": 7, "reason": "Tiny model, aggressive KV compression needed"},
"3B": {"q_heads": 16, "kv_heads": 2, "ratio": 8, "reason": "Small model, still aggressive"},
"7B": {"q_heads": 28, "kv_heads": 4, "ratio": 7, "reason": "Medium model, moderate compression"},
"14B": {"q_heads": 40, "kv_heads": 8, "ratio": 5, "reason": "Large model, balanced"},
"32B": {"q_heads": 40, "kv_heads": 8, "ratio": 5, "reason": "Same as 14B (deeper, not wider attention)"},
"72B": {"q_heads": 64, "kv_heads": 8, "ratio": 8, "reason": "Match Llama 3 configuration"},
}
return configs
The reasoning: smaller models have less representational capacity, so they can afford more aggressive KV compression without quality loss. The 0.5B model uses only 2 KV heads (7:1 ratio), while the 72B uses 8 KV heads (8:1 ratio).
KV Cache Size per Token Across Qwen 2.5 Family
| Model | KV Heads | KV Cache/Token/Layer (FP16) | KV Cache at 128K (All Layers) |
|---|---|---|---|
| 0.5B | 2 | 512 bytes | 0.30 GB |
| 3B | 2 | 512 bytes | 0.56 GB |
| 7B | 4 | 1024 bytes | 0.87 GB |
| 14B | 8 | 2048 bytes | 2.99 GB |
| 32B | 8 | 2048 bytes | 3.99 GB |
| 72B | 8 | 2048 bytes | 4.98 GB |
SwiGLU FFN and Layer Dimensions
FFN Dimension Choices
Qwen 2.5 uses SwiGLU activation with a slightly different FFN multiplier than Llama 3:
def ffn_analysis():
"""Compare FFN dimensions and parameter ratios."""
models = {
"Llama 3.1 70B": {
"d_model": 8192,
"d_ff": 28672,
"ratio": 28672 / 8192, # 3.5x
"ffn_params_per_layer": 3 * 8192 * 28672, # SwiGLU: 3 matrices
},
"Qwen 2.5 72B": {
"d_model": 8192,
"d_ff": 29568,
"ratio": 29568 / 8192, # 3.61x
"ffn_params_per_layer": 3 * 8192 * 29568,
},
}
# Qwen uses a slightly larger FFN multiplier
# This means slightly more compute per layer
# The extra ~3% FFN params account for most of the
# parameter difference between the two models
return models
The FFN/attention parameter ratio in Qwen 2.5 72B is approximately 3:1, meaning 75% of each layer’s parameters are in the FFN block. This is consistent across most modern LLMs.
Context Extension and RoPE
YaRN-Based Extension
Qwen 2.5 supports 128K context through YaRN (Yet another RoPE extensioN), a technique that modifies RoPE frequencies based on their wavelength relative to the target context length.
def yarn_rope_scaling(
base=10000.0,
head_dim=128,
original_ctx=4096,
target_ctx=131072,
beta_fast=32,
beta_slow=1,
):
"""
YaRN RoPE scaling for context extension.
Divides frequencies into three regions:
1. High frequency (short wavelength): no scaling needed
2. Medium frequency: interpolate between original and scaled
3. Low frequency (long wavelength): full scaling
"""
scale_factor = target_ctx / original_ctx # 32x
freqs = 1.0 / (base ** (torch.arange(0, head_dim, 2).float() / head_dim))
wavelengths = 2 * torch.pi / freqs
scaled_freqs = torch.zeros_like(freqs)
for i, (f, w) in enumerate(zip(freqs, wavelengths)):
if w < original_ctx / beta_fast:
# High frequency: no scaling
scaled_freqs[i] = f
elif w > original_ctx / beta_slow:
# Low frequency: full interpolation scaling
scaled_freqs[i] = f / scale_factor
else:
# Medium frequency: smooth interpolation
smooth = (original_ctx / w - beta_slow) / (beta_fast - beta_slow)
smooth = max(0.0, min(1.0, smooth))
scaled_freqs[i] = f * (1 - smooth) + (f / scale_factor) * smooth
return scaled_freqs
Continued Pretraining for Long Context
Like Llama 3.1, Qwen 2.5 undergoes continued pretraining on progressively longer sequences after the initial short-context training:
def qwen25_context_extension():
"""Context extension schedule."""
stages = [
{"ctx": 4096, "tokens": "~18T", "phase": "Main pretraining"},
{"ctx": 32768, "tokens": "~200B", "phase": "First extension"},
{"ctx": 65536, "tokens": "~100B", "phase": "Second extension"},
{"ctx": 131072, "tokens": "~50B", "phase": "Final extension"},
]
return stages
Code and Math Specialization
Why Qwen Excels at Code
Qwen 2.5 was explicitly designed to be strong at code generation. The training recipe includes:
- 15%+ code in pretraining data: Higher proportion than most models.
- Code-specific synthetic data: Generated coding problems and solutions.
- Multi-language code coverage: Python, JavaScript, TypeScript, Java, C++, Rust, Go, and more.
- Qwen2.5-Coder variant: Specialized code model trained on additional code-focused data.
def code_benchmark_comparison():
"""Code benchmark results."""
benchmarks = {
"HumanEval": {
"Qwen 2.5 72B": 86.6,
"Llama 3.1 70B": 80.5,
"DeepSeek V3": 92.7,
"GPT-4o": 90.2,
},
"MBPP": {
"Qwen 2.5 72B": 88.2,
"Llama 3.1 70B": 82.4,
"DeepSeek V3": 90.1,
"GPT-4o": 88.5,
},
"MultiPL-E (avg across languages)": {
"Qwen 2.5 72B": 78.3,
"Llama 3.1 70B": 71.2,
"DeepSeek V3": 82.5,
"GPT-4o": 79.8,
},
}
return benchmarks
Code and Math Benchmarks
| Benchmark | Qwen 2.5 72B | Llama 3.1 70B | DeepSeek V3 | GPT-4o |
|---|---|---|---|---|
| HumanEval | 86.6 | 80.5 | 92.7 | 90.2 |
| MBPP | 88.2 | 82.4 | 90.1 | 88.5 |
| MATH-500 | 83.1 | 68.0 | 90.2 | 76.6 |
| GSM8K | 91.5 | 95.1 | 97.8 | 95.8 |
| MMLU | 85.3 | 86.0 | 88.5 | 88.7 |
Math Performance
Qwen 2.5 72B scores 83.1 on MATH-500, significantly above Llama 3.1 70B’s 68.0. The gap comes from:
- Math-specific synthetic data: Step-by-step solutions generated and verified.
- Chain-of-thought training: The SFT data includes detailed mathematical reasoning chains.
- Higher proportion of mathematical text in pretraining: Textbooks, arXiv papers, competition problems.
MATH-500 Score Comparison
(MATH-500 accuracy (%))On MATH-500, Qwen 2.5 72B (83.1) outperforms Llama 3.1 405B (73.8) by a significant margin despite being 5.6x smaller. This demonstrates that training data quality and composition matter more than raw parameter count for specialized capabilities like mathematical reasoning.
Multilingual Performance
CJK Advantage
Qwen 2.5’s primary differentiator is multilingual performance, especially for Chinese:
def multilingual_benchmarks():
"""
Multilingual benchmark comparison.
"""
results = {
"C-Eval (Chinese)": {
"Qwen 2.5 72B": 86.1,
"Llama 3.1 70B": 55.2,
"DeepSeek V3": 78.0,
},
"CMMLU (Chinese)": {
"Qwen 2.5 72B": 85.3,
"Llama 3.1 70B": 52.8,
"DeepSeek V3": 77.4,
},
"JLPT (Japanese)": {
"Qwen 2.5 72B": 78.5,
"Llama 3.1 70B": 62.3,
"DeepSeek V3": 71.2,
},
}
return results
Multilingual Benchmark Comparison
| Benchmark | Qwen 2.5 72B | Llama 3.1 70B | Gap |
|---|---|---|---|
| C-Eval (Chinese) | 86.1 | 55.2 | +30.9 points for Qwen |
| CMMLU (Chinese) | 85.3 | 52.8 | +32.5 points for Qwen |
| MMLU (English) | 85.3 | 86.0 | -0.7 points (Llama slightly better) |
| ARC-C (English) | 88.4 | 87.3 | +1.1 points for Qwen |
The 30+ point gap on Chinese benchmarks comes from three factors:
- 20% Chinese training data vs Llama’s estimated 5-10%.
- 152K vocabulary optimized for CJK compression.
- Chinese-specific synthetic data for math and knowledge tasks.
Open-Weight Release Strategy
Model Variants
Alibaba releases Qwen 2.5 in an unusually comprehensive set of variants:
QWEN25_RELEASE_MATRIX = {
"base_models": ["0.5B", "1.5B", "3B", "7B", "14B", "32B", "72B"],
"instruct_models": ["0.5B-Instruct", "1.5B-Instruct", "3B-Instruct",
"7B-Instruct", "14B-Instruct", "32B-Instruct",
"72B-Instruct"],
"specialized": [
"Qwen2.5-Coder-7B",
"Qwen2.5-Coder-32B",
"Qwen2.5-Math-7B",
"Qwen2.5-Math-72B",
],
"moe_variants": [
"Qwen2.5-MoE-A2.7B (14.3B total)",
],
"license": "Apache 2.0 (permissive) for most sizes",
}
This breadth is strategic: by providing models at every scale from mobile (0.5B) to datacenter (72B), Alibaba ensures Qwen becomes the default choice for the Chinese-language AI ecosystem.
Comparison: Qwen 2.5 vs Llama 3 vs DeepSeek V3
Systematic Analysis
def three_way_comparison():
"""Systematic comparison across dimensions."""
dimensions = {
"architecture": {
"Qwen 2.5 72B": "Dense, GQA-8, SwiGLU, RoPE, 152K vocab",
"Llama 3.1 70B": "Dense, GQA-8, SwiGLU, RoPE, 128K vocab",
"DeepSeek V3": "MoE (256e), MLA, SwiGLU, RoPE, 128K vocab",
},
"training_tokens": {
"Qwen 2.5 72B": "18T+",
"Llama 3.1 70B": "15T+",
"DeepSeek V3": "14.8T",
},
"chinese_quality": {
"Qwen 2.5 72B": "Best",
"Llama 3.1 70B": "Weakest",
"DeepSeek V3": "Strong",
},
"english_quality": {
"Qwen 2.5 72B": "Strong",
"Llama 3.1 70B": "Strong",
"DeepSeek V3": "Best",
},
"math_quality": {
"Qwen 2.5 72B": "Very strong",
"Llama 3.1 70B": "Moderate",
"DeepSeek V3": "Best",
},
"serving_simplicity": {
"Qwen 2.5 72B": "Simple (dense)",
"Llama 3.1 70B": "Simple (dense)",
"DeepSeek V3": "Complex (MoE)",
},
"memory_requirement": {
"Qwen 2.5 72B": "~145 GB FP16",
"Llama 3.1 70B": "~140 GB FP16",
"DeepSeek V3": "~1340 GB FP16",
},
}
return dimensions
Three-Way Model Comparison
| Dimension | Qwen 2.5 72B | Llama 3.1 70B | DeepSeek V3 |
|---|---|---|---|
| MMLU | 85.3 | 86.0 | 88.5 |
| MATH-500 | 83.1 | 68.0 | 90.2 |
| HumanEval | 86.6 | 80.5 | 92.7 |
| C-Eval (Chinese) | 86.1 | 55.2 | 78.0 |
| Training tokens | 18T+ | 15T+ | 14.8T |
| Memory (FP16) | 145 GB | 140 GB | 1,342 GB |
| Serving GPUs (A100) | 2 | 2 | 20 |
What Makes Qwen 2.5 Competitive
The Data Advantage
Qwen 2.5’s primary competitive advantage is training data, not architecture. The architecture is standard (nearly identical to Llama 3). The quality comes from:
- More training tokens (18T vs 15T): 20% more training data.
- Better multilingual data: Proprietary Chinese web corpora that Western labs cannot easily access.
- Synthetic data augmentation: Code and math synthetic data generated by earlier Qwen models.
- Data quality filtering: Aggressive perplexity-based and classifier-based filtering.
def training_data_impact_analysis():
"""
Estimate the contribution of different factors to Qwen's performance.
"""
factors = {
"architecture": {
"contribution": "~5%",
"evidence": "Nearly identical to Llama 3 architecture",
},
"training_tokens": {
"contribution": "~20%",
"evidence": "18T vs 15T = 20% more data",
},
"data_quality_filtering": {
"contribution": "~25%",
"evidence": "Better Chinese data, stronger dedup",
},
"synthetic_data": {
"contribution": "~25%",
"evidence": "Math and code synthetic data (huge impact on MATH benchmark)",
},
"post_training": {
"contribution": "~25%",
"evidence": "DPO/RLHF quality, instruction following",
},
}
return factors
Qwen 2.5 demonstrates that the architecture wars are largely over. Two labs can build nearly identical architectures and get very different results based on training data. The frontier of LLM quality has shifted from architecture innovation to data engineering — curation, filtering, synthetic generation, and domain balance. Qwen 2.5’s strength in code and math comes from better data in those domains, not from any architectural innovation.
Summary
Qwen 2.5 is a strong open-weight LLM family that demonstrates data-driven quality:
- Architecture: Standard modern LLM architecture, nearly identical to Llama 3. GQA-8, SwiGLU, RoPE, pre-norm.
- Vocabulary: 152K tokens with strong CJK optimization, contributing to 13% better Chinese tokenization.
- Training: 18T+ tokens with emphasis on Chinese, code, and math data including synthetic augmentation.
- Performance: Matches Llama 3 on English, dramatically exceeds it on Chinese (+30 points on C-Eval), and outperforms on math (+15 points on MATH-500).
- Serving: Dense architecture at 72B is straightforward to deploy (2 A100 80GB for FP16).
- Release strategy: Comprehensive model family from 0.5B to 72B with specialized code and math variants.
For Chinese-language applications, Qwen 2.5 is the clear best choice among open-weight models. For English, it trades blows with Llama 3. For math and code, it punches well above its weight class.