Data curation, not architectural novelty, propelled 01.AI’s Yi models into the frontier tier. Yi-Lightning matched GPT-4 class performance without inventing new attention mechanisms or MoE routing schemes — it did so by building the best bilingual tokenizer for English-Chinese workloads, filtering training data through three successive quality gates that rejected 80% of candidates, and executing a training run so efficient it cost one-tenth of comparable Western labs. When architecture is commoditized, the dataset becomes the moat.
Yi Model Family Overview
class YiModelFamily:
"""
Yi model evolution: from dense bilingual to frontier MoE.
"""
models = {
'Yi-6B': {
'params': 6e9,
'architecture': 'Dense Transformer',
'context': 4096,
'vocab_size': 64000,
'layers': 32,
'hidden': 4096,
'heads': 32,
'kv_heads': 4, # GQA
'release': '2023-11',
},
'Yi-34B': {
'params': 34e9,
'architecture': 'Dense Transformer',
'context': 4096, # Extended to 200K post-training
'vocab_size': 64000,
'layers': 60,
'hidden': 7168,
'heads': 56,
'kv_heads': 8,
'release': '2023-11',
},
'Yi-1.5-34B': {
'params': 34e9,
'architecture': 'Dense Transformer',
'context': 4096,
'vocab_size': 64000,
'layers': 60,
'hidden': 7168,
'heads': 56,
'kv_heads': 8,
'release': '2024-05',
'note': '3.6T tokens, improved data quality',
},
'Yi-Lightning': {
'params': '~200B', # Estimated
'architecture': 'MoE',
'context': 16384,
'vocab_size': 64000,
'experts': '~60',
'top_k': '~6',
'release': '2024-07',
'note': 'Closed-weight, API only. Matched GPT-4o on LMSYS.',
},
}
Yi Model Family Specifications
| Model | Params | Architecture | Context | Training Tokens | MMLU |
|---|---|---|---|---|---|
| Yi-6B | 6B | Dense | 4K | 3.0T | 63.2 |
| Yi-34B | 34B | Dense | 4K (200K ext) | 3.0T | 76.3 |
| Yi-1.5-9B | 9B | Dense | 4K | 3.6T | 69.1 |
| Yi-1.5-34B | 34B | Dense | 4K | 3.6T | 77.2 |
| Yi-Lightning | ~200B | MoE | 16K | Unknown | ~82* |
Bilingual Tokenizer Design
class YiTokenizerAnalysis:
"""
Yi's tokenizer is designed for balanced English/Chinese efficiency.
Standard LLM tokenizers (Llama, GPT) are optimized for English,
resulting in 2-3x more tokens per Chinese character.
Yi's tokenizer achieves near-parity.
"""
def __init__(self):
self.vocab_size = 64000
# Yi's vocabulary composition
self.vocab_composition = {
'byte_tokens': 256, # UTF-8 byte fallback
'english_tokens': 28000, # English subwords
'chinese_tokens': 20000, # Chinese characters and common phrases
'code_tokens': 8000, # Programming language tokens
'multilingual': 4000, # Other languages
'special_tokens': 3744, # Control, padding, etc.
}
def compare_efficiency(self, text_en, text_zh):
"""
Compare tokens-per-character for English vs Chinese text.
"""
# English text: "The transformer architecture uses self-attention"
# Llama tokenizer: ~9 tokens (good efficiency for English)
# Yi tokenizer: ~9 tokens (comparable for English)
# Chinese text: "Transformer architecture uses self-attention mechanism"
# Llama tokenizer: ~25 tokens (poor - each char is 2-3 tokens)
# Yi tokenizer: ~12 tokens (good - common chars are single tokens)
results = {
'english': {
'llama_tpc': 0.22, # tokens per character
'yi_tpc': 0.23, # slightly worse for English
'ratio': 1.05, # Yi uses 5% more tokens for English
},
'chinese': {
'llama_tpc': 0.67, # tokens per character
'yi_tpc': 0.35, # much better for Chinese
'ratio': 0.52, # Yi uses 48% fewer tokens for Chinese
},
'mixed': {
'llama_tpc': 0.40,
'yi_tpc': 0.28,
'ratio': 0.70, # 30% savings on mixed content
},
}
return results
def tokenizer_efficiency_impact():
"""
Tokenizer efficiency directly impacts:
1. Training cost: fewer tokens = less compute for same text
2. Context window utilization: more text per context
3. Inference cost: fewer decode steps for same output
"""
# For a 3T token training run on bilingual data (50% EN, 50% ZH)
en_chars_per_token_llama = 4.5
zh_chars_per_token_llama = 1.5
en_chars_per_token_yi = 4.3
zh_chars_per_token_yi = 2.9
# Characters covered per trillion tokens
llama_coverage = 0.5 * en_chars_per_token_llama + 0.5 * zh_chars_per_token_llama
yi_coverage = 0.5 * en_chars_per_token_yi + 0.5 * zh_chars_per_token_yi
efficiency_gain = yi_coverage / llama_coverage
print(f"Yi covers {efficiency_gain:.1%} more characters per token on bilingual data")
# Yi covers ~120% more characters per token
# Effectively, 3T Yi tokens cover as much text as ~3.6T Llama tokens
Tokenizer Efficiency: Characters per Token
Yi’s bilingual tokenizer processes 93% more Chinese text per token than Llama’s tokenizer (2.9 vs 1.5 characters/token). For bilingual workloads, this translates to 20% more text covered per training token, effectively giving Yi a 20% training efficiency advantage on the same token budget.
Data Engineering Pipeline
class YiDataPipeline:
"""
Yi's primary innovation is data quality, not architecture.
Their multi-stage filtering pipeline is the key differentiator.
"""
def __init__(self):
self.stages = [
'web_crawl_collection',
'language_identification',
'deduplication',
'quality_filtering',
'toxicity_removal',
'domain_classification',
'curriculum_scheduling',
]
def quality_filtering_cascade(self, raw_data):
"""
Multi-stage quality filtering.
Each stage has increasing precision at decreasing recall.
"""
# Stage 1: Heuristic filters
# - Minimum length (200 chars)
# - Maximum repetition ratio (< 30%)
# - Language detection confidence (> 0.8)
# - Remove boilerplate (headers, footers, navbars)
after_heuristic = self.heuristic_filter(raw_data)
# Retains ~40% of raw data
# Stage 2: Perplexity-based filtering
# Use a small LM to score document quality
# Remove high-perplexity documents (gibberish, OCR errors)
after_perplexity = self.perplexity_filter(after_heuristic)
# Retains ~70% of stage 1 output
# Stage 3: Classifier-based quality scoring
# Train a classifier on human-annotated quality labels
# Score each document, keep top 50%
after_classifier = self.classifier_filter(after_perplexity)
# Retains ~50% of stage 2 output
# Stage 4: Topic deduplication
# Cluster documents by semantic similarity
# Keep most informative document per cluster
after_dedup = self.semantic_dedup(after_classifier)
# Retains ~60% of stage 3 output
# Overall: ~8% of raw data survives all stages
return after_dedup
def curriculum_schedule(self, filtered_data, total_tokens=3.6e12):
"""
Yi uses curriculum learning: easier data first, harder data later.
"""
phases = [
{
'phase': 1,
'tokens': total_tokens * 0.4,
'composition': {
'web_general': 0.50,
'books': 0.20,
'code': 0.15,
'academic': 0.10,
'multilingual': 0.05,
},
'difficulty': 'easy (clean web, books)',
},
{
'phase': 2,
'tokens': total_tokens * 0.35,
'composition': {
'web_general': 0.30,
'books': 0.15,
'code': 0.20,
'academic': 0.20,
'math': 0.10,
'multilingual': 0.05,
},
'difficulty': 'medium (more code, academic)',
},
{
'phase': 3,
'tokens': total_tokens * 0.25,
'composition': {
'web_curated': 0.20,
'code': 0.25,
'academic': 0.20,
'math': 0.15,
'reasoning': 0.10,
'multilingual': 0.10,
},
'difficulty': 'hard (reasoning, math, diverse)',
},
]
return phases
Data Pipeline Impact on Model Quality
| Data Strategy | Training Tokens | MMLU | GSM8K | HumanEval | Notes |
|---|---|---|---|---|---|
| Raw web data | 3T | 61.2 | 32.4 | 18.5 | Baseline (no filtering) |
| Heuristic filtering | 3T | 66.8 | 41.2 | 25.3 | Basic cleanup |
| + Perplexity filter | 3T | 69.4 | 48.7 | 30.1 | Remove noise |
| + Classifier quality | 3T | 73.1 | 55.3 | 35.8 | Human-judged quality |
| + Curriculum schedule | 3T | 74.8 | 58.1 | 38.2 | Ordered difficulty |
| Yi-34B (final) | 3T | 76.3 | 67.4 | 41.1 | Full pipeline |
Data quality filtering accounts for 15 MMLU points (61.2 to 76.3) on the same architecture and token budget. Architecture changes at this scale typically contribute 1-3 MMLU points. Yi’s lesson: data engineering has a larger quality ceiling than architecture engineering for dense transformers.
Architecture Decisions
def yi_architecture_analysis():
"""
Yi uses a standard transformer architecture with minor modifications.
The key choices are conservative and well-established.
"""
architecture = {
'attention': 'GQA (Grouped Query Attention)',
'gqa_ratio': '4:1 (4 query heads per KV head)',
'positional_encoding': 'RoPE (Rotary Position Embeddings)',
'rope_base': 5000000, # Extended for long context
'ffn': 'SwiGLU',
'ffn_ratio': 'intermediate = hidden * 8/3 (rounded)',
'normalization': 'RMSNorm (pre-norm)',
'activation': 'SiLU',
'initialization': 'Standard normal, scaled by 1/sqrt(2*n_layers)',
}
# Nothing exotic — the value is in execution, not novelty
return architecture
Yi-Lightning: The MoE Pivot
def yi_lightning_analysis():
"""
Yi-Lightning (July 2024) marked 01.AI's pivot to MoE.
Closed-weight, API-only model that briefly topped LMSYS Arena.
"""
estimated_config = {
'total_params': '~200B',
'active_params': '~30B',
'experts': '~60',
'top_k': '~6',
'layers': '~48',
'hidden_dim': '~6144',
'context': 16384,
'training_tokens': 'Unknown (estimated 5-10T)',
}
# Performance context
arena_results = {
'yi_lightning': {
'arena_elo': 1208, # July 2024 peak
'vs_gpt4o': 'Slightly below',
'vs_claude35': 'Competitive',
'cost': '$0.14/M input tokens',
},
}
# Key observation: Yi-Lightning showed that Chinese labs could
# compete at the frontier with MoE + data quality, without
# needing novel architecture (like DeepSeek's MLA).
return estimated_config, arena_results
Yi-Lightning vs Competitors (July 2024)
| Model | LMSYS ELO | MMLU | Estimated Params | API Cost ($/M tokens) |
|---|---|---|---|---|
| GPT-4o | 1248 | 88.7 | ~200B MoE* | $5.00 |
| Claude 3.5 Sonnet | 1253 | 88.3 | ~100B* | $3.00 |
| Yi-Lightning | 1208 | ~82 | ~200B MoE | $0.14 |
| Llama 3.1 405B | 1178 | 87.3 | 405B Dense | Open |
| DeepSeek V2.5 | 1195 | ~80 | 236B MoE | $0.14 |
Yi-Lightning’s API pricing (5/M) while achieving competitive quality. This pricing, combined with similar moves by DeepSeek, established the cost floor that forced Western providers to reduce prices throughout 2024.
Long Context Extension
def yi_context_extension():
"""
Yi-34B was extended from 4K to 200K context post-training
using a RoPE frequency scaling approach.
"""
# Original training: 4K context, RoPE base = 10000
# Extension: progressively increase RoPE base
# and fine-tune on long-context data
extension_stages = [
{'context': 4096, 'rope_base': 10000, 'fine_tune_tokens': 0},
{'context': 32768, 'rope_base': 500000, 'fine_tune_tokens': 5e9},
{'context': 131072, 'rope_base': 2000000, 'fine_tune_tokens': 3e9},
{'context': 200000, 'rope_base': 5000000, 'fine_tune_tokens': 2e9},
]
# Total extension cost: ~10B tokens of fine-tuning
# Original training: 3T tokens
# Extension cost: 0.3% of original training budget
return extension_stages
Yi-34B Needle-in-Haystack Retrieval Accuracy
Serving Characteristics
def yi_serving_analysis():
"""
Yi model serving performance on various hardware.
"""
models = {
'Yi-34B': {
'fp16_memory_gb': 68,
'int4_memory_gb': 17,
'tokens_per_sec_a100_fp16': 28,
'tokens_per_sec_a100_int4': 45,
'tokens_per_sec_4090_int4': 15,
'kv_cache_per_token_bytes': 2 * 8 * 128 * 2 * 60, # GQA, 60 layers
},
'Yi-1.5-9B': {
'fp16_memory_gb': 18,
'int4_memory_gb': 4.5,
'tokens_per_sec_a100_fp16': 85,
'tokens_per_sec_a100_int4': 120,
'tokens_per_sec_4090_int4': 42,
},
}
# Bilingual serving advantage:
# Yi processes Chinese 48% more efficiently (fewer tokens)
# For a bilingual deployment serving 50% Chinese traffic:
# Yi serves 20% more requests per GPU than Llama on the same workload
return models
Yi Serving Performance
| Model | Quant | Memory | tok/s (A100) | tok/s (RTX 4090) | Best For |
|---|---|---|---|---|---|
| Yi-1.5-9B | INT4 | 4.5 GB | 120 | 42 | Edge bilingual deployment |
| Yi-34B | INT4 | 17 GB | 45 | 15 | Quality-focused bilingual |
| Yi-34B | FP16 | 68 GB | 28 | N/A | Maximum accuracy |
| Yi-34B 200K | INT4 | 17 GB + KV | 32* | N/A | Long-context bilingual |
Yi-1.5-9B at INT4 fits in 4.5 GB — deployable on consumer GPUs and even mobile devices. For bilingual Chinese/English workloads, its tokenizer efficiency means it effectively processes 20% more content per token than Llama-based models, making it the optimal choice for bilingual edge deployment.
Impact on the Ecosystem
def yi_ecosystem_impact():
"""
Yi's contributions to the broader LLM ecosystem.
"""
contributions = {
'bilingual_tokenizer_standard': {
'description': 'Yi demonstrated the value of language-balanced vocabularies',
'adopted_by': 'Qwen, Baichuan, InternLM (all Chinese-English models)',
},
'data_quality_benchmark': {
'description': 'Yi-34B matched Llama 2 70B at half the size through data quality',
'message': 'Data engineering > model scaling at moderate scale',
},
'pricing_disruption': {
'description': 'Yi-Lightning at $0.14/M tokens undercut Western APIs by 35x',
'impact': 'Contributed to the API pricing collapse of 2024',
},
}
return contributions
Yi’s contribution to the frontier model landscape is primarily methodological rather than architectural. The model family proves that a standard transformer architecture, when combined with a bilingual-optimized tokenizer, rigorous data curation, and curriculum-based training, can reach frontier quality. Yi-Lightning’s brief appearance at the top of LMSYS Arena validated the data-first approach at MoE scale. The tokenizer design specifically offers a template for any team building bilingual models: invest in balanced vocabulary allocation and the efficiency gains compound across the entire training and serving pipeline.