English has 4.8 trillion tokens of web text. Yoruba has 8 million β a 600,000x gap for a language with 45 million speakers. The power law is brutal: the top 10 languages hold 85% of all digital text, leaving 7,000+ languages to share the remaining 15%. Training a truly multilingual LLM means confronting this imbalance: either you oversample low-resource languages (risking memorization from limited data) or you accept that your model will be English-dominant with token-level understanding of everything else.
This post addresses the full pipeline for building multilingual training data: quantifying the data distribution, exploiting cross-lingual transfer, using translation as augmentation, building language-specific quality filters, analyzing tokenizer efficiency across scripts, and implementing a multilingual data mixer that balances coverage against quality.
The Data Distribution Problem
Quantifying the Imbalance
Web crawl data follows a power law across languages. Common Crawl statistics from 2024 show the distribution:
Token Count by Language in Common Crawl (Log Scale Approximation)
(Billion tokens)The ratio between English and Yoruba is roughly 16,000:1. No amount of clever filtering can close this gap β the data simply does not exist. This is the fundamental constraint that drives every design decision in multilingual data curation.
Language Tiers
For practical pipeline design, we partition languages into tiers based on available data volume after quality filtering:
Language Tier Classification
| Tier | Available Tokens (Post-Filter) | Languages | Strategy |
|---|---|---|---|
| High-resource | 100B+ | English, Chinese, German, French, Japanese, Russian | Standard curation pipeline |
| Mid-resource | 10B-100B | Korean, Arabic, Thai, Vietnamese, Hindi | Augment with parallel corpora |
| Low-resource | 1B-10B | Swahili, Malay, Tagalog, Amharic | Cross-lingual transfer + translation |
| Ultra-low-resource | Under 1B | Yoruba, Khmer, Lao, Igbo | Heavy translation + few-shot evaluation only |
Cross-Lingual Transfer
The Mechanism
Cross-lingual transfer is the empirical observation that training a transformer on language A improves performance on language B, even when language B was a small fraction of training data. This occurs because:
- Shared subword overlap: Languages sharing scripts or loanwords have overlapping tokenizations. βComputerβ appears in English, German, French, and dozens of other languages.
- Structural alignment: Transformers learn abstract syntactic patterns (subject-verb-object ordering, modifier placement) that partially transfer across languages with similar structure.
- Anchor tokens: Shared entities (proper nouns, numbers, URLs, code) create alignment points that pull representations of different languages into shared subspaces.
The mathematical formulation: if is the hidden representation at layer for input , cross-lingual transfer means that for semantically equivalent inputs (English) and (Swahili), the representations converge in deeper layers:
where is the final layer. This convergence happens even when Swahili is 0.01% of training data, as long as the model has seen enough English to learn the underlying concepts.
Measuring Transfer
We can quantify transfer by training models with varying multilingual data ratios and evaluating on held-out multilingual benchmarks:
import numpy as np
def measure_cross_lingual_transfer(
eval_scores_multilingual_only,
eval_scores_with_english,
languages,
):
"""
Compute transfer gain per language.
eval_scores_multilingual_only: dict mapping lang -> accuracy
when trained only on that language's data
eval_scores_with_english: dict mapping lang -> accuracy
when trained on that language + English data
"""
transfer_gains = {}
for lang in languages:
baseline = eval_scores_multilingual_only[lang]
with_transfer = eval_scores_with_english[lang]
# Transfer gain: relative improvement from adding English
if baseline > 0:
gain = (with_transfer - baseline) / baseline
else:
gain = float('inf') if with_transfer > 0 else 0.0
transfer_gains[lang] = {
"baseline_accuracy": baseline,
"with_english_accuracy": with_transfer,
"relative_gain": gain,
}
return transfer_gains
# Example: transfer gains for low-resource languages
results = measure_cross_lingual_transfer(
eval_scores_multilingual_only={
"sw": 0.31, # Swahili alone
"th": 0.45, # Thai alone
"yo": 0.18, # Yoruba alone
},
eval_scores_with_english={
"sw": 0.52, # Swahili + English
"th": 0.61, # Thai + English
"yo": 0.38, # Yoruba + English
},
languages=["sw", "th", "yo"],
)
# Swahili: 67.7% relative gain
# Thai: 35.6% relative gain
# Yoruba: 111.1% relative gain
The pattern is consistent: the lower the resource level of the target language, the greater the relative transfer gain from adding English data. This is because the English data teaches the model general reasoning capabilities that transfer through shared representations.
Transfer Limitations
Cross-lingual transfer has hard limits:
- Script divergence: Languages with unique scripts (Thai, Khmer, Georgian) share fewer subword tokens with English, reducing the anchor points for alignment.
- Morphological divergence: Agglutinative languages (Turkish, Finnish, Swahili) encode information in affixes that have no English equivalent, limiting syntactic transfer.
- Tokenizer penalty: If the tokenizer was trained primarily on English, it fragments non-Latin text into character-level or byte-level tokens, increasing sequence length and degrading performance.
Translation as Data Augmentation
When Translation Helps
For low-resource languages with under 10B tokens of native data, machine translation of high-quality English sources is the single most effective augmentation strategy. The approach:
- Curate a high-quality English subset (textbooks, Wikipedia, curated web)
- Translate into the target language using a strong MT system
- Filter translated output for quality
- Mix translated data with native data
import json
import hashlib
from dataclasses import dataclass
@dataclass
class TranslatedDocument:
source_text: str
translated_text: str
source_lang: str
target_lang: str
translation_score: float
source_hash: str
def build_translation_augmentation_pipeline(
source_documents,
target_lang,
translate_fn,
quality_threshold=0.7,
):
"""
Translate high-quality source documents and filter results.
source_documents: list of dicts with 'text' and 'quality_score'
target_lang: ISO 639-1 code (e.g., 'sw' for Swahili)
translate_fn: callable(text, src, tgt) -> (translated_text, score)
quality_threshold: minimum translation quality score to keep
"""
results = []
seen_hashes = set()
# Sort by quality -- translate best sources first
sorted_docs = sorted(
source_documents,
key=lambda d: d["quality_score"],
reverse=True,
)
for doc in sorted_docs:
source_text = doc["text"]
# Skip very short documents
if len(source_text.split()) < 50:
continue
# Deduplicate sources
source_hash = hashlib.sha256(source_text.encode()).hexdigest()[:16]
if source_hash in seen_hashes:
continue
seen_hashes.add(source_hash)
# Translate
translated_text, score = translate_fn(
source_text, "en", target_lang
)
# Filter low-quality translations
if score < quality_threshold:
continue
# Length ratio check: translated text should be within
# 0.5x-2.0x of source length (by word count)
src_words = len(source_text.split())
tgt_words = len(translated_text.split())
ratio = tgt_words / max(src_words, 1)
if ratio < 0.3 or ratio > 3.0:
continue
results.append(TranslatedDocument(
source_text=source_text,
translated_text=translated_text,
source_lang="en",
target_lang=target_lang,
translation_score=score,
source_hash=source_hash,
))
return results
Translation Quality Signals
Not all translations are usable. Common failure modes and their detection:
Translation Failure Modes and Detection
| Failure Mode | Detection Method | Frequency |
|---|---|---|
| Hallucinated content (translator adds text not in source) | Length ratio exceeds 1.5x | 5-15% of outputs |
| Untranslated passages (source language leaks through) | Language ID on output chunks | 3-8% of outputs |
| Repeated phrases (degenerate decoding) | N-gram repetition ratio above 0.3 | 2-5% of outputs |
| Script mixing (Latin characters in non-Latin output) | Script consistency check | 1-3% of outputs |
| Semantic drift (meaning changed) | Round-trip translation BLEU below 0.2 | 8-20% of outputs |
import re
from collections import Counter
def translation_quality_checks(source_text, translated_text, target_lang):
"""
Run quality checks on a translated document.
Returns (passed, reasons) tuple.
"""
reasons = []
# Check 1: Length ratio
src_len = len(source_text.split())
tgt_len = len(translated_text.split())
ratio = tgt_len / max(src_len, 1)
if ratio < 0.3 or ratio > 3.0:
reasons.append(f"length_ratio={ratio:.2f}")
# Check 2: Repetition detection
words = translated_text.split()
if len(words) >= 20:
trigrams = [
" ".join(words[i:i+3]) for i in range(len(words) - 2)
]
counts = Counter(trigrams)
most_common_count = counts.most_common(1)[0][1]
repetition_ratio = most_common_count / len(trigrams)
if repetition_ratio > 0.1:
reasons.append(f"repetition_ratio={repetition_ratio:.3f}")
# Check 3: Source language leakage
# Count words that appear verbatim in source (excluding numbers/proper nouns)
source_words = set(source_text.lower().split())
target_words = translated_text.lower().split()
# Filter out numbers and very short words
overlap = [
w for w in target_words
if w in source_words and len(w) > 3 and not w.isdigit()
]
leakage_ratio = len(overlap) / max(len(target_words), 1)
if leakage_ratio > 0.3:
reasons.append(f"source_leakage={leakage_ratio:.3f}")
# Check 4: Empty or near-empty output
if len(translated_text.strip()) < 10:
reasons.append("empty_output")
passed = len(reasons) == 0
return passed, reasons
Back-Translation for Quality Verification
The strongest quality signal for translations: translate the output back to the source language and measure similarity with the original. High-quality translations survive the round trip; hallucinated or drifted translations do not.
A round-trip BLEU score below 0.15 strongly indicates a bad translation. Scores above 0.35 indicate reliable translations. The gap between 0.15 and 0.35 requires manual inspection or additional signals.
Language-Specific Quality Filtering
Per-Language Perplexity Models
A quality filter trained on English data does not transfer to other languages. The perplexity distributions differ because sentence structure, vocabulary richness, and document conventions vary by language. Each language needs its own quality model.
The approach: train a small language model (KenLM or a 100M-parameter transformer) on known-good text for each language, then use its perplexity as a quality signal.
import math
class LanguagePerplexityFilter:
"""
Per-language perplexity-based quality filter.
Uses a small language model per language to score documents.
"""
def __init__(self, language_models, thresholds):
"""
language_models: dict mapping lang_code -> model
Each model has a .score(text) method returning log probability
thresholds: dict mapping lang_code -> (min_ppl, max_ppl)
"""
self.models = language_models
self.thresholds = thresholds
def compute_perplexity(self, text, lang):
"""Compute perplexity of text under the language model."""
if lang not in self.models:
return None
model = self.models[lang]
log_prob = model.score(text)
num_tokens = len(text.split())
if num_tokens == 0:
return float('inf')
# Perplexity = exp(-log_prob / num_tokens)
ppl = math.exp(-log_prob / num_tokens)
return ppl
def filter_document(self, text, lang):
"""
Returns (keep, perplexity) tuple.
Documents with perplexity outside the language-specific
range are filtered out.
"""
ppl = self.compute_perplexity(text, lang)
if ppl is None:
# No model for this language -- pass through
return True, None
min_ppl, max_ppl = self.thresholds.get(
lang, (1.0, 10000.0)
)
# Too low perplexity = repetitive/templated text
# Too high perplexity = garbage/wrong language
keep = min_ppl <= ppl <= max_ppl
return keep, ppl
Language-Specific Thresholds
Perplexity thresholds vary dramatically across languages because of morphological complexity and script properties:
Perplexity Filter Thresholds by Language
| Language | Min PPL | Max PPL | Rationale |
|---|---|---|---|
| English | 10 | 1000 | Well-studied, narrow distribution |
| Chinese | 15 | 2000 | Character-level LM, higher baseline PPL |
| German | 20 | 1500 | Compound words inflate PPL |
| Turkish | 30 | 2500 | Agglutinative morphology, high OOV rate |
| Thai | 25 | 3000 | No spaces between words, segmentation noise |
| Swahili | 35 | 4000 | Limited training data for LM, wide PPL spread |
These thresholds must be calibrated empirically for each language model. The procedure: compute perplexity on a held-out set of known-good documents (e.g., Wikipedia articles), take the 5th and 95th percentiles, and use those as min/max thresholds. Do not copy thresholds between languages β they are not comparable.
Language-Specific Heuristic Filters
Beyond perplexity, each language has unique noise patterns that require targeted filters:
import re
import unicodedata
LANGUAGE_FILTERS = {
"zh": {
# Chinese-specific: filter documents with too many Latin characters
# (likely code-switched or OCR errors)
"max_latin_ratio": 0.3,
# Minimum Chinese character ratio
"min_cjk_ratio": 0.5,
},
"th": {
# Thai-specific: requires Thai script characters
"min_thai_ratio": 0.6,
# Thai text without spaces -- word count heuristic unreliable
# Use character count instead
"min_char_count": 200,
},
"ar": {
# Arabic-specific: right-to-left script checks
"min_arabic_ratio": 0.5,
# Diacritics ratio can indicate quality (vocalized vs unvocalized)
"max_diacritic_ratio": 0.4,
},
}
def count_script_ratio(text, script_name):
"""Count fraction of characters belonging to a Unicode script."""
total = 0
in_script = 0
for char in text:
if char.isspace():
continue
total += 1
try:
if unicodedata.name(char, "").startswith(script_name.upper()):
in_script += 1
except ValueError:
pass
return in_script / max(total, 1)
def apply_language_filters(text, lang):
"""Apply language-specific heuristic filters."""
if lang not in LANGUAGE_FILTERS:
return True
config = LANGUAGE_FILTERS[lang]
if lang == "zh":
cjk_count = sum(
1 for c in text if '\u4e00' <= c <= '\u9fff'
)
total_chars = sum(1 for c in text if not c.isspace())
cjk_ratio = cjk_count / max(total_chars, 1)
latin_count = sum(1 for c in text if c.isascii() and c.isalpha())
latin_ratio = latin_count / max(total_chars, 1)
if cjk_ratio < config["min_cjk_ratio"]:
return False
if latin_ratio > config["max_latin_ratio"]:
return False
elif lang == "th":
thai_count = sum(
1 for c in text if '\u0e00' <= c <= '\u0e7f'
)
total_chars = sum(1 for c in text if not c.isspace())
thai_ratio = thai_count / max(total_chars, 1)
if thai_ratio < config["min_thai_ratio"]:
return False
if total_chars < config["min_char_count"]:
return False
return True
Tokenizer Impact on Multilingual Performance
The Fertility Problem
Tokenizer βfertilityβ measures how many tokens a tokenizer produces per word (or per character) in a given language. A tokenizer trained predominantly on English text has low fertility for English (close to 1.0 token per word) but high fertility for non-Latin scripts.
High fertility is a direct tax on non-English languages:
- Context window waste: A Thai sentence that takes 50 English tokens might consume 150 tokens with a bad tokenizer. The model sees 3x less Thai content per context window.
- Training efficiency: Each gradient update processes fewer Thai words, requiring more steps to learn the same amount.
- Inference cost: Generation is token-by-token. A 3x fertility penalty means 3x the inference cost for the same content.
def measure_tokenizer_fertility(tokenizer, text_samples_by_lang):
"""
Measure tokenizer fertility across languages.
text_samples_by_lang: dict mapping lang -> list of text strings
Returns dict mapping lang -> average fertility
"""
results = {}
for lang, samples in text_samples_by_lang.items():
total_tokens = 0
total_words = 0
for text in samples:
tokens = tokenizer.encode(text)
total_tokens += len(tokens)
# Word count approximation
# For CJK/Thai: count characters as "words"
if lang in ("zh", "ja", "th", "km"):
words = sum(
1 for c in text
if not c.isspace() and not c.isascii()
)
# Add ASCII word count for mixed text
ascii_text = re.sub(r'[^\x00-\x7f]', ' ', text)
words += len(ascii_text.split())
else:
words = len(text.split())
total_words += max(words, 1)
results[lang] = total_tokens / max(total_words, 1)
return results
Empirical Fertility Measurements
Measured on the Llama 2 tokenizer (32K BPE vocabulary, trained mostly on English):
Tokenizer Fertility by Language (Llama 2 Tokenizer)
| Language | Script | Fertility (tokens/word) | Relative Cost vs English |
|---|---|---|---|
| English | Latin | 1.3 | 1.0x |
| German | Latin | 1.8 | 1.4x |
| French | Latin | 1.5 | 1.2x |
| Russian | Cyrillic | 2.8 | 2.2x |
| Chinese | CJK | 2.1 | 1.6x |
| Hindi | Devanagari | 4.2 | 3.2x |
| Thai | Thai script | 5.8 | 4.5x |
| Khmer | Khmer script | 8.1 | 6.2x |
Thai text costs 4.5x more tokens per word than English. For Khmer, it is 6.2x. This is not a minor overhead β it fundamentally limits what the model can learn from limited data in these languages.
Mitigations
Multilingual tokenizer training: Train BPE on a balanced multilingual corpus instead of an English-heavy one. Llama 3 expanded vocabulary to 128K tokens and trained on multilingual data, reducing fertility for non-English scripts by 30-50%.
Script-aware preprocessing: For languages without spaces (Thai, Khmer, Lao, Chinese, Japanese), apply word segmentation before tokenizer training so that BPE can learn word-level tokens.
def build_balanced_tokenizer_corpus(
data_sources,
target_tokens_per_lang=10_000_000,
temperature=0.3,
):
"""
Build a balanced corpus for tokenizer training.
Uses temperature sampling to balance between languages:
- temperature=1.0: proportional to data size (English dominates)
- temperature=0.0: equal tokens per language
- temperature=0.3: moderate upsampling of low-resource languages
data_sources: dict mapping lang -> list of text documents
"""
# Count total tokens per language (approximate)
lang_sizes = {}
for lang, docs in data_sources.items():
total_chars = sum(len(d) for d in docs)
lang_sizes[lang] = total_chars
# Temperature-scaled sampling probabilities
total = sum(v for v in lang_sizes.values())
raw_probs = {
lang: size / total for lang, size in lang_sizes.items()
}
# Apply temperature
scaled = {
lang: prob ** temperature for lang, prob in raw_probs.items()
}
scale_total = sum(v for v in scaled.values())
sample_probs = {
lang: v / scale_total for lang, v in scaled.items()
}
# Compute tokens to sample per language
total_target = target_tokens_per_lang * len(data_sources)
tokens_per_lang = {
lang: int(prob * total_target)
for lang, prob in sample_probs.items()
}
return tokens_per_lang
The Multilingual Data Mixer
Architecture
The mixer must balance multiple competing objectives:
- Maximize coverage across languages
- Prioritize high-quality data over translated data
- Apply language-specific quality filters
- Prevent catastrophic forgetting of low-resource languages during training
import random
import json
import math
from dataclasses import dataclass, field
@dataclass
class LanguageSource:
lang_code: str
native_data_path: str
translated_data_path: str # Empty string if no translations
native_token_count: int
translated_token_count: int
quality_weight: float # 0.0-1.0, based on quality assessment
@dataclass
class MixerConfig:
total_tokens: int # Target total tokens for training
temperature: float # Sampling temperature (0.3-0.7 typical)
max_epochs_native: int # Maximum passes over native data
max_epochs_translated: int # Usually 1 -- translated data degrades on repeat
native_preference: float # 0.0-1.0, weight toward native vs translated
class MultilingualDataMixer:
"""
Produces a training data stream that samples from multiple
languages with configurable balancing.
"""
def __init__(self, sources, config):
self.sources = {s.lang_code: s for s in sources}
self.config = config
self.sampling_weights = self._compute_weights()
def _compute_weights(self):
"""
Compute per-language sampling probabilities using
temperature-scaled balancing.
"""
# Effective token count per language
effective = {}
for lang, src in self.sources.items():
native = min(
src.native_token_count * self.config.max_epochs_native,
src.native_token_count * 4, # Cap at 4 epochs
)
translated = min(
src.translated_token_count * self.config.max_epochs_translated,
src.translated_token_count, # Usually 1 epoch
)
effective[lang] = (native + translated) * src.quality_weight
# Temperature scaling
total = sum(v for v in effective.values())
if total == 0:
n = len(effective)
return {lang: 1.0 / n for lang in effective}
raw_probs = {
lang: count / total for lang, count in effective.items()
}
t = self.config.temperature
scaled = {
lang: prob ** t for lang, prob in raw_probs.items()
}
scale_total = sum(v for v in scaled.values())
return {
lang: v / scale_total for lang, v in scaled.items()
}
def compute_token_budget(self):
"""
Compute how many tokens to sample from each language.
Returns dict mapping lang -> {native_tokens, translated_tokens}
"""
budget = {}
total = self.config.total_tokens
pref = self.config.native_preference
for lang, weight in self.sampling_weights.items():
lang_tokens = int(weight * total)
src = self.sources[lang]
# Split between native and translated
max_native = (
src.native_token_count * self.config.max_epochs_native
)
max_translated = (
src.translated_token_count * self.config.max_epochs_translated
)
# Allocate with preference for native data
native_target = int(lang_tokens * pref)
native_actual = min(native_target, max_native)
remaining = lang_tokens - native_actual
translated_actual = min(remaining, max_translated)
budget[lang] = {
"total_tokens": native_actual + translated_actual,
"native_tokens": native_actual,
"translated_tokens": translated_actual,
"native_epochs": (
native_actual / max(src.native_token_count, 1)
),
"sampling_weight": weight,
}
return budget
def log_budget(self, budget):
"""Print a summary of the token budget."""
print(f"{'Language':<10} {'Weight':<8} {'Native':<12} "
f"{'Translated':<12} {'Epochs':<8} {'Total':<12}")
print("-" * 62)
for lang, info in sorted(
budget.items(),
key=lambda x: x[1]["total_tokens"],
reverse=True,
):
print(
f"{lang:<10} {info['sampling_weight']:<8.4f} "
f"{info['native_tokens']:<12,} "
f"{info['translated_tokens']:<12,} "
f"{info['native_epochs']:<8.2f} "
f"{info['total_tokens']:<12,}"
)
# Usage example
sources = [
LanguageSource("en", "/data/en", "",
4_000_000_000_000, 0, 1.0),
LanguageSource("zh", "/data/zh", "",
800_000_000_000, 0, 0.95),
LanguageSource("de", "/data/de", "",
400_000_000_000, 0, 0.90),
LanguageSource("th", "/data/th", "/data/th_translated",
30_000_000_000, 50_000_000_000, 0.80),
LanguageSource("sw", "/data/sw", "/data/sw_translated",
2_000_000_000, 80_000_000_000, 0.70),
LanguageSource("yo", "/data/yo", "/data/yo_translated",
300_000_000, 40_000_000_000, 0.60),
]
config = MixerConfig(
total_tokens=15_000_000_000_000,
temperature=0.3,
max_epochs_native=4,
max_epochs_translated=1,
native_preference=0.8,
)
mixer = MultilingualDataMixer(sources, config)
budget = mixer.compute_token_budget()
mixer.log_budget(budget)
Temperature Selection
The temperature parameter controls the tradeoff between data quality (high temperature, proportional to data size, English dominates) and language coverage (low temperature, more equal distribution):
English Share of Training Data vs Temperature
(% English)Llama 3 uses an effective temperature around 0.3, which gives English roughly 30-35% of training tokens despite being over 50% of available data. This substantially upsamples low-resource languages.
Aggressive upsampling of low-resource languages means repeating that data multiple times per training run. Empirical results show quality degrades after 4 epochs on the same data β the model begins memorizing specific documents rather than learning generalizable patterns. For ultra-low-resource languages, this limits effective data to roughly .
Parallel Corpora and Alignment
Using Bitext for Transfer
Parallel corpora β the same text in multiple languages β provide the strongest cross-lingual signal. Sources include:
- OPUS: Aggregated parallel data from EU Parliament, UN documents, subtitles (covers 100+ language pairs)
- CCAligned: Parallel web pages extracted from Common Crawl
- WikiMatrix: Parallel sentences extracted from Wikipedia across languages
- NLLB data: Training data released with Metaβs No Language Left Behind model
def create_parallel_training_examples(
bitext_pairs,
format_style="interleaved",
):
"""
Convert parallel corpus into training examples.
bitext_pairs: list of (source_text, target_text, src_lang, tgt_lang)
format_style:
'interleaved' - alternating paragraphs in both languages
'translation_task' - explicit translation prompts
'concatenated' - source then target with separator
"""
examples = []
for src_text, tgt_text, src_lang, tgt_lang in bitext_pairs:
if format_style == "interleaved":
# Split into sentences and alternate
src_sents = src_text.split(". ")
tgt_sents = tgt_text.split(". ")
pairs = zip(src_sents, tgt_sents)
text = "\n".join(
f"{s.strip()}. {t.strip()}."
for s, t in pairs
if s.strip() and t.strip()
)
elif format_style == "translation_task":
text = (
f"[{src_lang}] {src_text}\n"
f"[{tgt_lang}] {tgt_text}"
)
elif format_style == "concatenated":
text = f"{src_text}\n---\n{tgt_text}"
else:
raise ValueError(f"Unknown format: {format_style}")
examples.append({
"text": text,
"languages": [src_lang, tgt_lang],
"type": "parallel",
})
return examples
Alignment Verification
Parallel data is noisy. Misaligned pairs (where the source and target are not translations of each other) inject confusion into training. Verification methods:
- Cosine similarity of sentence embeddings: Encode both sides with a multilingual encoder (LaBSE, SONAR). Pairs with cosine similarity below 0.6 are likely misaligned.
- Length ratio: Sentence-level length ratios outside the expected range for a language pair indicate alignment errors.
- Overlap of named entities: If the source mentions βTokyoβ and the target does not, the alignment is suspect.
where and typical values are , , .
End-to-End Pipeline
Putting It All Together
The complete multilingual data curation pipeline:
class MultilingualCurationPipeline:
"""
End-to-end pipeline for multilingual data curation.
Steps:
1. Language identification
2. Per-language heuristic filtering
3. Per-language perplexity filtering
4. Deduplication (per-language and cross-language)
5. Translation augmentation for low-resource
6. Quality scoring and mixing
"""
def __init__(
self,
lang_detector,
perplexity_filter,
deduplicator,
translator,
mixer_config,
):
self.lang_detector = lang_detector
self.perplexity_filter = perplexity_filter
self.deduplicator = deduplicator
self.translator = translator
self.mixer_config = mixer_config
# Statistics tracking
self.stats = {
"total_input": 0,
"lang_id_filtered": 0,
"heuristic_filtered": 0,
"perplexity_filtered": 0,
"dedup_filtered": 0,
"passed": 0,
"by_language": {},
}
def process_document(self, text):
"""Process a single document through the pipeline."""
self.stats["total_input"] += 1
# Step 1: Language identification
lang, confidence = self.lang_detector.detect(text)
if confidence < 0.8:
self.stats["lang_id_filtered"] += 1
return None
# Step 2: Language-specific heuristic filters
if not apply_language_filters(text, lang):
self.stats["heuristic_filtered"] += 1
return None
# Step 3: Perplexity filter
keep, ppl = self.perplexity_filter.filter_document(text, lang)
if not keep:
self.stats["perplexity_filtered"] += 1
return None
# Step 4: Deduplication check
if self.deduplicator.is_duplicate(text, lang):
self.stats["dedup_filtered"] += 1
return None
self.stats["passed"] += 1
if lang not in self.stats["by_language"]:
self.stats["by_language"][lang] = 0
self.stats["by_language"][lang] += 1
return {
"text": text,
"language": lang,
"lang_confidence": confidence,
"perplexity": ppl,
"source": "native",
}
def augment_low_resource(self, english_docs, target_langs):
"""
Translate high-quality English documents into
low-resource target languages.
"""
augmented = {}
for lang in target_langs:
translated = build_translation_augmentation_pipeline(
source_documents=english_docs,
target_lang=lang,
translate_fn=self.translator.translate,
quality_threshold=0.7,
)
augmented[lang] = translated
return augmented
def get_statistics(self):
"""Return pipeline statistics."""
total = self.stats["total_input"]
return {
"total_input": total,
"pass_rate": self.stats["passed"] / max(total, 1),
"filter_breakdown": {
"lang_id": self.stats["lang_id_filtered"] / max(total, 1),
"heuristic": self.stats["heuristic_filtered"] / max(total, 1),
"perplexity": self.stats["perplexity_filtered"] / max(total, 1),
"dedup": self.stats["dedup_filtered"] / max(total, 1),
},
"language_distribution": self.stats["by_language"],
}
Evaluation
Multilingual Benchmarks
The final test: does the curated multilingual data produce a model that performs well across languages?
Model Performance by Language Tier (Hypothetical 7B Model)
| Evaluation | English Only | Naive Multilingual | Curated Multilingual |
|---|---|---|---|
| English MMLU | 64.2 | 61.8 | 63.5 |
| Chinese C-Eval | 38.1 | 52.4 | 56.8 |
| Thai TyDi QA | 12.3 | 28.7 | 41.2 |
| Swahili QA | 8.1 | 15.2 | 29.4 |
| Average (all langs) | 30.7 | 39.5 | 47.7 |
Key findings:
- Curated multilingual training costs approximately 1-2 points of English performance (63.5 vs 64.2) but gains 17 points on average across all languages.
- The biggest gains are in low-resource languages (Swahili: 8.1 to 29.4) due to translation augmentation and cross-lingual transfer.
- Naive multilingual (proportional sampling without quality filtering) underperforms curated multilingual because it includes low-quality translations and noisy data.
Multilingual data curation is not about maximizing the volume of non-English data. It is about maximizing the quality of cross-lingual signal: clean native data, high-quality translations, parallel corpora for alignment, and a tokenizer that does not impose a 5x tax on non-Latin scripts. The mixer temperature and per-language quality filters are the two most impactful design decisions.