Part of Series The Dataset Frontier 6 of 27
1 Synthetic Data Pipelines: Magpie, Nemotron-4, and Generating Training Data at Scale 2 Data Curation at Scale: DCLM, FineWeb-Edu, and the Exact Heuristics That Filter the Web 3 Agent-Based Simulation: Using 10,000 AI Agents to Generate Synthetic Training Data 4 Code Dataset Curation: Deduplication, License Filtering, and Quality Scoring for LLM Training 5 Multilingual Data: Cross-Lingual Transfer, Low-Resource Languages, and Translation Quality 6 Instruction Tuning Data: ShareGPT, OpenAssistant, and Quality Metrics for Alignment 7 Preference Data: Building DPO/RLHF Datasets from Human and AI Feedback 8 Data Mixing: Optimal Proportions of Code, Math, Web, and Books for LLM Training 9 Evaluation Datasets: Building Benchmarks That Actually Measure LLM Capability 10 Data Contamination: Detecting and Preventing Benchmark Leakage in Training Data 11 The Data Scaling Law: How Much Data Is Enough, and What Happens When You Run Out 12 Training a Tokenizer from Scratch: BPE Merge Rules, Vocabulary Optimization, and Compression Ratio 13 Multimodal Training Data: Image-Text Pairs, Video Captioning, and Interleaved Document Formats 14 RLHF Data at Scale: Collecting Millions of Human Preferences with Minimal Cost 15 Building a Decontamination Pipeline: Removing Benchmark Data from Training Corpora 16 Safety Training Data: Red Teaming, Refusal Training, and Building Datasets for Harmless AI 17 Data Versioning and Reproducibility: Tracking What Changed Between Training Runs 18 Domain-Specific Data: Building Medical, Legal, and Financial Training Datasets 19 Data Attribution and Provenance: Tracing Model Outputs Back to Training Examples 20 The Data Flywheel: Using Production Logs to Continuously Improve Training Data 21 Reward Model Training Data: Building Datasets for Math Verification and Code Correctness 22 Long-Context Training Data: Book-Length Documents, Multi-Document QA, and Needle-in-Haystack 23 Agentic Interaction Data: Tool Use Traces, Multi-Step Planning Logs, and Environment Feedback 24 Data Labeling Platforms: Scale AI, Surge AI, and Building Your Own Annotation Pipeline 25 Data Legal Issues: Copyright, Fair Use, Opt-Out, and the Regulatory Landscape for Training Data 26 Data Pipeline at Scale: Spark, Ray, and Processing 15 Trillion Tokens Across 1000 Nodes 27 Building a Data Pipeline: From Raw HTML to Clean Training Tokens in 500 Lines

English has 4.8 trillion tokens of web text. Yoruba has 8 million β€” a 600,000x gap for a language with 45 million speakers. The power law is brutal: the top 10 languages hold 85% of all digital text, leaving 7,000+ languages to share the remaining 15%. Training a truly multilingual LLM means confronting this imbalance: either you oversample low-resource languages (risking memorization from limited data) or you accept that your model will be English-dominant with token-level understanding of everything else.

This post addresses the full pipeline for building multilingual training data: quantifying the data distribution, exploiting cross-lingual transfer, using translation as augmentation, building language-specific quality filters, analyzing tokenizer efficiency across scripts, and implementing a multilingual data mixer that balances coverage against quality.

The Data Distribution Problem

Quantifying the Imbalance

Web crawl data follows a power law across languages. Common Crawl statistics from 2024 show the distribution:

Token Count by Language in Common Crawl (Log Scale Approximation)

(Billion tokens)
English ~4.8T tokens
4,800 Billion tokens
Chinese ~900B tokens
900 Billion tokens
German ~450B tokens
450 Billion tokens
French ~400B tokens
400 Billion tokens
Japanese ~350B tokens
350 Billion tokens
Russian ~300B tokens
300 Billion tokens
Spanish ~280B tokens
280 Billion tokens
Thai ~35B tokens
35 Billion tokens
Swahili ~2B tokens
2 Billion tokens
Yoruba ~300M tokens
0.3 Billion tokens

The ratio between English and Yoruba is roughly 16,000:1. No amount of clever filtering can close this gap β€” the data simply does not exist. This is the fundamental constraint that drives every design decision in multilingual data curation.

Language Tiers

For practical pipeline design, we partition languages into tiers based on available data volume after quality filtering:

πŸ“Š

Language Tier Classification

TierAvailable Tokens (Post-Filter)LanguagesStrategy
High-resource 100B+ English, Chinese, German, French, Japanese, Russian Standard curation pipeline
Mid-resource 10B-100B Korean, Arabic, Thai, Vietnamese, Hindi Augment with parallel corpora
Low-resource 1B-10B Swahili, Malay, Tagalog, Amharic Cross-lingual transfer + translation
Ultra-low-resource Under 1B Yoruba, Khmer, Lao, Igbo Heavy translation + few-shot evaluation only

Cross-Lingual Transfer

The Mechanism

Cross-lingual transfer is the empirical observation that training a transformer on language A improves performance on language B, even when language B was a small fraction of training data. This occurs because:

  1. Shared subword overlap: Languages sharing scripts or loanwords have overlapping tokenizations. β€œComputer” appears in English, German, French, and dozens of other languages.
  2. Structural alignment: Transformers learn abstract syntactic patterns (subject-verb-object ordering, modifier placement) that partially transfer across languages with similar structure.
  3. Anchor tokens: Shared entities (proper nouns, numbers, URLs, code) create alignment points that pull representations of different languages into shared subspaces.

The mathematical formulation: if hβ„“(x)h_\ell(x) is the hidden representation at layer β„“\ell for input xx, cross-lingual transfer means that for semantically equivalent inputs xenx_{en} (English) and xswx_{sw} (Swahili), the representations converge in deeper layers:

βˆ₯hL(xen)βˆ’hL(xsw)βˆ₯β‰ͺβˆ₯h1(xen)βˆ’h1(xsw)βˆ₯\|h_L(x_{en}) - h_L(x_{sw})\| \ll \|h_1(x_{en}) - h_1(x_{sw})\|

where LL is the final layer. This convergence happens even when Swahili is 0.01% of training data, as long as the model has seen enough English to learn the underlying concepts.

Measuring Transfer

We can quantify transfer by training models with varying multilingual data ratios and evaluating on held-out multilingual benchmarks:

import numpy as np

def measure_cross_lingual_transfer(
    eval_scores_multilingual_only,
    eval_scores_with_english,
    languages,
):
    """
    Compute transfer gain per language.

    eval_scores_multilingual_only: dict mapping lang -> accuracy
        when trained only on that language's data
    eval_scores_with_english: dict mapping lang -> accuracy
        when trained on that language + English data
    """
    transfer_gains = {}
    for lang in languages:
        baseline = eval_scores_multilingual_only[lang]
        with_transfer = eval_scores_with_english[lang]
        # Transfer gain: relative improvement from adding English
        if baseline > 0:
            gain = (with_transfer - baseline) / baseline
        else:
            gain = float('inf') if with_transfer > 0 else 0.0
        transfer_gains[lang] = {
            "baseline_accuracy": baseline,
            "with_english_accuracy": with_transfer,
            "relative_gain": gain,
        }
    return transfer_gains

# Example: transfer gains for low-resource languages
results = measure_cross_lingual_transfer(
    eval_scores_multilingual_only={
        "sw": 0.31,  # Swahili alone
        "th": 0.45,  # Thai alone
        "yo": 0.18,  # Yoruba alone
    },
    eval_scores_with_english={
        "sw": 0.52,  # Swahili + English
        "th": 0.61,  # Thai + English
        "yo": 0.38,  # Yoruba + English
    },
    languages=["sw", "th", "yo"],
)
# Swahili: 67.7% relative gain
# Thai: 35.6% relative gain
# Yoruba: 111.1% relative gain

The pattern is consistent: the lower the resource level of the target language, the greater the relative transfer gain from adding English data. This is because the English data teaches the model general reasoning capabilities that transfer through shared representations.

Transfer Limitations

Cross-lingual transfer has hard limits:

  1. Script divergence: Languages with unique scripts (Thai, Khmer, Georgian) share fewer subword tokens with English, reducing the anchor points for alignment.
  2. Morphological divergence: Agglutinative languages (Turkish, Finnish, Swahili) encode information in affixes that have no English equivalent, limiting syntactic transfer.
  3. Tokenizer penalty: If the tokenizer was trained primarily on English, it fragments non-Latin text into character-level or byte-level tokens, increasing sequence length and degrading performance.

Translation as Data Augmentation

When Translation Helps

For low-resource languages with under 10B tokens of native data, machine translation of high-quality English sources is the single most effective augmentation strategy. The approach:

  1. Curate a high-quality English subset (textbooks, Wikipedia, curated web)
  2. Translate into the target language using a strong MT system
  3. Filter translated output for quality
  4. Mix translated data with native data
import json
import hashlib
from dataclasses import dataclass

@dataclass
class TranslatedDocument:
    source_text: str
    translated_text: str
    source_lang: str
    target_lang: str
    translation_score: float
    source_hash: str

def build_translation_augmentation_pipeline(
    source_documents,
    target_lang,
    translate_fn,
    quality_threshold=0.7,
):
    """
    Translate high-quality source documents and filter results.

    source_documents: list of dicts with 'text' and 'quality_score'
    target_lang: ISO 639-1 code (e.g., 'sw' for Swahili)
    translate_fn: callable(text, src, tgt) -> (translated_text, score)
    quality_threshold: minimum translation quality score to keep
    """
    results = []
    seen_hashes = set()

    # Sort by quality -- translate best sources first
    sorted_docs = sorted(
        source_documents,
        key=lambda d: d["quality_score"],
        reverse=True,
    )

    for doc in sorted_docs:
        source_text = doc["text"]

        # Skip very short documents
        if len(source_text.split()) < 50:
            continue

        # Deduplicate sources
        source_hash = hashlib.sha256(source_text.encode()).hexdigest()[:16]
        if source_hash in seen_hashes:
            continue
        seen_hashes.add(source_hash)

        # Translate
        translated_text, score = translate_fn(
            source_text, "en", target_lang
        )

        # Filter low-quality translations
        if score < quality_threshold:
            continue

        # Length ratio check: translated text should be within
        # 0.5x-2.0x of source length (by word count)
        src_words = len(source_text.split())
        tgt_words = len(translated_text.split())
        ratio = tgt_words / max(src_words, 1)
        if ratio < 0.3 or ratio > 3.0:
            continue

        results.append(TranslatedDocument(
            source_text=source_text,
            translated_text=translated_text,
            source_lang="en",
            target_lang=target_lang,
            translation_score=score,
            source_hash=source_hash,
        ))

    return results

Translation Quality Signals

Not all translations are usable. Common failure modes and their detection:

πŸ“Š

Translation Failure Modes and Detection

Failure ModeDetection MethodFrequency
Hallucinated content (translator adds text not in source) Length ratio exceeds 1.5x 5-15% of outputs
Untranslated passages (source language leaks through) Language ID on output chunks 3-8% of outputs
Repeated phrases (degenerate decoding) N-gram repetition ratio above 0.3 2-5% of outputs
Script mixing (Latin characters in non-Latin output) Script consistency check 1-3% of outputs
Semantic drift (meaning changed) Round-trip translation BLEU below 0.2 8-20% of outputs
import re
from collections import Counter

def translation_quality_checks(source_text, translated_text, target_lang):
    """
    Run quality checks on a translated document.
    Returns (passed, reasons) tuple.
    """
    reasons = []

    # Check 1: Length ratio
    src_len = len(source_text.split())
    tgt_len = len(translated_text.split())
    ratio = tgt_len / max(src_len, 1)
    if ratio < 0.3 or ratio > 3.0:
        reasons.append(f"length_ratio={ratio:.2f}")

    # Check 2: Repetition detection
    words = translated_text.split()
    if len(words) >= 20:
        trigrams = [
            " ".join(words[i:i+3]) for i in range(len(words) - 2)
        ]
        counts = Counter(trigrams)
        most_common_count = counts.most_common(1)[0][1]
        repetition_ratio = most_common_count / len(trigrams)
        if repetition_ratio > 0.1:
            reasons.append(f"repetition_ratio={repetition_ratio:.3f}")

    # Check 3: Source language leakage
    # Count words that appear verbatim in source (excluding numbers/proper nouns)
    source_words = set(source_text.lower().split())
    target_words = translated_text.lower().split()
    # Filter out numbers and very short words
    overlap = [
        w for w in target_words
        if w in source_words and len(w) > 3 and not w.isdigit()
    ]
    leakage_ratio = len(overlap) / max(len(target_words), 1)
    if leakage_ratio > 0.3:
        reasons.append(f"source_leakage={leakage_ratio:.3f}")

    # Check 4: Empty or near-empty output
    if len(translated_text.strip()) < 10:
        reasons.append("empty_output")

    passed = len(reasons) == 0
    return passed, reasons

Back-Translation for Quality Verification

The strongest quality signal for translations: translate the output back to the source language and measure similarity with the original. High-quality translations survive the round trip; hallucinated or drifted translations do not.

quality=BLEU(source,BackTranslate(Translate(source)))\text{quality} = \text{BLEU}(\text{source}, \text{BackTranslate}(\text{Translate}(\text{source})))

A round-trip BLEU score below 0.15 strongly indicates a bad translation. Scores above 0.35 indicate reliable translations. The gap between 0.15 and 0.35 requires manual inspection or additional signals.

Language-Specific Quality Filtering

Per-Language Perplexity Models

A quality filter trained on English data does not transfer to other languages. The perplexity distributions differ because sentence structure, vocabulary richness, and document conventions vary by language. Each language needs its own quality model.

The approach: train a small language model (KenLM or a 100M-parameter transformer) on known-good text for each language, then use its perplexity as a quality signal.

import math

class LanguagePerplexityFilter:
    """
    Per-language perplexity-based quality filter.
    Uses a small language model per language to score documents.
    """

    def __init__(self, language_models, thresholds):
        """
        language_models: dict mapping lang_code -> model
            Each model has a .score(text) method returning log probability
        thresholds: dict mapping lang_code -> (min_ppl, max_ppl)
        """
        self.models = language_models
        self.thresholds = thresholds

    def compute_perplexity(self, text, lang):
        """Compute perplexity of text under the language model."""
        if lang not in self.models:
            return None

        model = self.models[lang]
        log_prob = model.score(text)
        num_tokens = len(text.split())

        if num_tokens == 0:
            return float('inf')

        # Perplexity = exp(-log_prob / num_tokens)
        ppl = math.exp(-log_prob / num_tokens)
        return ppl

    def filter_document(self, text, lang):
        """
        Returns (keep, perplexity) tuple.
        Documents with perplexity outside the language-specific
        range are filtered out.
        """
        ppl = self.compute_perplexity(text, lang)

        if ppl is None:
            # No model for this language -- pass through
            return True, None

        min_ppl, max_ppl = self.thresholds.get(
            lang, (1.0, 10000.0)
        )

        # Too low perplexity = repetitive/templated text
        # Too high perplexity = garbage/wrong language
        keep = min_ppl <= ppl <= max_ppl
        return keep, ppl

Language-Specific Thresholds

Perplexity thresholds vary dramatically across languages because of morphological complexity and script properties:

πŸ“Š

Perplexity Filter Thresholds by Language

LanguageMin PPLMax PPLRationale
English 10 1000 Well-studied, narrow distribution
Chinese 15 2000 Character-level LM, higher baseline PPL
German 20 1500 Compound words inflate PPL
Turkish 30 2500 Agglutinative morphology, high OOV rate
Thai 25 3000 No spaces between words, segmentation noise
Swahili 35 4000 Limited training data for LM, wide PPL spread
⚠️ Threshold Calibration

These thresholds must be calibrated empirically for each language model. The procedure: compute perplexity on a held-out set of known-good documents (e.g., Wikipedia articles), take the 5th and 95th percentiles, and use those as min/max thresholds. Do not copy thresholds between languages β€” they are not comparable.

Language-Specific Heuristic Filters

Beyond perplexity, each language has unique noise patterns that require targeted filters:

import re
import unicodedata

LANGUAGE_FILTERS = {
    "zh": {
        # Chinese-specific: filter documents with too many Latin characters
        # (likely code-switched or OCR errors)
        "max_latin_ratio": 0.3,
        # Minimum Chinese character ratio
        "min_cjk_ratio": 0.5,
    },
    "th": {
        # Thai-specific: requires Thai script characters
        "min_thai_ratio": 0.6,
        # Thai text without spaces -- word count heuristic unreliable
        # Use character count instead
        "min_char_count": 200,
    },
    "ar": {
        # Arabic-specific: right-to-left script checks
        "min_arabic_ratio": 0.5,
        # Diacritics ratio can indicate quality (vocalized vs unvocalized)
        "max_diacritic_ratio": 0.4,
    },
}

def count_script_ratio(text, script_name):
    """Count fraction of characters belonging to a Unicode script."""
    total = 0
    in_script = 0
    for char in text:
        if char.isspace():
            continue
        total += 1
        try:
            if unicodedata.name(char, "").startswith(script_name.upper()):
                in_script += 1
        except ValueError:
            pass
    return in_script / max(total, 1)

def apply_language_filters(text, lang):
    """Apply language-specific heuristic filters."""
    if lang not in LANGUAGE_FILTERS:
        return True

    config = LANGUAGE_FILTERS[lang]

    if lang == "zh":
        cjk_count = sum(
            1 for c in text if '\u4e00' <= c <= '\u9fff'
        )
        total_chars = sum(1 for c in text if not c.isspace())
        cjk_ratio = cjk_count / max(total_chars, 1)

        latin_count = sum(1 for c in text if c.isascii() and c.isalpha())
        latin_ratio = latin_count / max(total_chars, 1)

        if cjk_ratio < config["min_cjk_ratio"]:
            return False
        if latin_ratio > config["max_latin_ratio"]:
            return False

    elif lang == "th":
        thai_count = sum(
            1 for c in text if '\u0e00' <= c <= '\u0e7f'
        )
        total_chars = sum(1 for c in text if not c.isspace())
        thai_ratio = thai_count / max(total_chars, 1)

        if thai_ratio < config["min_thai_ratio"]:
            return False
        if total_chars < config["min_char_count"]:
            return False

    return True

Tokenizer Impact on Multilingual Performance

The Fertility Problem

Tokenizer β€œfertility” measures how many tokens a tokenizer produces per word (or per character) in a given language. A tokenizer trained predominantly on English text has low fertility for English (close to 1.0 token per word) but high fertility for non-Latin scripts.

fertility(lang)=num_tokens(text)num_words(text)\text{fertility}(\text{lang}) = \frac{\text{num\_tokens}(\text{text})}{\text{num\_words}(\text{text})}

High fertility is a direct tax on non-English languages:

  1. Context window waste: A Thai sentence that takes 50 English tokens might consume 150 tokens with a bad tokenizer. The model sees 3x less Thai content per context window.
  2. Training efficiency: Each gradient update processes fewer Thai words, requiring more steps to learn the same amount.
  3. Inference cost: Generation is token-by-token. A 3x fertility penalty means 3x the inference cost for the same content.
def measure_tokenizer_fertility(tokenizer, text_samples_by_lang):
    """
    Measure tokenizer fertility across languages.

    text_samples_by_lang: dict mapping lang -> list of text strings
    Returns dict mapping lang -> average fertility
    """
    results = {}

    for lang, samples in text_samples_by_lang.items():
        total_tokens = 0
        total_words = 0

        for text in samples:
            tokens = tokenizer.encode(text)
            total_tokens += len(tokens)

            # Word count approximation
            # For CJK/Thai: count characters as "words"
            if lang in ("zh", "ja", "th", "km"):
                words = sum(
                    1 for c in text
                    if not c.isspace() and not c.isascii()
                )
                # Add ASCII word count for mixed text
                ascii_text = re.sub(r'[^\x00-\x7f]', ' ', text)
                words += len(ascii_text.split())
            else:
                words = len(text.split())

            total_words += max(words, 1)

        results[lang] = total_tokens / max(total_words, 1)

    return results

Empirical Fertility Measurements

Measured on the Llama 2 tokenizer (32K BPE vocabulary, trained mostly on English):

πŸ“Š

Tokenizer Fertility by Language (Llama 2 Tokenizer)

LanguageScriptFertility (tokens/word)Relative Cost vs English
English Latin 1.3 1.0x
German Latin 1.8 1.4x
French Latin 1.5 1.2x
Russian Cyrillic 2.8 2.2x
Chinese CJK 2.1 1.6x
Hindi Devanagari 4.2 3.2x
Thai Thai script 5.8 4.5x
Khmer Khmer script 8.1 6.2x

Thai text costs 4.5x more tokens per word than English. For Khmer, it is 6.2x. This is not a minor overhead β€” it fundamentally limits what the model can learn from limited data in these languages.

Mitigations

Multilingual tokenizer training: Train BPE on a balanced multilingual corpus instead of an English-heavy one. Llama 3 expanded vocabulary to 128K tokens and trained on multilingual data, reducing fertility for non-English scripts by 30-50%.

Script-aware preprocessing: For languages without spaces (Thai, Khmer, Lao, Chinese, Japanese), apply word segmentation before tokenizer training so that BPE can learn word-level tokens.

def build_balanced_tokenizer_corpus(
    data_sources,
    target_tokens_per_lang=10_000_000,
    temperature=0.3,
):
    """
    Build a balanced corpus for tokenizer training.

    Uses temperature sampling to balance between languages:
    - temperature=1.0: proportional to data size (English dominates)
    - temperature=0.0: equal tokens per language
    - temperature=0.3: moderate upsampling of low-resource languages

    data_sources: dict mapping lang -> list of text documents
    """
    # Count total tokens per language (approximate)
    lang_sizes = {}
    for lang, docs in data_sources.items():
        total_chars = sum(len(d) for d in docs)
        lang_sizes[lang] = total_chars

    # Temperature-scaled sampling probabilities
    total = sum(v for v in lang_sizes.values())
    raw_probs = {
        lang: size / total for lang, size in lang_sizes.items()
    }

    # Apply temperature
    scaled = {
        lang: prob ** temperature for lang, prob in raw_probs.items()
    }
    scale_total = sum(v for v in scaled.values())
    sample_probs = {
        lang: v / scale_total for lang, v in scaled.items()
    }

    # Compute tokens to sample per language
    total_target = target_tokens_per_lang * len(data_sources)
    tokens_per_lang = {
        lang: int(prob * total_target)
        for lang, prob in sample_probs.items()
    }

    return tokens_per_lang

The Multilingual Data Mixer

Architecture

The mixer must balance multiple competing objectives:

  • Maximize coverage across languages
  • Prioritize high-quality data over translated data
  • Apply language-specific quality filters
  • Prevent catastrophic forgetting of low-resource languages during training
import random
import json
import math
from dataclasses import dataclass, field

@dataclass
class LanguageSource:
    lang_code: str
    native_data_path: str
    translated_data_path: str  # Empty string if no translations
    native_token_count: int
    translated_token_count: int
    quality_weight: float  # 0.0-1.0, based on quality assessment

@dataclass
class MixerConfig:
    total_tokens: int  # Target total tokens for training
    temperature: float  # Sampling temperature (0.3-0.7 typical)
    max_epochs_native: int  # Maximum passes over native data
    max_epochs_translated: int  # Usually 1 -- translated data degrades on repeat
    native_preference: float  # 0.0-1.0, weight toward native vs translated

class MultilingualDataMixer:
    """
    Produces a training data stream that samples from multiple
    languages with configurable balancing.
    """

    def __init__(self, sources, config):
        self.sources = {s.lang_code: s for s in sources}
        self.config = config
        self.sampling_weights = self._compute_weights()

    def _compute_weights(self):
        """
        Compute per-language sampling probabilities using
        temperature-scaled balancing.
        """
        # Effective token count per language
        effective = {}
        for lang, src in self.sources.items():
            native = min(
                src.native_token_count * self.config.max_epochs_native,
                src.native_token_count * 4,  # Cap at 4 epochs
            )
            translated = min(
                src.translated_token_count * self.config.max_epochs_translated,
                src.translated_token_count,  # Usually 1 epoch
            )
            effective[lang] = (native + translated) * src.quality_weight

        # Temperature scaling
        total = sum(v for v in effective.values())
        if total == 0:
            n = len(effective)
            return {lang: 1.0 / n for lang in effective}

        raw_probs = {
            lang: count / total for lang, count in effective.items()
        }

        t = self.config.temperature
        scaled = {
            lang: prob ** t for lang, prob in raw_probs.items()
        }
        scale_total = sum(v for v in scaled.values())

        return {
            lang: v / scale_total for lang, v in scaled.items()
        }

    def compute_token_budget(self):
        """
        Compute how many tokens to sample from each language.
        Returns dict mapping lang -> {native_tokens, translated_tokens}
        """
        budget = {}
        total = self.config.total_tokens
        pref = self.config.native_preference

        for lang, weight in self.sampling_weights.items():
            lang_tokens = int(weight * total)
            src = self.sources[lang]

            # Split between native and translated
            max_native = (
                src.native_token_count * self.config.max_epochs_native
            )
            max_translated = (
                src.translated_token_count * self.config.max_epochs_translated
            )

            # Allocate with preference for native data
            native_target = int(lang_tokens * pref)
            native_actual = min(native_target, max_native)

            remaining = lang_tokens - native_actual
            translated_actual = min(remaining, max_translated)

            budget[lang] = {
                "total_tokens": native_actual + translated_actual,
                "native_tokens": native_actual,
                "translated_tokens": translated_actual,
                "native_epochs": (
                    native_actual / max(src.native_token_count, 1)
                ),
                "sampling_weight": weight,
            }

        return budget

    def log_budget(self, budget):
        """Print a summary of the token budget."""
        print(f"{'Language':<10} {'Weight':<8} {'Native':<12} "
              f"{'Translated':<12} {'Epochs':<8} {'Total':<12}")
        print("-" * 62)

        for lang, info in sorted(
            budget.items(),
            key=lambda x: x[1]["total_tokens"],
            reverse=True,
        ):
            print(
                f"{lang:<10} {info['sampling_weight']:<8.4f} "
                f"{info['native_tokens']:<12,} "
                f"{info['translated_tokens']:<12,} "
                f"{info['native_epochs']:<8.2f} "
                f"{info['total_tokens']:<12,}"
            )

# Usage example
sources = [
    LanguageSource("en", "/data/en", "",
                   4_000_000_000_000, 0, 1.0),
    LanguageSource("zh", "/data/zh", "",
                   800_000_000_000, 0, 0.95),
    LanguageSource("de", "/data/de", "",
                   400_000_000_000, 0, 0.90),
    LanguageSource("th", "/data/th", "/data/th_translated",
                   30_000_000_000, 50_000_000_000, 0.80),
    LanguageSource("sw", "/data/sw", "/data/sw_translated",
                   2_000_000_000, 80_000_000_000, 0.70),
    LanguageSource("yo", "/data/yo", "/data/yo_translated",
                   300_000_000, 40_000_000_000, 0.60),
]

config = MixerConfig(
    total_tokens=15_000_000_000_000,
    temperature=0.3,
    max_epochs_native=4,
    max_epochs_translated=1,
    native_preference=0.8,
)

mixer = MultilingualDataMixer(sources, config)
budget = mixer.compute_token_budget()
mixer.log_budget(budget)

Temperature Selection

The temperature parameter controls the tradeoff between data quality (high temperature, proportional to data size, English dominates) and language coverage (low temperature, more equal distribution):

English Share of Training Data vs Temperature

(% English)
T=1.0 76% English (proportional)
76 % English
T=0.7 58% English
58 % English
T=0.5 45% English
45 % English
T=0.3 32% English (Llama 3 range)
32 % English
T=0.1 20% English
20 % English

Llama 3 uses an effective temperature around 0.3, which gives English roughly 30-35% of training tokens despite being over 50% of available data. This substantially upsamples low-resource languages.

⚑ The Upsampling Tax

Aggressive upsampling of low-resource languages means repeating that data multiple times per training run. Empirical results show quality degrades after 4 epochs on the same data β€” the model begins memorizing specific documents rather than learning generalizable patterns. For ultra-low-resource languages, this limits effective data to roughly 4Γ—native_tokens+1Γ—translated_tokens4 \times \text{native\_tokens} + 1 \times \text{translated\_tokens}.

Parallel Corpora and Alignment

Using Bitext for Transfer

Parallel corpora β€” the same text in multiple languages β€” provide the strongest cross-lingual signal. Sources include:

  • OPUS: Aggregated parallel data from EU Parliament, UN documents, subtitles (covers 100+ language pairs)
  • CCAligned: Parallel web pages extracted from Common Crawl
  • WikiMatrix: Parallel sentences extracted from Wikipedia across languages
  • NLLB data: Training data released with Meta’s No Language Left Behind model
def create_parallel_training_examples(
    bitext_pairs,
    format_style="interleaved",
):
    """
    Convert parallel corpus into training examples.

    bitext_pairs: list of (source_text, target_text, src_lang, tgt_lang)
    format_style:
        'interleaved' - alternating paragraphs in both languages
        'translation_task' - explicit translation prompts
        'concatenated' - source then target with separator
    """
    examples = []

    for src_text, tgt_text, src_lang, tgt_lang in bitext_pairs:
        if format_style == "interleaved":
            # Split into sentences and alternate
            src_sents = src_text.split(". ")
            tgt_sents = tgt_text.split(". ")
            pairs = zip(src_sents, tgt_sents)
            text = "\n".join(
                f"{s.strip()}. {t.strip()}."
                for s, t in pairs
                if s.strip() and t.strip()
            )

        elif format_style == "translation_task":
            text = (
                f"[{src_lang}] {src_text}\n"
                f"[{tgt_lang}] {tgt_text}"
            )

        elif format_style == "concatenated":
            text = f"{src_text}\n---\n{tgt_text}"

        else:
            raise ValueError(f"Unknown format: {format_style}")

        examples.append({
            "text": text,
            "languages": [src_lang, tgt_lang],
            "type": "parallel",
        })

    return examples

Alignment Verification

Parallel data is noisy. Misaligned pairs (where the source and target are not translations of each other) inject confusion into training. Verification methods:

  1. Cosine similarity of sentence embeddings: Encode both sides with a multilingual encoder (LaBSE, SONAR). Pairs with cosine similarity below 0.6 are likely misaligned.
  2. Length ratio: Sentence-level length ratios outside the expected range for a language pair indicate alignment errors.
  3. Overlap of named entities: If the source mentions β€œTokyo” and the target does not, the alignment is suspect.

alignment_score=Ξ±β‹…cos_sim(es,et)+Ξ²β‹…length_ratio_score+Ξ³β‹…entity_overlap\text{alignment\_score} = \alpha \cdot \text{cos\_sim}(e_s, e_t) + \beta \cdot \text{length\_ratio\_score} + \gamma \cdot \text{entity\_overlap}

where Ξ±+Ξ²+Ξ³=1\alpha + \beta + \gamma = 1 and typical values are Ξ±=0.6\alpha = 0.6, Ξ²=0.2\beta = 0.2, Ξ³=0.2\gamma = 0.2.

End-to-End Pipeline

Putting It All Together

The complete multilingual data curation pipeline:

class MultilingualCurationPipeline:
    """
    End-to-end pipeline for multilingual data curation.

    Steps:
    1. Language identification
    2. Per-language heuristic filtering
    3. Per-language perplexity filtering
    4. Deduplication (per-language and cross-language)
    5. Translation augmentation for low-resource
    6. Quality scoring and mixing
    """

    def __init__(
        self,
        lang_detector,
        perplexity_filter,
        deduplicator,
        translator,
        mixer_config,
    ):
        self.lang_detector = lang_detector
        self.perplexity_filter = perplexity_filter
        self.deduplicator = deduplicator
        self.translator = translator
        self.mixer_config = mixer_config

        # Statistics tracking
        self.stats = {
            "total_input": 0,
            "lang_id_filtered": 0,
            "heuristic_filtered": 0,
            "perplexity_filtered": 0,
            "dedup_filtered": 0,
            "passed": 0,
            "by_language": {},
        }

    def process_document(self, text):
        """Process a single document through the pipeline."""
        self.stats["total_input"] += 1

        # Step 1: Language identification
        lang, confidence = self.lang_detector.detect(text)
        if confidence < 0.8:
            self.stats["lang_id_filtered"] += 1
            return None

        # Step 2: Language-specific heuristic filters
        if not apply_language_filters(text, lang):
            self.stats["heuristic_filtered"] += 1
            return None

        # Step 3: Perplexity filter
        keep, ppl = self.perplexity_filter.filter_document(text, lang)
        if not keep:
            self.stats["perplexity_filtered"] += 1
            return None

        # Step 4: Deduplication check
        if self.deduplicator.is_duplicate(text, lang):
            self.stats["dedup_filtered"] += 1
            return None

        self.stats["passed"] += 1
        if lang not in self.stats["by_language"]:
            self.stats["by_language"][lang] = 0
        self.stats["by_language"][lang] += 1

        return {
            "text": text,
            "language": lang,
            "lang_confidence": confidence,
            "perplexity": ppl,
            "source": "native",
        }

    def augment_low_resource(self, english_docs, target_langs):
        """
        Translate high-quality English documents into
        low-resource target languages.
        """
        augmented = {}
        for lang in target_langs:
            translated = build_translation_augmentation_pipeline(
                source_documents=english_docs,
                target_lang=lang,
                translate_fn=self.translator.translate,
                quality_threshold=0.7,
            )
            augmented[lang] = translated
        return augmented

    def get_statistics(self):
        """Return pipeline statistics."""
        total = self.stats["total_input"]
        return {
            "total_input": total,
            "pass_rate": self.stats["passed"] / max(total, 1),
            "filter_breakdown": {
                "lang_id": self.stats["lang_id_filtered"] / max(total, 1),
                "heuristic": self.stats["heuristic_filtered"] / max(total, 1),
                "perplexity": self.stats["perplexity_filtered"] / max(total, 1),
                "dedup": self.stats["dedup_filtered"] / max(total, 1),
            },
            "language_distribution": self.stats["by_language"],
        }

Evaluation

Multilingual Benchmarks

The final test: does the curated multilingual data produce a model that performs well across languages?

πŸ“Š

Model Performance by Language Tier (Hypothetical 7B Model)

EvaluationEnglish OnlyNaive MultilingualCurated Multilingual
English MMLU 64.2 61.8 63.5
Chinese C-Eval 38.1 52.4 56.8
Thai TyDi QA 12.3 28.7 41.2
Swahili QA 8.1 15.2 29.4
Average (all langs) 30.7 39.5 47.7

Key findings:

  • Curated multilingual training costs approximately 1-2 points of English performance (63.5 vs 64.2) but gains 17 points on average across all languages.
  • The biggest gains are in low-resource languages (Swahili: 8.1 to 29.4) due to translation augmentation and cross-lingual transfer.
  • Naive multilingual (proportional sampling without quality filtering) underperforms curated multilingual because it includes low-quality translations and noisy data.
πŸ’‘ The Key Insight

Multilingual data curation is not about maximizing the volume of non-English data. It is about maximizing the quality of cross-lingual signal: clean native data, high-quality translations, parallel corpora for alignment, and a tokenizer that does not impose a 5x tax on non-Latin scripts. The mixer temperature and per-language quality filters are the two most impactful design decisions.