Part of Series The Dataset Frontier 12 of 27
1 Synthetic Data Pipelines: Magpie, Nemotron-4, and Generating Training Data at Scale 2 Data Curation at Scale: DCLM, FineWeb-Edu, and the Exact Heuristics That Filter the Web 3 Agent-Based Simulation: Using 10,000 AI Agents to Generate Synthetic Training Data 4 Code Dataset Curation: Deduplication, License Filtering, and Quality Scoring for LLM Training 5 Multilingual Data: Cross-Lingual Transfer, Low-Resource Languages, and Translation Quality 6 Instruction Tuning Data: ShareGPT, OpenAssistant, and Quality Metrics for Alignment 7 Preference Data: Building DPO/RLHF Datasets from Human and AI Feedback 8 Data Mixing: Optimal Proportions of Code, Math, Web, and Books for LLM Training 9 Evaluation Datasets: Building Benchmarks That Actually Measure LLM Capability 10 Data Contamination: Detecting and Preventing Benchmark Leakage in Training Data 11 The Data Scaling Law: How Much Data Is Enough, and What Happens When You Run Out 12 Training a Tokenizer from Scratch: BPE Merge Rules, Vocabulary Optimization, and Compression Ratio 13 Multimodal Training Data: Image-Text Pairs, Video Captioning, and Interleaved Document Formats 14 RLHF Data at Scale: Collecting Millions of Human Preferences with Minimal Cost 15 Building a Decontamination Pipeline: Removing Benchmark Data from Training Corpora 16 Safety Training Data: Red Teaming, Refusal Training, and Building Datasets for Harmless AI 17 Data Versioning and Reproducibility: Tracking What Changed Between Training Runs 18 Domain-Specific Data: Building Medical, Legal, and Financial Training Datasets 19 Data Attribution and Provenance: Tracing Model Outputs Back to Training Examples 20 The Data Flywheel: Using Production Logs to Continuously Improve Training Data 21 Reward Model Training Data: Building Datasets for Math Verification and Code Correctness 22 Long-Context Training Data: Book-Length Documents, Multi-Document QA, and Needle-in-Haystack 23 Agentic Interaction Data: Tool Use Traces, Multi-Step Planning Logs, and Environment Feedback 24 Data Labeling Platforms: Scale AI, Surge AI, and Building Your Own Annotation Pipeline 25 Data Legal Issues: Copyright, Fair Use, Opt-Out, and the Regulatory Landscape for Training Data 26 Data Pipeline at Scale: Spark, Ray, and Processing 15 Trillion Tokens Across 1000 Nodes 27 Building a Data Pipeline: From Raw HTML to Clean Training Tokens in 500 Lines

Chinchilla says a 70B model needs 1.4T training tokens for compute-optimal training. Llama 3 trained on 15T tokens — 10.7x over the Chinchilla recommendation — and achieved better quality than any compute-optimal baseline. The discrepancy is economics: Chinchilla optimizes training cost, but inference cost dominates production deployment. Over-training a 70B model by 10x costs 10x more in training but makes every inference request cheaper because the smaller model serves faster than the undertrained 200B alternative. When you serve billions of requests, over-training is the rational choice.

The Chinchilla Scaling Law

The Formulation

The loss of a transformer language model trained with compute budget CC follows:

L(N,D)=ANα+BDβ+EL(N, D) = \frac{A}{N^\alpha} + \frac{B}{D^\beta} + E

where NN is parameters, DD is training tokens, A,B,EA, B, E are fitted constants, and α0.34\alpha \approx 0.34, β0.28\beta \approx 0.28 are the scaling exponents. The irreducible loss EE represents the entropy of natural language — even a perfect model cannot predict text with zero loss.

The compute budget constraint (using the approximation C6NDC \approx 6ND for dense transformers) gives the Chinchilla-optimal allocation:

NC0.50,DC0.50N^* \propto C^{0.50}, \quad D^* \propto C^{0.50}

This means parameters and data should scale equally with compute. Doubling your compute budget should go to roughly equal increases in model size and training tokens.

The Chinchilla Table

📊

Chinchilla-Optimal Data Requirements

Model SizeOptimal Tokens (20x)Compute (FLOPs)Training Cost Estimate
1B 20B 1.2e20 $3K
7B 140B 5.9e21 $100K
13B 260B 2.0e22 $400K
70B 1.4T 5.9e23 $10M
175B 3.5T 3.7e24 $60M
540B 10.8T 3.5e25 $500M
def chinchilla_optimal(compute_budget_flops):
    """
    Compute Chinchilla-optimal model size and token count.

    compute_budget_flops: total compute in FLOPs
    Returns (optimal_params, optimal_tokens)
    """
    # Chinchilla constants (from the paper)
    # C = 6 * N * D
    # Optimal ratio: D = 20 * N
    # Therefore: C = 6 * N * 20N = 120 * N^2
    # N = sqrt(C / 120)

    import math
    optimal_params = math.sqrt(compute_budget_flops / 120)
    optimal_tokens = 20 * optimal_params

    return int(optimal_params), int(optimal_tokens)

# Example: $10M compute budget at ~$2/A100-hour, 312 TFLOPS
# Total FLOPs: ~5.9e23
params, tokens = chinchilla_optimal(5.9e23)
# params ~ 70B, tokens ~ 1.4T

Why Over-Training Works

The Economic Argument

Chinchilla optimizes for compute-optimal training: given a fixed compute budget, what model achieves the lowest loss? But in production, the relevant cost is not training cost alone — it is training cost PLUS inference cost amortized over the model’s lifetime.

Total Cost=Ctrain+Cinference×Q\text{Total Cost} = C_{\text{train}} + C_{\text{inference}} \times Q

where QQ is the total number of queries the model will serve. For a model deployed at scale, QQ is in the billions, and inference cost dominates.

A smaller model that was over-trained (more tokens than Chinchilla-optimal for its size) has:

  • Higher training cost per quality unit (wasteful from Chinchilla’s perspective)
  • Lower inference cost per query (smaller model = fewer FLOPs per token)
  • Better total economics when QQ is large
def compute_total_cost(
    model_params,
    training_tokens,
    queries_served,
    avg_tokens_per_query,
    cost_per_training_flop,
    cost_per_inference_flop,
):
    """
    Compute total lifecycle cost: training + inference.

    model_params: number of parameters
    training_tokens: number of tokens trained on
    queries_served: total queries over model lifetime
    avg_tokens_per_query: average tokens per inference query
    cost_per_training_flop: $/FLOP for training
    cost_per_inference_flop: $/FLOP for inference
    """
    # Training cost
    training_flops = 6 * model_params * training_tokens
    training_cost = training_flops * cost_per_training_flop

    # Inference cost (per query)
    # Approximate: 2 * N FLOPs per token generated
    inference_flops_per_query = (
        2 * model_params * avg_tokens_per_query
    )
    total_inference_cost = (
        inference_flops_per_query
        * queries_served
        * cost_per_inference_flop
    )

    return {
        "training_cost": training_cost,
        "inference_cost": total_inference_cost,
        "total_cost": training_cost + total_inference_cost,
        "training_fraction": training_cost / (
            training_cost + total_inference_cost
        ),
    }

# Comparison: Chinchilla-optimal 70B vs over-trained 8B
chinchilla_70b = compute_total_cost(
    model_params=70e9,
    training_tokens=1.4e12,       # Chinchilla-optimal
    queries_served=100e9,         # 100B queries
    avg_tokens_per_query=500,
    cost_per_training_flop=1e-15,
    cost_per_inference_flop=2e-15,
)

overtrained_8b = compute_total_cost(
    model_params=8e9,
    training_tokens=15e12,        # 15T tokens (over-trained)
    queries_served=100e9,
    avg_tokens_per_query=500,
    cost_per_training_flop=1e-15,
    cost_per_inference_flop=2e-15,
)

# The over-trained 8B model costs more to train but
# dramatically less to serve. At 100B queries, it wins.

Empirical Evidence

Model Quality (MMLU) at Different Training Token Counts

(MMLU Accuracy %)
Llama 2 7B (2T) 46% MMLU, 14x Chinchilla
46 MMLU Accuracy %
Llama 3 8B (15T) 66% MMLU, 107x Chinchilla
66 MMLU Accuracy %
Mistral 7B (8T) 60% MMLU, ~57x Chinchilla
60 MMLU Accuracy %
Gemma 2 9B (8T) 64% MMLU, ~44x Chinchilla
64 MMLU Accuracy %
Qwen 2.5 7B (18T) 68% MMLU, ~129x Chinchilla
68 MMLU Accuracy %

All these models train far beyond Chinchilla-optimal. The pattern is clear: more tokens monotonically improve quality at the 7-9B scale, at least up to 15-18T tokens. The returns diminish but do not plateau.

The Modified Scaling Law

When accounting for inference cost, the optimal token count shifts:

Ddeployment=DChinchilla×(Cinference×QCtrain)γD^*_{\text{deployment}} = D^*_{\text{Chinchilla}} \times \left(\frac{C_{\text{inference}} \times Q}{C_{\text{train}}}\right)^{\gamma}

where γ0.30.5\gamma \approx 0.3-0.5 is an empirical exponent. For a model serving billions of queries, DdeploymentD^*_{\text{deployment}} can be 10-200x the Chinchilla optimum.

The Over-Training Heuristic

For production deployment: if the model will serve more than 1000 queries per training dollar, over-training is economically justified. Llama 3’s 15T tokens on 8B parameters costs roughly 10Mintrainingcompute,butsavesover10M in training compute, but saves over 100M in inference compute over the model’s deployment lifetime compared to a Chinchilla-optimal 70B that achieves similar quality.

The Data Wall

What Is It

The data wall is the point at which you run out of high-quality training data. For English web text, the estimated ceiling after quality filtering is approximately 5-10T tokens. For all languages combined, perhaps 15-20T tokens.

📊

Estimated High-Quality Token Supply by Source

SourceRaw TokensPost-Filter Tokens% of 15T Budget
English web (Common Crawl) ~50T ~4T 27%
Non-English web ~30T ~3T 20%
Code (GitHub, etc.) ~5T ~2T 13%
Books ~1T ~0.5T 3%
Academic papers ~0.5T ~0.3T 2%
Synthetic data Unlimited ~5T 33%
Total ~87T ~15T 100%

At 15T tokens, a 70B model has consumed virtually all available high-quality natural text. Scaling further requires either: (1) relaxing quality filters to include lower-quality text, (2) repeating data for multiple epochs, or (3) generating synthetic data.

Why the Wall Is Not Just About Volume

The data wall is not only about token count — it is about information diversity. Web text is highly redundant. The millionth article about “how to bake a chocolate cake” adds near-zero marginal information even though it adds tokens. The effective information content of web data follows a log curve:

Ieffective(D)Klog(1+D/D0)I_{\text{effective}}(D) \approx K \cdot \log(1 + D/D_0)

where D0D_0 is a reference scale and KK depends on data quality. After a few trillion tokens, each additional trillion adds decreasing marginal information.

import math

def effective_information(tokens, d0=1e11, k=1.0):
    """
    Estimate effective information content of a dataset.

    tokens: number of tokens in dataset
    d0: reference scale (~100B for web text)
    k: quality coefficient (1.0 for filtered web, 1.5 for code)

    Returns information in arbitrary units (useful for comparison).
    """
    return k * math.log(1 + tokens / d0)

# Diminishing returns illustration
for t in [100e9, 500e9, 1e12, 5e12, 15e12]:
    info = effective_information(t)
    marginal = effective_information(t) - effective_information(t * 0.9)
    print(f"{t/1e12:.1f}T tokens: info={info:.2f}, "
          f"marginal(last 10%)={marginal:.3f}")

# Output shows marginal information per token decreasing rapidly

Epoch Counting

How Many Passes Before Quality Degrades

When you run out of unique data, you must re-use existing data for multiple epochs. The question: how many epochs before the model starts memorizing specific documents instead of learning generalizable patterns?

Empirical findings from Muennighoff et al. (2023) and others:

Quality vs Number of Epochs (7B Model, 100B Unique Tokens)

(% of 1-epoch quality (higher is better))
1 epoch Baseline loss
100 % of 1-epoch quality (higher is better)
2 epochs 3% quality loss
97 % of 1-epoch quality (higher is better)
4 epochs 7% quality loss
93 % of 1-epoch quality (higher is better)
8 epochs 15% quality loss
85 % of 1-epoch quality (higher is better)
16 epochs 28% quality loss
72 % of 1-epoch quality (higher is better)

The relationship between epochs ee and quality degradation follows approximately:

ΔL(e)clog(e)\Delta L(e) \approx c \cdot \log(e)

where cc is a constant that depends on data diversity. High-diversity data (web text from many domains) tolerates more epochs than low-diversity data (e.g., a single textbook).

import math

class EpochBudgetCalculator:
    """
    Calculate how many epochs are justified for each data source
    based on its diversity and value.
    """

    def __init__(self, quality_degradation_rate=0.05):
        """
        quality_degradation_rate: quality loss per epoch doubling
        (0.05 = 5% quality loss when going from 1 to 2 epochs)
        """
        self.degradation_rate = quality_degradation_rate

    def quality_factor(self, epochs):
        """
        Compute quality factor for a given number of epochs.
        1.0 = perfect (1 epoch), decreasing with more epochs.
        """
        if epochs <= 1:
            return 1.0
        return max(0.0, 1.0 - self.degradation_rate * math.log2(epochs))

    def effective_tokens(self, unique_tokens, epochs):
        """
        Compute effective tokens accounting for epoch degradation.

        Not all repeated tokens contribute equally -- later epochs
        contribute less due to diminishing returns.
        """
        effective = 0.0
        for e in range(1, epochs + 1):
            # Each epoch contributes quality_factor(e) worth of tokens
            epoch_contribution = unique_tokens * self.quality_factor(e)
            effective += epoch_contribution
        return effective

    def optimal_epochs(self, unique_tokens, target_total_tokens):
        """
        Find optimal number of epochs given unique data and target.

        Returns (epochs, effective_tokens, quality_factor)
        """
        if unique_tokens >= target_total_tokens:
            return 1, unique_tokens, 1.0

        best_epochs = 1
        best_effective = unique_tokens

        for e in range(2, 50):
            effective = self.effective_tokens(unique_tokens, e)
            raw_total = unique_tokens * e

            if raw_total >= target_total_tokens:
                return e, effective, self.quality_factor(e)

            if effective > best_effective:
                best_effective = effective
                best_epochs = e
            else:
                # Passed the point of diminishing returns
                break

        return best_epochs, best_effective, self.quality_factor(best_epochs)

# Example: 500B unique tokens, want 2T total
calc = EpochBudgetCalculator(quality_degradation_rate=0.05)
epochs, effective, quality = calc.optimal_epochs(500e9, 2e12)
# epochs = 4, effective = ~1.7T, quality = 0.90

Source-Specific Epoch Limits

Different data sources tolerate different epoch counts:

📊

Recommended Maximum Epochs by Data Source

Data SourceMax EpochsRationale
Web text (filtered) 4 High diversity, but repetitive topics
Code 4-8 Structural diversity is high, semantics repeat less
Books 2 Long documents, strong memorization risk
Math (natural) 2-4 Limited diversity in problem formats
Math (synthetic) 1 Already generated specifically, no repeat value
Instruction data (SFT) 2-3 Small dataset, memorization risk high
Translated text 1 Translation artifacts amplify on repeat

Solutions to the Data Wall

Solution 1: Synthetic Data Generation

When natural data is exhausted, generate more. The key insight: a trained model can generate training data for the next generation of models if the synthetic data covers distributions that the natural data does not.

class SyntheticDataGenerator:
    """
    Generate synthetic training data to extend beyond the data wall.
    """

    def __init__(self, teacher_model, quality_filter):
        """
        teacher_model: the model used to generate synthetic data
        quality_filter: callable(text) -> (keep, score)
        """
        self.teacher = teacher_model
        self.filter = quality_filter

    def generate_math_data(self, num_problems, difficulty_range):
        """
        Generate synthetic math problems with step-by-step solutions.
        Math is the highest-value synthetic data because correctness
        is verifiable.
        """
        results = []
        attempts = 0
        max_attempts = num_problems * 5

        while len(results) < num_problems and attempts < max_attempts:
            attempts += 1

            difficulty = (
                difficulty_range[0]
                + (difficulty_range[1] - difficulty_range[0])
                * (attempts / max_attempts)
            )

            prompt = (
                f"Generate a math problem at difficulty level "
                f"{difficulty:.1f}/10. Include the problem statement "
                f"and a detailed step-by-step solution. The solution "
                f"must be verifiable."
            )

            response = self.teacher.generate(prompt, max_tokens=1000)

            # Verify the solution (for math, we can check correctness)
            keep, score = self.filter(response)
            if keep:
                results.append({
                    "text": response,
                    "type": "synthetic_math",
                    "difficulty": difficulty,
                    "quality_score": score,
                })

        return results

    def generate_code_data(self, num_examples, languages):
        """
        Generate synthetic code with documentation and tests.
        Code quality is partially verifiable (syntax, tests pass).
        """
        results = []

        for lang in languages:
            target = num_examples // len(languages)

            for _ in range(target):
                prompt = (
                    f"Write a complete, well-documented {lang} function "
                    f"that solves a practical programming problem. "
                    f"Include docstring, type annotations, and "
                    f"unit tests."
                )

                response = self.teacher.generate(
                    prompt, max_tokens=1500
                )

                keep, score = self.filter(response)
                if keep:
                    results.append({
                        "text": response,
                        "type": "synthetic_code",
                        "language": lang,
                        "quality_score": score,
                    })

        return results

    def generate_reasoning_chains(self, seed_questions, num_per_seed):
        """
        Generate extended reasoning chains from seed questions.
        """
        results = []

        for question in seed_questions:
            for _ in range(num_per_seed):
                prompt = (
                    f"Question: {question}\n\n"
                    f"Think through this step by step. Show your "
                    f"complete reasoning process, including any "
                    f"mistakes you consider and reject."
                )

                response = self.teacher.generate(
                    prompt, max_tokens=2000
                )

                keep, score = self.filter(response)
                if keep:
                    results.append({
                        "text": f"Question: {question}\n\n{response}",
                        "type": "synthetic_reasoning",
                        "seed_question": question,
                        "quality_score": score,
                    })

        return results

Solution 2: Multilingual Expansion

English data is the most scarce because it is the most consumed. But Chinese, German, French, and dozens of other languages have billions of underutilized tokens. Cross-lingual transfer means that non-English data improves English performance (at a reduced rate).

Solution 3: Code as a Universal Data Source

Code improves reasoning across all domains. The working hypothesis: code teaches the model structured, logical thinking that transfers to natural language reasoning. Training on more code improves math, logic, and instruction-following benchmarks even when the evaluation is in natural language.

MMLU Score vs Code Fraction in Training Data (7B Model)

(MMLU %)
0% code 55% MMLU
55 MMLU %
5% code 58% MMLU
58 MMLU %
15% code 62% MMLU
62 MMLU %
25% code 65% MMLU
65 MMLU %
40% code 64% MMLU (slight decline)
64 MMLU %
60% code 60% MMLU (too much code)
60 MMLU %

The optimal code fraction is around 20-30% of training data. Below that, the model misses the reasoning transfer. Above that, the model becomes code-specialized at the expense of general knowledge.

The Data Budget Calculator

Complete Implementation

import math
from dataclasses import dataclass, field

@dataclass
class DataSource:
    name: str
    available_tokens: int  # Unique tokens available
    quality_score: float  # 0.0-1.0
    max_epochs: int  # Maximum recommended epochs
    cost_per_token: float  # Cost to acquire/generate, in USD
    category: str  # "web", "code", "math", "books", "synthetic"

@dataclass
class TrainingBudget:
    total_training_tokens: int
    compute_budget_flops: int
    monetary_budget_usd: float
    model_params: int

class DataBudgetCalculator:
    """
    Calculate optimal data allocation across sources given
    constraints on budget, compute, and data availability.
    """

    def __init__(self, sources, budget):
        self.sources = {s.name: s for s in sources}
        self.budget = budget

    def chinchilla_ratio(self):
        """Compute the Chinchilla token/parameter ratio."""
        return self.budget.total_training_tokens / self.budget.model_params

    def over_training_factor(self):
        """How many times Chinchilla-optimal are we training."""
        chinchilla_tokens = 20 * self.budget.model_params
        return self.budget.total_training_tokens / chinchilla_tokens

    def compute_allocation(self, target_mix):
        """
        Compute token allocation per source given a target mix.

        target_mix: dict mapping source_name -> target fraction
        (fractions should sum to 1.0)

        Returns allocation accounting for:
        - Available token limits
        - Maximum epoch constraints
        - Quality degradation from multi-epoch
        """
        total = self.budget.total_training_tokens
        allocation = {}

        # First pass: allocate based on target mix
        for source_name, fraction in target_mix.items():
            if source_name not in self.sources:
                continue
            source = self.sources[source_name]

            target_tokens = int(total * fraction)

            # Cap by available tokens * max_epochs
            max_available = source.available_tokens * source.max_epochs
            actual_tokens = min(target_tokens, max_available)

            # Compute epochs needed
            if source.available_tokens > 0:
                epochs = actual_tokens / source.available_tokens
            else:
                epochs = 0

            # Quality degradation
            epoch_calc = EpochBudgetCalculator()
            quality_factor = epoch_calc.quality_factor(max(1, int(epochs)))
            effective_tokens = epoch_calc.effective_tokens(
                source.available_tokens,
                max(1, int(math.ceil(epochs))),
            )

            allocation[source_name] = {
                "target_tokens": target_tokens,
                "actual_tokens": actual_tokens,
                "epochs": epochs,
                "quality_factor": quality_factor,
                "effective_tokens": min(effective_tokens, actual_tokens),
                "cost": actual_tokens * source.cost_per_token,
            }

        # Second pass: redistribute unallocated tokens
        allocated = sum(
            a["actual_tokens"] for a in allocation.values()
        )
        remaining = total - allocated

        if remaining > 0:
            # Distribute to sources that have headroom
            for source_name, alloc in allocation.items():
                source = self.sources[source_name]
                headroom = (
                    source.available_tokens * source.max_epochs
                    - alloc["actual_tokens"]
                )
                if headroom > 0:
                    additional = min(remaining, headroom)
                    alloc["actual_tokens"] += additional
                    remaining -= additional
                    if remaining <= 0:
                        break

        return allocation

    def print_budget_report(self, allocation):
        """Print a formatted budget report."""
        print("=" * 80)
        print(f"DATA BUDGET REPORT")
        print(f"Model: {self.budget.model_params/1e9:.0f}B parameters")
        print(f"Total tokens: {self.budget.total_training_tokens/1e12:.1f}T")
        print(f"Chinchilla ratio: {self.chinchilla_ratio():.0f}x "
              f"(Chinchilla-optimal = 20x)")
        print(f"Over-training factor: {self.over_training_factor():.0f}x")
        print("=" * 80)
        print()
        print(f"{'Source':<20} {'Tokens':<12} {'Epochs':<8} "
              f"{'Quality':<10} {'Effective':<12} {'Cost':<10}")
        print("-" * 72)

        total_cost = 0
        total_effective = 0

        for name, alloc in sorted(
            allocation.items(),
            key=lambda x: x[1]["actual_tokens"],
            reverse=True,
        ):
            tokens_str = f"{alloc['actual_tokens']/1e12:.2f}T"
            effective_str = f"{alloc['effective_tokens']/1e12:.2f}T"
            cost_str = f"${alloc['cost']/1e6:.1f}M"

            print(f"{name:<20} {tokens_str:<12} "
                  f"{alloc['epochs']:<8.1f} "
                  f"{alloc['quality_factor']:<10.2f} "
                  f"{effective_str:<12} {cost_str:<10}")

            total_cost += alloc["cost"]
            total_effective += alloc["effective_tokens"]

        print("-" * 72)
        print(f"{'TOTAL':<20} "
              f"{sum(a['actual_tokens'] for a in allocation.values())/1e12:.2f}T"
              f"{'':>20} "
              f"{total_effective/1e12:.2f}T"
              f"{'':>2} ${total_cost/1e6:.1f}M")

# Usage example: Llama 3 scale training
sources = [
    DataSource("english_web", 4_000_000_000_000, 0.85, 4,
               0.0, "web"),
    DataSource("multilingual_web", 3_000_000_000_000, 0.75, 3,
               0.0, "web"),
    DataSource("code", 2_000_000_000_000, 0.90, 4,
               0.0, "code"),
    DataSource("books", 500_000_000_000, 0.95, 2,
               0.0, "books"),
    DataSource("academic", 300_000_000_000, 0.90, 2,
               0.0, "books"),
    DataSource("synthetic_math", 500_000_000_000, 0.80, 1,
               5e-9, "synthetic"),
    DataSource("synthetic_code", 1_000_000_000_000, 0.75, 1,
               3e-9, "synthetic"),
    DataSource("synthetic_reasoning", 500_000_000_000, 0.70, 1,
               8e-9, "synthetic"),
]

budget = TrainingBudget(
    total_training_tokens=15_000_000_000_000,
    compute_budget_flops=int(6 * 70e9 * 15e12),
    monetary_budget_usd=100_000_000,
    model_params=int(70e9),
)

calculator = DataBudgetCalculator(sources, budget)

target_mix = {
    "english_web": 0.30,
    "multilingual_web": 0.15,
    "code": 0.20,
    "books": 0.05,
    "academic": 0.03,
    "synthetic_math": 0.10,
    "synthetic_code": 0.10,
    "synthetic_reasoning": 0.07,
}

allocation = calculator.compute_allocation(target_mix)
calculator.print_budget_report(allocation)

When to Stop Training

Learning Rate Schedule and Data Exhaustion

The learning rate schedule is typically cosine decay, reaching near-zero at the planned end of training. If you run out of data before the learning rate decays, you have two options:

  1. Extend with epochs: Continue training on repeated data. Quality degrades slowly.
  2. Early stop: End training at the data boundary. Wastes the remaining compute budget.

The right choice depends on the quality degradation rate versus the compute wasted by stopping early:

def should_continue_training(
    unique_tokens_remaining,
    tokens_to_train_remaining,
    current_quality,
    quality_per_epoch_loss=0.03,
):
    """
    Decide whether to continue training with repeated data
    or stop early.

    Returns (continue, reasoning)
    """
    if unique_tokens_remaining >= tokens_to_train_remaining:
        return True, "Sufficient unique data remaining"

    # How many epochs would remaining training require?
    if unique_tokens_remaining <= 0:
        return False, "No data remaining"

    required_epochs = tokens_to_train_remaining / unique_tokens_remaining

    # Quality cost of those epochs
    quality_loss = quality_per_epoch_loss * math.log2(
        max(required_epochs, 1)
    )

    # Compute cost of stopping early
    training_fraction_remaining = (
        tokens_to_train_remaining
        / (unique_tokens_remaining + tokens_to_train_remaining)
    )
    compute_wasted = training_fraction_remaining

    if quality_loss < compute_wasted * 0.5:
        return True, (
            f"Continue: {required_epochs:.1f} epochs, "
            f"quality loss {quality_loss:.3f} is less than "
            f"compute waste {compute_wasted:.3f}"
        )
    else:
        return False, (
            f"Stop: {required_epochs:.1f} epochs would cost "
            f"{quality_loss:.3f} quality, exceeding acceptable loss"
        )

Monitoring for Data Exhaustion

class DataExhaustionMonitor:
    """
    Monitor training for signs of data exhaustion:
    - Validation loss stops improving
    - Training loss drops faster than validation loss (overfitting)
    - Gradient norms decrease (model memorizing)
    """

    def __init__(self, patience=1000, min_delta=0.001):
        self.patience = patience
        self.min_delta = min_delta
        self.best_val_loss = float('inf')
        self.steps_without_improvement = 0
        self.history = []

    def update(self, step, train_loss, val_loss, grad_norm):
        """Update monitor with current metrics."""
        self.history.append({
            "step": step,
            "train_loss": train_loss,
            "val_loss": val_loss,
            "grad_norm": grad_norm,
        })

        # Check for improvement
        if val_loss < self.best_val_loss - self.min_delta:
            self.best_val_loss = val_loss
            self.steps_without_improvement = 0
        else:
            self.steps_without_improvement += 1

    def should_stop(self):
        """Check if training should stop due to data exhaustion."""
        if len(self.history) < 100:
            return False, "Insufficient history"

        # Check 1: Patience exceeded
        if self.steps_without_improvement >= self.patience:
            return True, "Validation loss stalled"

        # Check 2: Train/val gap growing (overfitting)
        recent = self.history[-100:]
        train_losses = [h["train_loss"] for h in recent]
        val_losses = [h["val_loss"] for h in recent]

        gap_start = val_losses[0] - train_losses[0]
        gap_end = val_losses[-1] - train_losses[-1]

        if gap_end - gap_start > 0.1:
            return True, (
                f"Overfitting detected: train/val gap grew from "
                f"{gap_start:.3f} to {gap_end:.3f}"
            )

        return False, "Training healthy"

Future Projections

The Token Supply Forecast

📊

Projected Token Supply vs Demand (2024-2027)

YearModel ScaleTokens NeededAvailable (Natural)Gap
2024 70B-400B 15T-30T ~15T 0-15T
2025 400B-1T 30T-100T ~18T 12T-82T
2026 1T-5T 100T-500T ~20T 80T-480T
2027 5T+ 500T+ ~22T 478T+

The gap between token demand and natural data supply grows exponentially. By 2026, frontier models may require 5-20x more tokens than exist in high-quality natural text. Synthetic data generation is not optional — it is the only path forward.

💡 The Key Insight

The data scaling law has two regimes. Below the Chinchilla optimum, adding more data has high marginal value — each token substantially reduces loss. Above the Chinchilla optimum (the over-training regime), more data still helps but with diminishing returns, and the economic justification shifts from training efficiency to inference cost amortization. The data wall is where even the over-training regime runs out of natural data, and the field must transition to synthetic data generation. The calculator in this post helps navigate all three regimes by quantifying effective tokens, quality degradation from epochs, and the economics of over-training.