Chinchilla says a 70B model needs 1.4T training tokens for compute-optimal training. Llama 3 trained on 15T tokens — 10.7x over the Chinchilla recommendation — and achieved better quality than any compute-optimal baseline. The discrepancy is economics: Chinchilla optimizes training cost, but inference cost dominates production deployment. Over-training a 70B model by 10x costs 10x more in training but makes every inference request cheaper because the smaller model serves faster than the undertrained 200B alternative. When you serve billions of requests, over-training is the rational choice.
The Chinchilla Scaling Law
The Formulation
The loss of a transformer language model trained with compute budget follows:
where is parameters, is training tokens, are fitted constants, and , are the scaling exponents. The irreducible loss represents the entropy of natural language — even a perfect model cannot predict text with zero loss.
The compute budget constraint (using the approximation for dense transformers) gives the Chinchilla-optimal allocation:
This means parameters and data should scale equally with compute. Doubling your compute budget should go to roughly equal increases in model size and training tokens.
The Chinchilla Table
Chinchilla-Optimal Data Requirements
| Model Size | Optimal Tokens (20x) | Compute (FLOPs) | Training Cost Estimate |
|---|---|---|---|
| 1B | 20B | 1.2e20 | $3K |
| 7B | 140B | 5.9e21 | $100K |
| 13B | 260B | 2.0e22 | $400K |
| 70B | 1.4T | 5.9e23 | $10M |
| 175B | 3.5T | 3.7e24 | $60M |
| 540B | 10.8T | 3.5e25 | $500M |
def chinchilla_optimal(compute_budget_flops):
"""
Compute Chinchilla-optimal model size and token count.
compute_budget_flops: total compute in FLOPs
Returns (optimal_params, optimal_tokens)
"""
# Chinchilla constants (from the paper)
# C = 6 * N * D
# Optimal ratio: D = 20 * N
# Therefore: C = 6 * N * 20N = 120 * N^2
# N = sqrt(C / 120)
import math
optimal_params = math.sqrt(compute_budget_flops / 120)
optimal_tokens = 20 * optimal_params
return int(optimal_params), int(optimal_tokens)
# Example: $10M compute budget at ~$2/A100-hour, 312 TFLOPS
# Total FLOPs: ~5.9e23
params, tokens = chinchilla_optimal(5.9e23)
# params ~ 70B, tokens ~ 1.4T
Why Over-Training Works
The Economic Argument
Chinchilla optimizes for compute-optimal training: given a fixed compute budget, what model achieves the lowest loss? But in production, the relevant cost is not training cost alone — it is training cost PLUS inference cost amortized over the model’s lifetime.
where is the total number of queries the model will serve. For a model deployed at scale, is in the billions, and inference cost dominates.
A smaller model that was over-trained (more tokens than Chinchilla-optimal for its size) has:
- Higher training cost per quality unit (wasteful from Chinchilla’s perspective)
- Lower inference cost per query (smaller model = fewer FLOPs per token)
- Better total economics when is large
def compute_total_cost(
model_params,
training_tokens,
queries_served,
avg_tokens_per_query,
cost_per_training_flop,
cost_per_inference_flop,
):
"""
Compute total lifecycle cost: training + inference.
model_params: number of parameters
training_tokens: number of tokens trained on
queries_served: total queries over model lifetime
avg_tokens_per_query: average tokens per inference query
cost_per_training_flop: $/FLOP for training
cost_per_inference_flop: $/FLOP for inference
"""
# Training cost
training_flops = 6 * model_params * training_tokens
training_cost = training_flops * cost_per_training_flop
# Inference cost (per query)
# Approximate: 2 * N FLOPs per token generated
inference_flops_per_query = (
2 * model_params * avg_tokens_per_query
)
total_inference_cost = (
inference_flops_per_query
* queries_served
* cost_per_inference_flop
)
return {
"training_cost": training_cost,
"inference_cost": total_inference_cost,
"total_cost": training_cost + total_inference_cost,
"training_fraction": training_cost / (
training_cost + total_inference_cost
),
}
# Comparison: Chinchilla-optimal 70B vs over-trained 8B
chinchilla_70b = compute_total_cost(
model_params=70e9,
training_tokens=1.4e12, # Chinchilla-optimal
queries_served=100e9, # 100B queries
avg_tokens_per_query=500,
cost_per_training_flop=1e-15,
cost_per_inference_flop=2e-15,
)
overtrained_8b = compute_total_cost(
model_params=8e9,
training_tokens=15e12, # 15T tokens (over-trained)
queries_served=100e9,
avg_tokens_per_query=500,
cost_per_training_flop=1e-15,
cost_per_inference_flop=2e-15,
)
# The over-trained 8B model costs more to train but
# dramatically less to serve. At 100B queries, it wins.
Empirical Evidence
Model Quality (MMLU) at Different Training Token Counts
(MMLU Accuracy %)All these models train far beyond Chinchilla-optimal. The pattern is clear: more tokens monotonically improve quality at the 7-9B scale, at least up to 15-18T tokens. The returns diminish but do not plateau.
The Modified Scaling Law
When accounting for inference cost, the optimal token count shifts:
where is an empirical exponent. For a model serving billions of queries, can be 10-200x the Chinchilla optimum.
For production deployment: if the model will serve more than 1000 queries per training dollar, over-training is economically justified. Llama 3’s 15T tokens on 8B parameters costs roughly 100M in inference compute over the model’s deployment lifetime compared to a Chinchilla-optimal 70B that achieves similar quality.
The Data Wall
What Is It
The data wall is the point at which you run out of high-quality training data. For English web text, the estimated ceiling after quality filtering is approximately 5-10T tokens. For all languages combined, perhaps 15-20T tokens.
Estimated High-Quality Token Supply by Source
| Source | Raw Tokens | Post-Filter Tokens | % of 15T Budget |
|---|---|---|---|
| English web (Common Crawl) | ~50T | ~4T | 27% |
| Non-English web | ~30T | ~3T | 20% |
| Code (GitHub, etc.) | ~5T | ~2T | 13% |
| Books | ~1T | ~0.5T | 3% |
| Academic papers | ~0.5T | ~0.3T | 2% |
| Synthetic data | Unlimited | ~5T | 33% |
| Total | ~87T | ~15T | 100% |
At 15T tokens, a 70B model has consumed virtually all available high-quality natural text. Scaling further requires either: (1) relaxing quality filters to include lower-quality text, (2) repeating data for multiple epochs, or (3) generating synthetic data.
Why the Wall Is Not Just About Volume
The data wall is not only about token count — it is about information diversity. Web text is highly redundant. The millionth article about “how to bake a chocolate cake” adds near-zero marginal information even though it adds tokens. The effective information content of web data follows a log curve:
where is a reference scale and depends on data quality. After a few trillion tokens, each additional trillion adds decreasing marginal information.
import math
def effective_information(tokens, d0=1e11, k=1.0):
"""
Estimate effective information content of a dataset.
tokens: number of tokens in dataset
d0: reference scale (~100B for web text)
k: quality coefficient (1.0 for filtered web, 1.5 for code)
Returns information in arbitrary units (useful for comparison).
"""
return k * math.log(1 + tokens / d0)
# Diminishing returns illustration
for t in [100e9, 500e9, 1e12, 5e12, 15e12]:
info = effective_information(t)
marginal = effective_information(t) - effective_information(t * 0.9)
print(f"{t/1e12:.1f}T tokens: info={info:.2f}, "
f"marginal(last 10%)={marginal:.3f}")
# Output shows marginal information per token decreasing rapidly
Epoch Counting
How Many Passes Before Quality Degrades
When you run out of unique data, you must re-use existing data for multiple epochs. The question: how many epochs before the model starts memorizing specific documents instead of learning generalizable patterns?
Empirical findings from Muennighoff et al. (2023) and others:
Quality vs Number of Epochs (7B Model, 100B Unique Tokens)
(% of 1-epoch quality (higher is better))The relationship between epochs and quality degradation follows approximately:
where is a constant that depends on data diversity. High-diversity data (web text from many domains) tolerates more epochs than low-diversity data (e.g., a single textbook).
import math
class EpochBudgetCalculator:
"""
Calculate how many epochs are justified for each data source
based on its diversity and value.
"""
def __init__(self, quality_degradation_rate=0.05):
"""
quality_degradation_rate: quality loss per epoch doubling
(0.05 = 5% quality loss when going from 1 to 2 epochs)
"""
self.degradation_rate = quality_degradation_rate
def quality_factor(self, epochs):
"""
Compute quality factor for a given number of epochs.
1.0 = perfect (1 epoch), decreasing with more epochs.
"""
if epochs <= 1:
return 1.0
return max(0.0, 1.0 - self.degradation_rate * math.log2(epochs))
def effective_tokens(self, unique_tokens, epochs):
"""
Compute effective tokens accounting for epoch degradation.
Not all repeated tokens contribute equally -- later epochs
contribute less due to diminishing returns.
"""
effective = 0.0
for e in range(1, epochs + 1):
# Each epoch contributes quality_factor(e) worth of tokens
epoch_contribution = unique_tokens * self.quality_factor(e)
effective += epoch_contribution
return effective
def optimal_epochs(self, unique_tokens, target_total_tokens):
"""
Find optimal number of epochs given unique data and target.
Returns (epochs, effective_tokens, quality_factor)
"""
if unique_tokens >= target_total_tokens:
return 1, unique_tokens, 1.0
best_epochs = 1
best_effective = unique_tokens
for e in range(2, 50):
effective = self.effective_tokens(unique_tokens, e)
raw_total = unique_tokens * e
if raw_total >= target_total_tokens:
return e, effective, self.quality_factor(e)
if effective > best_effective:
best_effective = effective
best_epochs = e
else:
# Passed the point of diminishing returns
break
return best_epochs, best_effective, self.quality_factor(best_epochs)
# Example: 500B unique tokens, want 2T total
calc = EpochBudgetCalculator(quality_degradation_rate=0.05)
epochs, effective, quality = calc.optimal_epochs(500e9, 2e12)
# epochs = 4, effective = ~1.7T, quality = 0.90
Source-Specific Epoch Limits
Different data sources tolerate different epoch counts:
Recommended Maximum Epochs by Data Source
| Data Source | Max Epochs | Rationale |
|---|---|---|
| Web text (filtered) | 4 | High diversity, but repetitive topics |
| Code | 4-8 | Structural diversity is high, semantics repeat less |
| Books | 2 | Long documents, strong memorization risk |
| Math (natural) | 2-4 | Limited diversity in problem formats |
| Math (synthetic) | 1 | Already generated specifically, no repeat value |
| Instruction data (SFT) | 2-3 | Small dataset, memorization risk high |
| Translated text | 1 | Translation artifacts amplify on repeat |
Solutions to the Data Wall
Solution 1: Synthetic Data Generation
When natural data is exhausted, generate more. The key insight: a trained model can generate training data for the next generation of models if the synthetic data covers distributions that the natural data does not.
class SyntheticDataGenerator:
"""
Generate synthetic training data to extend beyond the data wall.
"""
def __init__(self, teacher_model, quality_filter):
"""
teacher_model: the model used to generate synthetic data
quality_filter: callable(text) -> (keep, score)
"""
self.teacher = teacher_model
self.filter = quality_filter
def generate_math_data(self, num_problems, difficulty_range):
"""
Generate synthetic math problems with step-by-step solutions.
Math is the highest-value synthetic data because correctness
is verifiable.
"""
results = []
attempts = 0
max_attempts = num_problems * 5
while len(results) < num_problems and attempts < max_attempts:
attempts += 1
difficulty = (
difficulty_range[0]
+ (difficulty_range[1] - difficulty_range[0])
* (attempts / max_attempts)
)
prompt = (
f"Generate a math problem at difficulty level "
f"{difficulty:.1f}/10. Include the problem statement "
f"and a detailed step-by-step solution. The solution "
f"must be verifiable."
)
response = self.teacher.generate(prompt, max_tokens=1000)
# Verify the solution (for math, we can check correctness)
keep, score = self.filter(response)
if keep:
results.append({
"text": response,
"type": "synthetic_math",
"difficulty": difficulty,
"quality_score": score,
})
return results
def generate_code_data(self, num_examples, languages):
"""
Generate synthetic code with documentation and tests.
Code quality is partially verifiable (syntax, tests pass).
"""
results = []
for lang in languages:
target = num_examples // len(languages)
for _ in range(target):
prompt = (
f"Write a complete, well-documented {lang} function "
f"that solves a practical programming problem. "
f"Include docstring, type annotations, and "
f"unit tests."
)
response = self.teacher.generate(
prompt, max_tokens=1500
)
keep, score = self.filter(response)
if keep:
results.append({
"text": response,
"type": "synthetic_code",
"language": lang,
"quality_score": score,
})
return results
def generate_reasoning_chains(self, seed_questions, num_per_seed):
"""
Generate extended reasoning chains from seed questions.
"""
results = []
for question in seed_questions:
for _ in range(num_per_seed):
prompt = (
f"Question: {question}\n\n"
f"Think through this step by step. Show your "
f"complete reasoning process, including any "
f"mistakes you consider and reject."
)
response = self.teacher.generate(
prompt, max_tokens=2000
)
keep, score = self.filter(response)
if keep:
results.append({
"text": f"Question: {question}\n\n{response}",
"type": "synthetic_reasoning",
"seed_question": question,
"quality_score": score,
})
return results
Solution 2: Multilingual Expansion
English data is the most scarce because it is the most consumed. But Chinese, German, French, and dozens of other languages have billions of underutilized tokens. Cross-lingual transfer means that non-English data improves English performance (at a reduced rate).
Solution 3: Code as a Universal Data Source
Code improves reasoning across all domains. The working hypothesis: code teaches the model structured, logical thinking that transfers to natural language reasoning. Training on more code improves math, logic, and instruction-following benchmarks even when the evaluation is in natural language.
MMLU Score vs Code Fraction in Training Data (7B Model)
(MMLU %)The optimal code fraction is around 20-30% of training data. Below that, the model misses the reasoning transfer. Above that, the model becomes code-specialized at the expense of general knowledge.
The Data Budget Calculator
Complete Implementation
import math
from dataclasses import dataclass, field
@dataclass
class DataSource:
name: str
available_tokens: int # Unique tokens available
quality_score: float # 0.0-1.0
max_epochs: int # Maximum recommended epochs
cost_per_token: float # Cost to acquire/generate, in USD
category: str # "web", "code", "math", "books", "synthetic"
@dataclass
class TrainingBudget:
total_training_tokens: int
compute_budget_flops: int
monetary_budget_usd: float
model_params: int
class DataBudgetCalculator:
"""
Calculate optimal data allocation across sources given
constraints on budget, compute, and data availability.
"""
def __init__(self, sources, budget):
self.sources = {s.name: s for s in sources}
self.budget = budget
def chinchilla_ratio(self):
"""Compute the Chinchilla token/parameter ratio."""
return self.budget.total_training_tokens / self.budget.model_params
def over_training_factor(self):
"""How many times Chinchilla-optimal are we training."""
chinchilla_tokens = 20 * self.budget.model_params
return self.budget.total_training_tokens / chinchilla_tokens
def compute_allocation(self, target_mix):
"""
Compute token allocation per source given a target mix.
target_mix: dict mapping source_name -> target fraction
(fractions should sum to 1.0)
Returns allocation accounting for:
- Available token limits
- Maximum epoch constraints
- Quality degradation from multi-epoch
"""
total = self.budget.total_training_tokens
allocation = {}
# First pass: allocate based on target mix
for source_name, fraction in target_mix.items():
if source_name not in self.sources:
continue
source = self.sources[source_name]
target_tokens = int(total * fraction)
# Cap by available tokens * max_epochs
max_available = source.available_tokens * source.max_epochs
actual_tokens = min(target_tokens, max_available)
# Compute epochs needed
if source.available_tokens > 0:
epochs = actual_tokens / source.available_tokens
else:
epochs = 0
# Quality degradation
epoch_calc = EpochBudgetCalculator()
quality_factor = epoch_calc.quality_factor(max(1, int(epochs)))
effective_tokens = epoch_calc.effective_tokens(
source.available_tokens,
max(1, int(math.ceil(epochs))),
)
allocation[source_name] = {
"target_tokens": target_tokens,
"actual_tokens": actual_tokens,
"epochs": epochs,
"quality_factor": quality_factor,
"effective_tokens": min(effective_tokens, actual_tokens),
"cost": actual_tokens * source.cost_per_token,
}
# Second pass: redistribute unallocated tokens
allocated = sum(
a["actual_tokens"] for a in allocation.values()
)
remaining = total - allocated
if remaining > 0:
# Distribute to sources that have headroom
for source_name, alloc in allocation.items():
source = self.sources[source_name]
headroom = (
source.available_tokens * source.max_epochs
- alloc["actual_tokens"]
)
if headroom > 0:
additional = min(remaining, headroom)
alloc["actual_tokens"] += additional
remaining -= additional
if remaining <= 0:
break
return allocation
def print_budget_report(self, allocation):
"""Print a formatted budget report."""
print("=" * 80)
print(f"DATA BUDGET REPORT")
print(f"Model: {self.budget.model_params/1e9:.0f}B parameters")
print(f"Total tokens: {self.budget.total_training_tokens/1e12:.1f}T")
print(f"Chinchilla ratio: {self.chinchilla_ratio():.0f}x "
f"(Chinchilla-optimal = 20x)")
print(f"Over-training factor: {self.over_training_factor():.0f}x")
print("=" * 80)
print()
print(f"{'Source':<20} {'Tokens':<12} {'Epochs':<8} "
f"{'Quality':<10} {'Effective':<12} {'Cost':<10}")
print("-" * 72)
total_cost = 0
total_effective = 0
for name, alloc in sorted(
allocation.items(),
key=lambda x: x[1]["actual_tokens"],
reverse=True,
):
tokens_str = f"{alloc['actual_tokens']/1e12:.2f}T"
effective_str = f"{alloc['effective_tokens']/1e12:.2f}T"
cost_str = f"${alloc['cost']/1e6:.1f}M"
print(f"{name:<20} {tokens_str:<12} "
f"{alloc['epochs']:<8.1f} "
f"{alloc['quality_factor']:<10.2f} "
f"{effective_str:<12} {cost_str:<10}")
total_cost += alloc["cost"]
total_effective += alloc["effective_tokens"]
print("-" * 72)
print(f"{'TOTAL':<20} "
f"{sum(a['actual_tokens'] for a in allocation.values())/1e12:.2f}T"
f"{'':>20} "
f"{total_effective/1e12:.2f}T"
f"{'':>2} ${total_cost/1e6:.1f}M")
# Usage example: Llama 3 scale training
sources = [
DataSource("english_web", 4_000_000_000_000, 0.85, 4,
0.0, "web"),
DataSource("multilingual_web", 3_000_000_000_000, 0.75, 3,
0.0, "web"),
DataSource("code", 2_000_000_000_000, 0.90, 4,
0.0, "code"),
DataSource("books", 500_000_000_000, 0.95, 2,
0.0, "books"),
DataSource("academic", 300_000_000_000, 0.90, 2,
0.0, "books"),
DataSource("synthetic_math", 500_000_000_000, 0.80, 1,
5e-9, "synthetic"),
DataSource("synthetic_code", 1_000_000_000_000, 0.75, 1,
3e-9, "synthetic"),
DataSource("synthetic_reasoning", 500_000_000_000, 0.70, 1,
8e-9, "synthetic"),
]
budget = TrainingBudget(
total_training_tokens=15_000_000_000_000,
compute_budget_flops=int(6 * 70e9 * 15e12),
monetary_budget_usd=100_000_000,
model_params=int(70e9),
)
calculator = DataBudgetCalculator(sources, budget)
target_mix = {
"english_web": 0.30,
"multilingual_web": 0.15,
"code": 0.20,
"books": 0.05,
"academic": 0.03,
"synthetic_math": 0.10,
"synthetic_code": 0.10,
"synthetic_reasoning": 0.07,
}
allocation = calculator.compute_allocation(target_mix)
calculator.print_budget_report(allocation)
When to Stop Training
Learning Rate Schedule and Data Exhaustion
The learning rate schedule is typically cosine decay, reaching near-zero at the planned end of training. If you run out of data before the learning rate decays, you have two options:
- Extend with epochs: Continue training on repeated data. Quality degrades slowly.
- Early stop: End training at the data boundary. Wastes the remaining compute budget.
The right choice depends on the quality degradation rate versus the compute wasted by stopping early:
def should_continue_training(
unique_tokens_remaining,
tokens_to_train_remaining,
current_quality,
quality_per_epoch_loss=0.03,
):
"""
Decide whether to continue training with repeated data
or stop early.
Returns (continue, reasoning)
"""
if unique_tokens_remaining >= tokens_to_train_remaining:
return True, "Sufficient unique data remaining"
# How many epochs would remaining training require?
if unique_tokens_remaining <= 0:
return False, "No data remaining"
required_epochs = tokens_to_train_remaining / unique_tokens_remaining
# Quality cost of those epochs
quality_loss = quality_per_epoch_loss * math.log2(
max(required_epochs, 1)
)
# Compute cost of stopping early
training_fraction_remaining = (
tokens_to_train_remaining
/ (unique_tokens_remaining + tokens_to_train_remaining)
)
compute_wasted = training_fraction_remaining
if quality_loss < compute_wasted * 0.5:
return True, (
f"Continue: {required_epochs:.1f} epochs, "
f"quality loss {quality_loss:.3f} is less than "
f"compute waste {compute_wasted:.3f}"
)
else:
return False, (
f"Stop: {required_epochs:.1f} epochs would cost "
f"{quality_loss:.3f} quality, exceeding acceptable loss"
)
Monitoring for Data Exhaustion
class DataExhaustionMonitor:
"""
Monitor training for signs of data exhaustion:
- Validation loss stops improving
- Training loss drops faster than validation loss (overfitting)
- Gradient norms decrease (model memorizing)
"""
def __init__(self, patience=1000, min_delta=0.001):
self.patience = patience
self.min_delta = min_delta
self.best_val_loss = float('inf')
self.steps_without_improvement = 0
self.history = []
def update(self, step, train_loss, val_loss, grad_norm):
"""Update monitor with current metrics."""
self.history.append({
"step": step,
"train_loss": train_loss,
"val_loss": val_loss,
"grad_norm": grad_norm,
})
# Check for improvement
if val_loss < self.best_val_loss - self.min_delta:
self.best_val_loss = val_loss
self.steps_without_improvement = 0
else:
self.steps_without_improvement += 1
def should_stop(self):
"""Check if training should stop due to data exhaustion."""
if len(self.history) < 100:
return False, "Insufficient history"
# Check 1: Patience exceeded
if self.steps_without_improvement >= self.patience:
return True, "Validation loss stalled"
# Check 2: Train/val gap growing (overfitting)
recent = self.history[-100:]
train_losses = [h["train_loss"] for h in recent]
val_losses = [h["val_loss"] for h in recent]
gap_start = val_losses[0] - train_losses[0]
gap_end = val_losses[-1] - train_losses[-1]
if gap_end - gap_start > 0.1:
return True, (
f"Overfitting detected: train/val gap grew from "
f"{gap_start:.3f} to {gap_end:.3f}"
)
return False, "Training healthy"
Future Projections
The Token Supply Forecast
Projected Token Supply vs Demand (2024-2027)
| Year | Model Scale | Tokens Needed | Available (Natural) | Gap |
|---|---|---|---|---|
| 2024 | 70B-400B | 15T-30T | ~15T | 0-15T |
| 2025 | 400B-1T | 30T-100T | ~18T | 12T-82T |
| 2026 | 1T-5T | 100T-500T | ~20T | 80T-480T |
| 2027 | 5T+ | 500T+ | ~22T | 478T+ |
The gap between token demand and natural data supply grows exponentially. By 2026, frontier models may require 5-20x more tokens than exist in high-quality natural text. Synthetic data generation is not optional — it is the only path forward.
The data scaling law has two regimes. Below the Chinchilla optimum, adding more data has high marginal value — each token substantially reduces loss. Above the Chinchilla optimum (the over-training regime), more data still helps but with diminishing returns, and the economic justification shifts from training efficiency to inference cost amortization. The data wall is where even the over-training regime runs out of natural data, and the field must transition to synthetic data generation. The calculator in this post helps navigate all three regimes by quantifying effective tokens, quality degradation from epochs, and the economics of over-training.