Kaplan (2020) said scale the model; Chinchilla (2022) said scale the data too; o1 (2024) said scale inference compute. All three were right, and all three discovered power laws. Quality improves predictably with training FLOPs (Kaplan exponent: 0.05), data tokens (Chinchilla exponent: 0.095), and thinking tokens (o1 exponent: ~0.12-0.17 on reasoning tasks). The frontier is three-dimensional: you can train a bigger model, train on more data, or generate longer chains of thought. Each axis costs compute, but the costs are spent differently — training compute is amortized across all queries; inference compute is paid per request.
This post covers all three scaling laws in detail, the mathematics of compute-optimal allocation, and the emerging multi-dimensional frontier.
Kaplan Scaling Laws (2020)
Power Law Relationships
Kaplan et al. found that language model cross-entropy loss scales as a power law in model parameters , dataset tokens , and compute FLOPs :
where , , and are the scaling exponents, and , , are constants.
The combined scaling law accounts for all three simultaneously:
import numpy as np
from dataclasses import dataclass
from scipy.optimize import minimize
@dataclass
class ScalingLawParams:
"""Parameters for a power-law scaling relationship."""
alpha: float # Scaling exponent
constant: float # Multiplicative constant
irreducible: float # Irreducible loss floor
class KaplanScalingLaw:
"""
Kaplan et al. (2020) scaling laws.
Key findings:
1. Loss scales as power law of N, D, C
2. Larger models are more sample-efficient
3. With fixed compute, Kaplan recommended:
- Allocate most compute to model size
- N should scale as C^0.73
- D should scale as C^0.27
(This was later corrected by Chinchilla)
"""
def __init__(self):
# Kaplan's fitted parameters
self.alpha_N = 0.076 # Model size exponent
self.alpha_D = 0.095 # Data size exponent
self.alpha_C = 0.050 # Compute exponent
self.N_c = 8.8e13 # Model size constant
self.D_c = 5.4e13 # Data size constant
self.L_inf = 1.69 # Irreducible loss
def loss_vs_params(self, N):
"""
Predict loss given model parameters (N).
Assumes sufficient data (D >> needed).
"""
return self.L_inf + (self.N_c / N) ** self.alpha_N
def loss_vs_data(self, D):
"""
Predict loss given dataset tokens (D).
Assumes sufficient model size.
"""
return self.L_inf + (self.D_c / D) ** self.alpha_D
def loss_vs_compute(self, C):
"""
Predict loss given compute budget (C in FLOPs).
Assumes optimal allocation between N and D.
"""
return self.L_inf + (self.N_c * self.D_c / C) ** self.alpha_C
def loss_combined(self, N, D):
"""
Combined scaling law: loss as function of both N and D.
Captures the interaction between model size and data.
"""
term_N = (self.N_c / N) ** (self.alpha_N / self.alpha_D)
term_D = self.D_c / D
return self.L_inf + (term_N + term_D) ** self.alpha_D
def kaplan_optimal_allocation(self, C_total):
"""
Kaplan's recommended compute allocation.
Most compute goes to model size:
N proportional to C^0.73, D proportional to C^0.27
Note: This was later shown to be suboptimal
by Chinchilla.
"""
# Approximate: 6*N*D = C (FLOPs for transformer training)
N_optimal = (C_total / 6) ** 0.73
D_optimal = (C_total / 6) ** 0.27
return N_optimal, D_optimal
def predict_from_small_runs(self, small_runs):
"""
Fit scaling law parameters from small training runs
and extrapolate to larger scales.
small_runs: list of (N, D, loss) tuples
"""
# Fit alpha_N and N_c from runs varying N
N_values = np.array([r[0] for r in small_runs])
losses = np.array([r[2] for r in small_runs])
# Log-log linear fit: log(L - L_inf) = -alpha * log(N) + const
# First estimate L_inf as the minimum observed loss
L_inf_est = min(losses) * 0.9
log_N = np.log(N_values)
log_excess_loss = np.log(losses - L_inf_est)
# Linear regression in log space
slope, intercept = np.polyfit(log_N, log_excess_loss, 1)
return {
"alpha_N": -slope,
"N_c": np.exp(intercept / (-slope)),
"L_inf": L_inf_est,
}
Kaplan Scaling: Loss vs Model Parameters
| Metric | 10M | 100M | 1B | 10B | 100B | 1T |
|---|---|---|---|---|---|---|
| Predicted Loss (Kaplan) | ||||||
| Observed Loss (GPT-series) |
Chinchilla Scaling Laws (2022)
Compute-Optimal Training
Hoffmann et al. (Chinchilla paper) trained over 400 models ranging from 70M to 16B parameters on different amounts of data. Their finding corrected Kaplan: the compute-optimal allocation scales model size and data equally. A 10x compute increase should yield approximately a 3.2x larger model trained on 3.2x more data (since , and ).
The Chinchilla scaling law:
where , , , , and .
class ChinchillaScalingLaw:
"""
Hoffmann et al. (2022) Chinchilla scaling laws.
Key correction to Kaplan:
- Kaplan: scale N faster than D (N ~ C^0.73, D ~ C^0.27)
- Chinchilla: scale N and D equally (N ~ C^0.50, D ~ C^0.50)
The practical implication:
- GPT-3 175B trained on 300B tokens was UNDERTRAINED
- Chinchilla 70B trained on 1.4T tokens matched GPT-3
with 4x less compute at inference
Chinchilla rule of thumb:
Optimal D = 20 * N (tokens = 20x parameters)
"""
def __init__(self):
# Fitted parameters from the Chinchilla paper
self.A = 406.4
self.B = 410.7
self.alpha = 0.34
self.beta = 0.28
self.E = 1.69 # Irreducible entropy
def loss(self, N, D):
"""
Predict loss for model size N and data tokens D.
"""
return self.A / (N ** self.alpha) + self.B / (D ** self.beta) + self.E
def compute_optimal(self, C_total):
"""
Find the compute-optimal model size and data size
for a given compute budget.
C = 6 * N * D (approximate FLOPs for a transformer)
Minimize L(N, D) subject to 6*N*D = C_total.
"""
def objective(log_N):
N = np.exp(log_N)
D = C_total / (6 * N)
if D <= 0:
return 1e10
return self.loss(N, D)
# Search over log-space
log_N_range = np.linspace(
np.log(1e6), np.log(C_total / 6), 1000
)
losses = [objective(ln) for ln in log_N_range]
best_idx = np.argmin(losses)
N_optimal = np.exp(log_N_range[best_idx])
D_optimal = C_total / (6 * N_optimal)
return {
"N_optimal": int(N_optimal),
"D_optimal": int(D_optimal),
"tokens_per_param": D_optimal / N_optimal,
"predicted_loss": self.loss(N_optimal, D_optimal),
"compute_budget": C_total,
}
def analyze_model(self, name, N, D, C=None):
"""
Analyze whether a model was trained compute-optimally.
"""
if C is None:
C = 6 * N * D
optimal = self.compute_optimal(C)
actual_loss = self.loss(N, D)
optimal_loss = optimal["predicted_loss"]
tokens_per_param = D / N
optimal_ratio = optimal["tokens_per_param"]
return {
"model": name,
"actual_N": N,
"actual_D": D,
"actual_tokens_per_param": round(tokens_per_param, 1),
"optimal_N": optimal["N_optimal"],
"optimal_D": optimal["D_optimal"],
"optimal_tokens_per_param": round(optimal_ratio, 1),
"actual_loss": round(actual_loss, 4),
"optimal_loss": round(optimal_loss, 4),
"loss_gap": round(actual_loss - optimal_loss, 4),
"diagnosis": (
"undertrained" if tokens_per_param < optimal_ratio * 0.5
else "overtrained" if tokens_per_param > optimal_ratio * 2
else "near-optimal"
),
}
Chinchilla Analysis of Major Models
| Model | Parameters | Tokens | Tokens/Param | Chinchilla Optimal Tokens/Param | Diagnosis |
|---|---|---|---|---|---|
| GPT-3 175B | 175B | 300B | 1.7x | 20x | Severely undertrained |
| Chinchilla 70B | 70B | 1.4T | 20x | 20x | Compute-optimal |
| Llama 2 70B | 70B | 2T | 28.6x | 20x | Slightly overtrained |
| Llama 3.1 8B | 8B | 15T | 1875x | 20x | Heavily overtrained |
| Llama 3.1 405B | 405B | 15T | 37x | 20x | Moderately overtrained |
| Mistral 7B | 7B | 8T+ | 1143x+ | 20x | Heavily overtrained |
Chinchilla-optimal training minimizes total training compute for a target loss. But it does not minimize inference cost. A 70B model is cheaper to serve than a 175B model. Modern practice intentionally overtrains smaller models (Llama 3.1 8B uses 1875x tokens-per-parameter vs Chinchilla’s 20x) because the extra training compute is spent once, while inference savings accrue over every query.
Inference-Time Scaling (2024)
Quality Scales with Thinking
The third scaling law, discovered through chain-of-thought reasoning and formalized by work on OpenAI’s o1 and DeepSeek-R1, shows that model quality improves predictably with the number of inference-time tokens. Instead of making the model bigger (more parameters at training), make it think longer (more tokens at inference).
where is the number of “thinking” tokens generated and - depending on the task.
class InferenceTimeScaling:
"""
Inference-time scaling law.
Given a fixed model, quality improves as the model
generates more reasoning tokens before producing
the final answer.
This creates a new tradeoff: instead of training a
larger model, use a smaller model but let it think longer.
"""
def __init__(self):
# Fitted parameters (task-dependent)
self.task_params = {
"math": {"A": 15.0, "gamma": 0.25, "E": 0.05},
"coding": {"A": 12.0, "gamma": 0.20, "E": 0.08},
"reasoning": {"A": 18.0, "gamma": 0.22, "E": 0.10},
"general": {"A": 8.0, "gamma": 0.15, "E": 0.15},
}
def error_rate(self, thinking_tokens, task="math"):
"""
Predict error rate given number of thinking tokens.
More thinking tokens = lower error rate.
"""
params = self.task_params.get(
task, self.task_params["general"]
)
return (
params["A"] / (thinking_tokens ** params["gamma"])
+ params["E"]
)
def compute_cost(self, model_params, thinking_tokens):
"""
Compute the FLOPs cost of inference with
T thinking tokens.
Cost per token ~ 2 * N (for a transformer)
Total cost ~ 2 * N * T
"""
flops_per_token = 2 * model_params
total_flops = flops_per_token * thinking_tokens
return total_flops
def optimal_thinking_budget(self, model_params, target_error,
task="math"):
"""
Find the minimum thinking tokens needed to achieve
a target error rate.
"""
params = self.task_params.get(
task, self.task_params["general"]
)
if target_error <= params["E"]:
return float("inf") # Below irreducible error
# error = A / T^gamma + E
# T^gamma = A / (error - E)
# T = (A / (error - E))^(1/gamma)
T = (params["A"] / (target_error - params["E"])) ** (
1.0 / params["gamma"]
)
return {
"thinking_tokens": int(T),
"inference_flops": self.compute_cost(model_params, int(T)),
"target_error": target_error,
"achievable": True,
}
def compare_scaling_strategies(self, base_model_N,
base_thinking_T,
compute_multiplier,
task="math"):
"""
Given extra compute, compare two strategies:
1. Scale model: use a larger model with same thinking time
2. Scale thinking: use the same model with more thinking time
Which reduces error more for the same compute?
"""
base_cost = self.compute_cost(base_model_N, base_thinking_T)
target_cost = base_cost * compute_multiplier
# Strategy 1: Larger model, same thinking time
# New model size: N' = target_cost / (2 * T)
new_N = target_cost / (2 * base_thinking_T)
# Approximate how loss changes with N
# Using Kaplan: loss ~ N^(-0.076)
# Error rate improvement factor:
# (new_N / base_N)^(-0.076)
model_scale_factor = (new_N / base_model_N) ** 0.076
# Strategy 2: Same model, more thinking time
new_T = target_cost / (2 * base_model_N)
# Error from inference scaling law
error_more_thinking = self.error_rate(new_T, task)
error_base = self.error_rate(base_thinking_T, task)
return {
"compute_multiplier": compute_multiplier,
"strategy_1_larger_model": {
"new_N": int(new_N),
"thinking_T": base_thinking_T,
"error_reduction_factor": round(model_scale_factor, 3),
},
"strategy_2_more_thinking": {
"model_N": base_model_N,
"new_T": int(new_T),
"base_error": round(error_base, 4),
"new_error": round(error_more_thinking, 4),
"error_reduction_factor": round(
error_more_thinking / max(error_base, 1e-8), 3
),
},
}
Inference-Time Scaling: Error Rate vs Thinking Tokens
| Metric | 100 | 500 | 1K | 2K | 5K | 10K | 20K |
|---|---|---|---|---|---|---|---|
| 8B model | |||||||
| 70B model | |||||||
| 405B model |
The Multi-Dimensional Frontier
Three Dimensions of Scaling
You can improve model quality by spending compute in three ways:
- Training parameters (): Make the model larger. One-time cost. Affects all queries equally.
- Training data (): Train on more tokens. One-time cost. Improves base knowledge.
- Inference compute (): Generate more thinking tokens per query. Per-query cost. Improves reasoning quality.
class MultiDimensionalScaling:
"""
Multi-dimensional scaling law that combines training-time
and inference-time scaling.
Given a total compute budget split between training
and inference, find the optimal allocation.
Total cost = training_cost + inference_cost * n_queries
Training cost = 6 * N * D
Inference cost per query = 2 * N * T
"""
def __init__(self):
self.chinchilla = ChinchillaScalingLaw()
self.inference = InferenceTimeScaling()
def optimal_allocation(self, total_budget, n_queries,
task="math"):
"""
Find optimal allocation of total compute between
training (N, D) and inference (T).
total_budget = 6*N*D + n_queries * 2*N*T
We want to minimize the error rate on the given task.
"""
best_config = None
best_error = float("inf")
# Search over model sizes (log-spaced)
for log_N in np.linspace(np.log(1e8), np.log(1e12), 100):
N = np.exp(log_N)
# For each N, find optimal split between D and T
for train_fraction in np.linspace(0.1, 0.9, 50):
train_budget = total_budget * train_fraction
inference_budget = total_budget * (1 - train_fraction)
# Training: 6*N*D = train_budget
D = train_budget / (6 * N)
if D < N: # Need at least N tokens
continue
# Inference: n_queries * 2*N*T = inference_budget
T = inference_budget / (n_queries * 2 * N)
if T < 10: # Minimum thinking tokens
continue
# Compute error (combine training loss + inference)
# Training determines base model quality
train_loss = self.chinchilla.loss(N, D)
# Inference-time thinking reduces error further
inference_error = self.inference.error_rate(T, task)
# Combined error: base quality * inference improvement
# Simplified model: lower training loss enables
# better inference-time scaling
combined_error = (
train_loss * inference_error / self.chinchilla.E
)
if combined_error < best_error:
best_error = combined_error
best_config = {
"N": int(N),
"D": int(D),
"T": int(T),
"train_fraction": round(train_fraction, 2),
"train_budget": train_budget,
"inference_budget": inference_budget,
"tokens_per_param": round(D / N, 1),
"combined_error": round(combined_error, 4),
}
return best_config
def frontier_surface(self, budget_range, query_counts):
"""
Compute the error-compute frontier for different
deployment scenarios (varying query counts).
Returns the optimal configuration and achievable
error for each (budget, n_queries) pair.
"""
frontier = []
for budget in budget_range:
for n_queries in query_counts:
config = self.optimal_allocation(
budget, n_queries
)
if config:
frontier.append({
"budget": budget,
"n_queries": n_queries,
"optimal_N": config["N"],
"optimal_D": config["D"],
"optimal_T": config["T"],
"error": config["combined_error"],
"train_fraction": config["train_fraction"],
})
return frontier
The Key Insight: Query Count Determines Strategy
class DeploymentStrategyAdvisor:
"""
Advise on the optimal scaling strategy based on
deployment scenario.
Low-volume (research): invest in training, use large model
High-volume (production): invest in inference optimization
"""
def advise(self, total_budget, expected_queries,
latency_requirement_ms=None):
"""
Recommend scaling strategy for a deployment scenario.
"""
mds = MultiDimensionalScaling()
config = mds.optimal_allocation(
total_budget, expected_queries
)
if config is None:
return {"error": "Budget too small for any configuration"}
# Compute inference cost fraction
per_query_cost = 2 * config["N"] * config["T"]
total_inference_cost = per_query_cost * expected_queries
inference_fraction = (total_inference_cost
/ (total_budget + 1e-8))
# Strategy classification
if config["train_fraction"] > 0.7:
strategy = "training-dominant"
explanation = (
"Low query count. Invest in the largest, "
"best-trained model. Inference compute per query "
"is relatively cheap."
)
elif config["train_fraction"] < 0.3:
strategy = "inference-dominant"
explanation = (
"High query count. Use a smaller model and "
"invest in inference-time scaling (more thinking "
"tokens per query). Training is cheap relative "
"to total inference spend."
)
else:
strategy = "balanced"
explanation = (
"Moderate query count. Balance between model "
"size and inference-time thinking. Neither "
"dominates the total cost."
)
# Latency check
latency_ok = True
if latency_requirement_ms is not None:
# Rough estimate: 20ms per thinking token on H100
estimated_latency = config["T"] * 20
latency_ok = estimated_latency <= latency_requirement_ms
return {
"strategy": strategy,
"explanation": explanation,
"recommended_N": config["N"],
"recommended_D": config["D"],
"recommended_T": config["T"],
"train_fraction": config["train_fraction"],
"inference_fraction": round(inference_fraction, 2),
"estimated_latency_ms": config["T"] * 20,
"latency_ok": latency_ok,
}
Optimal Strategy by Deployment Scenario
| Scenario | Total Budget | Queries | Optimal N | Optimal T | Strategy |
|---|---|---|---|---|---|
| Research prototype | $100K | 10K | 70B | 5000 | Training-dominant (large model) |
| Startup MVP | $1M | 1M | 8B | 2000 | Balanced |
| Production API | $10M | 100M | 8B | 500 | Inference-dominant (small+fast) |
| Enterprise search | $50M | 1B | 3B | 200 | Inference-dominant (tiny+fast) |
| Frontier research | $100M | 100K | 405B | 10000 | Training-dominant (biggest model) |
Building a Scaling Law Predictor
Fitting Scaling Laws from Small Experiments
The most practical application of scaling laws is predicting the performance of a large training run from small-scale experiments. Train 5-10 small models (10M to 1B parameters), fit the scaling law, and extrapolate to predict the loss of a 70B model before spending $2M on training it.
class ScalingLawPredictor:
"""
Predict large-scale training outcomes from small experiments.
Workflow:
1. Train 5-10 small models at different (N, D) points
2. Fit the Chinchilla loss function to these points
3. Extrapolate to predict loss at target (N, D)
4. Estimate confidence interval via bootstrap
This saves millions of dollars by preventing
failed training runs.
"""
def __init__(self):
self.observed_points = []
self.fitted_params = None
def add_observation(self, N, D, loss):
"""Add a training run observation."""
self.observed_points.append({
"N": N, "D": D, "loss": loss,
"C": 6 * N * D,
})
def fit(self, method="mle"):
"""
Fit scaling law parameters to observed data.
Model: L(N, D) = A/N^alpha + B/D^beta + E
Parameters to fit: A, alpha, B, beta, E
"""
if len(self.observed_points) < 5:
raise ValueError(
"Need at least 5 observations to fit "
"scaling law reliably"
)
# Initial guess (Chinchilla values)
x0 = np.array([
np.log(406.4), # log(A)
0.34, # alpha
np.log(410.7), # log(B)
0.28, # beta
1.69, # E
])
def loss_fn(params):
A = np.exp(params[0])
alpha = params[1]
B = np.exp(params[2])
beta = params[3]
E = params[4]
total_error = 0
for obs in self.observed_points:
predicted = A / (obs["N"] ** alpha) + B / (obs["D"] ** beta) + E
actual = obs["loss"]
# Relative error (better for log-scale data)
total_error += ((predicted - actual) / actual) ** 2
return total_error
result = minimize(
loss_fn, x0,
method="Nelder-Mead",
options={"maxiter": 10000, "xatol": 1e-8},
)
self.fitted_params = {
"A": np.exp(result.x[0]),
"alpha": result.x[1],
"B": np.exp(result.x[2]),
"beta": result.x[3],
"E": result.x[4],
"fit_error": result.fun,
}
return self.fitted_params
def predict(self, N, D):
"""Predict loss for a given (N, D) configuration."""
if self.fitted_params is None:
raise ValueError("Must call fit() first")
p = self.fitted_params
loss = (p["A"] / (N ** p["alpha"])
+ p["B"] / (D ** p["beta"])
+ p["E"])
return loss
def predict_compute_optimal(self, C_target):
"""
Predict the compute-optimal configuration
and expected loss for a target compute budget.
"""
if self.fitted_params is None:
raise ValueError("Must call fit() first")
p = self.fitted_params
best_loss = float("inf")
best_config = None
for log_N in np.linspace(np.log(1e6), np.log(C_target / 6), 500):
N = np.exp(log_N)
D = C_target / (6 * N)
loss = self.predict(N, D)
if loss < best_loss:
best_loss = loss
best_config = {
"N": int(N),
"D": int(D),
"predicted_loss": round(loss, 4),
"tokens_per_param": round(D / N, 1),
}
return best_config
def bootstrap_confidence(self, N, D, n_bootstrap=200):
"""
Estimate prediction confidence interval via bootstrap.
Resample observations, refit, and predict.
"""
import random
predictions = []
n_obs = len(self.observed_points)
for _ in range(n_bootstrap):
# Resample with replacement
resampled = random.choices(
self.observed_points, k=n_obs
)
# Refit
predictor = ScalingLawPredictor()
for obs in resampled:
predictor.add_observation(
obs["N"], obs["D"], obs["loss"]
)
try:
predictor.fit()
pred = predictor.predict(N, D)
predictions.append(pred)
except Exception:
continue
if not predictions:
return None
predictions.sort()
n = len(predictions)
return {
"mean": round(np.mean(predictions), 4),
"std": round(np.std(predictions), 4),
"lower_95": round(predictions[int(0.025 * n)], 4),
"upper_95": round(predictions[int(0.975 * n)], 4),
"n_successful_fits": n,
}
Practical Scaling Law Application
Decision-Making Framework
class TrainingDecisionFramework:
"""
Use scaling laws to make training decisions:
1. Should I train a bigger model or train longer?
2. What loss can I expect for my budget?
3. How much compute to reach a target loss?
"""
def __init__(self, predictor):
self.predictor = predictor
def budget_to_loss(self, compute_budget):
"""Given a compute budget, predict the best achievable loss."""
return self.predictor.predict_compute_optimal(compute_budget)
def loss_to_budget(self, target_loss, precision=0.01):
"""Given a target loss, estimate the required compute."""
# Binary search over compute budgets
low = 1e18 # ~$100
high = 1e25 # ~$100M
for _ in range(100):
mid = (low + high) / 2
config = self.predictor.predict_compute_optimal(mid)
if config is None:
low = mid
continue
if config["predicted_loss"] > target_loss:
low = mid
else:
high = mid
if (high - low) / mid < precision:
break
return {
"target_loss": target_loss,
"required_compute": mid,
"estimated_cost_h100": self._compute_to_cost(mid),
"optimal_config": self.predictor.predict_compute_optimal(mid),
}
def should_scale_model_or_data(self, current_N, current_D,
additional_budget):
"""
Given current training, should you increase N or D
with additional compute?
"""
current_loss = self.predictor.predict(current_N, current_D)
current_C = 6 * current_N * current_D
# Option A: increase model size, keep data
new_C_a = current_C + additional_budget
new_N_a = new_C_a / (6 * current_D)
loss_a = self.predictor.predict(new_N_a, current_D)
# Option B: keep model size, increase data
new_D_b = (current_C + additional_budget) / (6 * current_N)
loss_b = self.predictor.predict(current_N, new_D_b)
# Option C: optimal split
optimal = self.predictor.predict_compute_optimal(
current_C + additional_budget
)
return {
"current_loss": round(current_loss, 4),
"option_a_bigger_model": {
"new_N": int(new_N_a),
"new_D": int(current_D),
"predicted_loss": round(loss_a, 4),
"improvement": round(current_loss - loss_a, 4),
},
"option_b_more_data": {
"new_N": int(current_N),
"new_D": int(new_D_b),
"predicted_loss": round(loss_b, 4),
"improvement": round(current_loss - loss_b, 4),
},
"option_c_optimal_retrain": {
"new_N": optimal["N"] if optimal else 0,
"new_D": optimal["D"] if optimal else 0,
"predicted_loss": (optimal["predicted_loss"]
if optimal else 0),
"improvement": round(
current_loss - (optimal["predicted_loss"]
if optimal else current_loss), 4
),
},
"recommendation": self._recommend(loss_a, loss_b,
optimal["predicted_loss"]
if optimal else 1e10),
}
def _recommend(self, loss_a, loss_b, loss_c):
"""Recommend the best option."""
best = min(loss_a, loss_b, loss_c)
if best == loss_c:
return "Retrain from scratch with optimal allocation"
elif best == loss_a:
return "Scale model size (keep data constant)"
else:
return "Train longer on more data (keep model constant)"
def _compute_to_cost(self, flops):
"""
Estimate dollar cost from FLOPs.
H100 at ~$2/hour, ~1e15 FLOP/s (half-precision MFU ~40%)
"""
h100_flops_per_second = 1e15 * 0.4
h100_cost_per_second = 2.0 / 3600
seconds = flops / h100_flops_per_second
cost = seconds * h100_cost_per_second
return round(cost, 2)
Predicted Loss vs Compute Budget (Fitted Scaling Law)
| Metric | 1e19 | 1e20 | 1e21 | 1e22 | 1e23 | 1e24 | 1e25 |
|---|---|---|---|---|---|---|---|
| Compute-Optimal Loss | |||||||
| Observed (GPT-3, Chinchilla, Llama) |
Key Takeaways
Scaling laws are the most reliable empirical finding in deep learning. They enable predicting the outcome of multi-million dollar training runs from small-scale experiments.
The three laws and their implications:
-
Kaplan (2020): Loss scales as power law of , , . Established that scaling is predictable. However, the recommended allocation (scale much faster than ) was wrong.
-
Chinchilla (2022): Corrected Kaplan. Compute-optimal training scales and equally: . Showed that GPT-3 was severely undertrained (300B tokens for 175B params vs the optimal 3.5T tokens). Post-Chinchilla, the industry shifted to training smaller models on much more data (Llama 3.1 8B on 15T tokens).
-
Inference-time scaling (2024): A fixed model improves predictably with more thinking tokens. Scaling exponent - (task-dependent). For reasoning-heavy tasks, spending 10x more inference compute reduces error by 50-75%.
-
The multi-dimensional frontier: Quality depends on three compute dimensions: training parameters (), training data (), and inference tokens (). The optimal allocation depends on query volume. Low-volume deployments should invest in large models (training-dominant). High-volume deployments should use smaller models with inference-time scaling (inference-dominant).
-
Practical prediction: Fit scaling law parameters from 5-10 small training runs (costing 10K total), then predict the loss of a 10M run with a 95% confidence interval of approximately 0.02-0.05 loss units. This prediction accuracy is sufficient to make go/no-go decisions on large training runs.
The fundamental equation: total compute for a deployment serving queries is . The first term is training (spent once), the second is inference (spent per query). As grows, inference dominates and the optimal strategy shifts toward smaller with larger . The crossover point where training cost equals total inference cost is at queries.