Part of Series Frontier Research 2025-2026 22 of 30
1 Reasoning Scaling Laws: How Inference-Time Compute Changes Everything We Know About Scaling 2 Lightning Attention: Implementing Linear-Time Attention for Million-Token Contexts 3 Policy of Thoughts: Test-Time Policy Evolution and Online Reasoning Refinement 4 Test-Time Compute Scaling: When a 1B Model Beats a 405B Model 5 Self-Improving Systems: Models That Generate Their Own Training Data 6 Embodied AI Foundations: World Models, Physical Reasoning, and the Sora/V-JEPA Connection 7 Reward Model Engineering: ORM vs PRM, Verifier Design, and Why Reward Quality Determines Reasoning Quality 8 Constitutional AI and RLHF Alternatives: DPO, KTO, ORPO, and the Post-Training Revolution 9 Long-Context Research: Architectures and Techniques for 1M to 10M+ Token Windows 10 Multimodal Fusion: Early vs Late Fusion, Cross-Attention, and Interleaved Architectures 11 Efficient Fine-Tuning: LoRA, DoRA, QLoRA, GaLore, and LISA — When to Use Each 12 The Research Frontier in 2026: Open Problems and Promising Directions 13 Hallucination Mitigation: Detection, Prevention, and Why LLMs Confidently Produce Nonsense 14 Mechanistic Interpretability: Sparse Autoencoders, Feature Circuits, and Understanding What's Inside 15 GRPO Complete Algorithm: Group Relative Policy Optimization for Reasoning Models 16 Synthetic Reasoning Data: STaR, ReST, and How Models Bootstrap Their Own Training Signal 17 Alignment at Scale: Scalable Oversight, Weak-to-Strong Generalization, and Constitutional AI 18 Agent Architectures: ReAct, Tool Use, Multi-Step Planning, and Memory Systems for LLM Agents 19 Continual Learning and Catastrophic Forgetting: Why Models Lose Old Knowledge When Learning New 20 Multimodal Generation: Text-to-Image, Text-to-Video, and Unified Generation Architectures 21 Model Evaluation Beyond Benchmarks: Arena, Human Preference, and Capability Elicitation 22 Scaling Laws Complete: Kaplan, Chinchilla, Inference-Time, and the Multi-Dimensional Frontier 23 World Models: Predicting Future States from Actions for Planning and Simulation 24 Tool Use and Function Calling: How LLMs Learn to Use APIs, Calculators, and Code Interpreters 25 Safety and Red Teaming: Adversarial Attacks, Jailbreaks, and Defense Mechanisms 26 Knowledge Editing: ROME, MEMIT, and Surgically Modifying What LLMs Know 27 Chain-of-Thought Internals: What Happens Inside the Model During Reasoning 28 Sparse Upcycling: Converting Dense Models to MoE Without Retraining from Scratch 29 Data-Efficient Training: Learning More from Less with Curriculum, Filtering, and Replay 30 The Open Source LLM Ecosystem in 2026: HuggingFace, Ollama, and the Tools That Changed Everything

Kaplan (2020) said scale the model; Chinchilla (2022) said scale the data too; o1 (2024) said scale inference compute. All three were right, and all three discovered power laws. Quality improves predictably with training FLOPs (Kaplan exponent: 0.05), data tokens (Chinchilla exponent: 0.095), and thinking tokens (o1 exponent: ~0.12-0.17 on reasoning tasks). The frontier is three-dimensional: you can train a bigger model, train on more data, or generate longer chains of thought. Each axis costs compute, but the costs are spent differently — training compute is amortized across all queries; inference compute is paid per request.

This post covers all three scaling laws in detail, the mathematics of compute-optimal allocation, and the emerging multi-dimensional frontier.

Kaplan Scaling Laws (2020)

Power Law Relationships

Kaplan et al. found that language model cross-entropy loss LL scales as a power law in model parameters NN, dataset tokens DD, and compute FLOPs CC:

L(N)=(NcN)αNL(N) = \left(\frac{N_c}{N}\right)^{\alpha_N}

L(D)=(DcD)αDL(D) = \left(\frac{D_c}{D}\right)^{\alpha_D}

L(C)=(CcC)αCL(C) = \left(\frac{C_c}{C}\right)^{\alpha_C}

where αN0.076\alpha_N \approx 0.076, αD0.095\alpha_D \approx 0.095, and αC0.050\alpha_C \approx 0.050 are the scaling exponents, and NcN_c, DcD_c, CcC_c are constants.

The combined scaling law accounts for all three simultaneously:

L(N,D)=[(NcN)αN/αD+DcD]αDL(N, D) = \left[\left(\frac{N_c}{N}\right)^{\alpha_N / \alpha_D} + \frac{D_c}{D}\right]^{\alpha_D}

import numpy as np
from dataclasses import dataclass
from scipy.optimize import minimize

@dataclass
class ScalingLawParams:
    """Parameters for a power-law scaling relationship."""
    alpha: float      # Scaling exponent
    constant: float   # Multiplicative constant
    irreducible: float  # Irreducible loss floor

class KaplanScalingLaw:
    """
    Kaplan et al. (2020) scaling laws.

    Key findings:
    1. Loss scales as power law of N, D, C
    2. Larger models are more sample-efficient
    3. With fixed compute, Kaplan recommended:
       - Allocate most compute to model size
       - N should scale as C^0.73
       - D should scale as C^0.27
       (This was later corrected by Chinchilla)
    """

    def __init__(self):
        # Kaplan's fitted parameters
        self.alpha_N = 0.076  # Model size exponent
        self.alpha_D = 0.095  # Data size exponent
        self.alpha_C = 0.050  # Compute exponent
        self.N_c = 8.8e13     # Model size constant
        self.D_c = 5.4e13     # Data size constant
        self.L_inf = 1.69     # Irreducible loss

    def loss_vs_params(self, N):
        """
        Predict loss given model parameters (N).
        Assumes sufficient data (D >> needed).
        """
        return self.L_inf + (self.N_c / N) ** self.alpha_N

    def loss_vs_data(self, D):
        """
        Predict loss given dataset tokens (D).
        Assumes sufficient model size.
        """
        return self.L_inf + (self.D_c / D) ** self.alpha_D

    def loss_vs_compute(self, C):
        """
        Predict loss given compute budget (C in FLOPs).
        Assumes optimal allocation between N and D.
        """
        return self.L_inf + (self.N_c * self.D_c / C) ** self.alpha_C

    def loss_combined(self, N, D):
        """
        Combined scaling law: loss as function of both N and D.
        Captures the interaction between model size and data.
        """
        term_N = (self.N_c / N) ** (self.alpha_N / self.alpha_D)
        term_D = self.D_c / D
        return self.L_inf + (term_N + term_D) ** self.alpha_D

    def kaplan_optimal_allocation(self, C_total):
        """
        Kaplan's recommended compute allocation.
        Most compute goes to model size:
        N proportional to C^0.73, D proportional to C^0.27

        Note: This was later shown to be suboptimal
        by Chinchilla.
        """
        # Approximate: 6*N*D = C (FLOPs for transformer training)
        N_optimal = (C_total / 6) ** 0.73
        D_optimal = (C_total / 6) ** 0.27
        return N_optimal, D_optimal

    def predict_from_small_runs(self, small_runs):
        """
        Fit scaling law parameters from small training runs
        and extrapolate to larger scales.

        small_runs: list of (N, D, loss) tuples
        """
        # Fit alpha_N and N_c from runs varying N
        N_values = np.array([r[0] for r in small_runs])
        losses = np.array([r[2] for r in small_runs])

        # Log-log linear fit: log(L - L_inf) = -alpha * log(N) + const
        # First estimate L_inf as the minimum observed loss
        L_inf_est = min(losses) * 0.9

        log_N = np.log(N_values)
        log_excess_loss = np.log(losses - L_inf_est)

        # Linear regression in log space
        slope, intercept = np.polyfit(log_N, log_excess_loss, 1)

        return {
            "alpha_N": -slope,
            "N_c": np.exp(intercept / (-slope)),
            "L_inf": L_inf_est,
        }

Kaplan Scaling: Loss vs Model Parameters

Metric 10M100M1B10B100B1T
Predicted Loss (Kaplan)
3.4
3.1
2.85
2.65
2.5
2.38
Observed Loss (GPT-series)
3.38
3.08
2.83
2.62
2.48

Chinchilla Scaling Laws (2022)

Compute-Optimal Training

Hoffmann et al. (Chinchilla paper) trained over 400 models ranging from 70M to 16B parameters on different amounts of data. Their finding corrected Kaplan: the compute-optimal allocation scales model size and data equally. A 10x compute increase should yield approximately a 3.2x larger model trained on 3.2x more data (since C6NDC \approx 6ND, and 3.2×3.2103.2 \times 3.2 \approx 10).

The Chinchilla scaling law:

L(N,D)=ANα+BDβ+EL(N, D) = \frac{A}{N^\alpha} + \frac{B}{D^\beta} + E

where α0.34\alpha \approx 0.34, β0.28\beta \approx 0.28, A406.4A \approx 406.4, B410.7B \approx 410.7, and E1.69E \approx 1.69.

class ChinchillaScalingLaw:
    """
    Hoffmann et al. (2022) Chinchilla scaling laws.

    Key correction to Kaplan:
    - Kaplan: scale N faster than D (N ~ C^0.73, D ~ C^0.27)
    - Chinchilla: scale N and D equally (N ~ C^0.50, D ~ C^0.50)

    The practical implication:
    - GPT-3 175B trained on 300B tokens was UNDERTRAINED
    - Chinchilla 70B trained on 1.4T tokens matched GPT-3
      with 4x less compute at inference

    Chinchilla rule of thumb:
    Optimal D = 20 * N (tokens = 20x parameters)
    """

    def __init__(self):
        # Fitted parameters from the Chinchilla paper
        self.A = 406.4
        self.B = 410.7
        self.alpha = 0.34
        self.beta = 0.28
        self.E = 1.69  # Irreducible entropy

    def loss(self, N, D):
        """
        Predict loss for model size N and data tokens D.
        """
        return self.A / (N ** self.alpha) + self.B / (D ** self.beta) + self.E

    def compute_optimal(self, C_total):
        """
        Find the compute-optimal model size and data size
        for a given compute budget.

        C = 6 * N * D (approximate FLOPs for a transformer)

        Minimize L(N, D) subject to 6*N*D = C_total.
        """
        def objective(log_N):
            N = np.exp(log_N)
            D = C_total / (6 * N)
            if D <= 0:
                return 1e10
            return self.loss(N, D)

        # Search over log-space
        log_N_range = np.linspace(
            np.log(1e6), np.log(C_total / 6), 1000
        )
        losses = [objective(ln) for ln in log_N_range]
        best_idx = np.argmin(losses)

        N_optimal = np.exp(log_N_range[best_idx])
        D_optimal = C_total / (6 * N_optimal)

        return {
            "N_optimal": int(N_optimal),
            "D_optimal": int(D_optimal),
            "tokens_per_param": D_optimal / N_optimal,
            "predicted_loss": self.loss(N_optimal, D_optimal),
            "compute_budget": C_total,
        }

    def analyze_model(self, name, N, D, C=None):
        """
        Analyze whether a model was trained compute-optimally.
        """
        if C is None:
            C = 6 * N * D

        optimal = self.compute_optimal(C)
        actual_loss = self.loss(N, D)
        optimal_loss = optimal["predicted_loss"]

        tokens_per_param = D / N
        optimal_ratio = optimal["tokens_per_param"]

        return {
            "model": name,
            "actual_N": N,
            "actual_D": D,
            "actual_tokens_per_param": round(tokens_per_param, 1),
            "optimal_N": optimal["N_optimal"],
            "optimal_D": optimal["D_optimal"],
            "optimal_tokens_per_param": round(optimal_ratio, 1),
            "actual_loss": round(actual_loss, 4),
            "optimal_loss": round(optimal_loss, 4),
            "loss_gap": round(actual_loss - optimal_loss, 4),
            "diagnosis": (
                "undertrained" if tokens_per_param < optimal_ratio * 0.5
                else "overtrained" if tokens_per_param > optimal_ratio * 2
                else "near-optimal"
            ),
        }
📊

Chinchilla Analysis of Major Models

ModelParametersTokensTokens/ParamChinchilla Optimal Tokens/ParamDiagnosis
GPT-3 175B 175B 300B 1.7x 20x Severely undertrained
Chinchilla 70B 70B 1.4T 20x 20x Compute-optimal
Llama 2 70B 70B 2T 28.6x 20x Slightly overtrained
Llama 3.1 8B 8B 15T 1875x 20x Heavily overtrained
Llama 3.1 405B 405B 15T 37x 20x Moderately overtrained
Mistral 7B 7B 8T+ 1143x+ 20x Heavily overtrained
Note: Post-Chinchilla models intentionally overtrain (more tokens per parameter) because inference cost depends on N, not D. A smaller model trained longer is cheaper to serve.
ℹ️ Note

Chinchilla-optimal training minimizes total training compute for a target loss. But it does not minimize inference cost. A 70B model is cheaper to serve than a 175B model. Modern practice intentionally overtrains smaller models (Llama 3.1 8B uses 1875x tokens-per-parameter vs Chinchilla’s 20x) because the extra training compute is spent once, while inference savings accrue over every query.

Inference-Time Scaling (2024)

Quality Scales with Thinking

The third scaling law, discovered through chain-of-thought reasoning and formalized by work on OpenAI’s o1 and DeepSeek-R1, shows that model quality improves predictably with the number of inference-time tokens. Instead of making the model bigger (more parameters at training), make it think longer (more tokens at inference).

Linference(T)=ATγ+EL_{\text{inference}}(T) = \frac{A'}{T^{\gamma}} + E'

where TT is the number of “thinking” tokens generated and γ0.1\gamma \approx 0.1-0.30.3 depending on the task.

class InferenceTimeScaling:
    """
    Inference-time scaling law.

    Given a fixed model, quality improves as the model
    generates more reasoning tokens before producing
    the final answer.

    This creates a new tradeoff: instead of training a
    larger model, use a smaller model but let it think longer.
    """

    def __init__(self):
        # Fitted parameters (task-dependent)
        self.task_params = {
            "math": {"A": 15.0, "gamma": 0.25, "E": 0.05},
            "coding": {"A": 12.0, "gamma": 0.20, "E": 0.08},
            "reasoning": {"A": 18.0, "gamma": 0.22, "E": 0.10},
            "general": {"A": 8.0, "gamma": 0.15, "E": 0.15},
        }

    def error_rate(self, thinking_tokens, task="math"):
        """
        Predict error rate given number of thinking tokens.
        More thinking tokens = lower error rate.
        """
        params = self.task_params.get(
            task, self.task_params["general"]
        )
        return (
            params["A"] / (thinking_tokens ** params["gamma"])
            + params["E"]
        )

    def compute_cost(self, model_params, thinking_tokens):
        """
        Compute the FLOPs cost of inference with
        T thinking tokens.

        Cost per token ~ 2 * N (for a transformer)
        Total cost ~ 2 * N * T
        """
        flops_per_token = 2 * model_params
        total_flops = flops_per_token * thinking_tokens
        return total_flops

    def optimal_thinking_budget(self, model_params, target_error,
                                 task="math"):
        """
        Find the minimum thinking tokens needed to achieve
        a target error rate.
        """
        params = self.task_params.get(
            task, self.task_params["general"]
        )

        if target_error <= params["E"]:
            return float("inf")  # Below irreducible error

        # error = A / T^gamma + E
        # T^gamma = A / (error - E)
        # T = (A / (error - E))^(1/gamma)
        T = (params["A"] / (target_error - params["E"])) ** (
            1.0 / params["gamma"]
        )

        return {
            "thinking_tokens": int(T),
            "inference_flops": self.compute_cost(model_params, int(T)),
            "target_error": target_error,
            "achievable": True,
        }

    def compare_scaling_strategies(self, base_model_N,
                                     base_thinking_T,
                                     compute_multiplier,
                                     task="math"):
        """
        Given extra compute, compare two strategies:
        1. Scale model: use a larger model with same thinking time
        2. Scale thinking: use the same model with more thinking time

        Which reduces error more for the same compute?
        """
        base_cost = self.compute_cost(base_model_N, base_thinking_T)
        target_cost = base_cost * compute_multiplier

        # Strategy 1: Larger model, same thinking time
        # New model size: N' = target_cost / (2 * T)
        new_N = target_cost / (2 * base_thinking_T)
        # Approximate how loss changes with N
        # Using Kaplan: loss ~ N^(-0.076)
        # Error rate improvement factor:
        # (new_N / base_N)^(-0.076)
        model_scale_factor = (new_N / base_model_N) ** 0.076

        # Strategy 2: Same model, more thinking time
        new_T = target_cost / (2 * base_model_N)
        # Error from inference scaling law
        error_more_thinking = self.error_rate(new_T, task)
        error_base = self.error_rate(base_thinking_T, task)

        return {
            "compute_multiplier": compute_multiplier,
            "strategy_1_larger_model": {
                "new_N": int(new_N),
                "thinking_T": base_thinking_T,
                "error_reduction_factor": round(model_scale_factor, 3),
            },
            "strategy_2_more_thinking": {
                "model_N": base_model_N,
                "new_T": int(new_T),
                "base_error": round(error_base, 4),
                "new_error": round(error_more_thinking, 4),
                "error_reduction_factor": round(
                    error_more_thinking / max(error_base, 1e-8), 3
                ),
            },
        }

Inference-Time Scaling: Error Rate vs Thinking Tokens

Metric 1005001K2K5K10K20K
8B model
45
32
25
20
15
12
10
70B model
28
18
13
10
7
5.5
4.5
405B model
18
11
8
6
4
3
2.5

The Multi-Dimensional Frontier

Three Dimensions of Scaling

You can improve model quality by spending compute in three ways:

  1. Training parameters (NN): Make the model larger. One-time cost. Affects all queries equally.
  2. Training data (DD): Train on more tokens. One-time cost. Improves base knowledge.
  3. Inference compute (TT): Generate more thinking tokens per query. Per-query cost. Improves reasoning quality.
class MultiDimensionalScaling:
    """
    Multi-dimensional scaling law that combines training-time
    and inference-time scaling.

    Given a total compute budget split between training
    and inference, find the optimal allocation.

    Total cost = training_cost + inference_cost * n_queries
    Training cost = 6 * N * D
    Inference cost per query = 2 * N * T
    """

    def __init__(self):
        self.chinchilla = ChinchillaScalingLaw()
        self.inference = InferenceTimeScaling()

    def optimal_allocation(self, total_budget, n_queries,
                            task="math"):
        """
        Find optimal allocation of total compute between
        training (N, D) and inference (T).

        total_budget = 6*N*D + n_queries * 2*N*T

        We want to minimize the error rate on the given task.
        """
        best_config = None
        best_error = float("inf")

        # Search over model sizes (log-spaced)
        for log_N in np.linspace(np.log(1e8), np.log(1e12), 100):
            N = np.exp(log_N)

            # For each N, find optimal split between D and T
            for train_fraction in np.linspace(0.1, 0.9, 50):
                train_budget = total_budget * train_fraction
                inference_budget = total_budget * (1 - train_fraction)

                # Training: 6*N*D = train_budget
                D = train_budget / (6 * N)
                if D < N:  # Need at least N tokens
                    continue

                # Inference: n_queries * 2*N*T = inference_budget
                T = inference_budget / (n_queries * 2 * N)
                if T < 10:  # Minimum thinking tokens
                    continue

                # Compute error (combine training loss + inference)
                # Training determines base model quality
                train_loss = self.chinchilla.loss(N, D)

                # Inference-time thinking reduces error further
                inference_error = self.inference.error_rate(T, task)

                # Combined error: base quality * inference improvement
                # Simplified model: lower training loss enables
                # better inference-time scaling
                combined_error = (
                    train_loss * inference_error / self.chinchilla.E
                )

                if combined_error < best_error:
                    best_error = combined_error
                    best_config = {
                        "N": int(N),
                        "D": int(D),
                        "T": int(T),
                        "train_fraction": round(train_fraction, 2),
                        "train_budget": train_budget,
                        "inference_budget": inference_budget,
                        "tokens_per_param": round(D / N, 1),
                        "combined_error": round(combined_error, 4),
                    }

        return best_config

    def frontier_surface(self, budget_range, query_counts):
        """
        Compute the error-compute frontier for different
        deployment scenarios (varying query counts).

        Returns the optimal configuration and achievable
        error for each (budget, n_queries) pair.
        """
        frontier = []

        for budget in budget_range:
            for n_queries in query_counts:
                config = self.optimal_allocation(
                    budget, n_queries
                )
                if config:
                    frontier.append({
                        "budget": budget,
                        "n_queries": n_queries,
                        "optimal_N": config["N"],
                        "optimal_D": config["D"],
                        "optimal_T": config["T"],
                        "error": config["combined_error"],
                        "train_fraction": config["train_fraction"],
                    })

        return frontier

The Key Insight: Query Count Determines Strategy

class DeploymentStrategyAdvisor:
    """
    Advise on the optimal scaling strategy based on
    deployment scenario.

    Low-volume (research): invest in training, use large model
    High-volume (production): invest in inference optimization
    """

    def advise(self, total_budget, expected_queries,
               latency_requirement_ms=None):
        """
        Recommend scaling strategy for a deployment scenario.
        """
        mds = MultiDimensionalScaling()
        config = mds.optimal_allocation(
            total_budget, expected_queries
        )

        if config is None:
            return {"error": "Budget too small for any configuration"}

        # Compute inference cost fraction
        per_query_cost = 2 * config["N"] * config["T"]
        total_inference_cost = per_query_cost * expected_queries
        inference_fraction = (total_inference_cost
                              / (total_budget + 1e-8))

        # Strategy classification
        if config["train_fraction"] > 0.7:
            strategy = "training-dominant"
            explanation = (
                "Low query count. Invest in the largest, "
                "best-trained model. Inference compute per query "
                "is relatively cheap."
            )
        elif config["train_fraction"] < 0.3:
            strategy = "inference-dominant"
            explanation = (
                "High query count. Use a smaller model and "
                "invest in inference-time scaling (more thinking "
                "tokens per query). Training is cheap relative "
                "to total inference spend."
            )
        else:
            strategy = "balanced"
            explanation = (
                "Moderate query count. Balance between model "
                "size and inference-time thinking. Neither "
                "dominates the total cost."
            )

        # Latency check
        latency_ok = True
        if latency_requirement_ms is not None:
            # Rough estimate: 20ms per thinking token on H100
            estimated_latency = config["T"] * 20
            latency_ok = estimated_latency <= latency_requirement_ms

        return {
            "strategy": strategy,
            "explanation": explanation,
            "recommended_N": config["N"],
            "recommended_D": config["D"],
            "recommended_T": config["T"],
            "train_fraction": config["train_fraction"],
            "inference_fraction": round(inference_fraction, 2),
            "estimated_latency_ms": config["T"] * 20,
            "latency_ok": latency_ok,
        }
📊

Optimal Strategy by Deployment Scenario

ScenarioTotal BudgetQueriesOptimal NOptimal TStrategy
Research prototype $100K 10K 70B 5000 Training-dominant (large model)
Startup MVP $1M 1M 8B 2000 Balanced
Production API $10M 100M 8B 500 Inference-dominant (small+fast)
Enterprise search $50M 1B 3B 200 Inference-dominant (tiny+fast)
Frontier research $100M 100K 405B 10000 Training-dominant (biggest model)
Note: Optimal configurations assume tasks that benefit from inference-time thinking (math, reasoning, coding). For simple tasks like classification, inference scaling has lower gamma and larger models are preferred.

Building a Scaling Law Predictor

Fitting Scaling Laws from Small Experiments

The most practical application of scaling laws is predicting the performance of a large training run from small-scale experiments. Train 5-10 small models (10M to 1B parameters), fit the scaling law, and extrapolate to predict the loss of a 70B model before spending $2M on training it.

class ScalingLawPredictor:
    """
    Predict large-scale training outcomes from small experiments.

    Workflow:
    1. Train 5-10 small models at different (N, D) points
    2. Fit the Chinchilla loss function to these points
    3. Extrapolate to predict loss at target (N, D)
    4. Estimate confidence interval via bootstrap

    This saves millions of dollars by preventing
    failed training runs.
    """

    def __init__(self):
        self.observed_points = []
        self.fitted_params = None

    def add_observation(self, N, D, loss):
        """Add a training run observation."""
        self.observed_points.append({
            "N": N, "D": D, "loss": loss,
            "C": 6 * N * D,
        })

    def fit(self, method="mle"):
        """
        Fit scaling law parameters to observed data.

        Model: L(N, D) = A/N^alpha + B/D^beta + E

        Parameters to fit: A, alpha, B, beta, E
        """
        if len(self.observed_points) < 5:
            raise ValueError(
                "Need at least 5 observations to fit "
                "scaling law reliably"
            )

        # Initial guess (Chinchilla values)
        x0 = np.array([
            np.log(406.4),  # log(A)
            0.34,            # alpha
            np.log(410.7),  # log(B)
            0.28,            # beta
            1.69,            # E
        ])

        def loss_fn(params):
            A = np.exp(params[0])
            alpha = params[1]
            B = np.exp(params[2])
            beta = params[3]
            E = params[4]

            total_error = 0
            for obs in self.observed_points:
                predicted = A / (obs["N"] ** alpha) + B / (obs["D"] ** beta) + E
                actual = obs["loss"]
                # Relative error (better for log-scale data)
                total_error += ((predicted - actual) / actual) ** 2

            return total_error

        result = minimize(
            loss_fn, x0,
            method="Nelder-Mead",
            options={"maxiter": 10000, "xatol": 1e-8},
        )

        self.fitted_params = {
            "A": np.exp(result.x[0]),
            "alpha": result.x[1],
            "B": np.exp(result.x[2]),
            "beta": result.x[3],
            "E": result.x[4],
            "fit_error": result.fun,
        }

        return self.fitted_params

    def predict(self, N, D):
        """Predict loss for a given (N, D) configuration."""
        if self.fitted_params is None:
            raise ValueError("Must call fit() first")

        p = self.fitted_params
        loss = (p["A"] / (N ** p["alpha"])
                + p["B"] / (D ** p["beta"])
                + p["E"])
        return loss

    def predict_compute_optimal(self, C_target):
        """
        Predict the compute-optimal configuration
        and expected loss for a target compute budget.
        """
        if self.fitted_params is None:
            raise ValueError("Must call fit() first")

        p = self.fitted_params
        best_loss = float("inf")
        best_config = None

        for log_N in np.linspace(np.log(1e6), np.log(C_target / 6), 500):
            N = np.exp(log_N)
            D = C_target / (6 * N)
            loss = self.predict(N, D)

            if loss < best_loss:
                best_loss = loss
                best_config = {
                    "N": int(N),
                    "D": int(D),
                    "predicted_loss": round(loss, 4),
                    "tokens_per_param": round(D / N, 1),
                }

        return best_config

    def bootstrap_confidence(self, N, D, n_bootstrap=200):
        """
        Estimate prediction confidence interval via bootstrap.
        Resample observations, refit, and predict.
        """
        import random

        predictions = []
        n_obs = len(self.observed_points)

        for _ in range(n_bootstrap):
            # Resample with replacement
            resampled = random.choices(
                self.observed_points, k=n_obs
            )

            # Refit
            predictor = ScalingLawPredictor()
            for obs in resampled:
                predictor.add_observation(
                    obs["N"], obs["D"], obs["loss"]
                )

            try:
                predictor.fit()
                pred = predictor.predict(N, D)
                predictions.append(pred)
            except Exception:
                continue

        if not predictions:
            return None

        predictions.sort()
        n = len(predictions)
        return {
            "mean": round(np.mean(predictions), 4),
            "std": round(np.std(predictions), 4),
            "lower_95": round(predictions[int(0.025 * n)], 4),
            "upper_95": round(predictions[int(0.975 * n)], 4),
            "n_successful_fits": n,
        }

Practical Scaling Law Application

Decision-Making Framework

class TrainingDecisionFramework:
    """
    Use scaling laws to make training decisions:
    1. Should I train a bigger model or train longer?
    2. What loss can I expect for my budget?
    3. How much compute to reach a target loss?
    """

    def __init__(self, predictor):
        self.predictor = predictor

    def budget_to_loss(self, compute_budget):
        """Given a compute budget, predict the best achievable loss."""
        return self.predictor.predict_compute_optimal(compute_budget)

    def loss_to_budget(self, target_loss, precision=0.01):
        """Given a target loss, estimate the required compute."""
        # Binary search over compute budgets
        low = 1e18   # ~$100
        high = 1e25  # ~$100M

        for _ in range(100):
            mid = (low + high) / 2
            config = self.predictor.predict_compute_optimal(mid)

            if config is None:
                low = mid
                continue

            if config["predicted_loss"] > target_loss:
                low = mid
            else:
                high = mid

            if (high - low) / mid < precision:
                break

        return {
            "target_loss": target_loss,
            "required_compute": mid,
            "estimated_cost_h100": self._compute_to_cost(mid),
            "optimal_config": self.predictor.predict_compute_optimal(mid),
        }

    def should_scale_model_or_data(self, current_N, current_D,
                                     additional_budget):
        """
        Given current training, should you increase N or D
        with additional compute?
        """
        current_loss = self.predictor.predict(current_N, current_D)
        current_C = 6 * current_N * current_D

        # Option A: increase model size, keep data
        new_C_a = current_C + additional_budget
        new_N_a = new_C_a / (6 * current_D)
        loss_a = self.predictor.predict(new_N_a, current_D)

        # Option B: keep model size, increase data
        new_D_b = (current_C + additional_budget) / (6 * current_N)
        loss_b = self.predictor.predict(current_N, new_D_b)

        # Option C: optimal split
        optimal = self.predictor.predict_compute_optimal(
            current_C + additional_budget
        )

        return {
            "current_loss": round(current_loss, 4),
            "option_a_bigger_model": {
                "new_N": int(new_N_a),
                "new_D": int(current_D),
                "predicted_loss": round(loss_a, 4),
                "improvement": round(current_loss - loss_a, 4),
            },
            "option_b_more_data": {
                "new_N": int(current_N),
                "new_D": int(new_D_b),
                "predicted_loss": round(loss_b, 4),
                "improvement": round(current_loss - loss_b, 4),
            },
            "option_c_optimal_retrain": {
                "new_N": optimal["N"] if optimal else 0,
                "new_D": optimal["D"] if optimal else 0,
                "predicted_loss": (optimal["predicted_loss"]
                                   if optimal else 0),
                "improvement": round(
                    current_loss - (optimal["predicted_loss"]
                                    if optimal else current_loss), 4
                ),
            },
            "recommendation": self._recommend(loss_a, loss_b,
                                               optimal["predicted_loss"]
                                               if optimal else 1e10),
        }

    def _recommend(self, loss_a, loss_b, loss_c):
        """Recommend the best option."""
        best = min(loss_a, loss_b, loss_c)
        if best == loss_c:
            return "Retrain from scratch with optimal allocation"
        elif best == loss_a:
            return "Scale model size (keep data constant)"
        else:
            return "Train longer on more data (keep model constant)"

    def _compute_to_cost(self, flops):
        """
        Estimate dollar cost from FLOPs.
        H100 at ~$2/hour, ~1e15 FLOP/s (half-precision MFU ~40%)
        """
        h100_flops_per_second = 1e15 * 0.4
        h100_cost_per_second = 2.0 / 3600
        seconds = flops / h100_flops_per_second
        cost = seconds * h100_cost_per_second
        return round(cost, 2)

Predicted Loss vs Compute Budget (Fitted Scaling Law)

Metric 1e191e201e211e221e231e241e25
Compute-Optimal Loss
3.2
2.95
2.72
2.52
2.35
2.2
2.08
Observed (GPT-3, Chinchilla, Llama)
2.75
2.54
2.38

Key Takeaways

Scaling laws are the most reliable empirical finding in deep learning. They enable predicting the outcome of multi-million dollar training runs from small-scale experiments.

The three laws and their implications:

  1. Kaplan (2020): Loss scales as power law of NN, DD, CC. Established that scaling is predictable. However, the recommended allocation (scale NN much faster than DD) was wrong.

  2. Chinchilla (2022): Corrected Kaplan. Compute-optimal training scales NN and DD equally: D20ND \approx 20 \cdot N. Showed that GPT-3 was severely undertrained (300B tokens for 175B params vs the optimal 3.5T tokens). Post-Chinchilla, the industry shifted to training smaller models on much more data (Llama 3.1 8B on 15T tokens).

  3. Inference-time scaling (2024): A fixed model improves predictably with more thinking tokens. Scaling exponent γ0.1\gamma \approx 0.1-0.30.3 (task-dependent). For reasoning-heavy tasks, spending 10x more inference compute reduces error by 50-75%.

  4. The multi-dimensional frontier: Quality depends on three compute dimensions: training parameters (NN), training data (DD), and inference tokens (TT). The optimal allocation depends on query volume. Low-volume deployments should invest in large models (training-dominant). High-volume deployments should use smaller models with inference-time scaling (inference-dominant).

  5. Practical prediction: Fit scaling law parameters from 5-10 small training runs (costing 1K1K-10K total), then predict the loss of a 2M2M-10M run with a 95% confidence interval of approximately 0.02-0.05 loss units. This prediction accuracy is sufficient to make go/no-go decisions on large training runs.

The fundamental equation: total compute for a deployment serving QQ queries is Ctotal=6ND+2NTQC_{\text{total}} = 6ND + 2NTQ. The first term is training (spent once), the second is inference (spent per query). As QQ grows, inference dominates and the optimal strategy shifts toward smaller NN with larger TT. The crossover point where training cost equals total inference cost is at Q=3D/TQ^* = 3D/T queries.