Reasoning Scaling Laws: How Inference-Time Compute Changes Everything We Know About Scaling

Part of Series Frontier Research 2025-2026 1 of 9

1 Reasoning Scaling Laws: How Inference-Time Compute Changes Everything We Know About Scaling 2 Lightning Attention: Implementing Linear-Time Attention for Million-Token Contexts 3 Policy of Thoughts: Test-Time Policy Evolution and Online Reasoning Refinement 4 Test-Time Compute Scaling: When a 1B Model Beats a 405B Model 5 Reward Model Engineering: ORM vs PRM, Verifier Design, and Why Reward Quality Determines Reasoning Quality 6 Constitutional AI and RLHF Alternatives: DPO, KTO, ORPO, and the Post-Training Revolution 7 Long-Context Research: Architectures and Techniques for 1M to 10M+ Token Windows 8 Multimodal Fusion: Early vs Late Fusion, Cross-Attention, and Interleaved Architectures 9 Efficient Fine-Tuning: LoRA, DoRA, QLoRA, GaLore, and LISA — When to Use Each

For a decade, the LLM scaling playbook was simple: more training compute yields better models. Chinchilla (2022) formalized this into a law: $L(N, D) \propto N^{-\alpha} + D^{-\beta}$ where $N$ is parameters and $D$ is training tokens. Invest in training, reap quality at inference.

In 2024-2025, a second scaling axis emerged: inference-time compute. Models like o1, DeepSeek-R1, and QwQ show that generating more “thinking” tokens at inference improves answer quality — sometimes by 30%+ on reasoning benchmarks. This changes the economics of AI fundamentally: you can now trade inference cost for quality, not just training cost.

The Two-Axis Scaling Framework

Σ Theorem: Two-Axis Scaling

Quality scales with both training compute and inference compute:

$Q(C_{\text{train}}, T) = Q_0 \cdot C_{\text{train}}^{\alpha} \cdot (1 + \gamma \ln(T))$

where $C_{\text{train}}$ is training FLOPs, $T$ is thinking tokens at inference, $\alpha \approx 0.05$ , and $\gamma \approx 0.02\text{-}0.04$ (task-dependent).

The critical insight: $\alpha$ and $\gamma$ are independent scaling exponents. You can improve quality by training more OR by thinking more. For a fixed total compute budget (training + inference amortized over queries), there is an optimal split.

Quality vs Thinking Tokens (AIME 2024 Math, DeepSeek-R1)

(% accuracy)

0 tokens (direct) No thinking

28 % accuracy

500 tokens

45 % accuracy

2000 tokens

62 % accuracy

8000 tokens

71 % accuracy

32000 tokens Diminishing returns

76 % accuracy

The curve is logarithmic: the first 500 thinking tokens provide the largest quality jump. Each subsequent doubling adds less. This shapes the compute-optimal allocation strategy.

How Reasoning Models Think

Standard LLMs generate answers directly: prompt in, answer out. Reasoning models insert an intermediate “thinking” phase:

Standard:  prompt -> [model] -> answer
Reasoning: prompt -> [model] -> thinking_tokens -> answer

The thinking tokens are a chain-of-thought (CoT) — the model’s internal working. For a math problem: “First, let me identify the variables. We have x = 3 and y = 5. The equation is x^2 + y^2 = z. So z = 9 + 25 = 34.”

Why does this help? Two complementary theories:

Effective circuit depth: A transformer has fixed depth (e.g., 80 layers). Complex reasoning may require more sequential computation steps than 80 layers can provide. Thinking tokens allow the model to “unroll” its computation across multiple forward passes — each thinking token adds 80 more layers of processing.
Working memory: The model’s hidden state is fixed-size ( $d_{\text{model}}$ dimensions). Thinking tokens externalize intermediate results into the token sequence (effectively the KV cache), providing unbounded working memory.

ℹ️ Thinking Tokens Are Not Free

Each thinking token costs the same as an output token: one full model forward pass. A reasoning trace of 10,000 tokens costs 10,000 forward passes — the same as generating a 10,000-word essay. The cost multiplier is directly proportional to the number of thinking tokens.

DeepSeek-R1: Training a Reasoning Model

DeepSeek-R1 pioneered open-source reasoning models. The training recipe has three stages:

Stage 1: Supervised Fine-Tuning (SFT)

Start from a strong base model (DeepSeek-V3). Fine-tune on high-quality reasoning traces: (problem, chain-of-thought, answer) triples. The CoT traces are generated by the model itself or by stronger models.

Stage 2: Reinforcement Learning with GRPO

Group Relative Policy Optimization (GRPO) is DeepSeek’s alternative to PPO (Proximal Policy Optimization). The key difference: no critic network.

For each problem, generate $K$ candidate solutions. Compute a reward for each (correctness, format compliance). Then optimize the policy relative to the group:

def grpo_loss(model, problems, K=8, clip_eps=0.2, kl_coeff=0.01):
    """Group Relative Policy Optimization loss."""
    total_loss = 0.0

    for problem in problems:
        # Generate K completions
        completions = [model.generate(problem) for _ in range(K)]
        rewards = [reward_fn(problem, c) for c in completions]

        # Compute group-relative advantages
        mean_reward = sum(rewards) / K
        std_reward = (sum((r - mean_reward)**2 for r in rewards) / K) ** 0.5
        advantages = [(r - mean_reward) / (std_reward + 1e-8) for r in rewards]

        # Policy gradient with clipping
        for completion, advantage in zip(completions, advantages):
            log_prob = model.log_prob(completion)
            old_log_prob = log_prob.detach()  # No separate old policy needed
            ratio = (log_prob - old_log_prob).exp()
            clipped = ratio.clamp(1 - clip_eps, 1 + clip_eps)
            pg_loss = -min(ratio * advantage, clipped * advantage)
            kl_penalty = kl_coeff * (log_prob - old_log_prob)
            total_loss += pg_loss + kl_penalty

    return total_loss / len(problems)

⚡ Why GRPO Over PPO

PPO requires a separate critic (value function) network — typically the same size as the policy. For a 671B model, that doubles the memory requirement. GRPO eliminates the critic by using within-group comparisons as the baseline. This halves the memory cost and simplifies the training infrastructure.

Stage 3: Rejection Sampling + SFT

After RL, generate many solutions per problem. Keep only those that are both correct AND use clean reasoning (no gibberish steps, proper formatting). Fine-tune on this filtered set. This “distills” the RL policy into a cleaner SFT model.

📊

DeepSeek-R1 Training Pipeline Results

Stage	AIME 2024 Accuracy	Method
Base model (V3)	28%	Direct generation
After SFT on CoT data	52%	Supervised fine-tuning
After GRPO RL	71%	Reinforcement learning
After rejection sampling + SFT	76%	Distilled RL policy

Compute-Optimal Inference Allocation

For a given query, how many thinking tokens are optimal?

Define: difficulty $d$ (estimated from the query), quality gained per thinking token $g(T, d)$ , cost per token $c$ . The optimal thinking budget:

$T^*(d) = \arg\max_T \left[ V \cdot Q(T, d) - T \cdot c \right]$

where $V$ is the value of a correct answer and $Q(T, d)$ is the probability of correctness at $T$ thinking tokens for difficulty $d$ .

With the logarithmic quality model $Q(T, d) \approx Q_0(d) + \gamma(d) \ln(T)$ :

$T^*(d) = \frac{V \cdot \gamma(d)}{c}$

📊

Optimal Thinking Tokens by Query Difficulty

Difficulty	Example	Optimal Tokens	Quality Gain	Cost Multiplier
Trivial	What is 2+2?	0 (direct answer)	0%	1x
Easy	Summarize this paragraph	100-500	+5-10%	2-3x
Medium	Write a SQL query for X	500-2000	+15-25%	5-10x
Hard	AIME competition math	2000-10000	+30-50%	20-50x
Very hard	Research-level proof	10000-50000	+40-60%	50-200x

Adaptive Thinking Budgets

Production systems cannot afford to let every query think for 50,000 tokens. The model must decide how much to think:

def adaptive_generate(model, prompt, max_think_tokens=10000, confidence_threshold=0.95):
    """Generate with adaptive thinking budget."""
    tokens = tokenize(prompt)
    thinking_tokens = []

    for step in range(max_think_tokens):
        next_token = model.generate_one(tokens + thinking_tokens)

        # Check if model signals it is done thinking
        if next_token == ANSWER_TOKEN:
            break

        # Check confidence: if model is very sure, stop early
        logit_entropy = compute_entropy(model.last_logits)
        if logit_entropy < confidence_threshold and step > 100:
            break  # Model is confident enough

        thinking_tokens.append(next_token)

    # Generate final answer
    answer = model.generate(tokens + thinking_tokens + [ANSWER_TOKEN])
    return answer, len(thinking_tokens)

Systems Implications

Reasoning models change every aspect of serving infrastructure.

KV Cache Explosion

A standard query generates 100-500 output tokens. A reasoning query generates 1,000-50,000 thinking tokens PLUS the answer. The KV cache grows proportionally:

📊

KV Cache Impact of Reasoning (Llama 70B)

Query Type	Total Tokens	KV Cache	Concurrent Queries (80GB)
Standard (500 output)	1500 total	490 MB	~130
Light reasoning (2K think)	3500 total	1.14 GB	~56
Heavy reasoning (10K think)	11500 total	3.76 GB	~17
Research reasoning (50K think)	51500 total	16.8 GB	~3

Note: KV cache = 2 x 80 layers x 8 KV heads x 128 dim x total_tokens x 2 bytes. 80 GB GPU with ~65 GB usable for KV cache.

At 50K thinking tokens, a single query consumes 16.8 GB of KV cache — a single H100 can only serve 3 such queries concurrently. This is a fundamental shift from serving 130 standard queries.

Attention Cost

Attention is $O(n^2)$ . At 50K tokens total context: $50000^2 = 2.5$ billion operations per layer. Even with FlashAttention, this is 50x more compute than a standard query. Reasoning queries are both memory-intensive (KV cache) and compute-intensive (quadratic attention).

Scheduling Challenges

Standard queries have predictable decode length (100-500 tokens). Reasoning queries vary from 200 to 50,000 tokens — a 250x range. This makes batch scheduling much harder:

Resource prediction: The scheduler cannot predict how much KV cache a reasoning query will need
Preemption risk: A reasoning query that unexpectedly generates 50K tokens may force preemption of other queries
Batch heterogeneity: Mixing standard and reasoning queries in one batch leads to poor utilization (the short queries finish and their slots sit empty while the long query continues)

Cost Impact

Cost per Query: Standard vs Reasoning (H100, Llama 70B)

(USD per query)

Standard (500 tok out) 0.3 cents

0.003 USD per query

Light reasoning (2K) 1.2 cents

0.012 USD per query

Heavy reasoning (10K) 6 cents

0.06 USD per query

Research (50K) 30 cents

0.3 USD per query

A 50K-token reasoning query costs 100x more than a standard query. This is why reasoning models are typically reserved for high-value tasks (code generation, math, complex analysis) where the quality improvement justifies the cost.

Verification: Checking Reasoning Quality

How do you know if the model’s thinking led to a correct answer?

Majority Voting

Generate $N$ independent reasoning traces. Take the most common answer. If 6 of 8 traces produce “34”, that’s the answer.

Success probability with $N$ samples and per-sample accuracy $p$ :

$P_{\text{majority}}(N, p) = \sum_{k=\lceil N/2 \rceil}^{N} \binom{N}{k} p^k (1-p)^{N-k}$

For $p = 0.7$ and $N = 8$ : $P_{\text{majority}} \approx 0.94$ . Much better than $p = 0.7$ alone.

Process Reward Models (PRMs)

Score each step in the reasoning chain, not just the final answer. A PRM assigns a score to each thinking step. The product of step scores gives the overall chain quality:

$\text{PRM\_score}(chain) = \prod_{i=1}^{S} \text{step\_score}(s_i)$

Select the chain with the highest PRM score. This is more effective than majority voting because it evaluates reasoning quality, not just answer consensus.

📊

Verification Methods Compared (AIME 2024)

Method	N samples	Accuracy	Cost Multiplier
Single sample	1	71%	1x
Majority voting N=4	4	82%	4x
Majority voting N=8	8	87%	8x
PRM best-of-4	4	85%	4.2x
PRM best-of-8	8	91%	8.5x

Note: PRM adds ~5% overhead for scoring. At equal sample count, PRM outperforms majority voting by 3-4 percentage points.

💡 When Verification Is Worth It

Verification multiplies inference cost by $N$ . It is worth it when: (a) the value of a correct answer far exceeds the cost (medical diagnosis, legal analysis, competition math), and (b) the per-sample accuracy is in the 50-85% range (below 50%, even majority voting fails; above 85%, a single sample is usually sufficient).

What This Means for the Field

The reasoning scaling law introduces a new degree of freedom in AI system design:

Smaller models + more thinking can match larger models on reasoning tasks. A 7B model with 10K thinking tokens can outperform a 70B model with direct generation on some math/code tasks.
Inference cost becomes the dominant expense for reasoning-heavy workloads. Training cost is amortized over billions of queries, but inference cost is paid per query.
Serving infrastructure must adapt: variable-length generation, KV cache for 50K+ token sequences, scheduling that handles 250x variance in query cost.
New optimization targets: instead of minimizing per-token latency, minimize cost-per-correct-answer. Sometimes generating 8 cheap samples and voting is cheaper than generating 1 expensive long chain.

The inference-time scaling law is not a replacement for training-time scaling — it is a complement. The future of AI scaling is two-dimensional: train strong base models AND think harder at inference. The optimal point on this frontier depends on the specific task, cost constraints, and latency requirements.