For a decade, the LLM scaling playbook was simple: more training compute yields better models. Chinchilla (2022) formalized this into a law: L(N,D)Nα+DβL(N, D) \propto N^{-\alpha} + D^{-\beta} where NN is parameters and DD is training tokens. Invest in training, reap quality at inference.

In 2024-2025, a second scaling axis emerged: inference-time compute. Models like o1, DeepSeek-R1, and QwQ show that generating more “thinking” tokens at inference improves answer quality — sometimes by 30%+ on reasoning benchmarks. This changes the economics of AI fundamentally: you can now trade inference cost for quality, not just training cost.

The Two-Axis Scaling Framework

Σ Theorem: Two-Axis Scaling

Quality scales with both training compute and inference compute:

Q(Ctrain,T)=Q0Ctrainα(1+γln(T))Q(C_{\text{train}}, T) = Q_0 \cdot C_{\text{train}}^{\alpha} \cdot (1 + \gamma \ln(T))

where CtrainC_{\text{train}} is training FLOPs, TT is thinking tokens at inference, α0.05\alpha \approx 0.05, and γ0.02-0.04\gamma \approx 0.02\text{-}0.04 (task-dependent).

The critical insight: α\alpha and γ\gamma are independent scaling exponents. You can improve quality by training more OR by thinking more. For a fixed total compute budget (training + inference amortized over queries), there is an optimal split.

Quality vs Thinking Tokens (AIME 2024 Math, DeepSeek-R1)

(% accuracy)
0 tokens (direct) No thinking
28 % accuracy
500 tokens
45 % accuracy
2000 tokens
62 % accuracy
8000 tokens
71 % accuracy
32000 tokens Diminishing returns
76 % accuracy

The curve is logarithmic: the first 500 thinking tokens provide the largest quality jump. Each subsequent doubling adds less. This shapes the compute-optimal allocation strategy.

How Reasoning Models Think

Standard LLMs generate answers directly: prompt in, answer out. Reasoning models insert an intermediate “thinking” phase:

Standard:  prompt -> [model] -> answer
Reasoning: prompt -> [model] -> thinking_tokens -> answer

The thinking tokens are a chain-of-thought (CoT) — the model’s internal working. For a math problem: “First, let me identify the variables. We have x = 3 and y = 5. The equation is x^2 + y^2 = z. So z = 9 + 25 = 34.”

Why does this help? Two complementary theories:

  1. Effective circuit depth: A transformer has fixed depth (e.g., 80 layers). Complex reasoning may require more sequential computation steps than 80 layers can provide. Thinking tokens allow the model to “unroll” its computation across multiple forward passes — each thinking token adds 80 more layers of processing.

  2. Working memory: The model’s hidden state is fixed-size (dmodeld_{\text{model}} dimensions). Thinking tokens externalize intermediate results into the token sequence (effectively the KV cache), providing unbounded working memory.

ℹ️ Thinking Tokens Are Not Free

Each thinking token costs the same as an output token: one full model forward pass. A reasoning trace of 10,000 tokens costs 10,000 forward passes — the same as generating a 10,000-word essay. The cost multiplier is directly proportional to the number of thinking tokens.

DeepSeek-R1: Training a Reasoning Model

DeepSeek-R1 pioneered open-source reasoning models. The training recipe has three stages:

Stage 1: Supervised Fine-Tuning (SFT)

Start from a strong base model (DeepSeek-V3). Fine-tune on high-quality reasoning traces: (problem, chain-of-thought, answer) triples. The CoT traces are generated by the model itself or by stronger models.

Stage 2: Reinforcement Learning with GRPO

Group Relative Policy Optimization (GRPO) is DeepSeek’s alternative to PPO (Proximal Policy Optimization). The key difference: no critic network.

For each problem, generate KK candidate solutions. Compute a reward for each (correctness, format compliance). Then optimize the policy relative to the group:

def grpo_loss(model, problems, K=8, clip_eps=0.2, kl_coeff=0.01):
    """Group Relative Policy Optimization loss."""
    total_loss = 0.0

    for problem in problems:
        # Generate K completions
        completions = [model.generate(problem) for _ in range(K)]
        rewards = [reward_fn(problem, c) for c in completions]

        # Compute group-relative advantages
        mean_reward = sum(rewards) / K
        std_reward = (sum((r - mean_reward)**2 for r in rewards) / K) ** 0.5
        advantages = [(r - mean_reward) / (std_reward + 1e-8) for r in rewards]

        # Policy gradient with clipping
        for completion, advantage in zip(completions, advantages):
            log_prob = model.log_prob(completion)
            old_log_prob = log_prob.detach()  # No separate old policy needed
            ratio = (log_prob - old_log_prob).exp()
            clipped = ratio.clamp(1 - clip_eps, 1 + clip_eps)
            pg_loss = -min(ratio * advantage, clipped * advantage)
            kl_penalty = kl_coeff * (log_prob - old_log_prob)
            total_loss += pg_loss + kl_penalty

    return total_loss / len(problems)
Why GRPO Over PPO

PPO requires a separate critic (value function) network — typically the same size as the policy. For a 671B model, that doubles the memory requirement. GRPO eliminates the critic by using within-group comparisons as the baseline. This halves the memory cost and simplifies the training infrastructure.

Stage 3: Rejection Sampling + SFT

After RL, generate many solutions per problem. Keep only those that are both correct AND use clean reasoning (no gibberish steps, proper formatting). Fine-tune on this filtered set. This “distills” the RL policy into a cleaner SFT model.

📊

DeepSeek-R1 Training Pipeline Results

StageAIME 2024 AccuracyMethod
Base model (V3) 28% Direct generation
After SFT on CoT data 52% Supervised fine-tuning
After GRPO RL 71% Reinforcement learning
After rejection sampling + SFT 76% Distilled RL policy

Compute-Optimal Inference Allocation

For a given query, how many thinking tokens are optimal?

Define: difficulty dd (estimated from the query), quality gained per thinking token g(T,d)g(T, d), cost per token cc. The optimal thinking budget:

T(d)=argmaxT[VQ(T,d)Tc]T^*(d) = \arg\max_T \left[ V \cdot Q(T, d) - T \cdot c \right]

where VV is the value of a correct answer and Q(T,d)Q(T, d) is the probability of correctness at TT thinking tokens for difficulty dd.

With the logarithmic quality model Q(T,d)Q0(d)+γ(d)ln(T)Q(T, d) \approx Q_0(d) + \gamma(d) \ln(T):

T(d)=Vγ(d)cT^*(d) = \frac{V \cdot \gamma(d)}{c}

📊

Optimal Thinking Tokens by Query Difficulty

DifficultyExampleOptimal TokensQuality GainCost Multiplier
Trivial What is 2+2? 0 (direct answer) 0% 1x
Easy Summarize this paragraph 100-500 +5-10% 2-3x
Medium Write a SQL query for X 500-2000 +15-25% 5-10x
Hard AIME competition math 2000-10000 +30-50% 20-50x
Very hard Research-level proof 10000-50000 +40-60% 50-200x

Adaptive Thinking Budgets

Production systems cannot afford to let every query think for 50,000 tokens. The model must decide how much to think:

def adaptive_generate(model, prompt, max_think_tokens=10000, confidence_threshold=0.95):
    """Generate with adaptive thinking budget."""
    tokens = tokenize(prompt)
    thinking_tokens = []

    for step in range(max_think_tokens):
        next_token = model.generate_one(tokens + thinking_tokens)

        # Check if model signals it is done thinking
        if next_token == ANSWER_TOKEN:
            break

        # Check confidence: if model is very sure, stop early
        logit_entropy = compute_entropy(model.last_logits)
        if logit_entropy < confidence_threshold and step > 100:
            break  # Model is confident enough

        thinking_tokens.append(next_token)

    # Generate final answer
    answer = model.generate(tokens + thinking_tokens + [ANSWER_TOKEN])
    return answer, len(thinking_tokens)

Systems Implications

Reasoning models change every aspect of serving infrastructure.

KV Cache Explosion

A standard query generates 100-500 output tokens. A reasoning query generates 1,000-50,000 thinking tokens PLUS the answer. The KV cache grows proportionally:

📊

KV Cache Impact of Reasoning (Llama 70B)

Query TypeTotal TokensKV CacheConcurrent Queries (80GB)
Standard (500 output) 1500 total 490 MB ~130
Light reasoning (2K think) 3500 total 1.14 GB ~56
Heavy reasoning (10K think) 11500 total 3.76 GB ~17
Research reasoning (50K think) 51500 total 16.8 GB ~3
Note: KV cache = 2 x 80 layers x 8 KV heads x 128 dim x total_tokens x 2 bytes. 80 GB GPU with ~65 GB usable for KV cache.

At 50K thinking tokens, a single query consumes 16.8 GB of KV cache — a single H100 can only serve 3 such queries concurrently. This is a fundamental shift from serving 130 standard queries.

Attention Cost

Attention is O(n2)O(n^2). At 50K tokens total context: 500002=2.550000^2 = 2.5 billion operations per layer. Even with FlashAttention, this is 50x more compute than a standard query. Reasoning queries are both memory-intensive (KV cache) and compute-intensive (quadratic attention).

Scheduling Challenges

Standard queries have predictable decode length (100-500 tokens). Reasoning queries vary from 200 to 50,000 tokens — a 250x range. This makes batch scheduling much harder:

  • Resource prediction: The scheduler cannot predict how much KV cache a reasoning query will need
  • Preemption risk: A reasoning query that unexpectedly generates 50K tokens may force preemption of other queries
  • Batch heterogeneity: Mixing standard and reasoning queries in one batch leads to poor utilization (the short queries finish and their slots sit empty while the long query continues)

Cost Impact

Cost per Query: Standard vs Reasoning (H100, Llama 70B)

(USD per query)
Standard (500 tok out) 0.3 cents
0.003 USD per query
Light reasoning (2K) 1.2 cents
0.012 USD per query
Heavy reasoning (10K) 6 cents
0.06 USD per query
Research (50K) 30 cents
0.3 USD per query

A 50K-token reasoning query costs 100x more than a standard query. This is why reasoning models are typically reserved for high-value tasks (code generation, math, complex analysis) where the quality improvement justifies the cost.

Verification: Checking Reasoning Quality

How do you know if the model’s thinking led to a correct answer?

Majority Voting

Generate NN independent reasoning traces. Take the most common answer. If 6 of 8 traces produce “34”, that’s the answer.

Success probability with NN samples and per-sample accuracy pp:

Pmajority(N,p)=k=N/2N(Nk)pk(1p)NkP_{\text{majority}}(N, p) = \sum_{k=\lceil N/2 \rceil}^{N} \binom{N}{k} p^k (1-p)^{N-k}

For p=0.7p = 0.7 and N=8N = 8: Pmajority0.94P_{\text{majority}} \approx 0.94. Much better than p=0.7p = 0.7 alone.

Process Reward Models (PRMs)

Score each step in the reasoning chain, not just the final answer. A PRM assigns a score to each thinking step. The product of step scores gives the overall chain quality:

PRM_score(chain)=i=1Sstep_score(si)\text{PRM\_score}(chain) = \prod_{i=1}^{S} \text{step\_score}(s_i)

Select the chain with the highest PRM score. This is more effective than majority voting because it evaluates reasoning quality, not just answer consensus.

📊

Verification Methods Compared (AIME 2024)

MethodN samplesAccuracyCost Multiplier
Single sample 1 71% 1x
Majority voting N=4 4 82% 4x
Majority voting N=8 8 87% 8x
PRM best-of-4 4 85% 4.2x
PRM best-of-8 8 91% 8.5x
Note: PRM adds ~5% overhead for scoring. At equal sample count, PRM outperforms majority voting by 3-4 percentage points.
💡 When Verification Is Worth It

Verification multiplies inference cost by NN. It is worth it when: (a) the value of a correct answer far exceeds the cost (medical diagnosis, legal analysis, competition math), and (b) the per-sample accuracy is in the 50-85% range (below 50%, even majority voting fails; above 85%, a single sample is usually sufficient).

What This Means for the Field

The reasoning scaling law introduces a new degree of freedom in AI system design:

  1. Smaller models + more thinking can match larger models on reasoning tasks. A 7B model with 10K thinking tokens can outperform a 70B model with direct generation on some math/code tasks.

  2. Inference cost becomes the dominant expense for reasoning-heavy workloads. Training cost is amortized over billions of queries, but inference cost is paid per query.

  3. Serving infrastructure must adapt: variable-length generation, KV cache for 50K+ token sequences, scheduling that handles 250x variance in query cost.

  4. New optimization targets: instead of minimizing per-token latency, minimize cost-per-correct-answer. Sometimes generating 8 cheap samples and voting is cheaper than generating 1 expensive long chain.

The inference-time scaling law is not a replacement for training-time scaling — it is a complement. The future of AI scaling is two-dimensional: train strong base models AND think harder at inference. The optimal point on this frontier depends on the specific task, cost constraints, and latency requirements.