For a decade, the LLM scaling playbook was simple: more training compute yields better models. Chinchilla (2022) formalized this into a law: where is parameters and is training tokens. Invest in training, reap quality at inference.
In 2024-2025, a second scaling axis emerged: inference-time compute. Models like o1, DeepSeek-R1, and QwQ show that generating more “thinking” tokens at inference improves answer quality — sometimes by 30%+ on reasoning benchmarks. This changes the economics of AI fundamentally: you can now trade inference cost for quality, not just training cost.
The Two-Axis Scaling Framework
Quality scales with both training compute and inference compute:
where is training FLOPs, is thinking tokens at inference, , and (task-dependent).
The critical insight: and are independent scaling exponents. You can improve quality by training more OR by thinking more. For a fixed total compute budget (training + inference amortized over queries), there is an optimal split.
Quality vs Thinking Tokens (AIME 2024 Math, DeepSeek-R1)
(% accuracy)The curve is logarithmic: the first 500 thinking tokens provide the largest quality jump. Each subsequent doubling adds less. This shapes the compute-optimal allocation strategy.
How Reasoning Models Think
Standard LLMs generate answers directly: prompt in, answer out. Reasoning models insert an intermediate “thinking” phase:
Standard: prompt -> [model] -> answer
Reasoning: prompt -> [model] -> thinking_tokens -> answer
The thinking tokens are a chain-of-thought (CoT) — the model’s internal working. For a math problem: “First, let me identify the variables. We have x = 3 and y = 5. The equation is x^2 + y^2 = z. So z = 9 + 25 = 34.”
Why does this help? Two complementary theories:
-
Effective circuit depth: A transformer has fixed depth (e.g., 80 layers). Complex reasoning may require more sequential computation steps than 80 layers can provide. Thinking tokens allow the model to “unroll” its computation across multiple forward passes — each thinking token adds 80 more layers of processing.
-
Working memory: The model’s hidden state is fixed-size ( dimensions). Thinking tokens externalize intermediate results into the token sequence (effectively the KV cache), providing unbounded working memory.
Each thinking token costs the same as an output token: one full model forward pass. A reasoning trace of 10,000 tokens costs 10,000 forward passes — the same as generating a 10,000-word essay. The cost multiplier is directly proportional to the number of thinking tokens.
DeepSeek-R1: Training a Reasoning Model
DeepSeek-R1 pioneered open-source reasoning models. The training recipe has three stages:
Stage 1: Supervised Fine-Tuning (SFT)
Start from a strong base model (DeepSeek-V3). Fine-tune on high-quality reasoning traces: (problem, chain-of-thought, answer) triples. The CoT traces are generated by the model itself or by stronger models.
Stage 2: Reinforcement Learning with GRPO
Group Relative Policy Optimization (GRPO) is DeepSeek’s alternative to PPO (Proximal Policy Optimization). The key difference: no critic network.
For each problem, generate candidate solutions. Compute a reward for each (correctness, format compliance). Then optimize the policy relative to the group:
def grpo_loss(model, problems, K=8, clip_eps=0.2, kl_coeff=0.01):
"""Group Relative Policy Optimization loss."""
total_loss = 0.0
for problem in problems:
# Generate K completions
completions = [model.generate(problem) for _ in range(K)]
rewards = [reward_fn(problem, c) for c in completions]
# Compute group-relative advantages
mean_reward = sum(rewards) / K
std_reward = (sum((r - mean_reward)**2 for r in rewards) / K) ** 0.5
advantages = [(r - mean_reward) / (std_reward + 1e-8) for r in rewards]
# Policy gradient with clipping
for completion, advantage in zip(completions, advantages):
log_prob = model.log_prob(completion)
old_log_prob = log_prob.detach() # No separate old policy needed
ratio = (log_prob - old_log_prob).exp()
clipped = ratio.clamp(1 - clip_eps, 1 + clip_eps)
pg_loss = -min(ratio * advantage, clipped * advantage)
kl_penalty = kl_coeff * (log_prob - old_log_prob)
total_loss += pg_loss + kl_penalty
return total_loss / len(problems)
PPO requires a separate critic (value function) network — typically the same size as the policy. For a 671B model, that doubles the memory requirement. GRPO eliminates the critic by using within-group comparisons as the baseline. This halves the memory cost and simplifies the training infrastructure.
Stage 3: Rejection Sampling + SFT
After RL, generate many solutions per problem. Keep only those that are both correct AND use clean reasoning (no gibberish steps, proper formatting). Fine-tune on this filtered set. This “distills” the RL policy into a cleaner SFT model.
DeepSeek-R1 Training Pipeline Results
| Stage | AIME 2024 Accuracy | Method |
|---|---|---|
| Base model (V3) | 28% | Direct generation |
| After SFT on CoT data | 52% | Supervised fine-tuning |
| After GRPO RL | 71% | Reinforcement learning |
| After rejection sampling + SFT | 76% | Distilled RL policy |
Compute-Optimal Inference Allocation
For a given query, how many thinking tokens are optimal?
Define: difficulty (estimated from the query), quality gained per thinking token , cost per token . The optimal thinking budget:
where is the value of a correct answer and is the probability of correctness at thinking tokens for difficulty .
With the logarithmic quality model :
Optimal Thinking Tokens by Query Difficulty
| Difficulty | Example | Optimal Tokens | Quality Gain | Cost Multiplier |
|---|---|---|---|---|
| Trivial | What is 2+2? | 0 (direct answer) | 0% | 1x |
| Easy | Summarize this paragraph | 100-500 | +5-10% | 2-3x |
| Medium | Write a SQL query for X | 500-2000 | +15-25% | 5-10x |
| Hard | AIME competition math | 2000-10000 | +30-50% | 20-50x |
| Very hard | Research-level proof | 10000-50000 | +40-60% | 50-200x |
Adaptive Thinking Budgets
Production systems cannot afford to let every query think for 50,000 tokens. The model must decide how much to think:
def adaptive_generate(model, prompt, max_think_tokens=10000, confidence_threshold=0.95):
"""Generate with adaptive thinking budget."""
tokens = tokenize(prompt)
thinking_tokens = []
for step in range(max_think_tokens):
next_token = model.generate_one(tokens + thinking_tokens)
# Check if model signals it is done thinking
if next_token == ANSWER_TOKEN:
break
# Check confidence: if model is very sure, stop early
logit_entropy = compute_entropy(model.last_logits)
if logit_entropy < confidence_threshold and step > 100:
break # Model is confident enough
thinking_tokens.append(next_token)
# Generate final answer
answer = model.generate(tokens + thinking_tokens + [ANSWER_TOKEN])
return answer, len(thinking_tokens)
Systems Implications
Reasoning models change every aspect of serving infrastructure.
KV Cache Explosion
A standard query generates 100-500 output tokens. A reasoning query generates 1,000-50,000 thinking tokens PLUS the answer. The KV cache grows proportionally:
KV Cache Impact of Reasoning (Llama 70B)
| Query Type | Total Tokens | KV Cache | Concurrent Queries (80GB) |
|---|---|---|---|
| Standard (500 output) | 1500 total | 490 MB | ~130 |
| Light reasoning (2K think) | 3500 total | 1.14 GB | ~56 |
| Heavy reasoning (10K think) | 11500 total | 3.76 GB | ~17 |
| Research reasoning (50K think) | 51500 total | 16.8 GB | ~3 |
At 50K thinking tokens, a single query consumes 16.8 GB of KV cache — a single H100 can only serve 3 such queries concurrently. This is a fundamental shift from serving 130 standard queries.
Attention Cost
Attention is . At 50K tokens total context: billion operations per layer. Even with FlashAttention, this is 50x more compute than a standard query. Reasoning queries are both memory-intensive (KV cache) and compute-intensive (quadratic attention).
Scheduling Challenges
Standard queries have predictable decode length (100-500 tokens). Reasoning queries vary from 200 to 50,000 tokens — a 250x range. This makes batch scheduling much harder:
- Resource prediction: The scheduler cannot predict how much KV cache a reasoning query will need
- Preemption risk: A reasoning query that unexpectedly generates 50K tokens may force preemption of other queries
- Batch heterogeneity: Mixing standard and reasoning queries in one batch leads to poor utilization (the short queries finish and their slots sit empty while the long query continues)
Cost Impact
Cost per Query: Standard vs Reasoning (H100, Llama 70B)
(USD per query)A 50K-token reasoning query costs 100x more than a standard query. This is why reasoning models are typically reserved for high-value tasks (code generation, math, complex analysis) where the quality improvement justifies the cost.
Verification: Checking Reasoning Quality
How do you know if the model’s thinking led to a correct answer?
Majority Voting
Generate independent reasoning traces. Take the most common answer. If 6 of 8 traces produce “34”, that’s the answer.
Success probability with samples and per-sample accuracy :
For and : . Much better than alone.
Process Reward Models (PRMs)
Score each step in the reasoning chain, not just the final answer. A PRM assigns a score to each thinking step. The product of step scores gives the overall chain quality:
Select the chain with the highest PRM score. This is more effective than majority voting because it evaluates reasoning quality, not just answer consensus.
Verification Methods Compared (AIME 2024)
| Method | N samples | Accuracy | Cost Multiplier |
|---|---|---|---|
| Single sample | 1 | 71% | 1x |
| Majority voting N=4 | 4 | 82% | 4x |
| Majority voting N=8 | 8 | 87% | 8x |
| PRM best-of-4 | 4 | 85% | 4.2x |
| PRM best-of-8 | 8 | 91% | 8.5x |
Verification multiplies inference cost by . It is worth it when: (a) the value of a correct answer far exceeds the cost (medical diagnosis, legal analysis, competition math), and (b) the per-sample accuracy is in the 50-85% range (below 50%, even majority voting fails; above 85%, a single sample is usually sufficient).
What This Means for the Field
The reasoning scaling law introduces a new degree of freedom in AI system design:
-
Smaller models + more thinking can match larger models on reasoning tasks. A 7B model with 10K thinking tokens can outperform a 70B model with direct generation on some math/code tasks.
-
Inference cost becomes the dominant expense for reasoning-heavy workloads. Training cost is amortized over billions of queries, but inference cost is paid per query.
-
Serving infrastructure must adapt: variable-length generation, KV cache for 50K+ token sequences, scheduling that handles 250x variance in query cost.
-
New optimization targets: instead of minimizing per-token latency, minimize cost-per-correct-answer. Sometimes generating 8 cheap samples and voting is cheaper than generating 1 expensive long chain.
The inference-time scaling law is not a replacement for training-time scaling — it is a complement. The future of AI scaling is two-dimensional: train strong base models AND think harder at inference. The optimal point on this frontier depends on the specific task, cost constraints, and latency requirements.