OpenAI o1 spends 30 seconds generating hidden reasoning tokens before answering a hard math problem. The result: 83.3% on AIME 2024, versus GPT-4o’s 13.4%. The cost: 10-100x more inference FLOPs per request. This is the test-time compute paradigm — instead of training a 10x larger model, you let the existing model think 10x longer. For specialized reasoning domains (competition math, formal proofs, complex planning), o1 proves that inference scaling beats parameter scaling. The serving implications are brutal: each o1 request consumes the compute budget of 50 GPT-4o requests.
The Test-Time Compute Paradigm
import torch
import numpy as np
class TestTimeComputeAnalysis:
"""
Traditional scaling law: quality = f(training_compute)
o1 paradigm: quality = f(training_compute, inference_compute)
The key insight: for reasoning tasks, it's more efficient to
spend compute at inference time (thinking) than to train
a larger model.
"""
def compute_scaling_comparison(self):
"""
Compare two ways to improve accuracy on MATH benchmark:
1. Train a 10x larger model (more training compute)
2. Use same model with 10x more inference tokens (more thinking)
"""
# Training scaling: accuracy improves ~logarithmically with compute
# From GPT-4 to hypothetical GPT-5 (10x training compute)
training_accuracy_gpt4 = 52.9 # MATH benchmark
training_accuracy_10x = 58.0 # Estimated 10x training
training_cost_10x = 10.0 # Relative cost
# Inference scaling: accuracy improves with thinking tokens
# o1 uses the SAME base model, just generates more tokens
o1_accuracy_low_compute = 60.0 # ~100 thinking tokens
o1_accuracy_medium = 78.0 # ~1000 thinking tokens
o1_accuracy_high = 83.3 # ~10000 thinking tokens
# Cost: only proportional to thinking tokens generated
# Per-query cost comparison
# GPT-4: ~500 output tokens, standard inference
# o1 (medium): ~1500 tokens total (1000 thinking + 500 output)
# o1 uses ~3x more tokens per query but gets 25 MATH points more
results = {
'training_scaling': {
'accuracy_gain': training_accuracy_10x - training_accuracy_gpt4,
'relative_cost': 'Training: 10x (one-time)',
},
'inference_scaling': {
'accuracy_gain': o1_accuracy_medium - training_accuracy_gpt4,
'relative_cost': 'Inference: 3x (per-query)',
},
}
return results
Scaling Approaches for MATH Benchmark
| Approach | Accuracy | Relative Cost | Cost Type | Improvement over GPT-4 |
|---|---|---|---|---|
| GPT-4 (baseline) | 52.9% | 1x | Per-query | - |
| 10x training compute | ~58% | 10x | One-time | +5.1 pts |
| o1 (low thinking) | 60.0% | 1.5x | Per-query | +7.1 pts |
| o1 (medium thinking) | 78.0% | 3x | Per-query | +25.1 pts |
| o1 (high thinking) | 83.3% | 10x | Per-query | +30.4 pts |
| o1 (maximum thinking) | 94.8% | ~50x | Per-query | +41.9 pts |
o1 achieves 25 more MATH points than GPT-4 at 3x per-query cost. Getting equivalent accuracy through training scaling alone would require orders of magnitude more training compute. This establishes test-time compute as a more efficient scaling axis for reasoning tasks.
Internal Chain-of-Thought Architecture
class O1ReasoningArchitecture:
"""
Estimated o1 architecture based on published behavior and API responses.
Key components:
1. Base LLM (likely GPT-4 class, possibly GPT-4o)
2. Reasoning policy: trained to generate productive thinking tokens
3. Compute budget controller: allocates thinking based on difficulty
4. Summary generator: condenses thinking into final answer
"""
def __init__(self, base_model, reasoning_policy):
self.base_model = base_model
self.reasoning_policy = reasoning_policy
def generate_with_reasoning(self, prompt, max_thinking_tokens=8192):
"""
Step 1: Generate internal chain-of-thought
Step 2: Condense into final answer
"""
# The prompt is augmented with a reasoning instruction
reasoning_prompt = self._construct_reasoning_prompt(prompt)
# Generate thinking tokens (hidden from user)
thinking_tokens = []
current_state = reasoning_prompt
for step in range(max_thinking_tokens):
# The model generates one thinking token at a time
next_token, confidence = self.reasoning_policy.generate_step(
current_state
)
thinking_tokens.append(next_token)
current_state = current_state + [next_token]
# Early stopping: if the model is confident in its answer
if self._should_stop_thinking(thinking_tokens, confidence):
break
# Generate final answer conditioned on the thinking
final_answer = self.base_model.generate(
prompt,
context=thinking_tokens, # Thinking informs the answer
max_tokens=4096,
)
return {
'thinking_tokens': len(thinking_tokens), # Hidden
'answer': final_answer, # Visible
'total_tokens': len(thinking_tokens) + len(final_answer),
}
def _should_stop_thinking(self, thinking_tokens, confidence):
"""
Heuristic for when to stop thinking:
- High confidence in current answer
- Thinking has converged (repeating patterns)
- Budget exhausted
"""
if confidence > 0.95:
return True
if len(thinking_tokens) > 100:
# Check for convergence
recent = thinking_tokens[-50:]
earlier = thinking_tokens[-100:-50]
if self._semantic_similarity(recent, earlier) > 0.9:
return True
return False
Thinking Token Patterns
def analyze_thinking_patterns():
"""
Based on o1 API responses (which report thinking token count),
we can analyze how the model allocates compute.
"""
# Observed thinking token counts by problem type
problem_types = {
'simple_factual': {
'example': 'What is the capital of France?',
'avg_thinking_tokens': 12,
'accuracy': 0.99,
'note': 'Barely thinks — answer is cached/trivial',
},
'moderate_reasoning': {
'example': 'Solve 2x + 5 = 13',
'avg_thinking_tokens': 85,
'accuracy': 0.98,
'note': 'Brief verification chain',
},
'complex_math': {
'example': 'MATH competition Level 5 problem',
'avg_thinking_tokens': 2400,
'accuracy': 0.83,
'note': 'Extended reasoning with backtracking',
},
'coding_hard': {
'example': 'Implement red-black tree deletion',
'avg_thinking_tokens': 4200,
'accuracy': 0.72,
'note': 'Plan, implement, verify, revise',
},
'research_level': {
'example': 'Prove a novel mathematical theorem',
'avg_thinking_tokens': 12000,
'accuracy': 0.45,
'note': 'Multiple approaches, dead ends, retries',
},
}
for ptype, info in problem_types.items():
cost_ratio = info['avg_thinking_tokens'] / 500 # vs standard GPT-4 response
print(f"{ptype:25s}: ~{info['avg_thinking_tokens']:>5d} thinking tokens, "
f"acc={info['accuracy']:.0%}, cost={cost_ratio:.1f}x standard")
Average Thinking Tokens by Problem Difficulty
Training the Reasoning Policy
class ReasoningPolicyTraining:
"""
Estimated training approach for o1's reasoning capability.
Based on published research (STaR, Let's Verify Step by Step, etc.)
"""
def __init__(self, base_model):
self.base_model = base_model
def star_training(self, problems, solutions):
"""
Self-Taught Reasoner (STaR) approach:
1. Generate rationales for training problems
2. Keep rationales that lead to correct answers
3. Fine-tune on (problem, correct_rationale, answer) triples
4. Repeat
"""
for iteration in range(10):
rationale_dataset = []
for problem, solution in zip(problems, solutions):
# Generate multiple rationales
rationales = self.base_model.generate_rationales(
problem, num_samples=16
)
for rationale in rationales:
# Check if rationale leads to correct answer
answer = self.base_model.generate_answer(
problem, rationale=rationale
)
if self.check_correct(answer, solution):
rationale_dataset.append({
'problem': problem,
'rationale': rationale,
'answer': answer,
})
# Fine-tune on correct rationales
self.base_model.fine_tune(rationale_dataset)
print(f"Iteration {iteration}: {len(rationale_dataset)} correct rationales")
def process_reward_model(self, problems, solutions):
"""
Process Reward Model (PRM): train a reward model that
evaluates individual reasoning steps, not just final answers.
This enables step-level beam search during inference,
where the model can backtrack from bad reasoning paths.
"""
# Collect step-level annotations
step_annotations = []
for problem, solution in zip(problems, solutions):
# Generate reasoning chains
chains = self.base_model.generate_chains(problem, num_chains=32)
for chain in chains:
steps = chain.split('\n')
for step_idx, step in enumerate(steps):
# Evaluate if this step is correct/productive
is_correct = self.evaluate_step(
problem, steps[:step_idx+1], solution
)
step_annotations.append({
'problem': problem,
'steps_so_far': steps[:step_idx+1],
'is_correct': is_correct,
})
# Train PRM on step annotations
self.prm = ProcessRewardModel()
self.prm.train(step_annotations)
return self.prm
Serving Infrastructure Implications
def o1_serving_analysis():
"""
o1's reasoning model fundamentally changes serving economics.
"""
# Standard GPT-4 serving
gpt4_metrics = {
'avg_input_tokens': 500,
'avg_output_tokens': 500,
'total_tokens_per_query': 1000,
'latency_p50_ms': 5000,
'cost_per_query': 0.03,
'throughput_queries_per_gpu': 20,
}
# o1 serving
o1_metrics = {
'avg_input_tokens': 500,
'avg_thinking_tokens': 2000, # Hidden, not billed at output rate
'avg_output_tokens': 500,
'total_tokens_per_query': 3000, # 3x more generation
'latency_p50_ms': 15000, # 3x longer (thinking)
'cost_per_query': 0.09, # ~3x cost
'throughput_queries_per_gpu': 7, # 3x fewer concurrent queries
# But for reasoning tasks:
'accuracy_improvement': '+30%', # On MATH
'value_per_correct_answer': 'Much higher for enterprise',
}
# Key infrastructure challenges for o1-style models:
challenges = {
'long_generation': {
'issue': 'Thinking generates 1K-10K+ tokens before answer',
'impact': 'GPU occupied 3-20x longer per query',
'mitigation': 'Speculative decoding, dedicated thinking clusters',
},
'variable_compute': {
'issue': 'Simple queries: 12 thinking tokens. Hard: 12,000',
'impact': 'Massive variance in per-query latency and cost',
'mitigation': 'Adaptive batching, compute budget prediction',
},
'kv_cache_pressure': {
'issue': 'Thinking tokens fill KV cache before answer starts',
'impact': '2K thinking tokens = 2K extra KV cache entries',
'mitigation': 'KV cache compression, thinking token pruning',
},
}
return o1_metrics, challenges
o1 Serving Impact
| Metric | GPT-4 | o1 (easy query) | o1 (hard query) | Impact |
|---|---|---|---|---|
| Thinking tokens | 0 | 50-200 | 2,000-10,000 | Variable |
| Total generation | 500 tok | 700 tok | 10,500 tok | 21x variance |
| Latency | 5s | 6s | 45s | 9x variance |
| KV cache per query | 2 MB | 3 MB | 40 MB | 20x pressure |
| GPU-seconds per query | 0.5 | 0.7 | 5.0 | 10x cost variance |
| Throughput (queries/GPU/s) | 2.0 | 1.4 | 0.2 | 10x reduction on hard |
o1’s variable compute creates a bimodal serving workload. Easy queries (90% of traffic) take 5-7 seconds. Hard queries (10% of traffic) take 30-60 seconds. This means head-of-line blocking in traditional serving queues — a single hard query blocks the GPU for the time of 10 easy queries. Preemptive scheduling or dedicated “thinking” GPU pools are necessary for production deployment.
Accuracy vs Thinking Budget
def accuracy_vs_thinking_budget():
"""
The relationship between thinking tokens and accuracy
follows a logarithmic curve with diminishing returns.
"""
# Estimated data from o1 API experiments
budgets = [0, 50, 200, 500, 1000, 2000, 5000, 10000, 20000, 50000]
math_accuracy = [52.9, 58.2, 65.4, 72.1, 78.0, 83.3, 88.1, 91.2, 93.5, 94.8]
coding_accuracy = [67.0, 70.2, 74.8, 78.5, 82.3, 85.1, 88.4, 89.8, 90.5, 91.2]
# Key observations:
# 1. First 200 thinking tokens give the largest accuracy boost
# 2. Beyond 5000 tokens, returns are strongly diminishing
# 3. Different tasks have different saturation points
# For a cost-optimal system:
# - Classify query difficulty first
# - Allocate thinking budget proportional to difficulty
# - Stop thinking when confidence exceeds threshold
optimal_budgets = {
'simple': 50, # Quick verification
'moderate': 500, # Standard reasoning
'complex': 2000, # Extended analysis
'research': 10000, # Deep exploration (only if user pays)
}
return optimal_budgets
MATH Accuracy vs Thinking Token Budget
Impact on the Field
Reasoning Models Comparison (Early 2025)
| Model | MATH | GPQA Diamond | Codeforces | Approach |
|---|---|---|---|---|
| GPT-4o | 76.6 | 53.6 | 11% | Standard (no reasoning) |
| o1-preview | 83.3 | 73.3 | 62% | Internal CoT |
| o1 (full) | 94.8 | 78.0 | 89% | Internal CoT (high budget) |
| DeepSeek R1 | 79.8 | 71.5 | 52% | Open-weight reasoning |
| Claude 3.5 Sonnet | 78.3 | 65.0 | N/A | Standard + tools |
| o3 (estimated) | >95% | >80% | >90% | Next-gen reasoning |
o1 demonstrated that test-time compute scaling is a viable and efficient alternative to training-time scaling for reasoning tasks. The model generates thousands of internal thinking tokens, effectively running a search process over possible reasoning chains, and selects the best path to the answer. The serving implications are significant: variable per-query compute, 3-20x longer generation times, and bimodal latency distributions require new scheduling and infrastructure approaches. The paradigm has since been adopted by DeepSeek R1, QwQ, and other labs, establishing reasoning-time compute as a standard axis of model capability alongside model size and training data.