OpenAI o1: Reasoning Compute Budgets and Internal CoT

Part of Series Frontier Model Architectures 23 of 27

1 Kimi K2: How Moonshot Built a 1T MoE That Rivals Claude and GPT-4o 2 MiniMax-01: Lightning Attention, 4M Token Context, and the Linear Attention Revival 3 Frontier Models in 2025: The Architectural Convergence and Where Innovation Happens 4 Llama 3 Architecture Decisions: Why Meta Chose Dense, GQA-8, 128K Vocab, and RoPE 5 Qwen 2.5: Alibaba's Architecture, Training Recipe, and What Makes It Competitive 6 Gemini: Google's Natively Multimodal Architecture and the 1M Token Context 7 Claude Architecture: Constitutional AI, RLHF at Scale, and the 200K Context Window 8 Grok: xAI's Architecture, Massive Scale, and Real-Time Information Integration 9 Open vs Closed Models in 2026: Llama vs GPT-4o vs Claude vs DeepSeek — The Capability Gap Analysis 10 DeepSeek-R1: The Architecture of Reasoning — GRPO Training, Multi-Stage Pipeline, and Open Weights 11 Phi and Small Language Models: How Microsoft Achieves GPT-3.5 Quality at 3B Parameters 12 Mistral and the Sliding Window: Efficient Long-Context with Linear Memory 13 Llama 4: Meta's Shift to Multimodal MoE and What It Signals 14 Training Infrastructure: How Frontier Labs Build Their GPU Clusters 15 Benchmark Deep Dive: What MMLU, HumanEval, MATH, and SWE-bench Actually Measure 16 Jamba: AI21's Hybrid Mamba-Attention Architecture 17 Yi Series: 01.AI's Bilingual Architecture 18 DBRX: Databricks' Enterprise MoE Architecture 19 OpenAI o1: Reasoning Compute Budgets and Internal CoT 20 Distilled Models: Phi, Gemma, Llama 3.2 at Small Scale 21 Llama 4: Meta's Shift to Multimodal MoE 22 Scaling Laws and Model Design: How Chinchilla Changed Architecture Decisions 23 Open Weight Release Strategy: Llama vs Mistral vs DeepSeek — Licensing and Ecosystem Impact 24 Safety Architecture: How Frontier Models Build Guardrails Into the Model Itself 25 Multimodal Model Comparison 2026: GPT-4o vs Gemini vs Claude vs Llama Vision 26 MoE vs Dense in Production: Serving Cost, Latency, and When Each Wins 27 Chinese Frontier Models: DeepSeek, Qwen, Yi, and Kimi — Architecture Comparison

OpenAI o1 spends 30 seconds generating hidden reasoning tokens before answering a hard math problem. The result: 83.3% on AIME 2024, versus GPT-4o’s 13.4%. The cost: 10-100x more inference FLOPs per request. This is the test-time compute paradigm — instead of training a 10x larger model, you let the existing model think 10x longer. For specialized reasoning domains (competition math, formal proofs, complex planning), o1 proves that inference scaling beats parameter scaling. The serving implications are brutal: each o1 request consumes the compute budget of 50 GPT-4o requests.

The Test-Time Compute Paradigm

import torch
import numpy as np

class TestTimeComputeAnalysis:
    """
    Traditional scaling law: quality = f(training_compute)
    o1 paradigm: quality = f(training_compute, inference_compute)

    The key insight: for reasoning tasks, it's more efficient to
    spend compute at inference time (thinking) than to train
    a larger model.
    """

    def compute_scaling_comparison(self):
        """
        Compare two ways to improve accuracy on MATH benchmark:
        1. Train a 10x larger model (more training compute)
        2. Use same model with 10x more inference tokens (more thinking)
        """
        # Training scaling: accuracy improves ~logarithmically with compute
        # From GPT-4 to hypothetical GPT-5 (10x training compute)
        training_accuracy_gpt4 = 52.9  # MATH benchmark
        training_accuracy_10x = 58.0   # Estimated 10x training
        training_cost_10x = 10.0       # Relative cost

        # Inference scaling: accuracy improves with thinking tokens
        # o1 uses the SAME base model, just generates more tokens
        o1_accuracy_low_compute = 60.0   # ~100 thinking tokens
        o1_accuracy_medium = 78.0        # ~1000 thinking tokens
        o1_accuracy_high = 83.3          # ~10000 thinking tokens
        # Cost: only proportional to thinking tokens generated

        # Per-query cost comparison
        # GPT-4: ~500 output tokens, standard inference
        # o1 (medium): ~1500 tokens total (1000 thinking + 500 output)
        # o1 uses ~3x more tokens per query but gets 25 MATH points more

        results = {
            'training_scaling': {
                'accuracy_gain': training_accuracy_10x - training_accuracy_gpt4,
                'relative_cost': 'Training: 10x (one-time)',
            },
            'inference_scaling': {
                'accuracy_gain': o1_accuracy_medium - training_accuracy_gpt4,
                'relative_cost': 'Inference: 3x (per-query)',
            },
        }
        return results

📊

Scaling Approaches for MATH Benchmark

Approach	Accuracy	Relative Cost	Cost Type	Improvement over GPT-4
GPT-4 (baseline)	52.9%	1x	Per-query	-
10x training compute	~58%	10x	One-time	+5.1 pts
o1 (low thinking)	60.0%	1.5x	Per-query	+7.1 pts
o1 (medium thinking)	78.0%	3x	Per-query	+25.1 pts
o1 (high thinking)	83.3%	10x	Per-query	+30.4 pts
o1 (maximum thinking)	94.8%	~50x	Per-query	+41.9 pts

⚡ Performance

o1 achieves 25 more MATH points than GPT-4 at 3x per-query cost. Getting equivalent accuracy through training scaling alone would require orders of magnitude more training compute. This establishes test-time compute as a more efficient scaling axis for reasoning tasks.

Internal Chain-of-Thought Architecture

class O1ReasoningArchitecture:
    """
    Estimated o1 architecture based on published behavior and API responses.

    Key components:
    1. Base LLM (likely GPT-4 class, possibly GPT-4o)
    2. Reasoning policy: trained to generate productive thinking tokens
    3. Compute budget controller: allocates thinking based on difficulty
    4. Summary generator: condenses thinking into final answer
    """

    def __init__(self, base_model, reasoning_policy):
        self.base_model = base_model
        self.reasoning_policy = reasoning_policy

    def generate_with_reasoning(self, prompt, max_thinking_tokens=8192):
        """
        Step 1: Generate internal chain-of-thought
        Step 2: Condense into final answer
        """
        # The prompt is augmented with a reasoning instruction
        reasoning_prompt = self._construct_reasoning_prompt(prompt)

        # Generate thinking tokens (hidden from user)
        thinking_tokens = []
        current_state = reasoning_prompt

        for step in range(max_thinking_tokens):
            # The model generates one thinking token at a time
            next_token, confidence = self.reasoning_policy.generate_step(
                current_state
            )

            thinking_tokens.append(next_token)
            current_state = current_state + [next_token]

            # Early stopping: if the model is confident in its answer
            if self._should_stop_thinking(thinking_tokens, confidence):
                break

        # Generate final answer conditioned on the thinking
        final_answer = self.base_model.generate(
            prompt,
            context=thinking_tokens,  # Thinking informs the answer
            max_tokens=4096,
        )

        return {
            'thinking_tokens': len(thinking_tokens),  # Hidden
            'answer': final_answer,                    # Visible
            'total_tokens': len(thinking_tokens) + len(final_answer),
        }

    def _should_stop_thinking(self, thinking_tokens, confidence):
        """
        Heuristic for when to stop thinking:
        - High confidence in current answer
        - Thinking has converged (repeating patterns)
        - Budget exhausted
        """
        if confidence > 0.95:
            return True
        if len(thinking_tokens) > 100:
            # Check for convergence
            recent = thinking_tokens[-50:]
            earlier = thinking_tokens[-100:-50]
            if self._semantic_similarity(recent, earlier) > 0.9:
                return True
        return False

Thinking Token Patterns

def analyze_thinking_patterns():
    """
    Based on o1 API responses (which report thinking token count),
    we can analyze how the model allocates compute.
    """
    # Observed thinking token counts by problem type
    problem_types = {
        'simple_factual': {
            'example': 'What is the capital of France?',
            'avg_thinking_tokens': 12,
            'accuracy': 0.99,
            'note': 'Barely thinks — answer is cached/trivial',
        },
        'moderate_reasoning': {
            'example': 'Solve 2x + 5 = 13',
            'avg_thinking_tokens': 85,
            'accuracy': 0.98,
            'note': 'Brief verification chain',
        },
        'complex_math': {
            'example': 'MATH competition Level 5 problem',
            'avg_thinking_tokens': 2400,
            'accuracy': 0.83,
            'note': 'Extended reasoning with backtracking',
        },
        'coding_hard': {
            'example': 'Implement red-black tree deletion',
            'avg_thinking_tokens': 4200,
            'accuracy': 0.72,
            'note': 'Plan, implement, verify, revise',
        },
        'research_level': {
            'example': 'Prove a novel mathematical theorem',
            'avg_thinking_tokens': 12000,
            'accuracy': 0.45,
            'note': 'Multiple approaches, dead ends, retries',
        },
    }

    for ptype, info in problem_types.items():
        cost_ratio = info['avg_thinking_tokens'] / 500  # vs standard GPT-4 response
        print(f"{ptype:25s}: ~{info['avg_thinking_tokens']:>5d} thinking tokens, "
              f"acc={info['accuracy']:.0%}, cost={cost_ratio:.1f}x standard")

Average Thinking Tokens by Problem Difficulty

Simple factual

Moderate reasoning

Complex math

2,400

Hard coding

4,200

Research-level

12,000

Training the Reasoning Policy

class ReasoningPolicyTraining:
    """
    Estimated training approach for o1's reasoning capability.
    Based on published research (STaR, Let's Verify Step by Step, etc.)
    """

    def __init__(self, base_model):
        self.base_model = base_model

    def star_training(self, problems, solutions):
        """
        Self-Taught Reasoner (STaR) approach:
        1. Generate rationales for training problems
        2. Keep rationales that lead to correct answers
        3. Fine-tune on (problem, correct_rationale, answer) triples
        4. Repeat
        """
        for iteration in range(10):
            rationale_dataset = []

            for problem, solution in zip(problems, solutions):
                # Generate multiple rationales
                rationales = self.base_model.generate_rationales(
                    problem, num_samples=16
                )

                for rationale in rationales:
                    # Check if rationale leads to correct answer
                    answer = self.base_model.generate_answer(
                        problem, rationale=rationale
                    )
                    if self.check_correct(answer, solution):
                        rationale_dataset.append({
                            'problem': problem,
                            'rationale': rationale,
                            'answer': answer,
                        })

            # Fine-tune on correct rationales
            self.base_model.fine_tune(rationale_dataset)
            print(f"Iteration {iteration}: {len(rationale_dataset)} correct rationales")

    def process_reward_model(self, problems, solutions):
        """
        Process Reward Model (PRM): train a reward model that
        evaluates individual reasoning steps, not just final answers.

        This enables step-level beam search during inference,
        where the model can backtrack from bad reasoning paths.
        """
        # Collect step-level annotations
        step_annotations = []

        for problem, solution in zip(problems, solutions):
            # Generate reasoning chains
            chains = self.base_model.generate_chains(problem, num_chains=32)

            for chain in chains:
                steps = chain.split('\n')
                for step_idx, step in enumerate(steps):
                    # Evaluate if this step is correct/productive
                    is_correct = self.evaluate_step(
                        problem, steps[:step_idx+1], solution
                    )
                    step_annotations.append({
                        'problem': problem,
                        'steps_so_far': steps[:step_idx+1],
                        'is_correct': is_correct,
                    })

        # Train PRM on step annotations
        self.prm = ProcessRewardModel()
        self.prm.train(step_annotations)
        return self.prm

Serving Infrastructure Implications

def o1_serving_analysis():
    """
    o1's reasoning model fundamentally changes serving economics.
    """
    # Standard GPT-4 serving
    gpt4_metrics = {
        'avg_input_tokens': 500,
        'avg_output_tokens': 500,
        'total_tokens_per_query': 1000,
        'latency_p50_ms': 5000,
        'cost_per_query': 0.03,
        'throughput_queries_per_gpu': 20,
    }

    # o1 serving
    o1_metrics = {
        'avg_input_tokens': 500,
        'avg_thinking_tokens': 2000,   # Hidden, not billed at output rate
        'avg_output_tokens': 500,
        'total_tokens_per_query': 3000, # 3x more generation
        'latency_p50_ms': 15000,        # 3x longer (thinking)
        'cost_per_query': 0.09,         # ~3x cost
        'throughput_queries_per_gpu': 7, # 3x fewer concurrent queries

        # But for reasoning tasks:
        'accuracy_improvement': '+30%',  # On MATH
        'value_per_correct_answer': 'Much higher for enterprise',
    }

    # Key infrastructure challenges for o1-style models:
    challenges = {
        'long_generation': {
            'issue': 'Thinking generates 1K-10K+ tokens before answer',
            'impact': 'GPU occupied 3-20x longer per query',
            'mitigation': 'Speculative decoding, dedicated thinking clusters',
        },
        'variable_compute': {
            'issue': 'Simple queries: 12 thinking tokens. Hard: 12,000',
            'impact': 'Massive variance in per-query latency and cost',
            'mitigation': 'Adaptive batching, compute budget prediction',
        },
        'kv_cache_pressure': {
            'issue': 'Thinking tokens fill KV cache before answer starts',
            'impact': '2K thinking tokens = 2K extra KV cache entries',
            'mitigation': 'KV cache compression, thinking token pruning',
        },
    }

    return o1_metrics, challenges

📊

o1 Serving Impact

Metric	GPT-4	o1 (easy query)	o1 (hard query)	Impact
Thinking tokens	0	50-200	2,000-10,000	Variable
Total generation	500 tok	700 tok	10,500 tok	21x variance
Latency	5s	6s	45s	9x variance
KV cache per query	2 MB	3 MB	40 MB	20x pressure
GPU-seconds per query	0.5	0.7	5.0	10x cost variance
Throughput (queries/GPU/s)	2.0	1.4	0.2	10x reduction on hard

⚠️ Warning

o1’s variable compute creates a bimodal serving workload. Easy queries (90% of traffic) take 5-7 seconds. Hard queries (10% of traffic) take 30-60 seconds. This means head-of-line blocking in traditional serving queues — a single hard query blocks the GPU for the time of 10 easy queries. Preemptive scheduling or dedicated “thinking” GPU pools are necessary for production deployment.

Accuracy vs Thinking Budget

def accuracy_vs_thinking_budget():
    """
    The relationship between thinking tokens and accuracy
    follows a logarithmic curve with diminishing returns.
    """
    # Estimated data from o1 API experiments
    budgets = [0, 50, 200, 500, 1000, 2000, 5000, 10000, 20000, 50000]
    math_accuracy = [52.9, 58.2, 65.4, 72.1, 78.0, 83.3, 88.1, 91.2, 93.5, 94.8]
    coding_accuracy = [67.0, 70.2, 74.8, 78.5, 82.3, 85.1, 88.4, 89.8, 90.5, 91.2]

    # Key observations:
    # 1. First 200 thinking tokens give the largest accuracy boost
    # 2. Beyond 5000 tokens, returns are strongly diminishing
    # 3. Different tasks have different saturation points

    # For a cost-optimal system:
    # - Classify query difficulty first
    # - Allocate thinking budget proportional to difficulty
    # - Stop thinking when confidence exceeds threshold

    optimal_budgets = {
        'simple': 50,        # Quick verification
        'moderate': 500,     # Standard reasoning
        'complex': 2000,     # Extended analysis
        'research': 10000,   # Deep exploration (only if user pays)
    }
    return optimal_budgets

MATH Accuracy vs Thinking Token Budget

0 tokens (GPT-4 baseline)

52.9

200 tokens

65.4

1,000 tokens

5,000 tokens

88.1

20,000 tokens

93.5

50,000 tokens

94.8

Impact on the Field

📊

Reasoning Models Comparison (Early 2025)

Model	MATH	GPQA Diamond	Codeforces	Approach
GPT-4o	76.6	53.6	11%	Standard (no reasoning)
o1-preview	83.3	73.3	62%	Internal CoT
o1 (full)	94.8	78.0	89%	Internal CoT (high budget)
DeepSeek R1	79.8	71.5	52%	Open-weight reasoning
Claude 3.5 Sonnet	78.3	65.0	N/A	Standard + tools
o3 (estimated)	>95%	>80%	>90%	Next-gen reasoning

o1 demonstrated that test-time compute scaling is a viable and efficient alternative to training-time scaling for reasoning tasks. The model generates thousands of internal thinking tokens, effectively running a search process over possible reasoning chains, and selects the best path to the answer. The serving implications are significant: variable per-query compute, 3-20x longer generation times, and bimodal latency distributions require new scheduling and infrastructure approaches. The paradigm has since been adopted by DeepSeek R1, QwQ, and other labs, establishing reasoning-time compute as a standard axis of model capability alongside model size and training data.