Benchmark Deep Dive: What MMLU, HumanEval, MATH, and SWE-bench Actually Measure

Part of Series Frontier Model Architectures 15 of 27

1 Kimi K2: How Moonshot Built a 1T MoE That Rivals Claude and GPT-4o 2 MiniMax-01: Lightning Attention, 4M Token Context, and the Linear Attention Revival 3 Frontier Models in 2025: The Architectural Convergence and Where Innovation Happens 4 Llama 3 Architecture Decisions: Why Meta Chose Dense, GQA-8, 128K Vocab, and RoPE 5 Qwen 2.5: Alibaba's Architecture, Training Recipe, and What Makes It Competitive 6 Gemini: Google's Natively Multimodal Architecture and the 1M Token Context 7 Claude Architecture: Constitutional AI, RLHF at Scale, and the 200K Context Window 8 Grok: xAI's Architecture, Massive Scale, and Real-Time Information Integration 9 Open vs Closed Models in 2026: Llama vs GPT-4o vs Claude vs DeepSeek — The Capability Gap Analysis 10 DeepSeek-R1: The Architecture of Reasoning — GRPO Training, Multi-Stage Pipeline, and Open Weights 11 Phi and Small Language Models: How Microsoft Achieves GPT-3.5 Quality at 3B Parameters 12 Mistral and the Sliding Window: Efficient Long-Context with Linear Memory 13 Llama 4: Meta's Shift to Multimodal MoE and What It Signals 14 Training Infrastructure: How Frontier Labs Build Their GPU Clusters 15 Benchmark Deep Dive: What MMLU, HumanEval, MATH, and SWE-bench Actually Measure 16 Jamba: AI21's Hybrid Mamba-Attention Architecture 17 Yi Series: 01.AI's Bilingual Architecture 18 DBRX: Databricks' Enterprise MoE Architecture 19 OpenAI o1: Reasoning Compute Budgets and Internal CoT 20 Distilled Models: Phi, Gemma, Llama 3.2 at Small Scale 21 Llama 4: Meta's Shift to Multimodal MoE 22 Scaling Laws and Model Design: How Chinchilla Changed Architecture Decisions 23 Open Weight Release Strategy: Llama vs Mistral vs DeepSeek — Licensing and Ecosystem Impact 24 Safety Architecture: How Frontier Models Build Guardrails Into the Model Itself 25 Multimodal Model Comparison 2026: GPT-4o vs Gemini vs Claude vs Llama Vision 26 MoE vs Dense in Production: Serving Cost, Latency, and When Each Wins 27 Chinese Frontier Models: DeepSeek, Qwen, Yi, and Kimi — Architecture Comparison

MMLU is contaminated in every major training dataset. GPT-4 scores 86.4%, but retesting on newly-written MMLU-style questions drops performance to 78.2% — an 8.2 point overshoot from memorization. HumanEval’s 164 problems are so widely circulated that DeepSeek V3 scores 90.2% despite the test being “held out” since 2021. Benchmarks are not neutral measurements; they are adversarial games where training data increasingly includes the test set, and reported numbers overestimate generalization by 5-15 points.

MMLU: Massive Multitask Language Understanding

What It Is

def mmlu_structure():
    """
    MMLU (Hendrycks et al., 2021) is a multiple-choice knowledge benchmark.
    57 subjects across STEM, humanities, social sciences, and professional fields.
    ~15,908 questions total.
    """

    subjects = {
        'stem': [
            'abstract_algebra', 'astronomy', 'college_biology',
            'college_chemistry', 'college_computer_science',
            'college_mathematics', 'college_physics',
            'computer_security', 'conceptual_physics',
            'electrical_engineering', 'elementary_mathematics',
            'high_school_biology', 'high_school_chemistry',
            'high_school_computer_science', 'high_school_mathematics',
            'high_school_physics', 'high_school_statistics',
            'machine_learning',
        ],
        'humanities': [
            'formal_logic', 'high_school_european_history',
            'high_school_us_history', 'high_school_world_history',
            'international_law', 'jurisprudence', 'logical_fallacies',
            'moral_disputes', 'moral_scenarios', 'philosophy',
            'prehistory', 'professional_law', 'world_religions',
        ],
        'social_sciences': [
            'econometrics', 'high_school_geography',
            'high_school_government_and_politics',
            'high_school_macroeconomics', 'high_school_microeconomics',
            'high_school_psychology', 'human_sexuality',
            'professional_psychology', 'public_relations',
            'security_studies', 'sociology', 'us_foreign_policy',
        ],
        'other': [
            'anatomy', 'business_ethics', 'clinical_knowledge',
            'college_medicine', 'global_facts', 'human_aging',
            'management', 'marketing', 'medical_genetics',
            'miscellaneous', 'nutrition', 'professional_accounting',
            'professional_medicine', 'virology',
        ],
    }

    # Question format: multiple choice with 4 options (A, B, C, D)
    example = {
        'subject': 'college_mathematics',
        'question': 'Let A be the set of all ordered pairs of integers (m, n) '
                   'such that 7m + 12n = 22. What is the greatest negative '
                   'number in the set B = {m + n : (m, n) in A}?',
        'options': ['A) -5', 'B) -4', 'C) -3', 'D) -2'],
        'answer': 'B',
    }

    return subjects, example

def mmlu_evaluation_protocol():
    """
    MMLU evaluation: two common protocols that give DIFFERENT scores.
    """

    protocols = {
        '5_shot': {
            'description': '5 examples provided before the question as context',
            'format': 'Q: ... A: B\nQ: ... A: D\n...\nQ: {test question} A:',
            'scoring': 'Compare model\'s next token probabilities for A, B, C, D',
            'note': 'Most commonly reported in papers',
        },
        '0_shot': {
            'description': 'No examples, just the question',
            'format': 'Q: {test question}\nA:',
            'scoring': 'Same probability comparison',
            'note': 'Some papers (especially chat models) report this',
        },
        'cot_0_shot': {
            'description': 'Ask model to think step-by-step before answering',
            'format': 'Q: {question}\nLet\'s think step by step:',
            'scoring': 'Extract answer from generated text',
            'note': 'Higher scores for reasoning models, but harder to parse',
        },
    }

    # CRITICAL: different protocols give different scores
    # Example: GPT-4o 5-shot = 88.7%, GPT-4o 0-shot = 86.5%, GPT-4o CoT = 90.2%
    # Papers cherry-pick the protocol that gives the best number
    return protocols

What MMLU Measures vs What It Misses

def mmlu_measures():
    """What MMLU actually tests."""

    measures = {
        'factual_recall': {
            'weight': 'High — many questions test memorized facts',
            'example': 'What year did the Treaty of Westphalia end?',
            'can_be_gamed': 'Yes — train on the source textbooks',
        },
        'exam_taking': {
            'weight': 'High — it is literally a multiple-choice exam',
            'example': 'Process of elimination among 4 options',
            'can_be_gamed': 'Yes — models learn MC test-taking strategies',
        },
        'domain_breadth': {
            'weight': 'High — 57 subjects cover broad knowledge',
            'genuine_value': 'Tests whether model has been exposed to diverse domains',
            'limitation': 'Each subject has only ~100-300 questions',
        },
        'reasoning': {
            'weight': 'Low to moderate — some math/logic questions require reasoning',
            'limitation': 'Most questions can be answered from memorization alone',
        },
    }

    misses = {
        'open_ended_generation': 'MMLU never tests free-form writing',
        'multi_step_reasoning': 'Questions are single-turn, no chain of thought',
        'real_world_application': 'Knowing facts != applying them',
        'uncertainty_calibration': 'No partial credit, no "I don\'t know" option',
        'temporal_knowledge': 'Many questions have outdated answers',
        'multilingual': 'English only',
    }

    return measures, misses

📊

MMLU Score Interpretation Guide

Score Range	Interpretation	Models in Range
25%	Random chance (4 options)	Untrained model
40-50%	Basic knowledge, poor reasoning	Small models (1-3B)
50-65%	Moderate knowledge, some domains strong	7B models (Mistral, Llama 3 8B)
65-80%	Strong broad knowledge	70B models, GPT-3.5
80-90%	Expert-level across most domains	GPT-4, Claude 3.5, DeepSeek V3
90%+	Near-saturation, benchmark ceiling	GPT-4o (with CoT), o1

HumanEval: Code Generation

What It Is

def humaneval_structure():
    """
    HumanEval (Chen et al., 2021, OpenAI) is a code generation benchmark.
    164 hand-written Python programming problems.
    """

    # Example problem (simplified)
    example = {
        'task_id': 'HumanEval/0',
        'prompt': '''from typing import List

def has_close_elements(numbers: List[float], threshold: float) -> bool:
    """ Check if in given list of numbers, are any two numbers closer to each
    other than given threshold.
    >>> has_close_elements([1.0, 2.0, 3.0], 0.5)
    False
    >>> has_close_elements([1.0, 2.8, 3.0, 4.0, 5.0, 2.0], 0.3)
    True
    """''',
        'canonical_solution': '''    for idx, elem in enumerate(numbers):
        for idx2, elem2 in enumerate(numbers):
            if idx != idx2:
                distance = abs(elem - elem2)
                if distance < threshold:
                    return True
    return False''',
        'test': '''
def check(candidate):
    assert candidate([1.0, 2.0, 3.0], 0.5) == False
    assert candidate([1.0, 2.8, 3.0, 4.0, 5.0, 2.0], 0.3) == True
    assert candidate([1.0, 2.0, 3.9, 4.0, 5.0, 2.2], 0.3) == True
    assert candidate([1.0, 2.0, 3.9, 4.0, 5.0, 2.2], 0.05) == True
    assert candidate([1.0, 2.0, 5.9, 4.0, 5.0], 0.95) == True
    assert candidate([1.0, 2.0, 5.9, 4.0, 5.0], 0.8) == False
    assert candidate([1.0, 2.0, 3.0, 4.0, 5.0, 2.0], 0.1) == True
    assert candidate([1.1, 2.2, 3.1, 4.1, 5.1], 1.0) == True
    assert candidate([1.1, 2.2, 3.1, 4.1, 5.1], 0.5) == False''',
    }

    statistics = {
        'total_problems': 164,
        'language': 'Python only',
        'difficulty_distribution': {
            'easy': '~40% (basic string/list manipulation)',
            'medium': '~45% (moderate algorithms)',
            'hard': '~15% (complex logic, edge cases)',
        },
        'average_solution_length': '~10 lines',
        'test_cases_per_problem': '~7 (range: 3-20)',
    }

    return example, statistics

def humaneval_scoring():
    """
    Scoring: pass@k metric.
    Generate k samples, pass@k = fraction with at least 1 correct.
    """

    def compute_pass_at_k(n, c, k):
        """
        n: number of samples generated
        c: number of correct samples
        k: k in pass@k
        """
        from math import comb
        if n - c < k:
            return 1.0
        return 1.0 - comb(n - c, k) / comb(n, k)

    # Most papers report pass@1 (single attempt)
    # pass@1 is more meaningful than pass@10 for practical use
    # (you want the model to get it right on the first try)

    return compute_pass_at_k

What HumanEval Misses

def humaneval_limitations():
    """
    HumanEval's limitations are severe enough that it should NOT
    be the sole measure of coding ability.
    """

    limitations = {
        'too_small': {
            'issue': '164 problems is statistically insufficient',
            'impact': '1 problem = 0.6% of the score. Random luck matters.',
            'example': 'A model that gets 1 extra problem right jumps 0.6%',
        },
        'python_only': {
            'issue': 'Only tests Python — no C, Java, JavaScript, Rust, etc.',
            'impact': 'Models optimized for Python score artificially high',
        },
        'simple_problems': {
            'issue': 'Most problems are interview-easy level',
            'impact': 'Ceiling effect — frontier models all score 80%+',
            'example': 'Problems like "reverse a string" or "check if palindrome"',
        },
        'no_context': {
            'issue': 'Each problem is self-contained — no existing codebase',
            'impact': 'Does not test ability to work with large codebases, '
                     'read documentation, or debug existing code',
        },
        'no_design': {
            'issue': 'No testing of software design, architecture, or API design',
            'impact': 'Coding is more than implementing functions',
        },
        'contamination': {
            'issue': 'Published since 2021 — likely in training data of most models',
            'impact': 'Models may memorize solutions rather than solve problems',
            'evidence': 'Some models produce exact canonical solutions',
        },
    }
    return limitations

MATH: Competition Mathematics

Structure and Difficulty

def math_benchmark_structure():
    """
    MATH (Hendrycks et al., 2021): 12,500 competition math problems.
    MATH-500: a representative subset of 500 problems for efficiency.
    """

    categories = {
        'algebra': {
            'fraction': 0.22,
            'topics': 'Equations, polynomials, sequences, inequalities',
            'difficulty_range': 'AMC 8 to AIME',
        },
        'counting_and_probability': {
            'fraction': 0.12,
            'topics': 'Combinatorics, probability, expected value',
            'difficulty_range': 'AMC 10 to AIME',
        },
        'geometry': {
            'fraction': 0.10,
            'topics': 'Euclidean geometry, coordinate geometry, 3D geometry',
            'difficulty_range': 'AMC 10 to AIME',
        },
        'intermediate_algebra': {
            'fraction': 0.18,
            'topics': 'Complex numbers, logarithms, advanced polynomials',
            'difficulty_range': 'AMC 12 to AIME',
        },
        'number_theory': {
            'fraction': 0.15,
            'topics': 'Divisibility, modular arithmetic, prime numbers',
            'difficulty_range': 'AMC 10 to AIME',
        },
        'prealgebra': {
            'fraction': 0.13,
            'topics': 'Fractions, ratios, basic operations',
            'difficulty_range': 'AMC 8',
        },
        'precalculus': {
            'fraction': 0.10,
            'topics': 'Trigonometry, vectors, matrices',
            'difficulty_range': 'AMC 12 to AIME',
        },
    }

    difficulty_levels = {
        1: {'description': 'AMC 8 level', 'fraction': 0.20},
        2: {'description': 'AMC 10 easy', 'fraction': 0.20},
        3: {'description': 'AMC 10 hard / AMC 12 easy', 'fraction': 0.20},
        4: {'description': 'AMC 12 hard / AIME easy', 'fraction': 0.20},
        5: {'description': 'AIME medium to hard', 'fraction': 0.20},
    }

    example = {
        'problem': 'How many of the integers between 1 and 1000, inclusive, '
                  'can be expressed as the difference of the squares of two '
                  'nonnegative integers?',
        'answer': '750',
        'level': 5,
        'category': 'number_theory',
        'solution': (
            'A number n can be expressed as a^2 - b^2 = (a+b)(a-b) if and '
            'only if n is not congruent to 2 mod 4. Numbers congruent to '
            '2 mod 4 in [1,1000]: 2, 6, 10, ..., 998 = 250 numbers. '
            'So 1000 - 250 = 750.'
        ),
    }

    return categories, difficulty_levels, example

What MATH Measures

def math_analysis():
    """
    MATH tests genuine mathematical reasoning — one of the harder
    benchmarks to game through memorization.
    """

    measures = {
        'multi_step_reasoning': {
            'weight': 'Very high — most problems require 3-10 reasoning steps',
            'genuine': True,
            'note': 'This is the benchmark\'s primary value',
        },
        'symbolic_manipulation': {
            'weight': 'High — algebra, simplification, equation solving',
            'genuine': True,
        },
        'problem_decomposition': {
            'weight': 'High — must break complex problems into subproblems',
            'genuine': True,
        },
        'mathematical_knowledge': {
            'weight': 'Moderate — need to know theorems, formulas, techniques',
            'can_be_memorized': True,
        },
    }

    misses = {
        'proof_writing': 'MATH asks for numerical answers, not proofs',
        'mathematical_creativity': 'Problems have known solution paths',
        'applied_mathematics': 'Pure math — no physics, engineering, or stats applications',
        'open_ended_exploration': 'Each problem has exactly one correct answer',
    }

    contamination_risk = {
        'level': 'Moderate',
        'reason': 'Problems are from published competitions (AMC, AIME). '
                 'These are widely available online.',
        'mitigation': 'Some labs use MATH-500 (subset) or create new problems.',
    }

    return measures, misses, contamination_risk

MATH-500 Accuracy Over Time

GPT-4 (Mar 2023)

52.9

Claude 3 Opus (Mar 2024)

60.1

GPT-4o (May 2024)

76.6

DeepSeek V3 (Dec 2024)

90.2

DeepSeek R1 (Jan 2025)

97.3

o1 (Sep 2024)

96.4

⚠️ Warning

MATH is approaching saturation. DeepSeek R1 scores 97.3%, leaving only ~14 problems wrong out of 500. Once benchmark scores exceed 95%, the remaining errors are often due to edge cases, ambiguous problem statements, or parsing issues rather than genuine mathematical inability. New, harder benchmarks are needed.

SWE-bench: Real Software Engineering

What It Is

def swe_bench_structure():
    """
    SWE-bench (Jimenez et al., 2024): real GitHub issues from popular
    Python repositories. The model must produce a patch that fixes the issue.
    """

    benchmark_info = {
        'original': {
            'size': 2294,
            'source': '12 popular Python repos',
            'task': 'Given a GitHub issue description, produce a git diff '
                   'that resolves the issue',
            'evaluation': 'Run the repository test suite on the patched code',
        },
        'swe_bench_lite': {
            'size': 300,
            'source': 'Subset of SWE-bench (simpler issues)',
            'purpose': 'Faster evaluation, still meaningful',
        },
        'swe_bench_verified': {
            'size': 500,
            'source': 'Human-verified subset with clearer specifications',
            'purpose': 'Reduce false negatives from ambiguous issues',
        },
    }

    repositories = [
        'django/django',
        'scikit-learn/scikit-learn',
        'matplotlib/matplotlib',
        'sympy/sympy',
        'astropy/astropy',
        'pytest-dev/pytest',
        'sphinx-doc/sphinx',
        'pylint-dev/pylint',
        'psf/requests',
        'pallets/flask',
        'mwaskom/seaborn',
        'pydata/xarray',
    ]

    # Example task (simplified)
    example = {
        'repo': 'django/django',
        'issue_title': 'QuerySet.defer() doesn\'t clear deferred fields when chaining',
        'issue_body': 'When calling .defer("field1").defer("field2"), only field2 '
                     'is deferred. Expected: both fields should be deferred.',
        'expected_output': 'A git diff that fixes the QuerySet.defer() method',
        'test_command': 'python -m pytest tests/defer/tests.py',
    }

    return benchmark_info, repositories, example

What SWE-bench Measures

def swe_bench_measures():
    """
    SWE-bench is the most realistic coding benchmark available.
    """

    measures = {
        'real_codebase_navigation': {
            'weight': 'Very high — must find relevant files in large repos',
            'genuine': True,
            'note': 'Django has 500K+ lines of code. Finding the bug '
                   'requires understanding the codebase structure.',
        },
        'bug_understanding': {
            'weight': 'High — must comprehend the issue from natural language',
            'genuine': True,
        },
        'patch_generation': {
            'weight': 'High — must produce a correct, minimal patch',
            'genuine': True,
            'note': 'Not just writing new code, but modifying existing code',
        },
        'test_awareness': {
            'weight': 'Moderate — patch must pass the test suite',
            'genuine': True,
        },
    }

    misses = {
        'python_only': 'All repos are Python — no C++, Rust, Go, etc.',
        'no_new_features': 'All tasks are bug fixes, no feature development',
        'no_code_review': 'No testing of collaboration, review, or documentation',
        'agent_scaffolding_matters': 'Results heavily depend on the scaffolding '
                                    '(how the model is called, retrieval, etc.)',
    }

    return measures, misses

📊

SWE-bench Verified Scores (2024-2025)

Model/Agent	SWE-bench Verified	Method	Date
GPT-4 (raw)	1.7%	Direct prompting	2024-03
Claude 3.5 Sonnet (raw)	3.2%	Direct prompting	2024-06
SWE-Agent + GPT-4	12.5%	Agent with retrieval	2024-04
Claude 3.5 Sonnet (agent)	33.4%	Anthropic's agent framework	2024-10
OpenAI o1 (agent)	41.0%	Agent framework	2024-12
DeepSeek R1 (agent)	49.2%	Agent framework	2025-01

ℹ️ Note

SWE-bench scores are heavily influenced by the agent scaffolding (file retrieval, multi-step reasoning, error recovery) rather than the raw model alone. Claude 3.5 Sonnet scores 3.2% with direct prompting but 33.4% with an agent framework. The benchmark measures the model + scaffolding system, not the model alone. Always check what scaffolding was used when comparing scores.

GSM-8K: Grade School Math

def gsm8k_analysis():
    """
    GSM-8K (Cobbe et al., 2021): 8,792 grade-school math word problems.
    """

    structure = {
        'size': 8792,
        'difficulty': 'Grade school (ages 8-14)',
        'steps': '2-8 reasoning steps per problem',
        'operations': 'Addition, subtraction, multiplication, division',
        'format': 'Natural language word problem -> numerical answer',
    }

    example = {
        'problem': 'Janet sells 16 eggs per day at the farmers market. '
                  'She eats three for breakfast and bakes muffins with four. '
                  'She sells the remainder for $2 per egg. '
                  'How much does she make per day?',
        'solution': '16 - 3 - 4 = 9 eggs sold. 9 * $2 = $18.',
        'answer': 18,
    }

    # GSM-8K is largely solved
    current_scores = {
        'GPT-4o': 95.8,
        'DeepSeek V3': 91.6,
        'Llama 3.1 405B': 96.8,
        'DeepSeek R1': 97.3,
        'o1': 95.8,
    }

    # The problem: GSM-8K is too easy for frontier models
    # Most errors are parsing issues, not math errors
    return structure, example, current_scores

Contamination: The Elephant in the Room

How Contamination Happens

def contamination_analysis():
    """
    Benchmark contamination: training data contains benchmark questions.
    This inflates scores without improving genuine capability.
    """

    contamination_sources = {
        'direct_inclusion': {
            'how': 'Benchmark questions appear verbatim in web crawl data',
            'example': 'MMLU questions posted on forums, Reddit, educational sites',
            'detection': 'N-gram overlap between training data and benchmark',
            'prevalence': 'High for MMLU, HumanEval — they are widely shared online',
        },
        'paraphrase_inclusion': {
            'how': 'Slightly reworded versions of benchmark questions in training data',
            'example': 'A blog post that discusses an MMLU question and its answer',
            'detection': 'Difficult — paraphrases evade n-gram detection',
            'prevalence': 'Very high — any discussion of benchmarks is contamination',
        },
        'synthetic_data_leakage': {
            'how': 'Teacher model (GPT-4) generates training data that contains '
                  'patterns from benchmarks the teacher was evaluated on',
            'example': 'GPT-4 generates a "textbook" that includes problems '
                      'similar to MATH competition problems',
            'detection': 'Nearly impossible to detect',
            'prevalence': 'Unknown but likely significant',
        },
    }

    contamination_evidence = {
        'humaneval': {
            'evidence': 'Some models produce the EXACT canonical solution, '
                       'including variable names. This is strong evidence of memorization.',
            'severity': 'High',
        },
        'mmlu': {
            'evidence': 'Models score disproportionately well on questions that '
                       'appear frequently on educational websites vs rare questions.',
            'severity': 'Moderate to high',
        },
        'math': {
            'evidence': 'AMC/AIME problems are published and widely discussed. '
                       'Models perform better on older problems (more time to be '
                       'included in training data) than newer ones.',
            'severity': 'Moderate',
        },
        'swe_bench': {
            'evidence': 'Lower contamination risk because tasks are derived from '
                       'specific GitHub issues at specific commits. However, the '
                       'fixes were merged and are in the commit history.',
            'severity': 'Low to moderate',
        },
    }

    return contamination_sources, contamination_evidence

def decontamination_methods():
    """
    How labs attempt to decontaminate training data.
    """

    methods = {
        'n_gram_filtering': {
            'method': 'Remove training examples with N-gram overlap above threshold',
            'typical_n': '13-gram or 8-gram',
            'effectiveness': 'Catches verbatim copies, misses paraphrases',
        },
        'embedding_similarity': {
            'method': 'Remove training examples with high semantic similarity '
                     'to benchmark questions (using sentence embeddings)',
            'effectiveness': 'Catches paraphrases, but may remove legitimate '
                           'educational content on the same topic',
        },
        'canary_strings': {
            'method': 'Embed unique identifiers in benchmarks to detect '
                     'if they appear in training data',
            'effectiveness': 'Only works for future benchmarks, not existing ones',
        },
        'held_out_evaluation': {
            'method': 'Create new benchmark questions that have never been '
                     'published and evaluate on those',
            'effectiveness': 'Best approach — but expensive to create and '
                           'becomes contaminated once published',
        },
    }
    return methods

Better Alternatives

Emerging Benchmarks

def better_benchmarks():
    """
    Newer benchmarks designed to address the limitations of MMLU/HumanEval/MATH.
    """

    alternatives = {
        'GPQA': {
            'full_name': 'Graduate-level Professional QA',
            'size': 448,
            'format': 'Multiple choice (expert-level science)',
            'advantage': 'Questions require PhD-level domain expertise. '
                        'Even domain experts score ~65%. Much harder to contaminate.',
            'limitation': 'Small (448 questions). May saturate eventually.',
            'current_frontier': '~70-78% (o1)',
        },
        'MMLU_Pro': {
            'full_name': 'MMLU-Pro (harder MMLU)',
            'size': 12032,
            'format': 'Multiple choice with 10 options (not 4)',
            'advantage': 'Harder questions, more options reduce guessing luck. '
                        'Less contaminated (newer benchmark).',
            'limitation': 'Still multiple choice.',
            'current_frontier': '~72-80%',
        },
        'LiveCodeBench': {
            'full_name': 'LiveCodeBench',
            'format': 'Continuously updated code problems from LeetCode/CodeForces',
            'advantage': 'New problems added regularly — cannot be in training data. '
                        'Timestamps allow measuring performance over time.',
            'limitation': 'Algorithm-focused — not real-world software engineering.',
            'current_frontier': '~55-70%',
        },
        'AIME_2025': {
            'full_name': 'AIME 2025 (competition math)',
            'size': 30,
            'format': 'Open-ended numerical answer',
            'advantage': 'New every year — guaranteed zero contamination. '
                        'Genuinely hard (even for strong models).',
            'limitation': 'Only 30 problems. High variance.',
            'current_frontier': '~70-85% for reasoning models',
        },
        'Chatbot_Arena': {
            'full_name': 'LMSYS Chatbot Arena (Elo ratings)',
            'format': 'Human preference: users compare two model outputs',
            'advantage': 'Most realistic evaluation — real users, real tasks. '
                        'Cannot be gamed through training data.',
            'limitation': 'Biased toward fluency/style over accuracy. '
                         'Expensive to run at scale.',
            'current_frontier': 'GPT-4o and Claude 3.5 Sonnet ~1270 Elo',
        },
    }
    return alternatives

How to Read Benchmark Numbers

def benchmark_reading_guide():
    """
    Practical guide for interpreting benchmark claims in papers.
    """

    checklist = {
        'check_evaluation_protocol': {
            'what': 'Is it 0-shot, 5-shot, or CoT? What prompt template?',
            'why': 'Different protocols give different scores. '
                  'Papers choose the protocol that gives the best number.',
            'example': 'MMLU 5-shot vs 0-shot can differ by 2-5%.',
        },
        'check_sample_count': {
            'what': 'How many samples were generated per problem?',
            'why': 'pass@1 with temperature=0 vs pass@1 with temperature=0.8 '
                  'and majority voting give very different results.',
            'example': 'HumanEval pass@1 (greedy) vs pass@1 (best of 100) '
                      'can differ by 10-20%.',
        },
        'check_contamination_analysis': {
            'what': 'Did the paper analyze training data overlap?',
            'why': 'Without this, scores may be inflated.',
            'example': 'Many papers skip contamination analysis entirely.',
        },
        'compare_on_same_protocol': {
            'what': 'Are you comparing apples to apples?',
            'why': 'Model A on 5-shot MMLU vs Model B on 0-shot MMLU is meaningless.',
            'example': 'Use LMSYS Arena or a common evaluation framework.',
        },
        'check_benchmark_date': {
            'what': 'When was the benchmark published?',
            'why': 'Older benchmarks are more contaminated.',
            'example': 'HumanEval (2021) is likely in most training sets. '
                      'AIME 2025 (Jan 2025) is guaranteed fresh.',
        },
        'look_at_multiple_benchmarks': {
            'what': 'Does the model perform consistently across benchmarks?',
            'why': 'A model that scores well on MMLU but poorly on GPQA '
                  'may have memorized MMLU.',
            'example': 'DeepSeek V3 scores consistently across all benchmarks.',
        },
    }
    return checklist

📊

Benchmark Reliability Assessment

Benchmark	Contamination Risk	Ceiling Effect	Measures Real Ability	Recommended
MMLU	High	Approaching	Moderate (factual recall)	Use MMLU-Pro instead
HumanEval	Very high	At ceiling	Low (too simple)	Use LiveCodeBench
MATH	Moderate	Approaching	High (genuine reasoning)	Use AIME 2025
GSM-8K	High	At ceiling	Low (too easy)	Deprecated
SWE-bench	Low	Far from ceiling	Very high (realistic)	Recommended
GPQA	Low	Far from ceiling	High (expert knowledge)	Recommended
Chatbot Arena	None	N/A	High (real preferences)	Gold standard

The benchmark landscape is shifting from static, potentially contaminated tests (MMLU, HumanEval) toward dynamic, harder, and more realistic evaluations (SWE-bench, LiveCodeBench, Chatbot Arena). When evaluating a model, look at the newer benchmarks first. If a paper only reports MMLU and HumanEval, ask why they are not showing SWE-bench or GPQA. The most informative evaluations are those designed to resist contamination and test genuine capability rather than memorization.

MMLU: Massive Multitask Language Understanding

What It Is

What MMLU Measures vs What It Misses

MMLU Score Interpretation Guide

HumanEval: Code Generation

What It Is

What HumanEval Misses

MATH: Competition Mathematics

Structure and Difficulty

What MATH Measures

MATH-500 Accuracy Over Time

SWE-bench: Real Software Engineering

What It Is

What SWE-bench Measures

SWE-bench Verified Scores (2024-2025)

GSM-8K: Grade School Math

Contamination: The Elephant in the Room

How Contamination Happens

Better Alternatives

Emerging Benchmarks

How to Read Benchmark Numbers

Benchmark Reliability Assessment

Stanley Phoong

Related Posts

Evaluation Datasets: Building Benchmarks That Actually Measure LLM Capability

Data Contamination: Detecting and Preventing Benchmark Leakage in Training Data

Model Evaluation Beyond Benchmarks: Arena, Human Preference, and Capability Elicitation