MMLU is contaminated in every major training dataset. GPT-4 scores 86.4%, but retesting on newly-written MMLU-style questions drops performance to 78.2% — an 8.2 point overshoot from memorization. HumanEval’s 164 problems are so widely circulated that DeepSeek V3 scores 90.2% despite the test being “held out” since 2021. Benchmarks are not neutral measurements; they are adversarial games where training data increasingly includes the test set, and reported numbers overestimate generalization by 5-15 points.
MMLU: Massive Multitask Language Understanding
What It Is
def mmlu_structure():
"""
MMLU (Hendrycks et al., 2021) is a multiple-choice knowledge benchmark.
57 subjects across STEM, humanities, social sciences, and professional fields.
~15,908 questions total.
"""
subjects = {
'stem': [
'abstract_algebra', 'astronomy', 'college_biology',
'college_chemistry', 'college_computer_science',
'college_mathematics', 'college_physics',
'computer_security', 'conceptual_physics',
'electrical_engineering', 'elementary_mathematics',
'high_school_biology', 'high_school_chemistry',
'high_school_computer_science', 'high_school_mathematics',
'high_school_physics', 'high_school_statistics',
'machine_learning',
],
'humanities': [
'formal_logic', 'high_school_european_history',
'high_school_us_history', 'high_school_world_history',
'international_law', 'jurisprudence', 'logical_fallacies',
'moral_disputes', 'moral_scenarios', 'philosophy',
'prehistory', 'professional_law', 'world_religions',
],
'social_sciences': [
'econometrics', 'high_school_geography',
'high_school_government_and_politics',
'high_school_macroeconomics', 'high_school_microeconomics',
'high_school_psychology', 'human_sexuality',
'professional_psychology', 'public_relations',
'security_studies', 'sociology', 'us_foreign_policy',
],
'other': [
'anatomy', 'business_ethics', 'clinical_knowledge',
'college_medicine', 'global_facts', 'human_aging',
'management', 'marketing', 'medical_genetics',
'miscellaneous', 'nutrition', 'professional_accounting',
'professional_medicine', 'virology',
],
}
# Question format: multiple choice with 4 options (A, B, C, D)
example = {
'subject': 'college_mathematics',
'question': 'Let A be the set of all ordered pairs of integers (m, n) '
'such that 7m + 12n = 22. What is the greatest negative '
'number in the set B = {m + n : (m, n) in A}?',
'options': ['A) -5', 'B) -4', 'C) -3', 'D) -2'],
'answer': 'B',
}
return subjects, example
def mmlu_evaluation_protocol():
"""
MMLU evaluation: two common protocols that give DIFFERENT scores.
"""
protocols = {
'5_shot': {
'description': '5 examples provided before the question as context',
'format': 'Q: ... A: B\nQ: ... A: D\n...\nQ: {test question} A:',
'scoring': 'Compare model\'s next token probabilities for A, B, C, D',
'note': 'Most commonly reported in papers',
},
'0_shot': {
'description': 'No examples, just the question',
'format': 'Q: {test question}\nA:',
'scoring': 'Same probability comparison',
'note': 'Some papers (especially chat models) report this',
},
'cot_0_shot': {
'description': 'Ask model to think step-by-step before answering',
'format': 'Q: {question}\nLet\'s think step by step:',
'scoring': 'Extract answer from generated text',
'note': 'Higher scores for reasoning models, but harder to parse',
},
}
# CRITICAL: different protocols give different scores
# Example: GPT-4o 5-shot = 88.7%, GPT-4o 0-shot = 86.5%, GPT-4o CoT = 90.2%
# Papers cherry-pick the protocol that gives the best number
return protocols
What MMLU Measures vs What It Misses
def mmlu_measures():
"""What MMLU actually tests."""
measures = {
'factual_recall': {
'weight': 'High — many questions test memorized facts',
'example': 'What year did the Treaty of Westphalia end?',
'can_be_gamed': 'Yes — train on the source textbooks',
},
'exam_taking': {
'weight': 'High — it is literally a multiple-choice exam',
'example': 'Process of elimination among 4 options',
'can_be_gamed': 'Yes — models learn MC test-taking strategies',
},
'domain_breadth': {
'weight': 'High — 57 subjects cover broad knowledge',
'genuine_value': 'Tests whether model has been exposed to diverse domains',
'limitation': 'Each subject has only ~100-300 questions',
},
'reasoning': {
'weight': 'Low to moderate — some math/logic questions require reasoning',
'limitation': 'Most questions can be answered from memorization alone',
},
}
misses = {
'open_ended_generation': 'MMLU never tests free-form writing',
'multi_step_reasoning': 'Questions are single-turn, no chain of thought',
'real_world_application': 'Knowing facts != applying them',
'uncertainty_calibration': 'No partial credit, no "I don\'t know" option',
'temporal_knowledge': 'Many questions have outdated answers',
'multilingual': 'English only',
}
return measures, misses
MMLU Score Interpretation Guide
| Score Range | Interpretation | Models in Range |
|---|---|---|
| 25% | Random chance (4 options) | Untrained model |
| 40-50% | Basic knowledge, poor reasoning | Small models (1-3B) |
| 50-65% | Moderate knowledge, some domains strong | 7B models (Mistral, Llama 3 8B) |
| 65-80% | Strong broad knowledge | 70B models, GPT-3.5 |
| 80-90% | Expert-level across most domains | GPT-4, Claude 3.5, DeepSeek V3 |
| 90%+ | Near-saturation, benchmark ceiling | GPT-4o (with CoT), o1 |
HumanEval: Code Generation
What It Is
def humaneval_structure():
"""
HumanEval (Chen et al., 2021, OpenAI) is a code generation benchmark.
164 hand-written Python programming problems.
"""
# Example problem (simplified)
example = {
'task_id': 'HumanEval/0',
'prompt': '''from typing import List
def has_close_elements(numbers: List[float], threshold: float) -> bool:
""" Check if in given list of numbers, are any two numbers closer to each
other than given threshold.
>>> has_close_elements([1.0, 2.0, 3.0], 0.5)
False
>>> has_close_elements([1.0, 2.8, 3.0, 4.0, 5.0, 2.0], 0.3)
True
"""''',
'canonical_solution': ''' for idx, elem in enumerate(numbers):
for idx2, elem2 in enumerate(numbers):
if idx != idx2:
distance = abs(elem - elem2)
if distance < threshold:
return True
return False''',
'test': '''
def check(candidate):
assert candidate([1.0, 2.0, 3.0], 0.5) == False
assert candidate([1.0, 2.8, 3.0, 4.0, 5.0, 2.0], 0.3) == True
assert candidate([1.0, 2.0, 3.9, 4.0, 5.0, 2.2], 0.3) == True
assert candidate([1.0, 2.0, 3.9, 4.0, 5.0, 2.2], 0.05) == True
assert candidate([1.0, 2.0, 5.9, 4.0, 5.0], 0.95) == True
assert candidate([1.0, 2.0, 5.9, 4.0, 5.0], 0.8) == False
assert candidate([1.0, 2.0, 3.0, 4.0, 5.0, 2.0], 0.1) == True
assert candidate([1.1, 2.2, 3.1, 4.1, 5.1], 1.0) == True
assert candidate([1.1, 2.2, 3.1, 4.1, 5.1], 0.5) == False''',
}
statistics = {
'total_problems': 164,
'language': 'Python only',
'difficulty_distribution': {
'easy': '~40% (basic string/list manipulation)',
'medium': '~45% (moderate algorithms)',
'hard': '~15% (complex logic, edge cases)',
},
'average_solution_length': '~10 lines',
'test_cases_per_problem': '~7 (range: 3-20)',
}
return example, statistics
def humaneval_scoring():
"""
Scoring: pass@k metric.
Generate k samples, pass@k = fraction with at least 1 correct.
"""
def compute_pass_at_k(n, c, k):
"""
n: number of samples generated
c: number of correct samples
k: k in pass@k
"""
from math import comb
if n - c < k:
return 1.0
return 1.0 - comb(n - c, k) / comb(n, k)
# Most papers report pass@1 (single attempt)
# pass@1 is more meaningful than pass@10 for practical use
# (you want the model to get it right on the first try)
return compute_pass_at_k
What HumanEval Misses
def humaneval_limitations():
"""
HumanEval's limitations are severe enough that it should NOT
be the sole measure of coding ability.
"""
limitations = {
'too_small': {
'issue': '164 problems is statistically insufficient',
'impact': '1 problem = 0.6% of the score. Random luck matters.',
'example': 'A model that gets 1 extra problem right jumps 0.6%',
},
'python_only': {
'issue': 'Only tests Python — no C, Java, JavaScript, Rust, etc.',
'impact': 'Models optimized for Python score artificially high',
},
'simple_problems': {
'issue': 'Most problems are interview-easy level',
'impact': 'Ceiling effect — frontier models all score 80%+',
'example': 'Problems like "reverse a string" or "check if palindrome"',
},
'no_context': {
'issue': 'Each problem is self-contained — no existing codebase',
'impact': 'Does not test ability to work with large codebases, '
'read documentation, or debug existing code',
},
'no_design': {
'issue': 'No testing of software design, architecture, or API design',
'impact': 'Coding is more than implementing functions',
},
'contamination': {
'issue': 'Published since 2021 — likely in training data of most models',
'impact': 'Models may memorize solutions rather than solve problems',
'evidence': 'Some models produce exact canonical solutions',
},
}
return limitations
MATH: Competition Mathematics
Structure and Difficulty
def math_benchmark_structure():
"""
MATH (Hendrycks et al., 2021): 12,500 competition math problems.
MATH-500: a representative subset of 500 problems for efficiency.
"""
categories = {
'algebra': {
'fraction': 0.22,
'topics': 'Equations, polynomials, sequences, inequalities',
'difficulty_range': 'AMC 8 to AIME',
},
'counting_and_probability': {
'fraction': 0.12,
'topics': 'Combinatorics, probability, expected value',
'difficulty_range': 'AMC 10 to AIME',
},
'geometry': {
'fraction': 0.10,
'topics': 'Euclidean geometry, coordinate geometry, 3D geometry',
'difficulty_range': 'AMC 10 to AIME',
},
'intermediate_algebra': {
'fraction': 0.18,
'topics': 'Complex numbers, logarithms, advanced polynomials',
'difficulty_range': 'AMC 12 to AIME',
},
'number_theory': {
'fraction': 0.15,
'topics': 'Divisibility, modular arithmetic, prime numbers',
'difficulty_range': 'AMC 10 to AIME',
},
'prealgebra': {
'fraction': 0.13,
'topics': 'Fractions, ratios, basic operations',
'difficulty_range': 'AMC 8',
},
'precalculus': {
'fraction': 0.10,
'topics': 'Trigonometry, vectors, matrices',
'difficulty_range': 'AMC 12 to AIME',
},
}
difficulty_levels = {
1: {'description': 'AMC 8 level', 'fraction': 0.20},
2: {'description': 'AMC 10 easy', 'fraction': 0.20},
3: {'description': 'AMC 10 hard / AMC 12 easy', 'fraction': 0.20},
4: {'description': 'AMC 12 hard / AIME easy', 'fraction': 0.20},
5: {'description': 'AIME medium to hard', 'fraction': 0.20},
}
example = {
'problem': 'How many of the integers between 1 and 1000, inclusive, '
'can be expressed as the difference of the squares of two '
'nonnegative integers?',
'answer': '750',
'level': 5,
'category': 'number_theory',
'solution': (
'A number n can be expressed as a^2 - b^2 = (a+b)(a-b) if and '
'only if n is not congruent to 2 mod 4. Numbers congruent to '
'2 mod 4 in [1,1000]: 2, 6, 10, ..., 998 = 250 numbers. '
'So 1000 - 250 = 750.'
),
}
return categories, difficulty_levels, example
What MATH Measures
def math_analysis():
"""
MATH tests genuine mathematical reasoning — one of the harder
benchmarks to game through memorization.
"""
measures = {
'multi_step_reasoning': {
'weight': 'Very high — most problems require 3-10 reasoning steps',
'genuine': True,
'note': 'This is the benchmark\'s primary value',
},
'symbolic_manipulation': {
'weight': 'High — algebra, simplification, equation solving',
'genuine': True,
},
'problem_decomposition': {
'weight': 'High — must break complex problems into subproblems',
'genuine': True,
},
'mathematical_knowledge': {
'weight': 'Moderate — need to know theorems, formulas, techniques',
'can_be_memorized': True,
},
}
misses = {
'proof_writing': 'MATH asks for numerical answers, not proofs',
'mathematical_creativity': 'Problems have known solution paths',
'applied_mathematics': 'Pure math — no physics, engineering, or stats applications',
'open_ended_exploration': 'Each problem has exactly one correct answer',
}
contamination_risk = {
'level': 'Moderate',
'reason': 'Problems are from published competitions (AMC, AIME). '
'These are widely available online.',
'mitigation': 'Some labs use MATH-500 (subset) or create new problems.',
}
return measures, misses, contamination_risk
MATH-500 Accuracy Over Time
MATH is approaching saturation. DeepSeek R1 scores 97.3%, leaving only ~14 problems wrong out of 500. Once benchmark scores exceed 95%, the remaining errors are often due to edge cases, ambiguous problem statements, or parsing issues rather than genuine mathematical inability. New, harder benchmarks are needed.
SWE-bench: Real Software Engineering
What It Is
def swe_bench_structure():
"""
SWE-bench (Jimenez et al., 2024): real GitHub issues from popular
Python repositories. The model must produce a patch that fixes the issue.
"""
benchmark_info = {
'original': {
'size': 2294,
'source': '12 popular Python repos',
'task': 'Given a GitHub issue description, produce a git diff '
'that resolves the issue',
'evaluation': 'Run the repository test suite on the patched code',
},
'swe_bench_lite': {
'size': 300,
'source': 'Subset of SWE-bench (simpler issues)',
'purpose': 'Faster evaluation, still meaningful',
},
'swe_bench_verified': {
'size': 500,
'source': 'Human-verified subset with clearer specifications',
'purpose': 'Reduce false negatives from ambiguous issues',
},
}
repositories = [
'django/django',
'scikit-learn/scikit-learn',
'matplotlib/matplotlib',
'sympy/sympy',
'astropy/astropy',
'pytest-dev/pytest',
'sphinx-doc/sphinx',
'pylint-dev/pylint',
'psf/requests',
'pallets/flask',
'mwaskom/seaborn',
'pydata/xarray',
]
# Example task (simplified)
example = {
'repo': 'django/django',
'issue_title': 'QuerySet.defer() doesn\'t clear deferred fields when chaining',
'issue_body': 'When calling .defer("field1").defer("field2"), only field2 '
'is deferred. Expected: both fields should be deferred.',
'expected_output': 'A git diff that fixes the QuerySet.defer() method',
'test_command': 'python -m pytest tests/defer/tests.py',
}
return benchmark_info, repositories, example
What SWE-bench Measures
def swe_bench_measures():
"""
SWE-bench is the most realistic coding benchmark available.
"""
measures = {
'real_codebase_navigation': {
'weight': 'Very high — must find relevant files in large repos',
'genuine': True,
'note': 'Django has 500K+ lines of code. Finding the bug '
'requires understanding the codebase structure.',
},
'bug_understanding': {
'weight': 'High — must comprehend the issue from natural language',
'genuine': True,
},
'patch_generation': {
'weight': 'High — must produce a correct, minimal patch',
'genuine': True,
'note': 'Not just writing new code, but modifying existing code',
},
'test_awareness': {
'weight': 'Moderate — patch must pass the test suite',
'genuine': True,
},
}
misses = {
'python_only': 'All repos are Python — no C++, Rust, Go, etc.',
'no_new_features': 'All tasks are bug fixes, no feature development',
'no_code_review': 'No testing of collaboration, review, or documentation',
'agent_scaffolding_matters': 'Results heavily depend on the scaffolding '
'(how the model is called, retrieval, etc.)',
}
return measures, misses
SWE-bench Verified Scores (2024-2025)
| Model/Agent | SWE-bench Verified | Method | Date |
|---|---|---|---|
| GPT-4 (raw) | 1.7% | Direct prompting | 2024-03 |
| Claude 3.5 Sonnet (raw) | 3.2% | Direct prompting | 2024-06 |
| SWE-Agent + GPT-4 | 12.5% | Agent with retrieval | 2024-04 |
| Claude 3.5 Sonnet (agent) | 33.4% | Anthropic's agent framework | 2024-10 |
| OpenAI o1 (agent) | 41.0% | Agent framework | 2024-12 |
| DeepSeek R1 (agent) | 49.2% | Agent framework | 2025-01 |
SWE-bench scores are heavily influenced by the agent scaffolding (file retrieval, multi-step reasoning, error recovery) rather than the raw model alone. Claude 3.5 Sonnet scores 3.2% with direct prompting but 33.4% with an agent framework. The benchmark measures the model + scaffolding system, not the model alone. Always check what scaffolding was used when comparing scores.
GSM-8K: Grade School Math
def gsm8k_analysis():
"""
GSM-8K (Cobbe et al., 2021): 8,792 grade-school math word problems.
"""
structure = {
'size': 8792,
'difficulty': 'Grade school (ages 8-14)',
'steps': '2-8 reasoning steps per problem',
'operations': 'Addition, subtraction, multiplication, division',
'format': 'Natural language word problem -> numerical answer',
}
example = {
'problem': 'Janet sells 16 eggs per day at the farmers market. '
'She eats three for breakfast and bakes muffins with four. '
'She sells the remainder for $2 per egg. '
'How much does she make per day?',
'solution': '16 - 3 - 4 = 9 eggs sold. 9 * $2 = $18.',
'answer': 18,
}
# GSM-8K is largely solved
current_scores = {
'GPT-4o': 95.8,
'DeepSeek V3': 91.6,
'Llama 3.1 405B': 96.8,
'DeepSeek R1': 97.3,
'o1': 95.8,
}
# The problem: GSM-8K is too easy for frontier models
# Most errors are parsing issues, not math errors
return structure, example, current_scores
Contamination: The Elephant in the Room
How Contamination Happens
def contamination_analysis():
"""
Benchmark contamination: training data contains benchmark questions.
This inflates scores without improving genuine capability.
"""
contamination_sources = {
'direct_inclusion': {
'how': 'Benchmark questions appear verbatim in web crawl data',
'example': 'MMLU questions posted on forums, Reddit, educational sites',
'detection': 'N-gram overlap between training data and benchmark',
'prevalence': 'High for MMLU, HumanEval — they are widely shared online',
},
'paraphrase_inclusion': {
'how': 'Slightly reworded versions of benchmark questions in training data',
'example': 'A blog post that discusses an MMLU question and its answer',
'detection': 'Difficult — paraphrases evade n-gram detection',
'prevalence': 'Very high — any discussion of benchmarks is contamination',
},
'synthetic_data_leakage': {
'how': 'Teacher model (GPT-4) generates training data that contains '
'patterns from benchmarks the teacher was evaluated on',
'example': 'GPT-4 generates a "textbook" that includes problems '
'similar to MATH competition problems',
'detection': 'Nearly impossible to detect',
'prevalence': 'Unknown but likely significant',
},
}
contamination_evidence = {
'humaneval': {
'evidence': 'Some models produce the EXACT canonical solution, '
'including variable names. This is strong evidence of memorization.',
'severity': 'High',
},
'mmlu': {
'evidence': 'Models score disproportionately well on questions that '
'appear frequently on educational websites vs rare questions.',
'severity': 'Moderate to high',
},
'math': {
'evidence': 'AMC/AIME problems are published and widely discussed. '
'Models perform better on older problems (more time to be '
'included in training data) than newer ones.',
'severity': 'Moderate',
},
'swe_bench': {
'evidence': 'Lower contamination risk because tasks are derived from '
'specific GitHub issues at specific commits. However, the '
'fixes were merged and are in the commit history.',
'severity': 'Low to moderate',
},
}
return contamination_sources, contamination_evidence
def decontamination_methods():
"""
How labs attempt to decontaminate training data.
"""
methods = {
'n_gram_filtering': {
'method': 'Remove training examples with N-gram overlap above threshold',
'typical_n': '13-gram or 8-gram',
'effectiveness': 'Catches verbatim copies, misses paraphrases',
},
'embedding_similarity': {
'method': 'Remove training examples with high semantic similarity '
'to benchmark questions (using sentence embeddings)',
'effectiveness': 'Catches paraphrases, but may remove legitimate '
'educational content on the same topic',
},
'canary_strings': {
'method': 'Embed unique identifiers in benchmarks to detect '
'if they appear in training data',
'effectiveness': 'Only works for future benchmarks, not existing ones',
},
'held_out_evaluation': {
'method': 'Create new benchmark questions that have never been '
'published and evaluate on those',
'effectiveness': 'Best approach — but expensive to create and '
'becomes contaminated once published',
},
}
return methods
Better Alternatives
Emerging Benchmarks
def better_benchmarks():
"""
Newer benchmarks designed to address the limitations of MMLU/HumanEval/MATH.
"""
alternatives = {
'GPQA': {
'full_name': 'Graduate-level Professional QA',
'size': 448,
'format': 'Multiple choice (expert-level science)',
'advantage': 'Questions require PhD-level domain expertise. '
'Even domain experts score ~65%. Much harder to contaminate.',
'limitation': 'Small (448 questions). May saturate eventually.',
'current_frontier': '~70-78% (o1)',
},
'MMLU_Pro': {
'full_name': 'MMLU-Pro (harder MMLU)',
'size': 12032,
'format': 'Multiple choice with 10 options (not 4)',
'advantage': 'Harder questions, more options reduce guessing luck. '
'Less contaminated (newer benchmark).',
'limitation': 'Still multiple choice.',
'current_frontier': '~72-80%',
},
'LiveCodeBench': {
'full_name': 'LiveCodeBench',
'format': 'Continuously updated code problems from LeetCode/CodeForces',
'advantage': 'New problems added regularly — cannot be in training data. '
'Timestamps allow measuring performance over time.',
'limitation': 'Algorithm-focused — not real-world software engineering.',
'current_frontier': '~55-70%',
},
'AIME_2025': {
'full_name': 'AIME 2025 (competition math)',
'size': 30,
'format': 'Open-ended numerical answer',
'advantage': 'New every year — guaranteed zero contamination. '
'Genuinely hard (even for strong models).',
'limitation': 'Only 30 problems. High variance.',
'current_frontier': '~70-85% for reasoning models',
},
'Chatbot_Arena': {
'full_name': 'LMSYS Chatbot Arena (Elo ratings)',
'format': 'Human preference: users compare two model outputs',
'advantage': 'Most realistic evaluation — real users, real tasks. '
'Cannot be gamed through training data.',
'limitation': 'Biased toward fluency/style over accuracy. '
'Expensive to run at scale.',
'current_frontier': 'GPT-4o and Claude 3.5 Sonnet ~1270 Elo',
},
}
return alternatives
How to Read Benchmark Numbers
def benchmark_reading_guide():
"""
Practical guide for interpreting benchmark claims in papers.
"""
checklist = {
'check_evaluation_protocol': {
'what': 'Is it 0-shot, 5-shot, or CoT? What prompt template?',
'why': 'Different protocols give different scores. '
'Papers choose the protocol that gives the best number.',
'example': 'MMLU 5-shot vs 0-shot can differ by 2-5%.',
},
'check_sample_count': {
'what': 'How many samples were generated per problem?',
'why': 'pass@1 with temperature=0 vs pass@1 with temperature=0.8 '
'and majority voting give very different results.',
'example': 'HumanEval pass@1 (greedy) vs pass@1 (best of 100) '
'can differ by 10-20%.',
},
'check_contamination_analysis': {
'what': 'Did the paper analyze training data overlap?',
'why': 'Without this, scores may be inflated.',
'example': 'Many papers skip contamination analysis entirely.',
},
'compare_on_same_protocol': {
'what': 'Are you comparing apples to apples?',
'why': 'Model A on 5-shot MMLU vs Model B on 0-shot MMLU is meaningless.',
'example': 'Use LMSYS Arena or a common evaluation framework.',
},
'check_benchmark_date': {
'what': 'When was the benchmark published?',
'why': 'Older benchmarks are more contaminated.',
'example': 'HumanEval (2021) is likely in most training sets. '
'AIME 2025 (Jan 2025) is guaranteed fresh.',
},
'look_at_multiple_benchmarks': {
'what': 'Does the model perform consistently across benchmarks?',
'why': 'A model that scores well on MMLU but poorly on GPQA '
'may have memorized MMLU.',
'example': 'DeepSeek V3 scores consistently across all benchmarks.',
},
}
return checklist
Benchmark Reliability Assessment
| Benchmark | Contamination Risk | Ceiling Effect | Measures Real Ability | Recommended |
|---|---|---|---|---|
| MMLU | High | Approaching | Moderate (factual recall) | Use MMLU-Pro instead |
| HumanEval | Very high | At ceiling | Low (too simple) | Use LiveCodeBench |
| MATH | Moderate | Approaching | High (genuine reasoning) | Use AIME 2025 |
| GSM-8K | High | At ceiling | Low (too easy) | Deprecated |
| SWE-bench | Low | Far from ceiling | Very high (realistic) | Recommended |
| GPQA | Low | Far from ceiling | High (expert knowledge) | Recommended |
| Chatbot Arena | None | N/A | High (real preferences) | Gold standard |
The benchmark landscape is shifting from static, potentially contaminated tests (MMLU, HumanEval) toward dynamic, harder, and more realistic evaluations (SWE-bench, LiveCodeBench, Chatbot Arena). When evaluating a model, look at the newer benchmarks first. If a paper only reports MMLU and HumanEval, ask why they are not showing SWE-bench or GPQA. The most informative evaluations are those designed to resist contamination and test genuine capability rather than memorization.