Evaluation Datasets: Building Benchmarks That Actually Measure LLM Capability

Part of Series The Dataset Frontier 10 of 7

1 Synthetic Data Pipelines: Magpie, Nemotron-4, and Generating Training Data at Scale 2 Data Curation at Scale: DCLM, FineWeb-Edu, and the Exact Heuristics That Filter the Web 3 Agent-Based Simulation: Using 10,000 AI Agents to Generate Synthetic Training Data 4 Code Dataset Curation: Deduplication, License Filtering, and Quality Scoring for LLM Training 5 Preference Data: Building DPO/RLHF Datasets from Human and AI Feedback 6 Data Mixing: Optimal Proportions of Code, Math, Web, and Books for LLM Training 7 Evaluation Datasets: Building Benchmarks That Actually Measure LLM Capability

Every model claims state-of-the-art results on benchmarks. But benchmarks are increasingly gamed — training data leaks into evaluation sets, models memorize answers rather than understanding concepts, and the metrics themselves measure narrow capabilities that don’t reflect real-world usefulness. This post covers how to build evaluations that actually work.

The Contamination Problem

When evaluation data appears in training data, the model memorizes answers rather than demonstrating capability. Detection methods:

import hashlib

def detect_contamination(eval_examples, training_data_hashes):
    """Check if eval examples appear in training data."""
    contaminated = []
    for example in eval_examples:
        # Exact match: hash the question text
        question_hash = hashlib.sha256(
            example["question"].lower().strip().encode()
        ).hexdigest()
        if question_hash in training_data_hashes:
            contaminated.append(example)
            continue

        # Near match: check n-gram overlap
        question_ngrams = set(get_ngrams(example["question"], n=8))
        for training_hash, training_ngrams in training_data_hashes.items():
            overlap = len(question_ngrams & training_ngrams) / len(question_ngrams)
            if overlap > 0.8:
                contaminated.append(example)
                break

    return contaminated

📊

Contamination Detection Results (Representative)

Benchmark	Estimated Contamination	Impact on Scores	Detection Method
MMLU	3-8% of questions	+2-5% score inflation	N-gram overlap
HumanEval	15-25% (widely available)	+5-15% inflation	Exact code match
GSM8K	10-20%	+3-8% inflation	Question fingerprinting
Custom eval (never published)	0%	No inflation (ground truth)	N/A

Note: Contamination rates vary by model and training data source. Models trained on web crawls have higher contamination.

⚠️ The Only Trustworthy Benchmark Is One You Built Yourself

Published benchmarks are contaminated to varying degrees. The only reliable evaluation: create a custom dataset from fresh examples that have never been published online. This is expensive (human annotation) but gives true signal. Companies like Anthropic and OpenAI maintain private evaluation sets for exactly this reason.

Building a Custom Evaluation

Step 1: Define Capability Dimensions

CAPABILITY_DIMENSIONS = {
    "factual_recall": {
        "description": "Can the model recall specific facts?",
        "example_type": "question-answer",
        "scoring": "exact_match",
    },
    "reasoning": {
        "description": "Can the model solve multi-step problems?",
        "example_type": "problem-solution",
        "scoring": "answer_extraction + exact_match",
    },
    "code_generation": {
        "description": "Can the model write correct code?",
        "example_type": "spec-to-code",
        "scoring": "execution_pass_rate",
    },
    "instruction_following": {
        "description": "Does the model follow complex instructions?",
        "example_type": "constraint-satisfaction",
        "scoring": "constraint_check_pass_rate",
    },
}

Step 2: Generate Examples

For each dimension, create 100-500 examples with ground truth:

def create_reasoning_example():
    """Generate a multi-step reasoning problem with verifiable answer."""
    # Use templates to avoid contamination
    import random
    a, b, c = random.randint(10, 99), random.randint(10, 99), random.randint(2, 9)
    question = (f"Alice has {a} apples. She gives {b} to Bob. "
                f"Then she buys {c} times as many as she has left. "
                f"How many apples does Alice have now?")
    remaining = a - b
    answer = remaining + remaining * c
    return {"question": question, "answer": str(answer), "dimension": "reasoning"}

Step 3: Run Evaluation

def evaluate_model(model, eval_dataset):
    """Run evaluation and compute per-dimension scores."""
    results = {}
    for dimension, examples in eval_dataset.items():
        correct = 0
        for example in examples:
            response = model.generate(example["question"], max_tokens=512)
            extracted_answer = extract_answer(response)
            if score(extracted_answer, example["answer"], example.get("scoring", "exact_match")):
                correct += 1
        results[dimension] = correct / len(examples)
    return results

def extract_answer(response):
    """Extract the final answer from model response."""
    # Look for patterns like "The answer is X" or "X" at the end
    import re
    patterns = [
        r"(?:the answer is|answer:)\s*(\S+)",
        r"(?:therefore|so|thus)[,:]?\s*(\S+)",
        r"(\d+)\s*$",  # Last number in response
    ]
    for pattern in patterns:
        match = re.search(pattern, response.lower())
        if match:
            return match.group(1)
    return response.strip().split()[-1]  # Last word as fallback

💡 The 500-Example Sweet Spot

For most capability dimensions, 500 examples provide statistically significant results (95% confidence interval within +-3%). Below 100 examples, variance is too high for meaningful comparison. Above 1000, the marginal precision gain is not worth the evaluation cost.

Reviewer Agent Validation

Challenge: Implement a contamination detector that takes an evaluation question and a set of training document hashes, returning True if the question is likely contaminated (more than 80% 8-gram overlap with any training document).

Expected:

def is_contaminated(question, training_ngram_sets, n=8, threshold=0.8):
    words = question.lower().split()
    question_ngrams = set(
        " ".join(words[i:i+n]) for i in range(len(words) - n + 1)
    )
    if not question_ngrams:
        return False
    for training_ngrams in training_ngram_sets:
        overlap = len(question_ngrams & training_ngrams) / len(question_ngrams)
        if overlap > threshold:
            return True
    return False

The Contamination Problem

Contamination Detection Results (Representative)

Building a Custom Evaluation

Step 1: Define Capability Dimensions

Step 2: Generate Examples

Step 3: Run Evaluation

Reviewer Agent Validation

Stanley Phoong

Related Posts

Encoder vs Decoder: Why Decoder-Only Won

Constrained Generation: FSM-Based Decoding, Outlines, and Grammar-Guided LLM Output

CPU and Edge Inference: llama.cpp Internals, GGUF Format, and When CPU Actually Wins