Every model claims state-of-the-art results on benchmarks. But benchmarks are increasingly gamed — training data leaks into evaluation sets, models memorize answers rather than understanding concepts, and the metrics themselves measure narrow capabilities that don’t reflect real-world usefulness. This post covers how to build evaluations that actually work.

The Contamination Problem

When evaluation data appears in training data, the model memorizes answers rather than demonstrating capability. Detection methods:

import hashlib

def detect_contamination(eval_examples, training_data_hashes):
    """Check if eval examples appear in training data."""
    contaminated = []
    for example in eval_examples:
        # Exact match: hash the question text
        question_hash = hashlib.sha256(
            example["question"].lower().strip().encode()
        ).hexdigest()
        if question_hash in training_data_hashes:
            contaminated.append(example)
            continue

        # Near match: check n-gram overlap
        question_ngrams = set(get_ngrams(example["question"], n=8))
        for training_hash, training_ngrams in training_data_hashes.items():
            overlap = len(question_ngrams & training_ngrams) / len(question_ngrams)
            if overlap > 0.8:
                contaminated.append(example)
                break

    return contaminated
📊

Contamination Detection Results (Representative)

BenchmarkEstimated ContaminationImpact on ScoresDetection Method
MMLU 3-8% of questions +2-5% score inflation N-gram overlap
HumanEval 15-25% (widely available) +5-15% inflation Exact code match
GSM8K 10-20% +3-8% inflation Question fingerprinting
Custom eval (never published) 0% No inflation (ground truth) N/A
Note: Contamination rates vary by model and training data source. Models trained on web crawls have higher contamination.
⚠️ The Only Trustworthy Benchmark Is One You Built Yourself

Published benchmarks are contaminated to varying degrees. The only reliable evaluation: create a custom dataset from fresh examples that have never been published online. This is expensive (human annotation) but gives true signal. Companies like Anthropic and OpenAI maintain private evaluation sets for exactly this reason.

Building a Custom Evaluation

Step 1: Define Capability Dimensions

CAPABILITY_DIMENSIONS = {
    "factual_recall": {
        "description": "Can the model recall specific facts?",
        "example_type": "question-answer",
        "scoring": "exact_match",
    },
    "reasoning": {
        "description": "Can the model solve multi-step problems?",
        "example_type": "problem-solution",
        "scoring": "answer_extraction + exact_match",
    },
    "code_generation": {
        "description": "Can the model write correct code?",
        "example_type": "spec-to-code",
        "scoring": "execution_pass_rate",
    },
    "instruction_following": {
        "description": "Does the model follow complex instructions?",
        "example_type": "constraint-satisfaction",
        "scoring": "constraint_check_pass_rate",
    },
}

Step 2: Generate Examples

For each dimension, create 100-500 examples with ground truth:

def create_reasoning_example():
    """Generate a multi-step reasoning problem with verifiable answer."""
    # Use templates to avoid contamination
    import random
    a, b, c = random.randint(10, 99), random.randint(10, 99), random.randint(2, 9)
    question = (f"Alice has {a} apples. She gives {b} to Bob. "
                f"Then she buys {c} times as many as she has left. "
                f"How many apples does Alice have now?")
    remaining = a - b
    answer = remaining + remaining * c
    return {"question": question, "answer": str(answer), "dimension": "reasoning"}

Step 3: Run Evaluation

def evaluate_model(model, eval_dataset):
    """Run evaluation and compute per-dimension scores."""
    results = {}
    for dimension, examples in eval_dataset.items():
        correct = 0
        for example in examples:
            response = model.generate(example["question"], max_tokens=512)
            extracted_answer = extract_answer(response)
            if score(extracted_answer, example["answer"], example.get("scoring", "exact_match")):
                correct += 1
        results[dimension] = correct / len(examples)
    return results

def extract_answer(response):
    """Extract the final answer from model response."""
    # Look for patterns like "The answer is X" or "X" at the end
    import re
    patterns = [
        r"(?:the answer is|answer:)\s*(\S+)",
        r"(?:therefore|so|thus)[,:]?\s*(\S+)",
        r"(\d+)\s*$",  # Last number in response
    ]
    for pattern in patterns:
        match = re.search(pattern, response.lower())
        if match:
            return match.group(1)
    return response.strip().split()[-1]  # Last word as fallback
💡 The 500-Example Sweet Spot

For most capability dimensions, 500 examples provide statistically significant results (95% confidence interval within +-3%). Below 100 examples, variance is too high for meaningful comparison. Above 1000, the marginal precision gain is not worth the evaluation cost.

Reviewer Agent Validation

Challenge: Implement a contamination detector that takes an evaluation question and a set of training document hashes, returning True if the question is likely contaminated (more than 80% 8-gram overlap with any training document).

Expected:

def is_contaminated(question, training_ngram_sets, n=8, threshold=0.8):
    words = question.lower().split()
    question_ngrams = set(
        " ".join(words[i:i+n]) for i in range(len(words) - n + 1)
    )
    if not question_ngrams:
        return False
    for training_ngrams in training_ngram_sets:
        overlap = len(question_ngrams & training_ngrams) / len(question_ngrams)
        if overlap > threshold:
            return True
    return False