Every model claims state-of-the-art results on benchmarks. But benchmarks are increasingly gamed — training data leaks into evaluation sets, models memorize answers rather than understanding concepts, and the metrics themselves measure narrow capabilities that don’t reflect real-world usefulness. This post covers how to build evaluations that actually work.
The Contamination Problem
When evaluation data appears in training data, the model memorizes answers rather than demonstrating capability. Detection methods:
import hashlib
def detect_contamination(eval_examples, training_data_hashes):
"""Check if eval examples appear in training data."""
contaminated = []
for example in eval_examples:
# Exact match: hash the question text
question_hash = hashlib.sha256(
example["question"].lower().strip().encode()
).hexdigest()
if question_hash in training_data_hashes:
contaminated.append(example)
continue
# Near match: check n-gram overlap
question_ngrams = set(get_ngrams(example["question"], n=8))
for training_hash, training_ngrams in training_data_hashes.items():
overlap = len(question_ngrams & training_ngrams) / len(question_ngrams)
if overlap > 0.8:
contaminated.append(example)
break
return contaminated
Contamination Detection Results (Representative)
| Benchmark | Estimated Contamination | Impact on Scores | Detection Method |
|---|---|---|---|
| MMLU | 3-8% of questions | +2-5% score inflation | N-gram overlap |
| HumanEval | 15-25% (widely available) | +5-15% inflation | Exact code match |
| GSM8K | 10-20% | +3-8% inflation | Question fingerprinting |
| Custom eval (never published) | 0% | No inflation (ground truth) | N/A |
Published benchmarks are contaminated to varying degrees. The only reliable evaluation: create a custom dataset from fresh examples that have never been published online. This is expensive (human annotation) but gives true signal. Companies like Anthropic and OpenAI maintain private evaluation sets for exactly this reason.
Building a Custom Evaluation
Step 1: Define Capability Dimensions
CAPABILITY_DIMENSIONS = {
"factual_recall": {
"description": "Can the model recall specific facts?",
"example_type": "question-answer",
"scoring": "exact_match",
},
"reasoning": {
"description": "Can the model solve multi-step problems?",
"example_type": "problem-solution",
"scoring": "answer_extraction + exact_match",
},
"code_generation": {
"description": "Can the model write correct code?",
"example_type": "spec-to-code",
"scoring": "execution_pass_rate",
},
"instruction_following": {
"description": "Does the model follow complex instructions?",
"example_type": "constraint-satisfaction",
"scoring": "constraint_check_pass_rate",
},
}
Step 2: Generate Examples
For each dimension, create 100-500 examples with ground truth:
def create_reasoning_example():
"""Generate a multi-step reasoning problem with verifiable answer."""
# Use templates to avoid contamination
import random
a, b, c = random.randint(10, 99), random.randint(10, 99), random.randint(2, 9)
question = (f"Alice has {a} apples. She gives {b} to Bob. "
f"Then she buys {c} times as many as she has left. "
f"How many apples does Alice have now?")
remaining = a - b
answer = remaining + remaining * c
return {"question": question, "answer": str(answer), "dimension": "reasoning"}
Step 3: Run Evaluation
def evaluate_model(model, eval_dataset):
"""Run evaluation and compute per-dimension scores."""
results = {}
for dimension, examples in eval_dataset.items():
correct = 0
for example in examples:
response = model.generate(example["question"], max_tokens=512)
extracted_answer = extract_answer(response)
if score(extracted_answer, example["answer"], example.get("scoring", "exact_match")):
correct += 1
results[dimension] = correct / len(examples)
return results
def extract_answer(response):
"""Extract the final answer from model response."""
# Look for patterns like "The answer is X" or "X" at the end
import re
patterns = [
r"(?:the answer is|answer:)\s*(\S+)",
r"(?:therefore|so|thus)[,:]?\s*(\S+)",
r"(\d+)\s*$", # Last number in response
]
for pattern in patterns:
match = re.search(pattern, response.lower())
if match:
return match.group(1)
return response.strip().split()[-1] # Last word as fallback
For most capability dimensions, 500 examples provide statistically significant results (95% confidence interval within +-3%). Below 100 examples, variance is too high for meaningful comparison. Above 1000, the marginal precision gain is not worth the evaluation cost.
Reviewer Agent Validation
Challenge: Implement a contamination detector that takes an evaluation question and a set of training document hashes, returning True if the question is likely contaminated (more than 80% 8-gram overlap with any training document).
Expected:
def is_contaminated(question, training_ngram_sets, n=8, threshold=0.8):
words = question.lower().split()
question_ngrams = set(
" ".join(words[i:i+n]) for i in range(len(words) - n + 1)
)
if not question_ngrams:
return False
for training_ngrams in training_ngram_sets:
overlap = len(question_ngrams & training_ngrams) / len(question_ngrams)
if overlap > threshold:
return True
return False