OpenAI’s math PRM-800K dataset labels 800,000 solution steps as correct or incorrect. Training a process reward model on this data improved MATH benchmark accuracy from 73.5% (outcome reward model) to 78.2% (process reward model) — a 4.7 point gain from teaching the model to identify exactly which step in a derivation went wrong. For math and code, verifiable ground truth unlocks reward model training at a scale that general-purpose human annotation cannot reach: you can generate and check 1 million math solutions for the cost of annotating 10,000 chat responses.
This post covers the construction of reward model training data for math and code domains: outcome reward models (ORMs), process reward models (PRMs), synthetic data generation from execution feedback, and the challenge of reward hacking.
Outcome Reward Models for Math
Ground Truth from Final Answers
from dataclasses import dataclass
from enum import Enum
from typing import Optional
import re
class MathVerdict(Enum):
CORRECT = "correct"
INCORRECT = "incorrect"
UNPARSEABLE = "unparseable"
EQUIVALENT = "equivalent"
@dataclass
class MathRewardSample:
"""Single training sample for a math outcome reward model."""
problem: str
solution: str
final_answer: str
ground_truth: str
verdict: MathVerdict
reward: float
difficulty: str
source: str
class MathAnswerExtractor:
"""
Extract and normalize final answers from math solutions.
Math models output answers in various formats:
- LaTeX: \\boxed{42}, $\\frac{3}{4}$
- Plain text: "The answer is 42"
- Numerical: 42, 42.0, 4.2e1
- Symbolic: 3/4, sqrt(2), pi/4
Normalization is critical: '0.75' and '3/4' and
'\\frac{3}{4}' must all be recognized as equivalent.
"""
BOXED_PATTERN = re.compile(
r"\\boxed\{([^}]+)\}"
)
ANSWER_IS_PATTERN = re.compile(
r"(?:the\s+)?answer\s+is[:\s]+(.+?)(?:\.|$)",
re.IGNORECASE,
)
THEREFORE_PATTERN = re.compile(
r"therefore[,:\s]+(.+?)(?:\.|$)",
re.IGNORECASE,
)
def extract(self, solution_text):
"""
Extract the final answer from a solution.
Priority order:
1. \\boxed{...} (most explicit)
2. "The answer is ..." pattern
3. "Therefore, ..." pattern
4. Last numerical expression
"""
# Try boxed first
match = self.BOXED_PATTERN.search(solution_text)
if match:
return self._normalize(match.group(1))
# Try "answer is" pattern
match = self.ANSWER_IS_PATTERN.search(solution_text)
if match:
return self._normalize(match.group(1))
# Try "therefore" pattern
match = self.THEREFORE_PATTERN.search(solution_text)
if match:
return self._normalize(match.group(1))
# Fallback: last number in text
numbers = re.findall(
r"-?\d+(?:\.\d+)?(?:/\d+)?", solution_text
)
if numbers:
return self._normalize(numbers[-1])
return None
def _normalize(self, answer_str):
"""
Normalize a math answer to a canonical form.
Handles fractions, decimals, LaTeX, and
symbolic expressions.
"""
answer_str = answer_str.strip()
# Remove LaTeX wrappers
answer_str = answer_str.replace("$", "")
answer_str = answer_str.replace("\\", "")
# Normalize fractions
frac_match = re.match(
r"frac\{(\d+)\}\{(\d+)\}", answer_str
)
if frac_match:
num = int(frac_match.group(1))
den = int(frac_match.group(2))
return str(num / den)
slash_match = re.match(r"(-?\d+)/(\d+)", answer_str)
if slash_match:
num = int(slash_match.group(1))
den = int(slash_match.group(2))
return str(num / den)
# Try to evaluate as float
try:
val = float(answer_str)
return str(val)
except ValueError:
return answer_str.lower().strip()
def check_equivalence(self, predicted, ground_truth):
"""
Check if two math answers are equivalent.
Uses numerical comparison with tolerance for
floating point, and symbolic comparison for
exact expressions.
"""
pred_norm = self._normalize(str(predicted))
gt_norm = self._normalize(str(ground_truth))
# Direct string match
if pred_norm == gt_norm:
return MathVerdict.CORRECT
# Numerical comparison with tolerance
try:
pred_val = float(pred_norm)
gt_val = float(gt_norm)
if abs(pred_val - gt_val) < 1e-6:
return MathVerdict.CORRECT
if gt_val != 0 and abs(
(pred_val - gt_val) / gt_val
) < 1e-6:
return MathVerdict.CORRECT
except (ValueError, ZeroDivisionError):
pass
return MathVerdict.INCORRECT
Answer equivalence checking is the single largest source of noise in math reward data. The expression and are equivalent, but string comparison fails. Symbolic math libraries (SymPy) help but are slow and do not handle all edge cases. Manual audits of math reward datasets typically find 3-8% mislabeled samples due to equivalence checking failures.
Building the ORM Dataset
class MathORMDatasetBuilder:
"""
Build an Outcome Reward Model dataset from
math problems with known ground truth.
For each problem, sample N solutions from the model,
check each against ground truth, and label as
correct/incorrect. This produces binary reward labels
without human annotation.
"""
def __init__(self, model, extractor, n_samples=64):
self.model = model
self.extractor = extractor
self.n_samples = n_samples
def build_dataset(self, problems):
"""
Generate reward training data from math problems.
For each problem:
1. Sample N solutions at temperature > 0
2. Extract final answer from each
3. Compare to ground truth
4. Label as positive (correct) or negative (incorrect)
"""
dataset = []
for problem in problems:
solutions = self.model.generate(
problem["question"],
n=self.n_samples,
temperature=0.8,
max_tokens=2048,
)
correct_count = 0
for solution in solutions:
answer = self.extractor.extract(solution)
if answer is None:
verdict = MathVerdict.UNPARSEABLE
reward = -0.5
else:
verdict = self.extractor.check_equivalence(
answer, problem["answer"]
)
reward = (
1.0
if verdict == MathVerdict.CORRECT
else -1.0
)
if verdict == MathVerdict.CORRECT:
correct_count += 1
dataset.append(
MathRewardSample(
problem=problem["question"],
solution=solution,
final_answer=answer or "",
ground_truth=problem["answer"],
verdict=verdict,
reward=reward,
difficulty=problem.get(
"difficulty", "unknown"
),
source=problem.get("source", "unknown"),
)
)
# Track pass@N for difficulty calibration
pass_rate = correct_count / self.n_samples
self._update_difficulty_stats(
problem, pass_rate
)
return dataset
def _update_difficulty_stats(self, problem, pass_rate):
"""Track pass rates for difficulty-based sampling."""
pass
def balance_dataset(self, dataset):
"""
Balance positive and negative examples.
Math ORM datasets are typically imbalanced:
easy problems produce 90%+ correct solutions,
hard problems produce 5-10% correct.
Strategies:
1. Downsample easy-correct to match hard-correct
2. Upsample hard-correct with augmentation
3. Difficulty-weighted sampling
"""
by_difficulty = {}
for sample in dataset:
diff = sample.difficulty
if diff not in by_difficulty:
by_difficulty[diff] = {
"correct": [], "incorrect": []
}
key = (
"correct"
if sample.verdict == MathVerdict.CORRECT
else "incorrect"
)
by_difficulty[diff][key].append(sample)
balanced = []
for diff, samples in by_difficulty.items():
n_correct = len(samples["correct"])
n_incorrect = len(samples["incorrect"])
if n_correct == 0 or n_incorrect == 0:
continue
# Target 1:1 ratio per difficulty
target = min(n_correct, n_incorrect)
import random
balanced.extend(
random.sample(samples["correct"], target)
)
balanced.extend(
random.sample(samples["incorrect"], target)
)
return balanced
ORM Dataset Statistics by Source
| Dataset Source | Problems | Solutions per Problem | Avg Pass Rate | Total Reward Pairs | Answer Parse Rate |
|---|---|---|---|---|---|
| GSM8K | 7,473 | 64 | 68% | 478K | 97% |
| MATH | 5,000 | 64 | 22% | 320K | 92% |
| AIME | 240 | 256 | 4% | 61K | 88% |
| Olympiad | 500 | 256 | 2% | 128K | 85% |
| Synthetic (GPT-4 generated) | 50,000 | 32 | 45% | 1.6M | 95% |
Process Reward Models for Math
Step-Level Annotation
@dataclass
class StepAnnotation:
"""Annotation for a single reasoning step."""
step_index: int
step_text: str
label: str # "correct", "incorrect", "neutral"
confidence: float
error_type: Optional[str] = None
@dataclass
class PRMTrainingSample:
"""Complete PRM training sample with step-level labels."""
problem: str
solution_steps: list
step_labels: list
first_error_step: Optional[int]
final_answer_correct: bool
class MathPRMAnnotator:
"""
Annotate math solutions at the step level for PRM training.
Three annotation strategies:
1. Human annotation (gold standard, expensive)
2. Monte Carlo estimation (sample completions from each step)
3. Automated verification (symbolic execution per step)
"""
def __init__(self, model, extractor, mc_samples=32):
self.model = model
self.extractor = extractor
self.mc_samples = mc_samples
def annotate_monte_carlo(self, problem, solution_steps,
ground_truth):
"""
Monte Carlo PRM annotation (PRM800K method).
For each step i:
1. Take steps 1..i as prefix
2. Sample K completions from the model
3. Check if each completion reaches correct answer
4. Step i's label = fraction of completions that
reach correct answer
If step i has high completion rate but step i+1 has
low rate, step i+1 likely introduced an error.
"""
step_scores = []
prefix = problem + "\n"
for i, step in enumerate(solution_steps):
prefix += step + "\n"
# Sample completions from this prefix
completions = self.model.generate(
prefix,
n=self.mc_samples,
temperature=0.8,
max_tokens=1024,
)
correct_count = 0
for completion in completions:
full_solution = prefix + completion
answer = self.extractor.extract(full_solution)
if answer is not None:
verdict = self.extractor.check_equivalence(
answer, ground_truth
)
if verdict == MathVerdict.CORRECT:
correct_count += 1
score = correct_count / self.mc_samples
step_scores.append(score)
# Convert scores to labels
step_labels = []
first_error = None
for i, score in enumerate(step_scores):
if score > 0.5:
step_labels.append(
StepAnnotation(
step_index=i,
step_text=solution_steps[i],
label="correct",
confidence=score,
)
)
elif score > 0.1:
step_labels.append(
StepAnnotation(
step_index=i,
step_text=solution_steps[i],
label="neutral",
confidence=score,
)
)
else:
if first_error is None:
first_error = i
step_labels.append(
StepAnnotation(
step_index=i,
step_text=solution_steps[i],
label="incorrect",
confidence=1.0 - score,
)
)
return PRMTrainingSample(
problem=problem,
solution_steps=solution_steps,
step_labels=step_labels,
first_error_step=first_error,
final_answer_correct=(
step_scores[-1] > 0.5
if step_scores
else False
),
)
def split_into_steps(self, solution_text):
"""
Split a solution into reasoning steps.
Heuristics:
- Split on newlines
- Split on "Step N:" patterns
- Split on sentence boundaries after equations
- Merge very short lines with previous step
"""
lines = solution_text.strip().split("\n")
steps = []
current_step = ""
for line in lines:
line = line.strip()
if not line:
continue
# Check if this starts a new step
is_new_step = (
re.match(r"^(?:Step\s+\d|\\item|\d+\.)", line)
or (len(current_step) > 100 and line[0].isupper())
)
if is_new_step and current_step:
steps.append(current_step)
current_step = line
else:
current_step += " " + line if current_step else line
if current_step:
steps.append(current_step)
# Merge very short steps
merged = []
for step in steps:
if merged and len(step.split()) < 5:
merged[-1] += " " + step
else:
merged.append(step)
return merged
The Monte Carlo PRM annotation method (used in PRM800K) requires model forward passes per solution, where is the number of completions per step and is the number of steps. For PRM800K, OpenAI used and steps, requiring approximately 15,000 forward passes per solution. At scale, this makes PRM annotation 100-1000x more expensive than ORM annotation.
Code Correctness Reward Models
Unit Test Signal
import subprocess
import tempfile
import os
@dataclass
class CodeRewardSample:
"""Training sample for code correctness reward model."""
problem: str
code: str
language: str
test_results: dict
reward: float
error_type: Optional[str]
execution_time_ms: float
class CodeExecutionEnvironment:
"""
Sandboxed code execution for reward signal.
Runs generated code against unit tests in an isolated
environment. Captures pass/fail, error messages,
execution time, and memory usage.
"""
def __init__(self, timeout_s=10, memory_limit_mb=256):
self.timeout_s = timeout_s
self.memory_limit_mb = memory_limit_mb
def execute_python(self, code, test_code):
"""
Execute Python code with test cases.
Returns detailed results including:
- Number of tests passed/failed
- Error messages for failures
- Execution time
- Whether code compiled successfully
"""
full_code = code + "\n\n" + test_code
with tempfile.NamedTemporaryFile(
mode="w", suffix=".py", delete=False
) as f:
f.write(full_code)
tmp_path = f.name
try:
result = subprocess.run(
["python3", tmp_path],
capture_output=True,
text=True,
timeout=self.timeout_s,
)
return {
"compiled": True,
"returncode": result.returncode,
"stdout": result.stdout[:1000],
"stderr": result.stderr[:1000],
"passed": result.returncode == 0,
"error_type": (
self._classify_error(result.stderr)
if result.returncode != 0
else None
),
}
except subprocess.TimeoutExpired:
return {
"compiled": True,
"returncode": -1,
"stdout": "",
"stderr": "Timeout",
"passed": False,
"error_type": "timeout",
}
except Exception as e:
return {
"compiled": False,
"returncode": -1,
"stdout": "",
"stderr": str(e),
"passed": False,
"error_type": "execution_error",
}
finally:
os.unlink(tmp_path)
def _classify_error(self, stderr):
"""Classify the error type from stderr."""
if "SyntaxError" in stderr:
return "syntax_error"
if "NameError" in stderr:
return "name_error"
if "TypeError" in stderr:
return "type_error"
if "IndexError" in stderr:
return "index_error"
if "AssertionError" in stderr:
return "assertion_error"
if "RuntimeError" in stderr:
return "runtime_error"
return "other_error"
class CodeRewardDatasetBuilder:
"""
Build code reward dataset from problems with test suites.
Sources:
- HumanEval / MBPP (with existing tests)
- LeetCode problems (scrape test cases)
- Competitive programming (with judges)
- Synthetic problems with generated tests
"""
def __init__(self, model, executor, n_samples=50):
self.model = model
self.executor = executor
self.n_samples = n_samples
def build_from_problems(self, problems):
"""
Generate reward data from coding problems.
For each problem, sample N solutions, run tests,
and create reward labels from pass/fail.
"""
dataset = []
for problem in problems:
solutions = self.model.generate(
problem["prompt"],
n=self.n_samples,
temperature=0.8,
max_tokens=2048,
)
for solution in solutions:
result = self.executor.execute_python(
solution, problem["tests"]
)
# Graded reward based on test results
if result["passed"]:
reward = 1.0
elif result["compiled"]:
# Partial credit for compiling but failing
reward = -0.3
else:
# No credit for syntax errors
reward = -1.0
dataset.append(
CodeRewardSample(
problem=problem["prompt"],
code=solution,
language="python",
test_results=result,
reward=reward,
error_type=result.get("error_type"),
execution_time_ms=0.0,
)
)
return dataset
def generate_additional_tests(self, problem, solution):
"""
Generate additional test cases to increase
reward signal reliability.
A solution might pass the provided tests by
accident. Generating more tests (especially
edge cases) reduces false positives.
"""
test_gen_prompt = (
f"Given this programming problem:\n"
f"{problem['prompt']}\n\n"
f"And this reference solution:\n"
f"{problem.get('reference_solution', '')}\n\n"
f"Generate 10 additional test cases including:\n"
f"- Edge cases (empty input, single element)\n"
f"- Large inputs\n"
f"- Negative numbers\n"
f"- Boundary conditions\n"
f"Format as Python assert statements."
)
additional_tests = self.model.generate(
test_gen_prompt,
n=1,
temperature=0.3,
max_tokens=1024,
)[0]
return additional_tests
Code Reward Signal Quality by Test Count
| Tests per Problem | False Positive Rate | False Negative Rate | Reward Accuracy | Avg Execution Time (ms) |
|---|---|---|---|---|
| 1-3 (HumanEval default) | 12% | 3% | 85% | 50 |
| 5-10 (augmented) | 5% | 4% | 91% | 120 |
| 10-20 (comprehensive) | 2% | 5% | 93% | 250 |
| 20-50 (fuzzing) | 0.5% | 7% | 93% | 600 |
| Property-based (Hypothesis) | 0.3% | 8% | 92% | 1200 |
Synthetic Reward Data Generation
Execution-Guided Reward Data
class SyntheticRewardDataGenerator:
"""
Generate synthetic reward training data using
execution feedback loops.
Strategy: generate problems, generate solutions,
execute solutions to get ground truth, use execution
results as reward labels. No human annotation needed.
"""
def __init__(self, problem_generator, solver,
executor, verifier):
self.problem_generator = problem_generator
self.solver = solver
self.executor = executor
self.verifier = verifier
def generate_math_reward_data(self, n_problems, n_solutions):
"""
Synthetic math reward data pipeline:
1. Generate problems with known solutions
2. Solve each problem N times
3. Verify each solution against known answer
4. Label (problem, solution) pairs
"""
dataset = []
# Generate problems with reference solutions
problems = self.problem_generator.generate_math(
n=n_problems,
difficulty_distribution={
"easy": 0.2,
"medium": 0.4,
"hard": 0.3,
"olympiad": 0.1,
},
)
for problem in problems:
# Verify the generated problem has a valid answer
ref_answer = problem.get("reference_answer")
if ref_answer is None:
continue
# Sample solutions
solutions = self.solver.generate(
problem["question"],
n=n_solutions,
temperature=0.9,
)
for solution in solutions:
answer = self.verifier.extract_answer(solution)
verdict = self.verifier.check(
answer, ref_answer
)
dataset.append({
"problem": problem["question"],
"solution": solution,
"reward": (
1.0 if verdict == "correct" else -1.0
),
"answer_extracted": answer,
"reference_answer": ref_answer,
"difficulty": problem["difficulty"],
"synthetic": True,
})
return dataset
def generate_code_reward_data(self, n_problems,
n_solutions):
"""
Synthetic code reward data pipeline:
1. Generate problem descriptions
2. Generate reference solutions and test suites
3. Verify reference solution passes all tests
4. Generate N candidate solutions
5. Run each against test suite
"""
dataset = []
problems = self.problem_generator.generate_code(
n=n_problems,
topics=[
"arrays", "strings", "trees",
"graphs", "dynamic_programming",
"sorting", "searching",
],
)
for problem in problems:
# Verify reference solution
ref_result = self.executor.execute_python(
problem["reference_solution"],
problem["tests"],
)
if not ref_result["passed"]:
# Skip problems where reference fails
continue
# Generate candidate solutions
solutions = self.solver.generate(
problem["prompt"],
n=n_solutions,
temperature=0.8,
)
for solution in solutions:
result = self.executor.execute_python(
solution, problem["tests"]
)
dataset.append({
"problem": problem["prompt"],
"solution": solution,
"reward": 1.0 if result["passed"] else -1.0,
"error_type": result.get("error_type"),
"synthetic": True,
})
return dataset
Reward Hacking Detection
Identifying and Mitigating Reward Exploitation
class RewardHackingDetector:
"""
Detect reward hacking: when the policy exploits
artifacts in the reward model rather than improving
actual quality.
Common reward hacking patterns in math/code:
- Verbose solutions that pad length (length bias)
- Solutions that copy the question (repetition exploit)
- Code that hardcodes test case outputs
- Math solutions that state the answer without proof
"""
def __init__(self, reward_model):
self.reward_model = reward_model
self.baseline_stats = {}
def detect_length_hacking(self, samples):
"""
Check if reward correlates with length independent
of quality.
If reward increases with solution length even among
incorrect solutions, the reward model has a length bias.
"""
import numpy as np
correct_lengths = []
correct_rewards = []
incorrect_lengths = []
incorrect_rewards = []
for sample in samples:
length = len(sample["solution"].split())
reward = self.reward_model.score(
sample["problem"], sample["solution"]
)
if sample["reward"] > 0:
correct_lengths.append(length)
correct_rewards.append(reward)
else:
incorrect_lengths.append(length)
incorrect_rewards.append(reward)
# Correlation within incorrect solutions
if len(incorrect_lengths) > 10:
corr = np.corrcoef(
incorrect_lengths, incorrect_rewards
)[0, 1]
if corr > 0.3:
return {
"hacking_type": "length_bias",
"correlation": float(corr),
"severity": "high" if corr > 0.5 else "medium",
"recommendation": (
"Add length penalty to reward or "
"train with length-controlled pairs"
),
}
return None
def detect_hardcoding(self, code_samples):
"""
Detect code that hardcodes test case outputs
rather than implementing the algorithm.
Patterns:
- if/elif chains matching exact test inputs
- Dictionary mapping inputs to outputs
- No loops or logic, only return statements
"""
detections = []
for sample in code_samples:
code = sample["solution"]
# Count if/elif chains
if_count = code.count("if ") + code.count("elif ")
line_count = len(code.strip().split("\n"))
# High if/elif density suggests hardcoding
if line_count > 0 and if_count / line_count > 0.4:
detections.append({
"sample": sample,
"pattern": "if_chain",
"density": if_count / line_count,
})
# Check for direct output mapping
if "return {" in code and code.count(":") > 5:
detections.append({
"sample": sample,
"pattern": "dict_mapping",
})
return detections
def detect_answer_copying(self, math_samples):
"""
Detect solutions that state the answer without
reasoning. The model may learn that asserting
'\\boxed{42}' early gets high reward.
"""
detections = []
for sample in math_samples:
solution = sample["solution"]
steps = solution.strip().split("\n")
# Check if answer appears in first 2 lines
for step in steps[:2]:
if "\\boxed{" in step or "answer is" in step.lower():
total_lines = len([
s for s in steps if s.strip()
])
if total_lines < 5:
detections.append({
"sample": sample,
"pattern": "early_answer",
"reasoning_steps": total_lines,
})
return detections
Reward hacking is the most insidious failure mode in RLHF for math/code. A model trained with a reward model that has a length bias will produce increasingly verbose solutions that score well on the reward model but are worse by human evaluation. In code, hardcoding test outputs produces solutions that pass all tests but do not generalize. Regular auditing with held-out test cases and human evaluation is essential.
Reward Model Training Pipeline
Putting It All Together
class RewardModelTrainer:
"""
Train a reward model from the constructed dataset.
Architecture: Base LLM + value head (linear layer
projecting last hidden state to scalar reward).
Training: Bradley-Terry loss on preference pairs
or binary cross-entropy on (solution, label) pairs.
"""
def __init__(self, base_model, config):
self.base_model = base_model
self.learning_rate = config.get("lr", 1e-5)
self.epochs = config.get("epochs", 3)
self.batch_size = config.get("batch_size", 16)
self.loss_type = config.get("loss_type", "bce")
def prepare_training_data(self, orm_data, prm_data=None):
"""
Merge ORM and PRM data into a unified training set.
ORM data: (problem, solution, reward) triples
PRM data: (problem, step, step_label) triples
For a combined model, we train on both outcome-level
and step-level predictions.
"""
training_samples = []
# ORM samples
for sample in orm_data:
training_samples.append({
"input": (
f"Problem: {sample.problem}\n"
f"Solution: {sample.solution}"
),
"label": 1.0 if sample.reward > 0 else 0.0,
"weight": 1.0,
"type": "outcome",
})
# PRM samples (if available)
if prm_data:
for sample in prm_data:
for step_label in sample.step_labels:
prefix = "\n".join(
sample.solution_steps[
:step_label.step_index + 1
]
)
training_samples.append({
"input": (
f"Problem: {sample.problem}\n"
f"Solution so far: {prefix}"
),
"label": (
1.0
if step_label.label == "correct"
else 0.0
),
"weight": step_label.confidence,
"type": "process",
})
return training_samples
def compute_loss(self, predictions, labels, weights):
"""
Compute weighted binary cross-entropy loss.
loss = -w * [y * log(p) + (1-y) * log(1-p)]
where w is the sample weight (confidence),
y is the label, and p is the predicted reward.
"""
import numpy as np
epsilon = 1e-7
predictions = np.clip(predictions, epsilon, 1 - epsilon)
bce = -(
labels * np.log(predictions)
+ (1 - labels) * np.log(1 - predictions)
)
weighted_loss = bce * weights
return np.mean(weighted_loss)
Reward Model Accuracy by Data Source
| Metric | 10 | 50 | 100 | 250 | 500 | 1000 |
|---|---|---|---|---|---|---|
| ORM (math, execution-verified) | ||||||
| PRM (math, MC-annotated) | ||||||
| Code ORM (test-verified) | ||||||
| Human preference (chat) |
Key Takeaways
Reward model training data for math and code benefits from a unique advantage: verifiable ground truth. Unlike general chat where quality is subjective, math solutions are either correct or incorrect, and code either passes tests or fails.
The critical decisions:
-
Answer equivalence checking is hard: For math ORM data, the answer extractor and equivalence checker determine data quality. Symbolic normalization catches most cases but misses 3-8% of equivalent answers. Investing in robust equivalence checking (SymPy integration, multiple normalization passes) directly improves reward model accuracy.
-
Process reward models outperform outcome reward models: PRM data (step-level labels) produces reward models that are 2-5% more accurate than ORM data (final-answer-only labels) at the same data scale. The cost is 100-1000x more expensive annotation via Monte Carlo completion sampling.
-
More tests means better code reward data: False positive rate drops from 12% with 1-3 tests to 0.5% with 20-50 tests. Property-based testing (Hypothesis) achieves the lowest false positive rate but increases execution time 20x.
-
Synthetic reward data works: Execution-verified synthetic data (generate problems, generate solutions, verify by execution) produces reward models within 2-3% of human-annotated data quality while eliminating annotation cost entirely.
-
Reward hacking is a constant threat: Length bias, hardcoding, and early-answer patterns are the three most common reward hacking modes. Regular auditing with held-out tests and human spot-checks is not optional — it is a required component of any reward model training pipeline.