Reward Model Training Data: Building Datasets for Math Verification and Code Correctness

Part of Series The Dataset Frontier 22 of 27

1 Synthetic Data Pipelines: Magpie, Nemotron-4, and Generating Training Data at Scale 2 Data Curation at Scale: DCLM, FineWeb-Edu, and the Exact Heuristics That Filter the Web 3 Agent-Based Simulation: Using 10,000 AI Agents to Generate Synthetic Training Data 4 Code Dataset Curation: Deduplication, License Filtering, and Quality Scoring for LLM Training 5 Multilingual Data: Cross-Lingual Transfer, Low-Resource Languages, and Translation Quality 6 Instruction Tuning Data: ShareGPT, OpenAssistant, and Quality Metrics for Alignment 7 Preference Data: Building DPO/RLHF Datasets from Human and AI Feedback 8 Data Mixing: Optimal Proportions of Code, Math, Web, and Books for LLM Training 9 Evaluation Datasets: Building Benchmarks That Actually Measure LLM Capability 10 Data Contamination: Detecting and Preventing Benchmark Leakage in Training Data 11 The Data Scaling Law: How Much Data Is Enough, and What Happens When You Run Out 12 Training a Tokenizer from Scratch: BPE Merge Rules, Vocabulary Optimization, and Compression Ratio 13 Multimodal Training Data: Image-Text Pairs, Video Captioning, and Interleaved Document Formats 14 RLHF Data at Scale: Collecting Millions of Human Preferences with Minimal Cost 15 Building a Decontamination Pipeline: Removing Benchmark Data from Training Corpora 16 Safety Training Data: Red Teaming, Refusal Training, and Building Datasets for Harmless AI 17 Data Versioning and Reproducibility: Tracking What Changed Between Training Runs 18 Domain-Specific Data: Building Medical, Legal, and Financial Training Datasets 19 Data Attribution and Provenance: Tracing Model Outputs Back to Training Examples 20 The Data Flywheel: Using Production Logs to Continuously Improve Training Data 21 Reward Model Training Data: Building Datasets for Math Verification and Code Correctness 22 Long-Context Training Data: Book-Length Documents, Multi-Document QA, and Needle-in-Haystack 23 Agentic Interaction Data: Tool Use Traces, Multi-Step Planning Logs, and Environment Feedback 24 Data Labeling Platforms: Scale AI, Surge AI, and Building Your Own Annotation Pipeline 25 Data Legal Issues: Copyright, Fair Use, Opt-Out, and the Regulatory Landscape for Training Data 26 Data Pipeline at Scale: Spark, Ray, and Processing 15 Trillion Tokens Across 1000 Nodes 27 Building a Data Pipeline: From Raw HTML to Clean Training Tokens in 500 Lines

OpenAI’s math PRM-800K dataset labels 800,000 solution steps as correct or incorrect. Training a process reward model on this data improved MATH benchmark accuracy from 73.5% (outcome reward model) to 78.2% (process reward model) — a 4.7 point gain from teaching the model to identify exactly which step in a derivation went wrong. For math and code, verifiable ground truth unlocks reward model training at a scale that general-purpose human annotation cannot reach: you can generate and check 1 million math solutions for the cost of annotating 10,000 chat responses.

This post covers the construction of reward model training data for math and code domains: outcome reward models (ORMs), process reward models (PRMs), synthetic data generation from execution feedback, and the challenge of reward hacking.

Outcome Reward Models for Math

Ground Truth from Final Answers

from dataclasses import dataclass
from enum import Enum
from typing import Optional
import re

class MathVerdict(Enum):
    CORRECT = "correct"
    INCORRECT = "incorrect"
    UNPARSEABLE = "unparseable"
    EQUIVALENT = "equivalent"

@dataclass
class MathRewardSample:
    """Single training sample for a math outcome reward model."""
    problem: str
    solution: str
    final_answer: str
    ground_truth: str
    verdict: MathVerdict
    reward: float
    difficulty: str
    source: str

class MathAnswerExtractor:
    """
    Extract and normalize final answers from math solutions.

    Math models output answers in various formats:
    - LaTeX: \\boxed{42}, $\\frac{3}{4}$
    - Plain text: "The answer is 42"
    - Numerical: 42, 42.0, 4.2e1
    - Symbolic: 3/4, sqrt(2), pi/4

    Normalization is critical: '0.75' and '3/4' and
    '\\frac{3}{4}' must all be recognized as equivalent.
    """

    BOXED_PATTERN = re.compile(
        r"\\boxed\{([^}]+)\}"
    )
    ANSWER_IS_PATTERN = re.compile(
        r"(?:the\s+)?answer\s+is[:\s]+(.+?)(?:\.|$)",
        re.IGNORECASE,
    )
    THEREFORE_PATTERN = re.compile(
        r"therefore[,:\s]+(.+?)(?:\.|$)",
        re.IGNORECASE,
    )

    def extract(self, solution_text):
        """
        Extract the final answer from a solution.

        Priority order:
        1. \\boxed{...} (most explicit)
        2. "The answer is ..." pattern
        3. "Therefore, ..." pattern
        4. Last numerical expression
        """
        # Try boxed first
        match = self.BOXED_PATTERN.search(solution_text)
        if match:
            return self._normalize(match.group(1))

        # Try "answer is" pattern
        match = self.ANSWER_IS_PATTERN.search(solution_text)
        if match:
            return self._normalize(match.group(1))

        # Try "therefore" pattern
        match = self.THEREFORE_PATTERN.search(solution_text)
        if match:
            return self._normalize(match.group(1))

        # Fallback: last number in text
        numbers = re.findall(
            r"-?\d+(?:\.\d+)?(?:/\d+)?", solution_text
        )
        if numbers:
            return self._normalize(numbers[-1])

        return None

    def _normalize(self, answer_str):
        """
        Normalize a math answer to a canonical form.

        Handles fractions, decimals, LaTeX, and
        symbolic expressions.
        """
        answer_str = answer_str.strip()

        # Remove LaTeX wrappers
        answer_str = answer_str.replace("$", "")
        answer_str = answer_str.replace("\\", "")

        # Normalize fractions
        frac_match = re.match(
            r"frac\{(\d+)\}\{(\d+)\}", answer_str
        )
        if frac_match:
            num = int(frac_match.group(1))
            den = int(frac_match.group(2))
            return str(num / den)

        slash_match = re.match(r"(-?\d+)/(\d+)", answer_str)
        if slash_match:
            num = int(slash_match.group(1))
            den = int(slash_match.group(2))
            return str(num / den)

        # Try to evaluate as float
        try:
            val = float(answer_str)
            return str(val)
        except ValueError:
            return answer_str.lower().strip()

    def check_equivalence(self, predicted, ground_truth):
        """
        Check if two math answers are equivalent.

        Uses numerical comparison with tolerance for
        floating point, and symbolic comparison for
        exact expressions.
        """
        pred_norm = self._normalize(str(predicted))
        gt_norm = self._normalize(str(ground_truth))

        # Direct string match
        if pred_norm == gt_norm:
            return MathVerdict.CORRECT

        # Numerical comparison with tolerance
        try:
            pred_val = float(pred_norm)
            gt_val = float(gt_norm)
            if abs(pred_val - gt_val) < 1e-6:
                return MathVerdict.CORRECT
            if gt_val != 0 and abs(
                (pred_val - gt_val) / gt_val
            ) < 1e-6:
                return MathVerdict.CORRECT
        except (ValueError, ZeroDivisionError):
            pass

        return MathVerdict.INCORRECT

⚠️ Warning

Answer equivalence checking is the single largest source of noise in math reward data. The expression $\frac{\sqrt{2}}{2}$ and $\frac{1}{\sqrt{2}}$ are equivalent, but string comparison fails. Symbolic math libraries (SymPy) help but are slow and do not handle all edge cases. Manual audits of math reward datasets typically find 3-8% mislabeled samples due to equivalence checking failures.

Building the ORM Dataset

class MathORMDatasetBuilder:
    """
    Build an Outcome Reward Model dataset from
    math problems with known ground truth.

    For each problem, sample N solutions from the model,
    check each against ground truth, and label as
    correct/incorrect. This produces binary reward labels
    without human annotation.
    """

    def __init__(self, model, extractor, n_samples=64):
        self.model = model
        self.extractor = extractor
        self.n_samples = n_samples

    def build_dataset(self, problems):
        """
        Generate reward training data from math problems.

        For each problem:
        1. Sample N solutions at temperature > 0
        2. Extract final answer from each
        3. Compare to ground truth
        4. Label as positive (correct) or negative (incorrect)
        """
        dataset = []

        for problem in problems:
            solutions = self.model.generate(
                problem["question"],
                n=self.n_samples,
                temperature=0.8,
                max_tokens=2048,
            )

            correct_count = 0
            for solution in solutions:
                answer = self.extractor.extract(solution)

                if answer is None:
                    verdict = MathVerdict.UNPARSEABLE
                    reward = -0.5
                else:
                    verdict = self.extractor.check_equivalence(
                        answer, problem["answer"]
                    )
                    reward = (
                        1.0
                        if verdict == MathVerdict.CORRECT
                        else -1.0
                    )

                if verdict == MathVerdict.CORRECT:
                    correct_count += 1

                dataset.append(
                    MathRewardSample(
                        problem=problem["question"],
                        solution=solution,
                        final_answer=answer or "",
                        ground_truth=problem["answer"],
                        verdict=verdict,
                        reward=reward,
                        difficulty=problem.get(
                            "difficulty", "unknown"
                        ),
                        source=problem.get("source", "unknown"),
                    )
                )

            # Track pass@N for difficulty calibration
            pass_rate = correct_count / self.n_samples
            self._update_difficulty_stats(
                problem, pass_rate
            )

        return dataset

    def _update_difficulty_stats(self, problem, pass_rate):
        """Track pass rates for difficulty-based sampling."""
        pass

    def balance_dataset(self, dataset):
        """
        Balance positive and negative examples.

        Math ORM datasets are typically imbalanced:
        easy problems produce 90%+ correct solutions,
        hard problems produce 5-10% correct.

        Strategies:
        1. Downsample easy-correct to match hard-correct
        2. Upsample hard-correct with augmentation
        3. Difficulty-weighted sampling
        """
        by_difficulty = {}
        for sample in dataset:
            diff = sample.difficulty
            if diff not in by_difficulty:
                by_difficulty[diff] = {
                    "correct": [], "incorrect": []
                }
            key = (
                "correct"
                if sample.verdict == MathVerdict.CORRECT
                else "incorrect"
            )
            by_difficulty[diff][key].append(sample)

        balanced = []
        for diff, samples in by_difficulty.items():
            n_correct = len(samples["correct"])
            n_incorrect = len(samples["incorrect"])

            if n_correct == 0 or n_incorrect == 0:
                continue

            # Target 1:1 ratio per difficulty
            target = min(n_correct, n_incorrect)
            import random
            balanced.extend(
                random.sample(samples["correct"], target)
            )
            balanced.extend(
                random.sample(samples["incorrect"], target)
            )

        return balanced

📊

ORM Dataset Statistics by Source

Dataset Source	Problems	Solutions per Problem	Avg Pass Rate	Total Reward Pairs	Answer Parse Rate
GSM8K	7,473	64	68%	478K	97%
MATH	5,000	64	22%	320K	92%
AIME	240	256	4%	61K	88%
Olympiad	500	256	2%	128K	85%
Synthetic (GPT-4 generated)	50,000	32	45%	1.6M	95%

Process Reward Models for Math

Step-Level Annotation

@dataclass
class StepAnnotation:
    """Annotation for a single reasoning step."""
    step_index: int
    step_text: str
    label: str  # "correct", "incorrect", "neutral"
    confidence: float
    error_type: Optional[str] = None

@dataclass
class PRMTrainingSample:
    """Complete PRM training sample with step-level labels."""
    problem: str
    solution_steps: list
    step_labels: list
    first_error_step: Optional[int]
    final_answer_correct: bool

class MathPRMAnnotator:
    """
    Annotate math solutions at the step level for PRM training.

    Three annotation strategies:
    1. Human annotation (gold standard, expensive)
    2. Monte Carlo estimation (sample completions from each step)
    3. Automated verification (symbolic execution per step)
    """

    def __init__(self, model, extractor, mc_samples=32):
        self.model = model
        self.extractor = extractor
        self.mc_samples = mc_samples

    def annotate_monte_carlo(self, problem, solution_steps,
                              ground_truth):
        """
        Monte Carlo PRM annotation (PRM800K method).

        For each step i:
        1. Take steps 1..i as prefix
        2. Sample K completions from the model
        3. Check if each completion reaches correct answer
        4. Step i's label = fraction of completions that
           reach correct answer

        If step i has high completion rate but step i+1 has
        low rate, step i+1 likely introduced an error.
        """
        step_scores = []
        prefix = problem + "\n"

        for i, step in enumerate(solution_steps):
            prefix += step + "\n"

            # Sample completions from this prefix
            completions = self.model.generate(
                prefix,
                n=self.mc_samples,
                temperature=0.8,
                max_tokens=1024,
            )

            correct_count = 0
            for completion in completions:
                full_solution = prefix + completion
                answer = self.extractor.extract(full_solution)
                if answer is not None:
                    verdict = self.extractor.check_equivalence(
                        answer, ground_truth
                    )
                    if verdict == MathVerdict.CORRECT:
                        correct_count += 1

            score = correct_count / self.mc_samples
            step_scores.append(score)

        # Convert scores to labels
        step_labels = []
        first_error = None

        for i, score in enumerate(step_scores):
            if score > 0.5:
                step_labels.append(
                    StepAnnotation(
                        step_index=i,
                        step_text=solution_steps[i],
                        label="correct",
                        confidence=score,
                    )
                )
            elif score > 0.1:
                step_labels.append(
                    StepAnnotation(
                        step_index=i,
                        step_text=solution_steps[i],
                        label="neutral",
                        confidence=score,
                    )
                )
            else:
                if first_error is None:
                    first_error = i
                step_labels.append(
                    StepAnnotation(
                        step_index=i,
                        step_text=solution_steps[i],
                        label="incorrect",
                        confidence=1.0 - score,
                    )
                )

        return PRMTrainingSample(
            problem=problem,
            solution_steps=solution_steps,
            step_labels=step_labels,
            first_error_step=first_error,
            final_answer_correct=(
                step_scores[-1] > 0.5
                if step_scores
                else False
            ),
        )

    def split_into_steps(self, solution_text):
        """
        Split a solution into reasoning steps.

        Heuristics:
        - Split on newlines
        - Split on "Step N:" patterns
        - Split on sentence boundaries after equations
        - Merge very short lines with previous step
        """
        lines = solution_text.strip().split("\n")
        steps = []
        current_step = ""

        for line in lines:
            line = line.strip()
            if not line:
                continue

            # Check if this starts a new step
            is_new_step = (
                re.match(r"^(?:Step\s+\d|\\item|\d+\.)", line)
                or (len(current_step) > 100 and line[0].isupper())
            )

            if is_new_step and current_step:
                steps.append(current_step)
                current_step = line
            else:
                current_step += " " + line if current_step else line

        if current_step:
            steps.append(current_step)

        # Merge very short steps
        merged = []
        for step in steps:
            if merged and len(step.split()) < 5:
                merged[-1] += " " + step
            else:
                merged.append(step)

        return merged

ℹ️ Note

The Monte Carlo PRM annotation method (used in PRM800K) requires $K \times N$ model forward passes per solution, where $K$ is the number of completions per step and $N$ is the number of steps. For PRM800K, OpenAI used $K = 1000$ and $N \approx 15$ steps, requiring approximately 15,000 forward passes per solution. At scale, this makes PRM annotation 100-1000x more expensive than ORM annotation.

Code Correctness Reward Models

Unit Test Signal

import subprocess
import tempfile
import os

@dataclass
class CodeRewardSample:
    """Training sample for code correctness reward model."""
    problem: str
    code: str
    language: str
    test_results: dict
    reward: float
    error_type: Optional[str]
    execution_time_ms: float

class CodeExecutionEnvironment:
    """
    Sandboxed code execution for reward signal.

    Runs generated code against unit tests in an isolated
    environment. Captures pass/fail, error messages,
    execution time, and memory usage.
    """

    def __init__(self, timeout_s=10, memory_limit_mb=256):
        self.timeout_s = timeout_s
        self.memory_limit_mb = memory_limit_mb

    def execute_python(self, code, test_code):
        """
        Execute Python code with test cases.

        Returns detailed results including:
        - Number of tests passed/failed
        - Error messages for failures
        - Execution time
        - Whether code compiled successfully
        """
        full_code = code + "\n\n" + test_code

        with tempfile.NamedTemporaryFile(
            mode="w", suffix=".py", delete=False
        ) as f:
            f.write(full_code)
            tmp_path = f.name

        try:
            result = subprocess.run(
                ["python3", tmp_path],
                capture_output=True,
                text=True,
                timeout=self.timeout_s,
            )

            return {
                "compiled": True,
                "returncode": result.returncode,
                "stdout": result.stdout[:1000],
                "stderr": result.stderr[:1000],
                "passed": result.returncode == 0,
                "error_type": (
                    self._classify_error(result.stderr)
                    if result.returncode != 0
                    else None
                ),
            }

        except subprocess.TimeoutExpired:
            return {
                "compiled": True,
                "returncode": -1,
                "stdout": "",
                "stderr": "Timeout",
                "passed": False,
                "error_type": "timeout",
            }
        except Exception as e:
            return {
                "compiled": False,
                "returncode": -1,
                "stdout": "",
                "stderr": str(e),
                "passed": False,
                "error_type": "execution_error",
            }
        finally:
            os.unlink(tmp_path)

    def _classify_error(self, stderr):
        """Classify the error type from stderr."""
        if "SyntaxError" in stderr:
            return "syntax_error"
        if "NameError" in stderr:
            return "name_error"
        if "TypeError" in stderr:
            return "type_error"
        if "IndexError" in stderr:
            return "index_error"
        if "AssertionError" in stderr:
            return "assertion_error"
        if "RuntimeError" in stderr:
            return "runtime_error"
        return "other_error"

class CodeRewardDatasetBuilder:
    """
    Build code reward dataset from problems with test suites.

    Sources:
    - HumanEval / MBPP (with existing tests)
    - LeetCode problems (scrape test cases)
    - Competitive programming (with judges)
    - Synthetic problems with generated tests
    """

    def __init__(self, model, executor, n_samples=50):
        self.model = model
        self.executor = executor
        self.n_samples = n_samples

    def build_from_problems(self, problems):
        """
        Generate reward data from coding problems.

        For each problem, sample N solutions, run tests,
        and create reward labels from pass/fail.
        """
        dataset = []

        for problem in problems:
            solutions = self.model.generate(
                problem["prompt"],
                n=self.n_samples,
                temperature=0.8,
                max_tokens=2048,
            )

            for solution in solutions:
                result = self.executor.execute_python(
                    solution, problem["tests"]
                )

                # Graded reward based on test results
                if result["passed"]:
                    reward = 1.0
                elif result["compiled"]:
                    # Partial credit for compiling but failing
                    reward = -0.3
                else:
                    # No credit for syntax errors
                    reward = -1.0

                dataset.append(
                    CodeRewardSample(
                        problem=problem["prompt"],
                        code=solution,
                        language="python",
                        test_results=result,
                        reward=reward,
                        error_type=result.get("error_type"),
                        execution_time_ms=0.0,
                    )
                )

        return dataset

    def generate_additional_tests(self, problem, solution):
        """
        Generate additional test cases to increase
        reward signal reliability.

        A solution might pass the provided tests by
        accident. Generating more tests (especially
        edge cases) reduces false positives.
        """
        test_gen_prompt = (
            f"Given this programming problem:\n"
            f"{problem['prompt']}\n\n"
            f"And this reference solution:\n"
            f"{problem.get('reference_solution', '')}\n\n"
            f"Generate 10 additional test cases including:\n"
            f"- Edge cases (empty input, single element)\n"
            f"- Large inputs\n"
            f"- Negative numbers\n"
            f"- Boundary conditions\n"
            f"Format as Python assert statements."
        )

        additional_tests = self.model.generate(
            test_gen_prompt,
            n=1,
            temperature=0.3,
            max_tokens=1024,
        )[0]

        return additional_tests

📊

Code Reward Signal Quality by Test Count

Tests per Problem	False Positive Rate	False Negative Rate	Reward Accuracy	Avg Execution Time (ms)
1-3 (HumanEval default)	12%	3%	85%	50
5-10 (augmented)	5%	4%	91%	120
10-20 (comprehensive)	2%	5%	93%	250
20-50 (fuzzing)	0.5%	7%	93%	600
Property-based (Hypothesis)	0.3%	8%	92%	1200

Synthetic Reward Data Generation

Execution-Guided Reward Data

class SyntheticRewardDataGenerator:
    """
    Generate synthetic reward training data using
    execution feedback loops.

    Strategy: generate problems, generate solutions,
    execute solutions to get ground truth, use execution
    results as reward labels. No human annotation needed.
    """

    def __init__(self, problem_generator, solver,
                 executor, verifier):
        self.problem_generator = problem_generator
        self.solver = solver
        self.executor = executor
        self.verifier = verifier

    def generate_math_reward_data(self, n_problems, n_solutions):
        """
        Synthetic math reward data pipeline:
        1. Generate problems with known solutions
        2. Solve each problem N times
        3. Verify each solution against known answer
        4. Label (problem, solution) pairs
        """
        dataset = []

        # Generate problems with reference solutions
        problems = self.problem_generator.generate_math(
            n=n_problems,
            difficulty_distribution={
                "easy": 0.2,
                "medium": 0.4,
                "hard": 0.3,
                "olympiad": 0.1,
            },
        )

        for problem in problems:
            # Verify the generated problem has a valid answer
            ref_answer = problem.get("reference_answer")
            if ref_answer is None:
                continue

            # Sample solutions
            solutions = self.solver.generate(
                problem["question"],
                n=n_solutions,
                temperature=0.9,
            )

            for solution in solutions:
                answer = self.verifier.extract_answer(solution)
                verdict = self.verifier.check(
                    answer, ref_answer
                )

                dataset.append({
                    "problem": problem["question"],
                    "solution": solution,
                    "reward": (
                        1.0 if verdict == "correct" else -1.0
                    ),
                    "answer_extracted": answer,
                    "reference_answer": ref_answer,
                    "difficulty": problem["difficulty"],
                    "synthetic": True,
                })

        return dataset

    def generate_code_reward_data(self, n_problems,
                                   n_solutions):
        """
        Synthetic code reward data pipeline:
        1. Generate problem descriptions
        2. Generate reference solutions and test suites
        3. Verify reference solution passes all tests
        4. Generate N candidate solutions
        5. Run each against test suite
        """
        dataset = []

        problems = self.problem_generator.generate_code(
            n=n_problems,
            topics=[
                "arrays", "strings", "trees",
                "graphs", "dynamic_programming",
                "sorting", "searching",
            ],
        )

        for problem in problems:
            # Verify reference solution
            ref_result = self.executor.execute_python(
                problem["reference_solution"],
                problem["tests"],
            )

            if not ref_result["passed"]:
                # Skip problems where reference fails
                continue

            # Generate candidate solutions
            solutions = self.solver.generate(
                problem["prompt"],
                n=n_solutions,
                temperature=0.8,
            )

            for solution in solutions:
                result = self.executor.execute_python(
                    solution, problem["tests"]
                )

                dataset.append({
                    "problem": problem["prompt"],
                    "solution": solution,
                    "reward": 1.0 if result["passed"] else -1.0,
                    "error_type": result.get("error_type"),
                    "synthetic": True,
                })

        return dataset

Reward Hacking Detection

Identifying and Mitigating Reward Exploitation

class RewardHackingDetector:
    """
    Detect reward hacking: when the policy exploits
    artifacts in the reward model rather than improving
    actual quality.

    Common reward hacking patterns in math/code:
    - Verbose solutions that pad length (length bias)
    - Solutions that copy the question (repetition exploit)
    - Code that hardcodes test case outputs
    - Math solutions that state the answer without proof
    """

    def __init__(self, reward_model):
        self.reward_model = reward_model
        self.baseline_stats = {}

    def detect_length_hacking(self, samples):
        """
        Check if reward correlates with length independent
        of quality.

        If reward increases with solution length even among
        incorrect solutions, the reward model has a length bias.
        """
        import numpy as np

        correct_lengths = []
        correct_rewards = []
        incorrect_lengths = []
        incorrect_rewards = []

        for sample in samples:
            length = len(sample["solution"].split())
            reward = self.reward_model.score(
                sample["problem"], sample["solution"]
            )

            if sample["reward"] > 0:
                correct_lengths.append(length)
                correct_rewards.append(reward)
            else:
                incorrect_lengths.append(length)
                incorrect_rewards.append(reward)

        # Correlation within incorrect solutions
        if len(incorrect_lengths) > 10:
            corr = np.corrcoef(
                incorrect_lengths, incorrect_rewards
            )[0, 1]
            if corr > 0.3:
                return {
                    "hacking_type": "length_bias",
                    "correlation": float(corr),
                    "severity": "high" if corr > 0.5 else "medium",
                    "recommendation": (
                        "Add length penalty to reward or "
                        "train with length-controlled pairs"
                    ),
                }

        return None

    def detect_hardcoding(self, code_samples):
        """
        Detect code that hardcodes test case outputs
        rather than implementing the algorithm.

        Patterns:
        - if/elif chains matching exact test inputs
        - Dictionary mapping inputs to outputs
        - No loops or logic, only return statements
        """
        detections = []

        for sample in code_samples:
            code = sample["solution"]

            # Count if/elif chains
            if_count = code.count("if ") + code.count("elif ")
            line_count = len(code.strip().split("\n"))

            # High if/elif density suggests hardcoding
            if line_count > 0 and if_count / line_count > 0.4:
                detections.append({
                    "sample": sample,
                    "pattern": "if_chain",
                    "density": if_count / line_count,
                })

            # Check for direct output mapping
            if "return {" in code and code.count(":") > 5:
                detections.append({
                    "sample": sample,
                    "pattern": "dict_mapping",
                })

        return detections

    def detect_answer_copying(self, math_samples):
        """
        Detect solutions that state the answer without
        reasoning. The model may learn that asserting
        '\\boxed{42}' early gets high reward.
        """
        detections = []

        for sample in math_samples:
            solution = sample["solution"]
            steps = solution.strip().split("\n")

            # Check if answer appears in first 2 lines
            for step in steps[:2]:
                if "\\boxed{" in step or "answer is" in step.lower():
                    total_lines = len([
                        s for s in steps if s.strip()
                    ])
                    if total_lines < 5:
                        detections.append({
                            "sample": sample,
                            "pattern": "early_answer",
                            "reasoning_steps": total_lines,
                        })

        return detections

🚨 Danger

Reward hacking is the most insidious failure mode in RLHF for math/code. A model trained with a reward model that has a length bias will produce increasingly verbose solutions that score well on the reward model but are worse by human evaluation. In code, hardcoding test outputs produces solutions that pass all tests but do not generalize. Regular auditing with held-out test cases and human evaluation is essential.

Reward Model Training Pipeline

Putting It All Together

class RewardModelTrainer:
    """
    Train a reward model from the constructed dataset.

    Architecture: Base LLM + value head (linear layer
    projecting last hidden state to scalar reward).

    Training: Bradley-Terry loss on preference pairs
    or binary cross-entropy on (solution, label) pairs.
    """

    def __init__(self, base_model, config):
        self.base_model = base_model
        self.learning_rate = config.get("lr", 1e-5)
        self.epochs = config.get("epochs", 3)
        self.batch_size = config.get("batch_size", 16)
        self.loss_type = config.get("loss_type", "bce")

    def prepare_training_data(self, orm_data, prm_data=None):
        """
        Merge ORM and PRM data into a unified training set.

        ORM data: (problem, solution, reward) triples
        PRM data: (problem, step, step_label) triples

        For a combined model, we train on both outcome-level
        and step-level predictions.
        """
        training_samples = []

        # ORM samples
        for sample in orm_data:
            training_samples.append({
                "input": (
                    f"Problem: {sample.problem}\n"
                    f"Solution: {sample.solution}"
                ),
                "label": 1.0 if sample.reward > 0 else 0.0,
                "weight": 1.0,
                "type": "outcome",
            })

        # PRM samples (if available)
        if prm_data:
            for sample in prm_data:
                for step_label in sample.step_labels:
                    prefix = "\n".join(
                        sample.solution_steps[
                            :step_label.step_index + 1
                        ]
                    )
                    training_samples.append({
                        "input": (
                            f"Problem: {sample.problem}\n"
                            f"Solution so far: {prefix}"
                        ),
                        "label": (
                            1.0
                            if step_label.label == "correct"
                            else 0.0
                        ),
                        "weight": step_label.confidence,
                        "type": "process",
                    })

        return training_samples

    def compute_loss(self, predictions, labels, weights):
        """
        Compute weighted binary cross-entropy loss.

        loss = -w * [y * log(p) + (1-y) * log(1-p)]

        where w is the sample weight (confidence),
        y is the label, and p is the predicted reward.
        """
        import numpy as np

        epsilon = 1e-7
        predictions = np.clip(predictions, epsilon, 1 - epsilon)

        bce = -(
            labels * np.log(predictions)
            + (1 - labels) * np.log(1 - predictions)
        )

        weighted_loss = bce * weights
        return np.mean(weighted_loss)

Reward Model Accuracy by Data Source

Metric	10	50	100	250	500	1000
ORM (math, execution-verified)	72	78	82	86	88	90
PRM (math, MC-annotated)	68	76	82	87	90	92
Code ORM (test-verified)	75	82	86	89	91	93
Human preference (chat)	65	72	76	80	83	85

Key Takeaways

Reward model training data for math and code benefits from a unique advantage: verifiable ground truth. Unlike general chat where quality is subjective, math solutions are either correct or incorrect, and code either passes tests or fails.

The critical decisions:

Answer equivalence checking is hard: For math ORM data, the answer extractor and equivalence checker determine data quality. Symbolic normalization catches most cases but misses 3-8% of equivalent answers. Investing in robust equivalence checking (SymPy integration, multiple normalization passes) directly improves reward model accuracy.
Process reward models outperform outcome reward models: PRM data (step-level labels) produces reward models that are 2-5% more accurate than ORM data (final-answer-only labels) at the same data scale. The cost is 100-1000x more expensive annotation via Monte Carlo completion sampling.
More tests means better code reward data: False positive rate drops from 12% with 1-3 tests to 0.5% with 20-50 tests. Property-based testing (Hypothesis) achieves the lowest false positive rate but increases execution time 20x.
Synthetic reward data works: Execution-verified synthetic data (generate problems, generate solutions, verify by execution) produces reward models within 2-3% of human-annotated data quality while eliminating annotation cost entirely.
Reward hacking is a constant threat: Length bias, hardcoding, and early-answer patterns are the three most common reward hacking modes. Regular auditing with held-out tests and human spot-checks is not optional — it is a required component of any reward model training pipeline.