Preference Data: Building DPO/RLHF Datasets from Human and AI Feedback

Part of Series The Dataset Frontier 8 of 7

1 Synthetic Data Pipelines: Magpie, Nemotron-4, and Generating Training Data at Scale 2 Data Curation at Scale: DCLM, FineWeb-Edu, and the Exact Heuristics That Filter the Web 3 Agent-Based Simulation: Using 10,000 AI Agents to Generate Synthetic Training Data 4 Code Dataset Curation: Deduplication, License Filtering, and Quality Scoring for LLM Training 5 Preference Data: Building DPO/RLHF Datasets from Human and AI Feedback 6 Data Mixing: Optimal Proportions of Code, Math, Web, and Books for LLM Training 7 Evaluation Datasets: Building Benchmarks That Actually Measure LLM Capability

DPO, RLHF, KTO, and ORPO all require preference data: examples of “this response is better than that response” or “this response is good/bad.” The quality of this data determines the quality of alignment. This post covers how to build preference datasets — from human annotation through AI-assisted labeling to quality control.

The Data Format

Every alignment method needs one of two formats:

Paired preferences (for DPO, ORPO):

{
  "prompt": "Explain quantum entanglement",
  "chosen": "Quantum entanglement is a phenomenon where...",
  "rejected": "Quantum entanglement means particles are connected by invisible strings..."
}

Unpaired labels (for KTO):

{
  "prompt": "Explain quantum entanglement",
  "response": "Quantum entanglement is a phenomenon where...",
  "label": "good"
}

Human Annotation Pipeline

Step 1: Generate Response Pairs

For each prompt, generate 2-4 responses using the model being trained (or a mix of models):

def generate_response_pairs(model, prompts, responses_per_prompt=4):
    """Generate multiple responses per prompt for annotation."""
    dataset = []
    for prompt in prompts:
        responses = []
        for _ in range(responses_per_prompt):
            response = model.generate(
                prompt,
                temperature=1.0,  # High temp for diversity
                max_tokens=1024,
            )
            responses.append(response)
        dataset.append({"prompt": prompt, "responses": responses})
    return dataset

Step 2: Human Annotation

Present annotators with response pairs and ask: “Which response is better?” Annotation guidelines must specify:

Helpfulness: Does the response answer the question?
Correctness: Are the facts right?
Safety: Does the response refuse harmful requests appropriately?
Format: Is the response well-structured?

📊

Human Annotation Cost and Quality

Method	Cost per Example	Quality (Agreement Rate)	Speed
Expert annotators	USD 2-5	85-90% inter-annotator agreement	20-50 examples/hour
Crowd workers (MTurk)	USD 0.10-0.50	65-75% agreement	100-200 examples/hour
AI-assisted (human verifies AI labels)	USD 0.02-0.10	80-85% agreement	500-1000 examples/hour
Pure AI labeling (judge model)	USD 0.001-0.01	75-80% vs human labels	10K+ examples/hour

Step 3: AI-Assisted Labeling

Use a strong judge model (GPT-4, Claude) to generate initial preference labels, then have humans verify only the uncertain cases:

def ai_assisted_labeling(judge_model, prompt, response_a, response_b):
    """Use AI judge with human verification for uncertain cases."""
    judge_prompt = f"""Compare these two responses to the user query.

User: {prompt}

Response A: {response_a}

Response B: {response_b}

Which response is better? Consider helpfulness, correctness, and safety.
Output exactly one of: "A", "B", or "TIE".
Also output your confidence: "HIGH" or "LOW"."""

    judgment = judge_model.generate(judge_prompt)
    choice = parse_choice(judgment)  # "A", "B", or "TIE"
    confidence = parse_confidence(judgment)  # "HIGH" or "LOW"

    if confidence == "LOW":
        return None  # Route to human annotator
    return {"chosen": response_a if choice == "A" else response_b,
            "rejected": response_b if choice == "A" else response_a}

⚡ The 80/20 Rule for AI-Assisted Labeling

In practice, AI judges agree with expert human annotators on 80% of examples with HIGH confidence. Only the remaining 20% need human review. This reduces annotation cost by 5-10x while maintaining quality within 2-3% of full human annotation.

Quality Control

Detecting Low-Quality Annotations

def detect_low_quality(annotations):
    """Flag annotations that may be unreliable."""
    issues = []
    for ann in annotations:
        # Check 1: Response pair too similar (hard to distinguish)
        similarity = compute_similarity(ann["chosen"], ann["rejected"])
        if similarity > 0.95:
            issues.append(("too_similar", ann))

        # Check 2: Annotation time too fast (annotator may be rushing)
        if ann.get("annotation_time_seconds", 999) < 5:
            issues.append(("too_fast", ann))

        # Check 3: Chosen response is objectively wrong (fact-check)
        if contains_obvious_errors(ann["chosen"]):
            issues.append(("chosen_has_errors", ann))

    return issues

Filtering with Reward Model

After initial annotation, train a reward model on the data and use it to verify annotations:

def verify_with_reward_model(rm, annotations, threshold=0.1):
    """Use reward model to flag annotations where RM disagrees."""
    verified = []
    flagged = []
    for ann in annotations:
        chosen_score = rm.score(ann["prompt"], ann["chosen"])
        rejected_score = rm.score(ann["prompt"], ann["rejected"])
        margin = chosen_score - rejected_score

        if margin < threshold:
            flagged.append(ann)  # RM thinks chosen is NOT better
        else:
            verified.append(ann)

    return verified, flagged

Dataset Size Requirements

📊

Preference Dataset Sizes Used by Frontier Models

Model/Dataset	Pairs	Method	Quality Level
Anthropic HH-RLHF	170K	Human annotation	Expert
OpenAI (estimated)	1-10M	Human + AI mix	Expert + AI
UltraFeedback	64K	AI-generated (GPT-4 judge)	AI-only
Nectar	183K	AI-generated (multiple judges)	AI-only
Typical open-source DPO	10K-50K	Mixed	Varies

Note: Quality matters more than quantity. 10K high-quality expert pairs often outperform 100K noisy pairs.

💡 The Minimum Viable Dataset

For DPO alignment of a 7B model: 10K high-quality preference pairs is sufficient for measurable improvement. 50K pairs is the sweet spot for most open-source models. Beyond 100K, returns diminish unless the additional data covers new domains or edge cases not in the first 50K.

Reviewer Agent Validation

Challenge: Implement a function that takes a list of (prompt, response_a, response_b) triples and uses a judge model to generate DPO training data in the correct format.

Expected:

def create_dpo_dataset(judge_model, triples):
    dataset = []
    for prompt, response_a, response_b in triples:
        judgment = judge_model.compare(prompt, response_a, response_b)
        if judgment == "A":
            dataset.append({"prompt": prompt, "chosen": response_a, "rejected": response_b})
        elif judgment == "B":
            dataset.append({"prompt": prompt, "chosen": response_b, "rejected": response_a})
        # Skip TIE
    return dataset