DPO, RLHF, KTO, and ORPO all require preference data: examples of “this response is better than that response” or “this response is good/bad.” The quality of this data determines the quality of alignment. This post covers how to build preference datasets — from human annotation through AI-assisted labeling to quality control.
The Data Format
Every alignment method needs one of two formats:
Paired preferences (for DPO, ORPO):
{
"prompt": "Explain quantum entanglement",
"chosen": "Quantum entanglement is a phenomenon where...",
"rejected": "Quantum entanglement means particles are connected by invisible strings..."
}
Unpaired labels (for KTO):
{
"prompt": "Explain quantum entanglement",
"response": "Quantum entanglement is a phenomenon where...",
"label": "good"
}
Human Annotation Pipeline
Step 1: Generate Response Pairs
For each prompt, generate 2-4 responses using the model being trained (or a mix of models):
def generate_response_pairs(model, prompts, responses_per_prompt=4):
"""Generate multiple responses per prompt for annotation."""
dataset = []
for prompt in prompts:
responses = []
for _ in range(responses_per_prompt):
response = model.generate(
prompt,
temperature=1.0, # High temp for diversity
max_tokens=1024,
)
responses.append(response)
dataset.append({"prompt": prompt, "responses": responses})
return dataset
Step 2: Human Annotation
Present annotators with response pairs and ask: “Which response is better?” Annotation guidelines must specify:
- Helpfulness: Does the response answer the question?
- Correctness: Are the facts right?
- Safety: Does the response refuse harmful requests appropriately?
- Format: Is the response well-structured?
Human Annotation Cost and Quality
| Method | Cost per Example | Quality (Agreement Rate) | Speed |
|---|---|---|---|
| Expert annotators | USD 2-5 | 85-90% inter-annotator agreement | 20-50 examples/hour |
| Crowd workers (MTurk) | USD 0.10-0.50 | 65-75% agreement | 100-200 examples/hour |
| AI-assisted (human verifies AI labels) | USD 0.02-0.10 | 80-85% agreement | 500-1000 examples/hour |
| Pure AI labeling (judge model) | USD 0.001-0.01 | 75-80% vs human labels | 10K+ examples/hour |
Step 3: AI-Assisted Labeling
Use a strong judge model (GPT-4, Claude) to generate initial preference labels, then have humans verify only the uncertain cases:
def ai_assisted_labeling(judge_model, prompt, response_a, response_b):
"""Use AI judge with human verification for uncertain cases."""
judge_prompt = f"""Compare these two responses to the user query.
User: {prompt}
Response A: {response_a}
Response B: {response_b}
Which response is better? Consider helpfulness, correctness, and safety.
Output exactly one of: "A", "B", or "TIE".
Also output your confidence: "HIGH" or "LOW"."""
judgment = judge_model.generate(judge_prompt)
choice = parse_choice(judgment) # "A", "B", or "TIE"
confidence = parse_confidence(judgment) # "HIGH" or "LOW"
if confidence == "LOW":
return None # Route to human annotator
return {"chosen": response_a if choice == "A" else response_b,
"rejected": response_b if choice == "A" else response_a}
In practice, AI judges agree with expert human annotators on 80% of examples with HIGH confidence. Only the remaining 20% need human review. This reduces annotation cost by 5-10x while maintaining quality within 2-3% of full human annotation.
Quality Control
Detecting Low-Quality Annotations
def detect_low_quality(annotations):
"""Flag annotations that may be unreliable."""
issues = []
for ann in annotations:
# Check 1: Response pair too similar (hard to distinguish)
similarity = compute_similarity(ann["chosen"], ann["rejected"])
if similarity > 0.95:
issues.append(("too_similar", ann))
# Check 2: Annotation time too fast (annotator may be rushing)
if ann.get("annotation_time_seconds", 999) < 5:
issues.append(("too_fast", ann))
# Check 3: Chosen response is objectively wrong (fact-check)
if contains_obvious_errors(ann["chosen"]):
issues.append(("chosen_has_errors", ann))
return issues
Filtering with Reward Model
After initial annotation, train a reward model on the data and use it to verify annotations:
def verify_with_reward_model(rm, annotations, threshold=0.1):
"""Use reward model to flag annotations where RM disagrees."""
verified = []
flagged = []
for ann in annotations:
chosen_score = rm.score(ann["prompt"], ann["chosen"])
rejected_score = rm.score(ann["prompt"], ann["rejected"])
margin = chosen_score - rejected_score
if margin < threshold:
flagged.append(ann) # RM thinks chosen is NOT better
else:
verified.append(ann)
return verified, flagged
Dataset Size Requirements
Preference Dataset Sizes Used by Frontier Models
| Model/Dataset | Pairs | Method | Quality Level |
|---|---|---|---|
| Anthropic HH-RLHF | 170K | Human annotation | Expert |
| OpenAI (estimated) | 1-10M | Human + AI mix | Expert + AI |
| UltraFeedback | 64K | AI-generated (GPT-4 judge) | AI-only |
| Nectar | 183K | AI-generated (multiple judges) | AI-only |
| Typical open-source DPO | 10K-50K | Mixed | Varies |
For DPO alignment of a 7B model: 10K high-quality preference pairs is sufficient for measurable improvement. 50K pairs is the sweet spot for most open-source models. Beyond 100K, returns diminish unless the additional data covers new domains or edge cases not in the first 50K.
Reviewer Agent Validation
Challenge: Implement a function that takes a list of (prompt, response_a, response_b) triples and uses a judge model to generate DPO training data in the correct format.
Expected:
def create_dpo_dataset(judge_model, triples):
dataset = []
for prompt, response_a, response_b in triples:
judgment = judge_model.compare(prompt, response_a, response_b)
if judgment == "A":
dataset.append({"prompt": prompt, "chosen": response_a, "rejected": response_b})
elif judgment == "B":
dataset.append({"prompt": prompt, "chosen": response_b, "rejected": response_a})
# Skip TIE
return dataset