Training data is the single largest determinant of LLM quality. The model architecture matters. The optimizer matters. But the data matters more. And the data problem has two dimensions: quantity (you need billions of tokens for pretraining, millions of examples for instruction tuning) and quality (one bad example can teach the model a bad habit that persists across thousands of good examples). Human-generated data scores high on quality but fails catastrophically on quantity and cost. Synthetic data inverts this tradeoff: a strong model generates the training examples, a filtering pipeline enforces quality, and the result is a dataset that a weaker model can learn from at a fraction of the cost.
This post covers the full pipeline: why synthetic data works from a distillation perspective, the Magpie technique for generating instruction-response pairs from a model’s own latent distribution, NVIDIA’s Nemotron-4 reward-model-guided synthesis, the multi-stage quality filtering pipeline that separates signal from noise, domain-specific synthesis strategies, and a complete implementation exercise that you can run today.
1. Why Synthetic Data: The Economics and the Theory
The Cost Problem
Instruction-tuning data requires paired examples: an instruction (what the user asks) and a response (what the model should produce). Generating these by hand means hiring domain experts, writing prompts, writing gold-standard answers, and reviewing everything. The numbers are stark:
Cost Per Training Example by Source
| Source | Cost/Example | Throughput | Quality Control | Diversity |
|---|---|---|---|---|
| Expert annotation (PhD-level) | $25-50 | 5-20/day/person | High (human review) | Low (annotator bias) |
| Crowd annotation (MTurk) | $1-5 | 50-200/day/person | Medium (inter-annotator agreement) | Medium |
| Synthetic (GPT-4 class) | $0.005-0.02 | 10K-100K/day/API | Requires filtering pipeline | High (controllable) |
| Synthetic (open model, local) | $0.001-0.005 | 1K-50K/day/GPU | Requires filtering pipeline | High (controllable) |
At $30 per expert-annotated example, building a 100K-example instruction-tuning dataset costs $3M and takes months. At $0.01 per synthetic example, the same dataset costs $1K and takes days. Even if you filter out 90% of synthetic examples as low-quality, the economics still win by two orders of magnitude.
The Distillation Argument
Synthetic data generation is a form of knowledge distillation. A strong “teacher” model with parameters generates outputs that capture its learned distribution. A weaker “student” model with parameters trains on these outputs. The training objective for the student on synthetic data is:
where is the instruction and is the teacher-generated response. The student learns to mimic the teacher’s output distribution, not just memorize individual examples. This is strictly more informative than training on the raw text alone because the teacher’s responses embed reasoning patterns, formatting conventions, and factual knowledge that the student would otherwise need to learn from scratch.
The empirical evidence is strong. Alpaca (Stanford, 2023) trained LLaMA-7B on 52K instruction-response pairs generated by text-davinci-003 for under $500. The resulting model matched the original Davinci model on many benchmarks despite having 25x fewer parameters. WizardLM (2023) used a synthetic “Evol-Instruct” pipeline to generate increasingly complex instructions and achieved state-of-the-art instruction-following at the 7B scale.
When Synthetic Data Fails
Synthetic data is not free from failure modes. The critical ones:
-
Model collapse: If the student trains on data from a teacher that was itself trained on synthetic data from a previous generation, quality degrades. Each generation amplifies distributional artifacts. After 5-10 generations of recursive synthesis, outputs become repetitive and lose diversity. The fix: always use a teacher model trained on real data, never on its own synthetic outputs.
-
Systematic blind spots: The teacher model has its own failure modes. If GPT-4 consistently gets a certain class of math problems wrong, every synthetic example for that class will contain the same error. The student learns the error as if it were ground truth. The fix: reward model scoring and domain-specific validation.
-
Distribution mismatch: Synthetic instructions may not match the distribution of real user queries. A model prompted to generate “diverse instructions” tends to produce academically-flavored questions, not the messy, ambiguous, multi-turn queries that real users submit. The fix: seed the generation with real user query distributions (anonymized logs, public datasets).
Synthetic data has no inherent quality guarantee. Without filtering, a synthetic dataset is worse than a curated human dataset of the same size. The entire value proposition depends on generating 10-100x more examples than you need and filtering aggressively. The pipeline IS the product.
2. The Magpie Technique: Mining Instructions from Model Internals
The Core Insight
Magpie (Xu et al., 2024) starts from a simple observation: instruction-tuned LLMs have already internalized a distribution over instructions. When you send an empty user turn to a chat model — literally just the system prompt and the beginning of the assistant turn marker — the model will generate text that looks like a plausible user instruction. It does this because the chat template conditions the model to expect a user message, and without one, the model fills in the gap from its training distribution.
This is not a prompt hack. It is sampling from the model’s learned prior over the instruction space. The instructions are diverse, natural, and span the full range of tasks the model was trained on — because they literally come from the same distribution.
The Method
The Magpie pipeline has three stages:
Stage 1: Instruction Generation. Feed the model a chat template with an empty user turn and let it generate the instruction.
For Llama-3-Instruct, the template looks like this:
<|begin_of_text|><|start_header_id|>system<|end_header_id|>
You are a helpful AI assistant.<|eot_id|><|start_header_id|>user<|end_header_id|>
The model then generates text as if completing a user message. It might produce: “Can you explain the difference between TCP and UDP, focusing on when you’d choose one over the other for a real-time gaming application?” This is not a canned response — it is a sample from the model’s instruction prior.
Stage 2: Response Generation. Take the generated instruction, format it as a proper chat message, and let the same model (or a stronger one) generate the response.
Stage 3: Quality Filtering. Score each instruction-response pair and keep only the top fraction.
Implementation
Here is the complete pipeline for generating Magpie-style instruction-response pairs:
import json
import asyncio
from openai import AsyncOpenAI
from dataclasses import dataclass, asdict
@dataclass
class SyntheticExample:
instruction: str
response: str
instruction_length: int
response_length: int
generation_model: str
SYSTEM_PROMPT = "You are a helpful AI assistant."
# For Llama-3 style models, we exploit the chat template.
# The instruction generation prompt is the template up to
# the assistant turn, with the user content left empty.
INSTRUCTION_GEN_PROMPT = (
"Generate a single, specific user instruction or question. "
"The instruction should be self-contained, require a substantive "
"response, and cover any topic. Output ONLY the instruction text, "
"nothing else."
)
async def generate_instruction(
client: AsyncOpenAI,
model: str,
temperature: float = 1.0,
) -> str:
"""Generate a single instruction by sampling from the model's prior."""
response = await client.chat.completions.create(
model=model,
messages=[
{"role": "system", "content": SYSTEM_PROMPT},
{"role": "user", "content": INSTRUCTION_GEN_PROMPT},
],
temperature=temperature,
max_tokens=256,
)
return response.choices[0].message.content.strip()
async def generate_response(
client: AsyncOpenAI,
model: str,
instruction: str,
temperature: float = 0.7,
) -> str:
"""Generate a response to the given instruction."""
response = await client.chat.completions.create(
model=model,
messages=[
{"role": "system", "content": SYSTEM_PROMPT},
{"role": "user", "content": instruction},
],
temperature=temperature,
max_tokens=2048,
)
return response.choices[0].message.content.strip()
async def generate_batch(
client: AsyncOpenAI,
model: str,
n: int,
concurrency: int = 50,
) -> list[SyntheticExample]:
"""Generate n instruction-response pairs with bounded concurrency."""
semaphore = asyncio.Semaphore(concurrency)
results: list[SyntheticExample] = []
async def generate_one() -> SyntheticExample | None:
async with semaphore:
try:
instruction = await generate_instruction(client, model)
if len(instruction) < 10 or len(instruction) > 1000:
return None # Length filter on instructions
response = await generate_response(
client, model, instruction
)
if len(response) < 50:
return None # Reject trivially short responses
return SyntheticExample(
instruction=instruction,
response=response,
instruction_length=len(instruction),
response_length=len(response),
generation_model=model,
)
except Exception:
return None
tasks = [generate_one() for _ in range(n)]
completed = await asyncio.gather(*tasks)
results = [r for r in completed if r is not None]
return results
The key design decisions:
- Temperature 1.0 for instructions: Maximizes diversity. Lower temperatures produce repetitive instructions that cluster around common patterns.
- Temperature 0.7 for responses: Balances quality and diversity. Too high and responses become incoherent. Too low and they become formulaic.
- Concurrency control: API rate limits and model serving throughput cap the effective parallelism. The semaphore prevents overloading.
- Inline length filters: Reject obviously bad examples immediately rather than wasting downstream filtering compute.
The Magpie Numbers
The original Magpie paper reports the following pipeline statistics for Llama-3-8B-Instruct:
Magpie Pipeline Statistics (Llama-3-8B-Instruct)
| Stage | Examples | Retention Rate | Cumulative |
|---|---|---|---|
| Raw generation | 3,000,000 | 100% | 3,000,000 |
| Length filter (instruction) | 2,700,000 | 90% | 2,700,000 |
| Length filter (response) | 2,400,000 | 89% | 2,400,000 |
| Deduplication | 1,800,000 | 75% | 1,800,000 |
| Perplexity filter | 1,200,000 | 67% | 1,200,000 |
| Reward model filter (top quartile) | 300,000 | 25% | 300,000 |
The 10:1 raw-to-final ratio is typical. You generate far more than you need and filter aggressively. The 300K surviving examples from Magpie-Air matched or exceeded datasets that were 10x larger but unfiltered on the AlpacaEval 2.0 and Arena-Hard benchmarks.
With a local Llama-3-70B-Instruct on 8x A100 GPUs using vLLM, you can generate approximately 5,000 instruction-response pairs per hour. The full 3M raw pipeline takes about 25 days on a single 8-GPU node. With 4 nodes, about a week. API-based generation with GPT-4 at 100 RPM takes roughly 500 hours for 3M pairs, but costs scale linearly with API pricing.
Why Magpie Works Better Than Prompt-Based Generation
Previous approaches like Self-Instruct (Wang et al., 2023) used explicit prompts: “Generate a creative writing task” or “Generate a coding question about recursion.” These approaches suffer from two problems:
- Prompt bias: The generated instructions are constrained by the prompt categories you enumerate. If you forget to include “data visualization” as a category, you get zero data visualization instructions.
- Distribution mismatch: Hand-crafted category lists do not match the actual distribution of user queries. You end up over-representing some topics and under-representing others.
Magpie avoids both problems because the instructions come from the model’s own training distribution, which was shaped by millions of real user interactions during RLHF. The instruction diversity is not designed — it is emergent.
3. Nemotron-4: Reward-Model-Guided Synthesis at NVIDIA Scale
The Architecture
NVIDIA’s approach to synthetic data generation, published with the Nemotron-4 340B family (Adler et al., 2024), differs from Magpie in a fundamental way: instead of relying on post-hoc filtering, it uses a dedicated reward model as an online quality signal during generation. The reward model scores every generated response across multiple dimensions, and only responses that pass all thresholds become training data.
The system has three components:
- Generator: Nemotron-4-340B-Instruct generates candidate responses.
- Reward Model: Nemotron-4-340B-Reward scores each response on 5 dimensions.
- Filter: A multi-dimensional threshold gate that only passes examples scoring above the minimum on every dimension simultaneously.
The Five Reward Dimensions
The Nemotron-4 reward model outputs five scalar scores per response, each on a 0-5 scale:
Nemotron-4 Reward Model Scoring Dimensions
| Dimension | Measures | Typical Threshold | Reject If |
|---|---|---|---|
| Helpfulness | Does the response address the instruction? | 3.5 | Off-topic, refusals, partial answers |
| Correctness | Are the facts and reasoning accurate? | 3.5 | Factual errors, logical fallacies |
| Coherence | Is the response well-structured and readable? | 3.0 | Rambling, contradictions, repetition |
| Complexity | Does the response handle nuance appropriately? | 2.5 | Oversimplified answers to complex questions |
| Verbosity | Is the length appropriate? | 2.0 | Excessively long or short for the task |
The multi-dimensional scoring is critical. A response can be helpful but incorrect (confidently wrong answers), correct but incoherent (technically right but impossible to follow), or coherent but unhelpful (well-written refusal). Single-score reward models conflate these failure modes. The 5-dimension approach lets you set independent thresholds per dimension and diagnose exactly why examples fail.
The Reward Model API Call
Here is how to score a single example using the Nemotron reward model:
import requests
from dataclasses import dataclass
@dataclass
class RewardScores:
helpfulness: float
correctness: float
coherence: float
complexity: float
verbosity: float
def passes_threshold(
self,
min_helpfulness: float = 3.5,
min_correctness: float = 3.5,
min_coherence: float = 3.0,
min_complexity: float = 2.5,
min_verbosity: float = 2.0,
) -> bool:
return (
self.helpfulness >= min_helpfulness
and self.correctness >= min_correctness
and self.coherence >= min_coherence
and self.complexity >= min_complexity
and self.verbosity >= min_verbosity
)
def score_with_nemotron(
instruction: str,
response: str,
api_url: str = "http://localhost:8000/v1/chat/completions",
model: str = "nemotron-4-340b-reward",
) -> RewardScores:
"""Score an instruction-response pair using Nemotron-4 reward model."""
payload = {
"model": model,
"messages": [
{"role": "user", "content": instruction},
{"role": "assistant", "content": response},
],
}
resp = requests.post(api_url, json=payload)
resp.raise_for_status()
data = resp.json()
# The reward model returns scores in the logprobs field
# Format: [helpfulness, correctness, coherence, complexity, verbosity]
scores = data["choices"][0]["logprobs"]["content"]
return RewardScores(
helpfulness=float(scores[0]["token"]),
correctness=float(scores[1]["token"]),
coherence=float(scores[2]["token"]),
complexity=float(scores[3]["token"]),
verbosity=float(scores[4]["token"]),
)
For local deployment, the reward model runs on 4-8 A100 GPUs using a standard vLLM or TensorRT-LLM serving setup. Throughput is approximately 200-500 scoring calls per second on 8x A100, making it feasible to score millions of examples in hours.
The Filtering Economics
The Nemotron approach inverts the economics of quality filtering. Instead of generating cheap examples and filtering expensive, it generates expensive examples (using a 340B model) but filters precisely:
Pass Rate by Reward Dimension (Nemotron-4 Pipeline)
(% of generated examples)Only 38% of examples from a 340B model pass all five gates simultaneously. This is a stronger filter than Magpie’s 10% retention rate, but it starts from a higher-quality generator. The result: HelpSteer2, a dataset of approximately 10,000 examples that improved a Llama-3-70B model by 3-5% across MT-Bench, AlpacaEval, and Arena-Hard.
The HelpSteer2 dataset contains only ~10,000 examples. This is not a typo. High-quality, multi-dimensionally-scored synthetic data is so information-dense that 10K examples can shift a 70B model’s behavior measurably. Compare this to early synthetic datasets (Alpaca: 52K, Dolly: 15K) that used no reward model filtering. Quality beats quantity when quality is rigorously measured.
Nemotron vs. Magpie: When to Use Which
The two approaches target different regimes:
Magpie vs. Nemotron: Approach Comparison
| Dimension | Magpie | Nemotron-4 |
|---|---|---|
| Generator model size | 8B-70B (any instruct model) | 340B (frontier scale) |
| Instruction source | Model's own prior distribution | External prompts or seeds |
| Quality signal | Post-hoc filtering (perplexity + reward) | Online reward model (5 dimensions) |
| Typical output size | 100K-1M filtered examples | 5K-20K filtered examples |
| Best for | Broad instruction tuning | Targeted quality improvement |
| Compute cost | Low-medium | High (340B generator + 340B reward) |
Use Magpie when you need a large, diverse instruction-tuning dataset and have a limited compute budget. Use Nemotron-style reward filtering when you need a small, extremely high-quality dataset for targeted improvement on specific benchmarks.
4. The Quality Filtering Pipeline
Raw synthetic data is noisy. The filtering pipeline is what transforms “output from an LLM API” into “training-ready dataset.” Each stage removes a different class of failure. The order matters — early stages are cheap and remove obvious garbage, later stages are expensive and handle subtle quality issues.
Stage 1: Format Validation
Before any quality assessment, reject examples that fail structural requirements:
import re
from dataclasses import dataclass
@dataclass
class FormatCheckResult:
passed: bool
reason: str
def check_format(instruction: str, response: str) -> FormatCheckResult:
"""Reject structurally invalid examples."""
# Instruction checks
if len(instruction.strip()) < 10:
return FormatCheckResult(False, "instruction_too_short")
if len(instruction.strip()) > 2000:
return FormatCheckResult(False, "instruction_too_long")
if instruction.strip() == response.strip()[:len(instruction.strip())]:
return FormatCheckResult(False, "response_copies_instruction")
# Response checks
if len(response.strip()) < 50:
return FormatCheckResult(False, "response_too_short")
if len(response.strip()) > 16000:
return FormatCheckResult(False, "response_too_long")
# Detect repetition: same sentence repeated 3+ times
sentences = re.split(r'[.!?]+', response)
sentence_counts: dict[str, int] = {}
for s in sentences:
s_clean = s.strip().lower()
if len(s_clean) > 20:
sentence_counts[s_clean] = sentence_counts.get(s_clean, 0) + 1
if any(count >= 3 for count in sentence_counts.values()):
return FormatCheckResult(False, "excessive_repetition")
# Detect refusals
refusal_patterns = [
r"i cannot",
r"i can't",
r"i'm unable to",
r"as an ai",
r"i don't have the ability",
]
response_lower = response.lower()
for pattern in refusal_patterns:
if re.search(pattern, response_lower) and len(response) < 200:
return FormatCheckResult(False, "likely_refusal")
return FormatCheckResult(True, "ok")
This stage is essentially free — string operations on text. It typically removes 5-15% of raw examples.
Stage 2: Deduplication with MinHash LSH
Duplicate and near-duplicate examples waste training compute and skew the data distribution. MinHash Locality-Sensitive Hashing (LSH) detects near-duplicates efficiently at scale.
The algorithm:
- Convert each example to a set of character n-grams (shingles).
- Compute hash functions over the shingle set to produce a MinHash signature.
- Use LSH banding to identify candidate pairs with high Jaccard similarity.
- For each cluster of near-duplicates, keep one representative.
from datasketch import MinHash, MinHashLSH
def build_dedup_index(
examples: list[dict],
threshold: float = 0.7,
num_perm: int = 128,
shingle_size: int = 5,
) -> list[dict]:
"""Remove near-duplicate examples using MinHash LSH."""
lsh = MinHashLSH(threshold=threshold, num_perm=num_perm)
minhashes: dict[int, MinHash] = {}
# Build index
for i, ex in enumerate(examples):
text = ex["instruction"] + " " + ex["response"]
m = MinHash(num_perm=num_perm)
# Create character-level shingles
for j in range(len(text) - shingle_size + 1):
shingle = text[j:j + shingle_size]
m.update(shingle.encode("utf-8"))
minhashes[i] = m
try:
lsh.insert(str(i), m)
except ValueError:
pass # Duplicate signature, skip
# Find unique representatives
seen: set[int] = set()
unique: list[dict] = []
for i, ex in enumerate(examples):
if i in seen:
continue
# Query for near-duplicates
neighbors = lsh.query(minhashes[i])
for n in neighbors:
seen.add(int(n))
unique.append(ex)
return unique
The Jaccard threshold of 0.7 is standard. Examples sharing 70%+ of their character 5-grams are considered duplicates. This catches paraphrases (“Explain X” vs “Can you explain X?”) and template-generated responses that differ only in minor details.
Complexity: index building, querying. For 3M examples, this runs in under 10 minutes on a single CPU.
Stage 3: Perplexity Filtering
Perplexity measures how “surprised” a reference model is by a given text. It serves as a proxy for two distinct quality signals:
- Very low perplexity (): The text is highly predictable. This indicates repetitive, formulaic, or templated content. “The answer is the answer is the answer” has near-zero perplexity.
- Very high perplexity (): The text is unpredictable. This indicates incoherent, garbled, or code-mixed content that does not form valid natural language.
The target range depends on your reference model. For a GPT-2-medium reference:
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
class PerplexityFilter:
def __init__(
self,
model_name: str = "gpt2-medium",
device: str = "cuda",
min_ppl: float = 5.0,
max_ppl: float = 100.0,
):
self.tokenizer = AutoTokenizer.from_pretrained(model_name)
self.model = AutoModelForCausalLM.from_pretrained(
model_name
).to(device).eval()
self.device = device
self.min_ppl = min_ppl
self.max_ppl = max_ppl
@torch.no_grad()
def compute_perplexity(self, text: str) -> float:
encodings = self.tokenizer(
text,
return_tensors="pt",
truncation=True,
max_length=1024,
).to(self.device)
input_ids = encodings.input_ids
outputs = self.model(**encodings, labels=input_ids)
# outputs.loss is the mean negative log-likelihood
return torch.exp(outputs.loss).item()
def filter(self, text: str) -> bool:
"""Returns True if the text passes the perplexity filter."""
ppl = self.compute_perplexity(text)
return self.min_ppl <= ppl <= self.max_ppl
Use a smaller model (GPT-2, Pythia-410M) as the perplexity reference, not the same model that generated the data. If you compute perplexity using the generator model, every generated example will have low perplexity by definition — the model assigns high probability to its own outputs. A smaller, independently-trained model provides an unbiased quality signal.
Stage 4: Reward Model Scoring
After format validation, deduplication, and perplexity filtering have removed the obvious failures, the remaining examples need semantic quality assessment. A reward model scores each example on a continuous scale. This is the most expensive filtering stage but also the most impactful.
If you do not have access to Nemotron-4-340B-Reward, you can use open-source alternatives:
from transformers import AutoModelForSequenceClassification, AutoTokenizer
import torch
class RewardModelScorer:
def __init__(
self,
model_name: str = "OpenAssistant/reward-model-deberta-v3-large-v2",
device: str = "cuda",
):
self.tokenizer = AutoTokenizer.from_pretrained(model_name)
self.model = AutoModelForSequenceClassification.from_pretrained(
model_name
).to(device).eval()
self.device = device
@torch.no_grad()
def score(self, instruction: str, response: str) -> float:
"""Score an instruction-response pair. Higher = better."""
text = f"Human: {instruction}\n\nAssistant: {response}"
inputs = self.tokenizer(
text,
return_tensors="pt",
truncation=True,
max_length=2048,
).to(self.device)
outputs = self.model(**inputs)
return outputs.logits[0, 0].item()
def score_batch(
self,
pairs: list[tuple[str, str]],
batch_size: int = 16,
) -> list[float]:
"""Score a batch of instruction-response pairs."""
scores: list[float] = []
for i in range(0, len(pairs), batch_size):
batch = pairs[i:i + batch_size]
texts = [
f"Human: {inst}\n\nAssistant: {resp}"
for inst, resp in batch
]
inputs = self.tokenizer(
texts,
return_tensors="pt",
truncation=True,
max_length=2048,
padding=True,
).to(self.device)
with torch.no_grad():
outputs = self.model(**inputs)
batch_scores = outputs.logits[:, 0].tolist()
scores.extend(batch_scores)
return scores
With a DeBERTa-v3-large reward model on a single A100, you get approximately 500-1000 scores per second. For 1.8M examples (post-dedup), scoring takes about 30-60 minutes.
Stage 5: Difficulty Calibration
A dataset consisting entirely of easy examples teaches the model nothing new. A dataset consisting entirely of hard examples causes training instability. The optimal training distribution includes a mix:
import numpy as np
def calibrate_difficulty(
examples: list[dict],
target_distribution: dict[str, float] | None = None,
) -> list[dict]:
"""Select examples to match a target difficulty distribution.
Difficulty is estimated from instruction length, response length,
and reward model score (harder = lower reward, longer response).
"""
if target_distribution is None:
target_distribution = {
"easy": 0.20,
"medium": 0.50,
"hard": 0.30,
}
# Compute difficulty score per example
for ex in examples:
# Heuristic: combine instruction complexity and response length
inst_words = len(ex["instruction"].split())
resp_words = len(ex["response"].split())
reward = ex.get("reward_score", 0.0)
# Normalize to 0-1 scale
complexity = min(inst_words / 100, 1.0)
length_factor = min(resp_words / 500, 1.0)
# Lower reward on hard questions is expected, don't penalize
difficulty_score = 0.4 * complexity + 0.4 * length_factor + 0.2 * (1 - min(max(reward, 0), 1))
ex["difficulty_score"] = difficulty_score
# Bin into difficulty buckets
scores = [ex["difficulty_score"] for ex in examples]
p33, p66 = np.percentile(scores, [33, 66])
buckets: dict[str, list[dict]] = {"easy": [], "medium": [], "hard": []}
for ex in examples:
if ex["difficulty_score"] <= p33:
buckets["easy"].append(ex)
elif ex["difficulty_score"] <= p66:
buckets["medium"].append(ex)
else:
buckets["hard"].append(ex)
# Sample from each bucket according to target distribution
total = sum(len(b) for b in buckets.values())
target_total = min(
int(len(buckets[k]) / target_distribution[k])
for k in buckets
if target_distribution[k] > 0
)
calibrated: list[dict] = []
for difficulty, fraction in target_distribution.items():
n_select = int(target_total * fraction)
bucket = buckets[difficulty]
if len(bucket) <= n_select:
calibrated.extend(bucket)
else:
indices = np.random.choice(
len(bucket), size=n_select, replace=False
)
calibrated.extend([bucket[i] for i in indices])
return calibrated
The 20/50/30 easy/medium/hard split is a good starting point. Some practitioners prefer 15/50/35 to push harder on challenging examples, especially for math and coding domains.
Full Pipeline Summary
Filtering Pipeline: Examples Surviving Each Stage
(K examples)From 3M raw examples to 250K training-ready examples: a 12:1 compression ratio. Every stage is necessary. Skip deduplication and your model overfits to repeated patterns. Skip perplexity filtering and you train on garbage. Skip reward model scoring and you cannot distinguish mediocre from excellent.
5. Domain-Specific Synthesis
Generic instruction-response synthesis works for broad instruction tuning. But domain-specific applications — math reasoning, code generation, medical QA — require specialized generation and validation strategies.
Math Reasoning Traces
Math training data requires not just the final answer but the complete reasoning chain. Each step must be logically valid, and the final answer must be numerically correct.
MATH_SYSTEM_PROMPT = """You are a math tutor. For each problem:
1. State the problem clearly
2. Show every reasoning step with explicit calculations
3. Box the final numerical answer as \\boxed{answer}
Every step must follow logically from the previous one.
Show intermediate calculations explicitly."""
def generate_math_example(
client, model: str, difficulty: str = "medium"
) -> dict | None:
"""Generate a math problem with step-by-step solution."""
difficulty_prompts = {
"easy": "Generate an algebra problem suitable for a high school student.",
"medium": "Generate a calculus or probability problem at the undergraduate level.",
"hard": "Generate a competition-level math problem involving combinatorics, number theory, or analysis.",
}
# Step 1: Generate the problem
problem_resp = client.chat.completions.create(
model=model,
messages=[
{"role": "system", "content": MATH_SYSTEM_PROMPT},
{"role": "user", "content": difficulty_prompts[difficulty]},
],
temperature=0.9,
max_tokens=512,
)
problem = problem_resp.choices[0].message.content.strip()
# Step 2: Generate the solution
solution_resp = client.chat.completions.create(
model=model,
messages=[
{"role": "system", "content": MATH_SYSTEM_PROMPT},
{"role": "user", "content": f"Solve this problem step by step:\n\n{problem}"},
],
temperature=0.3, # Low temperature for accuracy
max_tokens=2048,
)
solution = solution_resp.choices[0].message.content.strip()
# Step 3: Verify the answer with a second pass
verify_resp = client.chat.completions.create(
model=model,
messages=[
{"role": "system", "content": "You are a math verifier. Check the solution for errors. Respond with CORRECT or INCORRECT followed by a brief explanation."},
{"role": "user", "content": f"Problem:\n{problem}\n\nSolution:\n{solution}"},
],
temperature=0.1,
max_tokens=256,
)
verification = verify_resp.choices[0].message.content.strip()
if "CORRECT" not in verification.upper():
return None # Reject unverified solutions
return {
"instruction": problem,
"response": solution,
"domain": "math",
"difficulty": difficulty,
"verified": True,
}
The verification step is critical. Without it, approximately 15-30% of generated math solutions contain errors — a wrong sign, an arithmetic mistake, a skipped step. The self-verification pass catches about 60-70% of these errors. For production datasets, adding a second independent model as a verifier (cross-verification) catches an additional 10-15%.
Even with self-verification and cross-verification, approximately 5-10% of generated math solutions contain subtle errors that pass all automated checks. For high-stakes math training data, there is no substitute for a final human review pass on a sampled subset. The goal of automated verification is to reduce the human review burden from 100% to 5-10% of examples.
Code Generation Exercises
Code synthesis requires a function specification (docstring + type signature) and a correct implementation. The unique challenge: code can be verified by execution.
CODE_SYSTEM_PROMPT = """Generate a Python programming exercise with:
1. A function signature with type hints
2. A detailed docstring explaining the task, inputs, outputs, and edge cases
3. A reference implementation
4. At least 5 test cases using assert statements
The function should be self-contained (no imports beyond stdlib)."""
def validate_code_example(example: dict) -> bool:
"""Execute the generated code and verify tests pass."""
code = example["response"]
try:
# Execute in isolated namespace
namespace: dict = {}
exec(code, namespace) # noqa: S102
return True
except (SyntaxError, NameError, TypeError, AssertionError, Exception):
return False
Execution-based validation is the gold standard for code data. If the tests pass, the implementation is (at minimum) consistent with the specification. The false-positive rate is low — incorrect code rarely passes well-designed tests by accident.
The execution filter rejects 20-40% of generated code examples, depending on the difficulty level. Hard algorithmic problems (dynamic programming, graph algorithms) have rejection rates above 50%.
Medical QA with Grounding
Medical domain synthesis requires grounding in verified sources to prevent hallucination. The approach: retrieve relevant PubMed abstracts, then generate QA pairs conditioned on the retrieved evidence.
def generate_medical_qa(
client,
model: str,
pubmed_abstracts: list[str],
) -> dict | None:
"""Generate a medical QA pair grounded in PubMed evidence."""
# Select 2-3 related abstracts as grounding context
context = "\n\n".join(pubmed_abstracts[:3])
prompt = f"""Based ONLY on the following medical literature, generate:
1. A clinical question that a medical professional might ask
2. A detailed answer citing specific findings from the provided abstracts
3. Mark any claim not directly supported by the abstracts as [UNVERIFIED]
Literature:
{context}"""
response = client.chat.completions.create(
model=model,
messages=[
{"role": "system", "content": "You are a medical education assistant. Only make claims supported by the provided literature."},
{"role": "user", "content": prompt},
],
temperature=0.5,
max_tokens=1024,
)
answer = response.choices[0].message.content.strip()
# Reject if too many unverified claims
unverified_count = answer.lower().count("[unverified]")
if unverified_count > 2:
return None
return {
"instruction": "clinical question extracted from generated text",
"response": answer,
"domain": "medical",
"grounding_sources": pubmed_abstracts[:3],
"unverified_claims": unverified_count,
}
The grounding approach reduces hallucination rates from 30-40% (ungrounded generation) to 5-10% (grounded generation). The [UNVERIFIED] tagging provides an additional self-monitoring signal — the model is more likely to flag uncertain claims when explicitly prompted to do so.
Domain Quality Comparison
Synthetic Data Quality by Domain
| Domain | Raw Accuracy | Post-Filter Accuracy | Verification Method | Typical Filter Rate |
|---|---|---|---|---|
| General chat | 70-80% | 90-95% | Reward model | 60-75% rejected |
| Math reasoning | 55-70% | 85-90% | Self-verify + cross-verify | 30-45% rejected |
| Code generation | 50-65% | 95-99% | Execution-based testing | 35-50% rejected |
| Medical QA | 60-75% | 88-93% | Grounding + expert review | 25-40% rejected |
Code generation achieves the highest post-filter accuracy because it has the strongest verification signal: either the code runs and passes tests, or it does not. Math reasoning is next because numerical answers can be checked. General chat and medical QA rely on softer quality signals (reward models, grounding consistency) and therefore retain more noise.
6. Implementation Exercise: The Reviewer Agent Pipeline
Let us build a complete end-to-end synthetic data pipeline. This implementation generates instruction-response pairs, scores them with a reward function, filters to the top-10%, and outputs a training-ready JSONL file.
Design
The pipeline has four components:
- Generator: Produces instruction-response pairs via an LLM API.
- Scorer: Assigns a quality score to each pair using a lightweight reward heuristic plus an optional reward model.
- Filter: Selects the top-% by score.
- Formatter: Writes the filtered examples as JSONL in the ChatML format expected by training frameworks.
Complete Implementation
"""
Synthetic Data Pipeline: Generate, Score, Filter, Format.
Usage:
python synth_pipeline.py \
--model gpt-4o-mini \
--num-examples 1000 \
--top-k-pct 10 \
--output training_data.jsonl
Requirements:
pip install openai tiktoken
"""
import argparse
import asyncio
import json
import math
import sys
from dataclasses import dataclass, asdict, field
from pathlib import Path
import tiktoken
from openai import AsyncOpenAI
# ── Data structures ──────────────────────────────────────────────
@dataclass
class Example:
instruction: str
response: str
scores: dict = field(default_factory=dict)
total_score: float = 0.0
@dataclass
class PipelineStats:
generated: int = 0
format_passed: int = 0
scored: int = 0
selected: int = 0
# ── Generation ───────────────────────────────────────────────────
SYSTEM_PROMPT = "You are a helpful, knowledgeable AI assistant."
INSTRUCTION_PROMPT = (
"Generate one specific, self-contained user question or instruction "
"that requires a detailed, substantive response. Cover any topic: "
"science, coding, math, writing, analysis, planning, etc. "
"Output ONLY the instruction text."
)
async def generate_pair(
client: AsyncOpenAI,
model: str,
semaphore: asyncio.Semaphore,
) -> Example | None:
async with semaphore:
try:
# Generate instruction
instr_resp = await client.chat.completions.create(
model=model,
messages=[
{"role": "system", "content": SYSTEM_PROMPT},
{"role": "user", "content": INSTRUCTION_PROMPT},
],
temperature=1.0,
max_tokens=256,
)
instruction = instr_resp.choices[0].message.content.strip()
# Generate response
resp = await client.chat.completions.create(
model=model,
messages=[
{"role": "system", "content": SYSTEM_PROMPT},
{"role": "user", "content": instruction},
],
temperature=0.7,
max_tokens=2048,
)
response = resp.choices[0].message.content.strip()
return Example(instruction=instruction, response=response)
except Exception as e:
print(f"Generation error: {e}", file=sys.stderr)
return None
# ── Format validation ────────────────────────────────────────────
def passes_format_check(ex: Example) -> bool:
if len(ex.instruction) < 10 or len(ex.instruction) > 2000:
return False
if len(ex.response) < 50 or len(ex.response) > 16000:
return False
# Reject if response starts by copying the instruction
if ex.response.lower().startswith(ex.instruction.lower()[:40]):
return False
return True
# ── Scoring ──────────────────────────────────────────────────────
def score_example(ex: Example, enc: tiktoken.Encoding) -> Example:
"""Score an example using heuristic quality signals.
Scores (each 0-1):
- length_score: longer, substantive responses score higher
- structure_score: presence of paragraphs, lists, or code
- specificity_score: ratio of unique tokens to total tokens
"""
resp = ex.response
resp_tokens = enc.encode(resp)
n_tokens = len(resp_tokens)
# Length: peak reward around 200-600 tokens
if n_tokens < 30:
length_score = 0.0
elif n_tokens < 200:
length_score = n_tokens / 200
elif n_tokens <= 600:
length_score = 1.0
else:
length_score = max(0.5, 1.0 - (n_tokens - 600) / 2000)
# Structure: reward paragraphs, lists, code blocks
has_paragraphs = resp.count("\n\n") >= 2
has_list = any(
resp.count(marker) >= 2
for marker in ["- ", "* ", "1.", "2.", "3."]
)
has_code = "```" in resp
structure_score = (
0.4 * has_paragraphs + 0.3 * has_list + 0.3 * has_code
)
# Specificity: unique token ratio (penalizes repetition)
unique_tokens = len(set(resp_tokens))
specificity_score = min(unique_tokens / max(n_tokens, 1), 1.0)
# Weighted total
total = (
0.35 * length_score
+ 0.30 * structure_score
+ 0.35 * specificity_score
)
ex.scores = {
"length": round(length_score, 3),
"structure": round(structure_score, 3),
"specificity": round(specificity_score, 3),
}
ex.total_score = round(total, 3)
return ex
# ── Filtering ────────────────────────────────────────────────────
def select_top_k(
examples: list[Example], top_k_pct: float
) -> list[Example]:
"""Select the top-k% of examples by total score."""
examples.sort(key=lambda e: e.total_score, reverse=True)
n_select = max(1, math.ceil(len(examples) * top_k_pct / 100))
return examples[:n_select]
# ── Formatting ───────────────────────────────────────────────────
def to_chatml_jsonl(examples: list[Example]) -> list[str]:
"""Format examples as ChatML JSONL for training."""
lines = []
for ex in examples:
record = {
"messages": [
{"role": "system", "content": SYSTEM_PROMPT},
{"role": "user", "content": ex.instruction},
{"role": "assistant", "content": ex.response},
],
"metadata": {
"scores": ex.scores,
"total_score": ex.total_score,
},
}
lines.append(json.dumps(record, ensure_ascii=False))
return lines
# ── Main pipeline ────────────────────────────────────────────────
async def run_pipeline(
model: str,
num_examples: int,
top_k_pct: float,
output_path: Path,
concurrency: int = 30,
):
client = AsyncOpenAI()
semaphore = asyncio.Semaphore(concurrency)
enc = tiktoken.encoding_for_model("gpt-4o")
stats = PipelineStats()
print(f"Generating {num_examples} examples with {model}...")
tasks = [
generate_pair(client, model, semaphore)
for _ in range(num_examples)
]
raw = await asyncio.gather(*tasks)
examples = [e for e in raw if e is not None]
stats.generated = len(examples)
print(f" Generated: {stats.generated}")
# Format filter
examples = [e for e in examples if passes_format_check(e)]
stats.format_passed = len(examples)
print(f" After format check: {stats.format_passed}")
# Score
examples = [score_example(e, enc) for e in examples]
stats.scored = len(examples)
# Filter top-k%
selected = select_top_k(examples, top_k_pct)
stats.selected = len(selected)
print(f" After top-{top_k_pct}% filter: {stats.selected}")
# Write JSONL
lines = to_chatml_jsonl(selected)
output_path.write_text("\n".join(lines), encoding="utf-8")
print(f" Written to: {output_path}")
print(f"\nPipeline stats: {asdict(stats)}")
def main():
parser = argparse.ArgumentParser(
description="Synthetic data generation pipeline"
)
parser.add_argument("--model", default="gpt-4o-mini")
parser.add_argument("--num-examples", type=int, default=1000)
parser.add_argument("--top-k-pct", type=float, default=10.0)
parser.add_argument("--output", default="training_data.jsonl")
parser.add_argument("--concurrency", type=int, default=30)
args = parser.parse_args()
asyncio.run(
run_pipeline(
model=args.model,
num_examples=args.num_examples,
top_k_pct=args.top_k_pct,
output_path=Path(args.output),
concurrency=args.concurrency,
)
)
if __name__ == "__main__":
main()
Walking Through the Pipeline
Let us trace the data flow for a concrete run with --num-examples 1000 --top-k-pct 10:
Step 1: Generation. The pipeline fires 1,000 concurrent API calls (bounded by the semaphore to 30 at a time). Each call generates an instruction, then a response. With gpt-4o-mini at $0.15/1M input tokens and $0.60/1M output tokens, and an average of ~100 input tokens and ~400 output tokens per pair, the total cost is approximately:
That is $0.51 for 1,000 raw examples. Some will fail (API errors, timeouts), leaving approximately 950 successful pairs.
Step 2: Format check. Rejects examples with instructions shorter than 10 characters, responses shorter than 50 characters, or responses that copy the instruction. Typically removes 5-10%, leaving ~870 examples.
Step 3: Scoring. Each example receives three sub-scores:
- Length score: A response with 350 tokens scores 1.0 (in the sweet spot). A response with 80 tokens scores 0.4. A response with 1,200 tokens scores 0.7 (slight penalty for excessive length).
- Structure score: A response with multiple paragraphs, a bulleted list, and a code block scores 1.0. A single-paragraph, unstructured response scores 0.0.
- Specificity score: A response where 85% of tokens are unique scores 0.85. A repetitive response where only 40% of tokens are unique scores 0.40.
The weighted total () produces a score between 0 and 1.
Step 4: Top-k selection. Sort by total score, take the top 10%. From 870 scored examples, select 87.
Step 5: JSONL formatting. Each selected example is written as a single JSON line in ChatML format:
{
"messages": [
{"role": "system", "content": "You are a helpful, knowledgeable AI assistant."},
{"role": "user", "content": "Explain how B-trees maintain balance during insertion..."},
{"role": "assistant", "content": "B-trees maintain balance through a split-and-promote mechanism..."}
],
"metadata": {
"scores": {"length": 0.95, "structure": 0.7, "specificity": 0.88},
"total_score": 0.854
}
}
This format is directly compatible with OpenAI fine-tuning, Axolotl, and most training frameworks that accept ChatML JSONL.
Extending the Pipeline
The heuristic scorer is a starting point. For production use, replace it with or augment it by:
- A reward model: Replace
score_examplewith a call to a DeBERTa-based reward model (see Section 4). This changes the scoring from heuristic to learned. - MinHash deduplication: Add the dedup stage between format checking and scoring. This removes near-duplicates that would otherwise inflate the top-k% with repetitive content.
- Domain tagging: Classify each example by domain (math, code, general, creative) and ensure the final selection includes a balanced mix.
- Multi-turn expansion: After selecting high-quality single-turn examples, use the same model to extend them into multi-turn conversations, creating a richer training signal.
Running this pipeline with 1,000 examples using gpt-4o-mini costs under $1.00 in API fees. Scaling to 100,000 examples costs under $50. The top-10% filter yields 10,000 training-ready examples for approximately $50 — a 500x cost reduction compared to expert annotation at $25 per example.
Putting It All Together: The Production Pipeline
A production synthetic data pipeline combines everything from this post into a staged system. The data structures and the filtering logic above are not academic exercises — they are the core components of how datasets like OpenHermes, Magpie-Air, and HelpSteer2 are actually built.
The complete flow:
- Instruction generation (Magpie or prompt-based) produces millions of raw instructions.
- Response generation (strong teacher model) generates responses to each instruction.
- Format validation (string checks, length filters) removes structural failures.
- Deduplication (MinHash LSH at Jaccard > 0.7) removes near-duplicates.
- Perplexity filtering (GPT-2 reference, reject outside 5-100 range) removes incoherent and repetitive content.
- Reward model scoring (Nemotron 5-dimension or single-score) ranks by quality.
- Difficulty calibration (20/50/30 easy/medium/hard) ensures training distribution balance.
- Domain-specific validation (execution for code, verification for math, grounding for medical) applies the strongest available correctness signal.
- JSONL export in ChatML format, ready for training.
Compute Cost per Pipeline Stage (1M Raw Examples)
(USD)The total cost for a 1M-raw-to-100K-filtered pipeline using API-based generation: approximately $2,500-$3,000. Using local open-source models (Llama-3-70B on rented A100s at $2/GPU-hour), the generation cost drops to approximately $400-$600 for the same volume, bringing the total pipeline under $700.
For $700 you get 100K high-quality, deduplicated, reward-filtered, difficulty-calibrated instruction-response pairs. The equivalent human annotation cost at $10 per example: $1,000,000.
That is the synthetic data value proposition. Not that the data is free — it is not. But the cost curve has shifted by three orders of magnitude, and the quality, when the pipeline is done right, is competitive with human annotation for most instruction-tuning tasks.