Training data is the single largest determinant of LLM quality. The model architecture matters. The optimizer matters. But the data matters more. And the data problem has two dimensions: quantity (you need billions of tokens for pretraining, millions of examples for instruction tuning) and quality (one bad example can teach the model a bad habit that persists across thousands of good examples). Human-generated data scores high on quality but fails catastrophically on quantity and cost. Synthetic data inverts this tradeoff: a strong model generates the training examples, a filtering pipeline enforces quality, and the result is a dataset that a weaker model can learn from at a fraction of the cost.

This post covers the full pipeline: why synthetic data works from a distillation perspective, the Magpie technique for generating instruction-response pairs from a model’s own latent distribution, NVIDIA’s Nemotron-4 reward-model-guided synthesis, the multi-stage quality filtering pipeline that separates signal from noise, domain-specific synthesis strategies, and a complete implementation exercise that you can run today.


1. Why Synthetic Data: The Economics and the Theory

The Cost Problem

Instruction-tuning data requires paired examples: an instruction (what the user asks) and a response (what the model should produce). Generating these by hand means hiring domain experts, writing prompts, writing gold-standard answers, and reviewing everything. The numbers are stark:

📊

Cost Per Training Example by Source

SourceCost/ExampleThroughputQuality ControlDiversity
Expert annotation (PhD-level) $25-50 5-20/day/person High (human review) Low (annotator bias)
Crowd annotation (MTurk) $1-5 50-200/day/person Medium (inter-annotator agreement) Medium
Synthetic (GPT-4 class) $0.005-0.02 10K-100K/day/API Requires filtering pipeline High (controllable)
Synthetic (open model, local) $0.001-0.005 1K-50K/day/GPU Requires filtering pipeline High (controllable)
Note: Expert annotation costs assume specialized domains (medicine, law, math). Synthetic costs include API fees or GPU amortization.

At $30 per expert-annotated example, building a 100K-example instruction-tuning dataset costs $3M and takes months. At $0.01 per synthetic example, the same dataset costs $1K and takes days. Even if you filter out 90% of synthetic examples as low-quality, the economics still win by two orders of magnitude.

The Distillation Argument

Synthetic data generation is a form of knowledge distillation. A strong “teacher” model MTM_T with parameters θT\theta_T generates outputs that capture its learned distribution. A weaker “student” model MSM_S with parameters θS\theta_S trains on these outputs. The training objective for the student on synthetic data is:

Lsynth=E(x,y)MT[t=1ylogPMS(yty<t,x)]\mathcal{L}_{\text{synth}} = -\mathbb{E}_{(x, y) \sim M_T} \left[ \sum_{t=1}^{|y|} \log P_{M_S}(y_t \mid y_{<t}, x) \right]

where xx is the instruction and yy is the teacher-generated response. The student learns to mimic the teacher’s output distribution, not just memorize individual examples. This is strictly more informative than training on the raw text alone because the teacher’s responses embed reasoning patterns, formatting conventions, and factual knowledge that the student would otherwise need to learn from scratch.

The empirical evidence is strong. Alpaca (Stanford, 2023) trained LLaMA-7B on 52K instruction-response pairs generated by text-davinci-003 for under $500. The resulting model matched the original Davinci model on many benchmarks despite having 25x fewer parameters. WizardLM (2023) used a synthetic “Evol-Instruct” pipeline to generate increasingly complex instructions and achieved state-of-the-art instruction-following at the 7B scale.

When Synthetic Data Fails

Synthetic data is not free from failure modes. The critical ones:

  1. Model collapse: If the student trains on data from a teacher that was itself trained on synthetic data from a previous generation, quality degrades. Each generation amplifies distributional artifacts. After 5-10 generations of recursive synthesis, outputs become repetitive and lose diversity. The fix: always use a teacher model trained on real data, never on its own synthetic outputs.

  2. Systematic blind spots: The teacher model has its own failure modes. If GPT-4 consistently gets a certain class of math problems wrong, every synthetic example for that class will contain the same error. The student learns the error as if it were ground truth. The fix: reward model scoring and domain-specific validation.

  3. Distribution mismatch: Synthetic instructions may not match the distribution of real user queries. A model prompted to generate “diverse instructions” tends to produce academically-flavored questions, not the messy, ambiguous, multi-turn queries that real users submit. The fix: seed the generation with real user query distributions (anonymized logs, public datasets).

⚠️ The Quality Floor

Synthetic data has no inherent quality guarantee. Without filtering, a synthetic dataset is worse than a curated human dataset of the same size. The entire value proposition depends on generating 10-100x more examples than you need and filtering aggressively. The pipeline IS the product.


2. The Magpie Technique: Mining Instructions from Model Internals

The Core Insight

Magpie (Xu et al., 2024) starts from a simple observation: instruction-tuned LLMs have already internalized a distribution over instructions. When you send an empty user turn to a chat model — literally just the system prompt and the beginning of the assistant turn marker — the model will generate text that looks like a plausible user instruction. It does this because the chat template conditions the model to expect a user message, and without one, the model fills in the gap from its training distribution.

This is not a prompt hack. It is sampling from the model’s learned prior over the instruction space. The instructions are diverse, natural, and span the full range of tasks the model was trained on — because they literally come from the same distribution.

The Method

The Magpie pipeline has three stages:

Stage 1: Instruction Generation. Feed the model a chat template with an empty user turn and let it generate the instruction.

For Llama-3-Instruct, the template looks like this:

<|begin_of_text|><|start_header_id|>system<|end_header_id|>

You are a helpful AI assistant.<|eot_id|><|start_header_id|>user<|end_header_id|>

The model then generates text as if completing a user message. It might produce: “Can you explain the difference between TCP and UDP, focusing on when you’d choose one over the other for a real-time gaming application?” This is not a canned response — it is a sample from the model’s instruction prior.

Stage 2: Response Generation. Take the generated instruction, format it as a proper chat message, and let the same model (or a stronger one) generate the response.

Stage 3: Quality Filtering. Score each instruction-response pair and keep only the top fraction.

Implementation

Here is the complete pipeline for generating Magpie-style instruction-response pairs:

import json
import asyncio
from openai import AsyncOpenAI
from dataclasses import dataclass, asdict

@dataclass
class SyntheticExample:
    instruction: str
    response: str
    instruction_length: int
    response_length: int
    generation_model: str

SYSTEM_PROMPT = "You are a helpful AI assistant."

# For Llama-3 style models, we exploit the chat template.
# The instruction generation prompt is the template up to
# the assistant turn, with the user content left empty.
INSTRUCTION_GEN_PROMPT = (
    "Generate a single, specific user instruction or question. "
    "The instruction should be self-contained, require a substantive "
    "response, and cover any topic. Output ONLY the instruction text, "
    "nothing else."
)

async def generate_instruction(
    client: AsyncOpenAI,
    model: str,
    temperature: float = 1.0,
) -> str:
    """Generate a single instruction by sampling from the model's prior."""
    response = await client.chat.completions.create(
        model=model,
        messages=[
            {"role": "system", "content": SYSTEM_PROMPT},
            {"role": "user", "content": INSTRUCTION_GEN_PROMPT},
        ],
        temperature=temperature,
        max_tokens=256,
    )
    return response.choices[0].message.content.strip()

async def generate_response(
    client: AsyncOpenAI,
    model: str,
    instruction: str,
    temperature: float = 0.7,
) -> str:
    """Generate a response to the given instruction."""
    response = await client.chat.completions.create(
        model=model,
        messages=[
            {"role": "system", "content": SYSTEM_PROMPT},
            {"role": "user", "content": instruction},
        ],
        temperature=temperature,
        max_tokens=2048,
    )
    return response.choices[0].message.content.strip()

async def generate_batch(
    client: AsyncOpenAI,
    model: str,
    n: int,
    concurrency: int = 50,
) -> list[SyntheticExample]:
    """Generate n instruction-response pairs with bounded concurrency."""
    semaphore = asyncio.Semaphore(concurrency)
    results: list[SyntheticExample] = []

    async def generate_one() -> SyntheticExample | None:
        async with semaphore:
            try:
                instruction = await generate_instruction(client, model)
                if len(instruction) < 10 or len(instruction) > 1000:
                    return None  # Length filter on instructions
                response = await generate_response(
                    client, model, instruction
                )
                if len(response) < 50:
                    return None  # Reject trivially short responses
                return SyntheticExample(
                    instruction=instruction,
                    response=response,
                    instruction_length=len(instruction),
                    response_length=len(response),
                    generation_model=model,
                )
            except Exception:
                return None

    tasks = [generate_one() for _ in range(n)]
    completed = await asyncio.gather(*tasks)
    results = [r for r in completed if r is not None]
    return results

The key design decisions:

  • Temperature 1.0 for instructions: Maximizes diversity. Lower temperatures produce repetitive instructions that cluster around common patterns.
  • Temperature 0.7 for responses: Balances quality and diversity. Too high and responses become incoherent. Too low and they become formulaic.
  • Concurrency control: API rate limits and model serving throughput cap the effective parallelism. The semaphore prevents overloading.
  • Inline length filters: Reject obviously bad examples immediately rather than wasting downstream filtering compute.

The Magpie Numbers

The original Magpie paper reports the following pipeline statistics for Llama-3-8B-Instruct:

📊

Magpie Pipeline Statistics (Llama-3-8B-Instruct)

StageExamplesRetention RateCumulative
Raw generation 3,000,000 100% 3,000,000
Length filter (instruction) 2,700,000 90% 2,700,000
Length filter (response) 2,400,000 89% 2,400,000
Deduplication 1,800,000 75% 1,800,000
Perplexity filter 1,200,000 67% 1,200,000
Reward model filter (top quartile) 300,000 25% 300,000
Note: From Xu et al. 2024. The final 300K Magpie-Air dataset achieves performance competitive with 10x larger unfiltered datasets.

The 10:1 raw-to-final ratio is typical. You generate far more than you need and filter aggressively. The 300K surviving examples from Magpie-Air matched or exceeded datasets that were 10x larger but unfiltered on the AlpacaEval 2.0 and Arena-Hard benchmarks.

Throughput Estimate

With a local Llama-3-70B-Instruct on 8x A100 GPUs using vLLM, you can generate approximately 5,000 instruction-response pairs per hour. The full 3M raw pipeline takes about 25 days on a single 8-GPU node. With 4 nodes, about a week. API-based generation with GPT-4 at 100 RPM takes roughly 500 hours for 3M pairs, but costs scale linearly with API pricing.

Why Magpie Works Better Than Prompt-Based Generation

Previous approaches like Self-Instruct (Wang et al., 2023) used explicit prompts: “Generate a creative writing task” or “Generate a coding question about recursion.” These approaches suffer from two problems:

  1. Prompt bias: The generated instructions are constrained by the prompt categories you enumerate. If you forget to include “data visualization” as a category, you get zero data visualization instructions.
  2. Distribution mismatch: Hand-crafted category lists do not match the actual distribution of user queries. You end up over-representing some topics and under-representing others.

Magpie avoids both problems because the instructions come from the model’s own training distribution, which was shaped by millions of real user interactions during RLHF. The instruction diversity is not designed — it is emergent.


3. Nemotron-4: Reward-Model-Guided Synthesis at NVIDIA Scale

The Architecture

NVIDIA’s approach to synthetic data generation, published with the Nemotron-4 340B family (Adler et al., 2024), differs from Magpie in a fundamental way: instead of relying on post-hoc filtering, it uses a dedicated reward model as an online quality signal during generation. The reward model scores every generated response across multiple dimensions, and only responses that pass all thresholds become training data.

The system has three components:

  1. Generator: Nemotron-4-340B-Instruct generates candidate responses.
  2. Reward Model: Nemotron-4-340B-Reward scores each response on 5 dimensions.
  3. Filter: A multi-dimensional threshold gate that only passes examples scoring above the minimum on every dimension simultaneously.

The Five Reward Dimensions

The Nemotron-4 reward model outputs five scalar scores per response, each on a 0-5 scale:

📊

Nemotron-4 Reward Model Scoring Dimensions

DimensionMeasuresTypical ThresholdReject If
Helpfulness Does the response address the instruction? 3.5 Off-topic, refusals, partial answers
Correctness Are the facts and reasoning accurate? 3.5 Factual errors, logical fallacies
Coherence Is the response well-structured and readable? 3.0 Rambling, contradictions, repetition
Complexity Does the response handle nuance appropriately? 2.5 Oversimplified answers to complex questions
Verbosity Is the length appropriate? 2.0 Excessively long or short for the task
Note: Thresholds from the HelpSteer2 paper. Helpfulness and correctness are the hardest gates — most rejections happen here.

The multi-dimensional scoring is critical. A response can be helpful but incorrect (confidently wrong answers), correct but incoherent (technically right but impossible to follow), or coherent but unhelpful (well-written refusal). Single-score reward models conflate these failure modes. The 5-dimension approach lets you set independent thresholds per dimension and diagnose exactly why examples fail.

The Reward Model API Call

Here is how to score a single example using the Nemotron reward model:

import requests
from dataclasses import dataclass

@dataclass
class RewardScores:
    helpfulness: float
    correctness: float
    coherence: float
    complexity: float
    verbosity: float

    def passes_threshold(
        self,
        min_helpfulness: float = 3.5,
        min_correctness: float = 3.5,
        min_coherence: float = 3.0,
        min_complexity: float = 2.5,
        min_verbosity: float = 2.0,
    ) -> bool:
        return (
            self.helpfulness >= min_helpfulness
            and self.correctness >= min_correctness
            and self.coherence >= min_coherence
            and self.complexity >= min_complexity
            and self.verbosity >= min_verbosity
        )

def score_with_nemotron(
    instruction: str,
    response: str,
    api_url: str = "http://localhost:8000/v1/chat/completions",
    model: str = "nemotron-4-340b-reward",
) -> RewardScores:
    """Score an instruction-response pair using Nemotron-4 reward model."""
    payload = {
        "model": model,
        "messages": [
            {"role": "user", "content": instruction},
            {"role": "assistant", "content": response},
        ],
    }
    resp = requests.post(api_url, json=payload)
    resp.raise_for_status()
    data = resp.json()

    # The reward model returns scores in the logprobs field
    # Format: [helpfulness, correctness, coherence, complexity, verbosity]
    scores = data["choices"][0]["logprobs"]["content"]
    return RewardScores(
        helpfulness=float(scores[0]["token"]),
        correctness=float(scores[1]["token"]),
        coherence=float(scores[2]["token"]),
        complexity=float(scores[3]["token"]),
        verbosity=float(scores[4]["token"]),
    )

For local deployment, the reward model runs on 4-8 A100 GPUs using a standard vLLM or TensorRT-LLM serving setup. Throughput is approximately 200-500 scoring calls per second on 8x A100, making it feasible to score millions of examples in hours.

The Filtering Economics

The Nemotron approach inverts the economics of quality filtering. Instead of generating cheap examples and filtering expensive, it generates expensive examples (using a 340B model) but filters precisely:

Pass Rate by Reward Dimension (Nemotron-4 Pipeline)

(% of generated examples)
Coherence (>3.0) 92% pass
92 % of generated examples
Verbosity (>2.0) 88% pass
88 % of generated examples
Complexity (>2.5) 78% pass
78 % of generated examples
Correctness (>3.5) 64% pass
64 % of generated examples
Helpfulness (>3.5) 61% pass
61 % of generated examples
ALL dimensions pass 38% pass all gates
38 % of generated examples

Only 38% of examples from a 340B model pass all five gates simultaneously. This is a stronger filter than Magpie’s 10% retention rate, but it starts from a higher-quality generator. The result: HelpSteer2, a dataset of approximately 10,000 examples that improved a Llama-3-70B model by 3-5% across MT-Bench, AlpacaEval, and Arena-Hard.

ℹ️ 10K Examples, Not 10M

The HelpSteer2 dataset contains only ~10,000 examples. This is not a typo. High-quality, multi-dimensionally-scored synthetic data is so information-dense that 10K examples can shift a 70B model’s behavior measurably. Compare this to early synthetic datasets (Alpaca: 52K, Dolly: 15K) that used no reward model filtering. Quality beats quantity when quality is rigorously measured.

Nemotron vs. Magpie: When to Use Which

The two approaches target different regimes:

📊

Magpie vs. Nemotron: Approach Comparison

DimensionMagpieNemotron-4
Generator model size 8B-70B (any instruct model) 340B (frontier scale)
Instruction source Model's own prior distribution External prompts or seeds
Quality signal Post-hoc filtering (perplexity + reward) Online reward model (5 dimensions)
Typical output size 100K-1M filtered examples 5K-20K filtered examples
Best for Broad instruction tuning Targeted quality improvement
Compute cost Low-medium High (340B generator + 340B reward)

Use Magpie when you need a large, diverse instruction-tuning dataset and have a limited compute budget. Use Nemotron-style reward filtering when you need a small, extremely high-quality dataset for targeted improvement on specific benchmarks.


4. The Quality Filtering Pipeline

Raw synthetic data is noisy. The filtering pipeline is what transforms “output from an LLM API” into “training-ready dataset.” Each stage removes a different class of failure. The order matters — early stages are cheap and remove obvious garbage, later stages are expensive and handle subtle quality issues.

Stage 1: Format Validation

Before any quality assessment, reject examples that fail structural requirements:

import re
from dataclasses import dataclass

@dataclass
class FormatCheckResult:
    passed: bool
    reason: str

def check_format(instruction: str, response: str) -> FormatCheckResult:
    """Reject structurally invalid examples."""
    # Instruction checks
    if len(instruction.strip()) < 10:
        return FormatCheckResult(False, "instruction_too_short")
    if len(instruction.strip()) > 2000:
        return FormatCheckResult(False, "instruction_too_long")
    if instruction.strip() == response.strip()[:len(instruction.strip())]:
        return FormatCheckResult(False, "response_copies_instruction")

    # Response checks
    if len(response.strip()) < 50:
        return FormatCheckResult(False, "response_too_short")
    if len(response.strip()) > 16000:
        return FormatCheckResult(False, "response_too_long")

    # Detect repetition: same sentence repeated 3+ times
    sentences = re.split(r'[.!?]+', response)
    sentence_counts: dict[str, int] = {}
    for s in sentences:
        s_clean = s.strip().lower()
        if len(s_clean) > 20:
            sentence_counts[s_clean] = sentence_counts.get(s_clean, 0) + 1
    if any(count >= 3 for count in sentence_counts.values()):
        return FormatCheckResult(False, "excessive_repetition")

    # Detect refusals
    refusal_patterns = [
        r"i cannot",
        r"i can't",
        r"i'm unable to",
        r"as an ai",
        r"i don't have the ability",
    ]
    response_lower = response.lower()
    for pattern in refusal_patterns:
        if re.search(pattern, response_lower) and len(response) < 200:
            return FormatCheckResult(False, "likely_refusal")

    return FormatCheckResult(True, "ok")

This stage is essentially free — string operations on text. It typically removes 5-15% of raw examples.

Stage 2: Deduplication with MinHash LSH

Duplicate and near-duplicate examples waste training compute and skew the data distribution. MinHash Locality-Sensitive Hashing (LSH) detects near-duplicates efficiently at scale.

The algorithm:

  1. Convert each example to a set of character n-grams (shingles).
  2. Compute kk hash functions over the shingle set to produce a MinHash signature.
  3. Use LSH banding to identify candidate pairs with high Jaccard similarity.
  4. For each cluster of near-duplicates, keep one representative.
from datasketch import MinHash, MinHashLSH

def build_dedup_index(
    examples: list[dict],
    threshold: float = 0.7,
    num_perm: int = 128,
    shingle_size: int = 5,
) -> list[dict]:
    """Remove near-duplicate examples using MinHash LSH."""
    lsh = MinHashLSH(threshold=threshold, num_perm=num_perm)
    minhashes: dict[int, MinHash] = {}

    # Build index
    for i, ex in enumerate(examples):
        text = ex["instruction"] + " " + ex["response"]
        m = MinHash(num_perm=num_perm)
        # Create character-level shingles
        for j in range(len(text) - shingle_size + 1):
            shingle = text[j:j + shingle_size]
            m.update(shingle.encode("utf-8"))
        minhashes[i] = m
        try:
            lsh.insert(str(i), m)
        except ValueError:
            pass  # Duplicate signature, skip

    # Find unique representatives
    seen: set[int] = set()
    unique: list[dict] = []
    for i, ex in enumerate(examples):
        if i in seen:
            continue
        # Query for near-duplicates
        neighbors = lsh.query(minhashes[i])
        for n in neighbors:
            seen.add(int(n))
        unique.append(ex)

    return unique

The Jaccard threshold of 0.7 is standard. Examples sharing 70%+ of their character 5-grams are considered duplicates. This catches paraphrases (“Explain X” vs “Can you explain X?”) and template-generated responses that differ only in minor details.

Complexity: O(n)O(n) index building, O(n)O(n) querying. For 3M examples, this runs in under 10 minutes on a single CPU.

Stage 3: Perplexity Filtering

Perplexity measures how “surprised” a reference model is by a given text. It serves as a proxy for two distinct quality signals:

  • Very low perplexity (<5\lt 5): The text is highly predictable. This indicates repetitive, formulaic, or templated content. “The answer is the answer is the answer” has near-zero perplexity.
  • Very high perplexity (>100\gt 100): The text is unpredictable. This indicates incoherent, garbled, or code-mixed content that does not form valid natural language.

The target range depends on your reference model. For a GPT-2-medium reference:

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

class PerplexityFilter:
    def __init__(
        self,
        model_name: str = "gpt2-medium",
        device: str = "cuda",
        min_ppl: float = 5.0,
        max_ppl: float = 100.0,
    ):
        self.tokenizer = AutoTokenizer.from_pretrained(model_name)
        self.model = AutoModelForCausalLM.from_pretrained(
            model_name
        ).to(device).eval()
        self.device = device
        self.min_ppl = min_ppl
        self.max_ppl = max_ppl

    @torch.no_grad()
    def compute_perplexity(self, text: str) -> float:
        encodings = self.tokenizer(
            text,
            return_tensors="pt",
            truncation=True,
            max_length=1024,
        ).to(self.device)
        input_ids = encodings.input_ids
        outputs = self.model(**encodings, labels=input_ids)
        # outputs.loss is the mean negative log-likelihood
        return torch.exp(outputs.loss).item()

    def filter(self, text: str) -> bool:
        """Returns True if the text passes the perplexity filter."""
        ppl = self.compute_perplexity(text)
        return self.min_ppl <= ppl <= self.max_ppl
💡 Choosing the Reference Model

Use a smaller model (GPT-2, Pythia-410M) as the perplexity reference, not the same model that generated the data. If you compute perplexity using the generator model, every generated example will have low perplexity by definition — the model assigns high probability to its own outputs. A smaller, independently-trained model provides an unbiased quality signal.

Stage 4: Reward Model Scoring

After format validation, deduplication, and perplexity filtering have removed the obvious failures, the remaining examples need semantic quality assessment. A reward model scores each example on a continuous scale. This is the most expensive filtering stage but also the most impactful.

If you do not have access to Nemotron-4-340B-Reward, you can use open-source alternatives:

from transformers import AutoModelForSequenceClassification, AutoTokenizer
import torch

class RewardModelScorer:
    def __init__(
        self,
        model_name: str = "OpenAssistant/reward-model-deberta-v3-large-v2",
        device: str = "cuda",
    ):
        self.tokenizer = AutoTokenizer.from_pretrained(model_name)
        self.model = AutoModelForSequenceClassification.from_pretrained(
            model_name
        ).to(device).eval()
        self.device = device

    @torch.no_grad()
    def score(self, instruction: str, response: str) -> float:
        """Score an instruction-response pair. Higher = better."""
        text = f"Human: {instruction}\n\nAssistant: {response}"
        inputs = self.tokenizer(
            text,
            return_tensors="pt",
            truncation=True,
            max_length=2048,
        ).to(self.device)
        outputs = self.model(**inputs)
        return outputs.logits[0, 0].item()

    def score_batch(
        self,
        pairs: list[tuple[str, str]],
        batch_size: int = 16,
    ) -> list[float]:
        """Score a batch of instruction-response pairs."""
        scores: list[float] = []
        for i in range(0, len(pairs), batch_size):
            batch = pairs[i:i + batch_size]
            texts = [
                f"Human: {inst}\n\nAssistant: {resp}"
                for inst, resp in batch
            ]
            inputs = self.tokenizer(
                texts,
                return_tensors="pt",
                truncation=True,
                max_length=2048,
                padding=True,
            ).to(self.device)
            with torch.no_grad():
                outputs = self.model(**inputs)
            batch_scores = outputs.logits[:, 0].tolist()
            scores.extend(batch_scores)
        return scores

With a DeBERTa-v3-large reward model on a single A100, you get approximately 500-1000 scores per second. For 1.8M examples (post-dedup), scoring takes about 30-60 minutes.

Stage 5: Difficulty Calibration

A dataset consisting entirely of easy examples teaches the model nothing new. A dataset consisting entirely of hard examples causes training instability. The optimal training distribution includes a mix:

import numpy as np

def calibrate_difficulty(
    examples: list[dict],
    target_distribution: dict[str, float] | None = None,
) -> list[dict]:
    """Select examples to match a target difficulty distribution.

    Difficulty is estimated from instruction length, response length,
    and reward model score (harder = lower reward, longer response).
    """
    if target_distribution is None:
        target_distribution = {
            "easy": 0.20,
            "medium": 0.50,
            "hard": 0.30,
        }

    # Compute difficulty score per example
    for ex in examples:
        # Heuristic: combine instruction complexity and response length
        inst_words = len(ex["instruction"].split())
        resp_words = len(ex["response"].split())
        reward = ex.get("reward_score", 0.0)

        # Normalize to 0-1 scale
        complexity = min(inst_words / 100, 1.0)
        length_factor = min(resp_words / 500, 1.0)
        # Lower reward on hard questions is expected, don't penalize
        difficulty_score = 0.4 * complexity + 0.4 * length_factor + 0.2 * (1 - min(max(reward, 0), 1))
        ex["difficulty_score"] = difficulty_score

    # Bin into difficulty buckets
    scores = [ex["difficulty_score"] for ex in examples]
    p33, p66 = np.percentile(scores, [33, 66])

    buckets: dict[str, list[dict]] = {"easy": [], "medium": [], "hard": []}
    for ex in examples:
        if ex["difficulty_score"] <= p33:
            buckets["easy"].append(ex)
        elif ex["difficulty_score"] <= p66:
            buckets["medium"].append(ex)
        else:
            buckets["hard"].append(ex)

    # Sample from each bucket according to target distribution
    total = sum(len(b) for b in buckets.values())
    target_total = min(
        int(len(buckets[k]) / target_distribution[k])
        for k in buckets
        if target_distribution[k] > 0
    )

    calibrated: list[dict] = []
    for difficulty, fraction in target_distribution.items():
        n_select = int(target_total * fraction)
        bucket = buckets[difficulty]
        if len(bucket) <= n_select:
            calibrated.extend(bucket)
        else:
            indices = np.random.choice(
                len(bucket), size=n_select, replace=False
            )
            calibrated.extend([bucket[i] for i in indices])

    return calibrated

The 20/50/30 easy/medium/hard split is a good starting point. Some practitioners prefer 15/50/35 to push harder on challenging examples, especially for math and coding domains.

Full Pipeline Summary

Filtering Pipeline: Examples Surviving Each Stage

(K examples)
Raw generated 3M examples
3,000 K examples
Format validation 90% pass
2,700 K examples
Deduplication 67% pass
1,800 K examples
Perplexity filter 67% pass
1,200 K examples
Reward model (top-25%) 25% pass
300 K examples
Difficulty calibrated Final dataset
250 K examples

From 3M raw examples to 250K training-ready examples: a 12:1 compression ratio. Every stage is necessary. Skip deduplication and your model overfits to repeated patterns. Skip perplexity filtering and you train on garbage. Skip reward model scoring and you cannot distinguish mediocre from excellent.


5. Domain-Specific Synthesis

Generic instruction-response synthesis works for broad instruction tuning. But domain-specific applications — math reasoning, code generation, medical QA — require specialized generation and validation strategies.

Math Reasoning Traces

Math training data requires not just the final answer but the complete reasoning chain. Each step must be logically valid, and the final answer must be numerically correct.

MATH_SYSTEM_PROMPT = """You are a math tutor. For each problem:
1. State the problem clearly
2. Show every reasoning step with explicit calculations
3. Box the final numerical answer as \\boxed{answer}

Every step must follow logically from the previous one.
Show intermediate calculations explicitly."""

def generate_math_example(
    client, model: str, difficulty: str = "medium"
) -> dict | None:
    """Generate a math problem with step-by-step solution."""
    difficulty_prompts = {
        "easy": "Generate an algebra problem suitable for a high school student.",
        "medium": "Generate a calculus or probability problem at the undergraduate level.",
        "hard": "Generate a competition-level math problem involving combinatorics, number theory, or analysis.",
    }

    # Step 1: Generate the problem
    problem_resp = client.chat.completions.create(
        model=model,
        messages=[
            {"role": "system", "content": MATH_SYSTEM_PROMPT},
            {"role": "user", "content": difficulty_prompts[difficulty]},
        ],
        temperature=0.9,
        max_tokens=512,
    )
    problem = problem_resp.choices[0].message.content.strip()

    # Step 2: Generate the solution
    solution_resp = client.chat.completions.create(
        model=model,
        messages=[
            {"role": "system", "content": MATH_SYSTEM_PROMPT},
            {"role": "user", "content": f"Solve this problem step by step:\n\n{problem}"},
        ],
        temperature=0.3,  # Low temperature for accuracy
        max_tokens=2048,
    )
    solution = solution_resp.choices[0].message.content.strip()

    # Step 3: Verify the answer with a second pass
    verify_resp = client.chat.completions.create(
        model=model,
        messages=[
            {"role": "system", "content": "You are a math verifier. Check the solution for errors. Respond with CORRECT or INCORRECT followed by a brief explanation."},
            {"role": "user", "content": f"Problem:\n{problem}\n\nSolution:\n{solution}"},
        ],
        temperature=0.1,
        max_tokens=256,
    )
    verification = verify_resp.choices[0].message.content.strip()

    if "CORRECT" not in verification.upper():
        return None  # Reject unverified solutions

    return {
        "instruction": problem,
        "response": solution,
        "domain": "math",
        "difficulty": difficulty,
        "verified": True,
    }

The verification step is critical. Without it, approximately 15-30% of generated math solutions contain errors — a wrong sign, an arithmetic mistake, a skipped step. The self-verification pass catches about 60-70% of these errors. For production datasets, adding a second independent model as a verifier (cross-verification) catches an additional 10-15%.

⚠️ Math Verification Is Not Solved

Even with self-verification and cross-verification, approximately 5-10% of generated math solutions contain subtle errors that pass all automated checks. For high-stakes math training data, there is no substitute for a final human review pass on a sampled subset. The goal of automated verification is to reduce the human review burden from 100% to 5-10% of examples.

Code Generation Exercises

Code synthesis requires a function specification (docstring + type signature) and a correct implementation. The unique challenge: code can be verified by execution.

CODE_SYSTEM_PROMPT = """Generate a Python programming exercise with:
1. A function signature with type hints
2. A detailed docstring explaining the task, inputs, outputs, and edge cases
3. A reference implementation
4. At least 5 test cases using assert statements

The function should be self-contained (no imports beyond stdlib)."""

def validate_code_example(example: dict) -> bool:
    """Execute the generated code and verify tests pass."""
    code = example["response"]
    try:
        # Execute in isolated namespace
        namespace: dict = {}
        exec(code, namespace)  # noqa: S102
        return True
    except (SyntaxError, NameError, TypeError, AssertionError, Exception):
        return False

Execution-based validation is the gold standard for code data. If the tests pass, the implementation is (at minimum) consistent with the specification. The false-positive rate is low — incorrect code rarely passes well-designed tests by accident.

The execution filter rejects 20-40% of generated code examples, depending on the difficulty level. Hard algorithmic problems (dynamic programming, graph algorithms) have rejection rates above 50%.

Medical QA with Grounding

Medical domain synthesis requires grounding in verified sources to prevent hallucination. The approach: retrieve relevant PubMed abstracts, then generate QA pairs conditioned on the retrieved evidence.

def generate_medical_qa(
    client,
    model: str,
    pubmed_abstracts: list[str],
) -> dict | None:
    """Generate a medical QA pair grounded in PubMed evidence."""
    # Select 2-3 related abstracts as grounding context
    context = "\n\n".join(pubmed_abstracts[:3])

    prompt = f"""Based ONLY on the following medical literature, generate:
1. A clinical question that a medical professional might ask
2. A detailed answer citing specific findings from the provided abstracts
3. Mark any claim not directly supported by the abstracts as [UNVERIFIED]

Literature:
{context}"""

    response = client.chat.completions.create(
        model=model,
        messages=[
            {"role": "system", "content": "You are a medical education assistant. Only make claims supported by the provided literature."},
            {"role": "user", "content": prompt},
        ],
        temperature=0.5,
        max_tokens=1024,
    )
    answer = response.choices[0].message.content.strip()

    # Reject if too many unverified claims
    unverified_count = answer.lower().count("[unverified]")
    if unverified_count > 2:
        return None

    return {
        "instruction": "clinical question extracted from generated text",
        "response": answer,
        "domain": "medical",
        "grounding_sources": pubmed_abstracts[:3],
        "unverified_claims": unverified_count,
    }

The grounding approach reduces hallucination rates from 30-40% (ungrounded generation) to 5-10% (grounded generation). The [UNVERIFIED] tagging provides an additional self-monitoring signal — the model is more likely to flag uncertain claims when explicitly prompted to do so.

Domain Quality Comparison

📊

Synthetic Data Quality by Domain

DomainRaw AccuracyPost-Filter AccuracyVerification MethodTypical Filter Rate
General chat 70-80% 90-95% Reward model 60-75% rejected
Math reasoning 55-70% 85-90% Self-verify + cross-verify 30-45% rejected
Code generation 50-65% 95-99% Execution-based testing 35-50% rejected
Medical QA 60-75% 88-93% Grounding + expert review 25-40% rejected
Note: Raw accuracy = fraction of generated examples that are correct before filtering. Post-filter accuracy = fraction correct after full pipeline. Code achieves the highest post-filter accuracy because execution provides a binary correctness signal.

Code generation achieves the highest post-filter accuracy because it has the strongest verification signal: either the code runs and passes tests, or it does not. Math reasoning is next because numerical answers can be checked. General chat and medical QA rely on softer quality signals (reward models, grounding consistency) and therefore retain more noise.


6. Implementation Exercise: The Reviewer Agent Pipeline

Let us build a complete end-to-end synthetic data pipeline. This implementation generates instruction-response pairs, scores them with a reward function, filters to the top-10%, and outputs a training-ready JSONL file.

Design

The pipeline has four components:

  1. Generator: Produces instruction-response pairs via an LLM API.
  2. Scorer: Assigns a quality score to each pair using a lightweight reward heuristic plus an optional reward model.
  3. Filter: Selects the top-kk% by score.
  4. Formatter: Writes the filtered examples as JSONL in the ChatML format expected by training frameworks.

Complete Implementation

"""
Synthetic Data Pipeline: Generate, Score, Filter, Format.

Usage:
    python synth_pipeline.py \
        --model gpt-4o-mini \
        --num-examples 1000 \
        --top-k-pct 10 \
        --output training_data.jsonl

Requirements:
    pip install openai tiktoken
"""

import argparse
import asyncio
import json
import math
import sys
from dataclasses import dataclass, asdict, field
from pathlib import Path

import tiktoken
from openai import AsyncOpenAI

# ── Data structures ──────────────────────────────────────────────

@dataclass
class Example:
    instruction: str
    response: str
    scores: dict = field(default_factory=dict)
    total_score: float = 0.0

@dataclass
class PipelineStats:
    generated: int = 0
    format_passed: int = 0
    scored: int = 0
    selected: int = 0

# ── Generation ───────────────────────────────────────────────────

SYSTEM_PROMPT = "You are a helpful, knowledgeable AI assistant."

INSTRUCTION_PROMPT = (
    "Generate one specific, self-contained user question or instruction "
    "that requires a detailed, substantive response. Cover any topic: "
    "science, coding, math, writing, analysis, planning, etc. "
    "Output ONLY the instruction text."
)

async def generate_pair(
    client: AsyncOpenAI,
    model: str,
    semaphore: asyncio.Semaphore,
) -> Example | None:
    async with semaphore:
        try:
            # Generate instruction
            instr_resp = await client.chat.completions.create(
                model=model,
                messages=[
                    {"role": "system", "content": SYSTEM_PROMPT},
                    {"role": "user", "content": INSTRUCTION_PROMPT},
                ],
                temperature=1.0,
                max_tokens=256,
            )
            instruction = instr_resp.choices[0].message.content.strip()

            # Generate response
            resp = await client.chat.completions.create(
                model=model,
                messages=[
                    {"role": "system", "content": SYSTEM_PROMPT},
                    {"role": "user", "content": instruction},
                ],
                temperature=0.7,
                max_tokens=2048,
            )
            response = resp.choices[0].message.content.strip()
            return Example(instruction=instruction, response=response)
        except Exception as e:
            print(f"Generation error: {e}", file=sys.stderr)
            return None

# ── Format validation ────────────────────────────────────────────

def passes_format_check(ex: Example) -> bool:
    if len(ex.instruction) < 10 or len(ex.instruction) > 2000:
        return False
    if len(ex.response) < 50 or len(ex.response) > 16000:
        return False
    # Reject if response starts by copying the instruction
    if ex.response.lower().startswith(ex.instruction.lower()[:40]):
        return False
    return True

# ── Scoring ──────────────────────────────────────────────────────

def score_example(ex: Example, enc: tiktoken.Encoding) -> Example:
    """Score an example using heuristic quality signals.

    Scores (each 0-1):
      - length_score: longer, substantive responses score higher
      - structure_score: presence of paragraphs, lists, or code
      - specificity_score: ratio of unique tokens to total tokens
    """
    resp = ex.response
    resp_tokens = enc.encode(resp)
    n_tokens = len(resp_tokens)

    # Length: peak reward around 200-600 tokens
    if n_tokens < 30:
        length_score = 0.0
    elif n_tokens < 200:
        length_score = n_tokens / 200
    elif n_tokens <= 600:
        length_score = 1.0
    else:
        length_score = max(0.5, 1.0 - (n_tokens - 600) / 2000)

    # Structure: reward paragraphs, lists, code blocks
    has_paragraphs = resp.count("\n\n") >= 2
    has_list = any(
        resp.count(marker) >= 2
        for marker in ["- ", "* ", "1.", "2.", "3."]
    )
    has_code = "```" in resp
    structure_score = (
        0.4 * has_paragraphs + 0.3 * has_list + 0.3 * has_code
    )

    # Specificity: unique token ratio (penalizes repetition)
    unique_tokens = len(set(resp_tokens))
    specificity_score = min(unique_tokens / max(n_tokens, 1), 1.0)

    # Weighted total
    total = (
        0.35 * length_score
        + 0.30 * structure_score
        + 0.35 * specificity_score
    )

    ex.scores = {
        "length": round(length_score, 3),
        "structure": round(structure_score, 3),
        "specificity": round(specificity_score, 3),
    }
    ex.total_score = round(total, 3)
    return ex

# ── Filtering ────────────────────────────────────────────────────

def select_top_k(
    examples: list[Example], top_k_pct: float
) -> list[Example]:
    """Select the top-k% of examples by total score."""
    examples.sort(key=lambda e: e.total_score, reverse=True)
    n_select = max(1, math.ceil(len(examples) * top_k_pct / 100))
    return examples[:n_select]

# ── Formatting ───────────────────────────────────────────────────

def to_chatml_jsonl(examples: list[Example]) -> list[str]:
    """Format examples as ChatML JSONL for training."""
    lines = []
    for ex in examples:
        record = {
            "messages": [
                {"role": "system", "content": SYSTEM_PROMPT},
                {"role": "user", "content": ex.instruction},
                {"role": "assistant", "content": ex.response},
            ],
            "metadata": {
                "scores": ex.scores,
                "total_score": ex.total_score,
            },
        }
        lines.append(json.dumps(record, ensure_ascii=False))
    return lines

# ── Main pipeline ────────────────────────────────────────────────

async def run_pipeline(
    model: str,
    num_examples: int,
    top_k_pct: float,
    output_path: Path,
    concurrency: int = 30,
):
    client = AsyncOpenAI()
    semaphore = asyncio.Semaphore(concurrency)
    enc = tiktoken.encoding_for_model("gpt-4o")
    stats = PipelineStats()

    print(f"Generating {num_examples} examples with {model}...")
    tasks = [
        generate_pair(client, model, semaphore)
        for _ in range(num_examples)
    ]
    raw = await asyncio.gather(*tasks)
    examples = [e for e in raw if e is not None]
    stats.generated = len(examples)
    print(f"  Generated: {stats.generated}")

    # Format filter
    examples = [e for e in examples if passes_format_check(e)]
    stats.format_passed = len(examples)
    print(f"  After format check: {stats.format_passed}")

    # Score
    examples = [score_example(e, enc) for e in examples]
    stats.scored = len(examples)

    # Filter top-k%
    selected = select_top_k(examples, top_k_pct)
    stats.selected = len(selected)
    print(f"  After top-{top_k_pct}% filter: {stats.selected}")

    # Write JSONL
    lines = to_chatml_jsonl(selected)
    output_path.write_text("\n".join(lines), encoding="utf-8")
    print(f"  Written to: {output_path}")
    print(f"\nPipeline stats: {asdict(stats)}")

def main():
    parser = argparse.ArgumentParser(
        description="Synthetic data generation pipeline"
    )
    parser.add_argument("--model", default="gpt-4o-mini")
    parser.add_argument("--num-examples", type=int, default=1000)
    parser.add_argument("--top-k-pct", type=float, default=10.0)
    parser.add_argument("--output", default="training_data.jsonl")
    parser.add_argument("--concurrency", type=int, default=30)
    args = parser.parse_args()

    asyncio.run(
        run_pipeline(
            model=args.model,
            num_examples=args.num_examples,
            top_k_pct=args.top_k_pct,
            output_path=Path(args.output),
            concurrency=args.concurrency,
        )
    )

if __name__ == "__main__":
    main()

Walking Through the Pipeline

Let us trace the data flow for a concrete run with --num-examples 1000 --top-k-pct 10:

Step 1: Generation. The pipeline fires 1,000 concurrent API calls (bounded by the semaphore to 30 at a time). Each call generates an instruction, then a response. With gpt-4o-mini at $0.15/1M input tokens and $0.60/1M output tokens, and an average of ~100 input tokens and ~400 output tokens per pair, the total cost is approximately:

C=1000×(100×0.15106+400×0.60106)×2$0.51C = 1000 \times \left(\frac{100 \times 0.15}{10^6} + \frac{400 \times 0.60}{10^6}\right) \times 2 \approx \$0.51

That is $0.51 for 1,000 raw examples. Some will fail (API errors, timeouts), leaving approximately 950 successful pairs.

Step 2: Format check. Rejects examples with instructions shorter than 10 characters, responses shorter than 50 characters, or responses that copy the instruction. Typically removes 5-10%, leaving ~870 examples.

Step 3: Scoring. Each example receives three sub-scores:

  • Length score: A response with 350 tokens scores 1.0 (in the sweet spot). A response with 80 tokens scores 0.4. A response with 1,200 tokens scores 0.7 (slight penalty for excessive length).
  • Structure score: A response with multiple paragraphs, a bulleted list, and a code block scores 1.0. A single-paragraph, unstructured response scores 0.0.
  • Specificity score: A response where 85% of tokens are unique scores 0.85. A repetitive response where only 40% of tokens are unique scores 0.40.

The weighted total (0.35×length+0.30×structure+0.35×specificity0.35 \times \text{length} + 0.30 \times \text{structure} + 0.35 \times \text{specificity}) produces a score between 0 and 1.

Step 4: Top-k selection. Sort by total score, take the top 10%. From 870 scored examples, select 87.

Step 5: JSONL formatting. Each selected example is written as a single JSON line in ChatML format:

{
  "messages": [
    {"role": "system", "content": "You are a helpful, knowledgeable AI assistant."},
    {"role": "user", "content": "Explain how B-trees maintain balance during insertion..."},
    {"role": "assistant", "content": "B-trees maintain balance through a split-and-promote mechanism..."}
  ],
  "metadata": {
    "scores": {"length": 0.95, "structure": 0.7, "specificity": 0.88},
    "total_score": 0.854
  }
}

This format is directly compatible with OpenAI fine-tuning, Axolotl, and most training frameworks that accept ChatML JSONL.

Extending the Pipeline

The heuristic scorer is a starting point. For production use, replace it with or augment it by:

  1. A reward model: Replace score_example with a call to a DeBERTa-based reward model (see Section 4). This changes the scoring from heuristic to learned.
  2. MinHash deduplication: Add the dedup stage between format checking and scoring. This removes near-duplicates that would otherwise inflate the top-k% with repetitive content.
  3. Domain tagging: Classify each example by domain (math, code, general, creative) and ensure the final selection includes a balanced mix.
  4. Multi-turn expansion: After selecting high-quality single-turn examples, use the same model to extend them into multi-turn conversations, creating a richer training signal.
Cost Summary for the Exercise

Running this pipeline with 1,000 examples using gpt-4o-mini costs under $1.00 in API fees. Scaling to 100,000 examples costs under $50. The top-10% filter yields 10,000 training-ready examples for approximately $50 — a 500x cost reduction compared to expert annotation at $25 per example.


Putting It All Together: The Production Pipeline

A production synthetic data pipeline combines everything from this post into a staged system. The data structures and the filtering logic above are not academic exercises — they are the core components of how datasets like OpenHermes, Magpie-Air, and HelpSteer2 are actually built.

The complete flow:

  1. Instruction generation (Magpie or prompt-based) produces millions of raw instructions.
  2. Response generation (strong teacher model) generates responses to each instruction.
  3. Format validation (string checks, length filters) removes structural failures.
  4. Deduplication (MinHash LSH at Jaccard > 0.7) removes near-duplicates.
  5. Perplexity filtering (GPT-2 reference, reject outside 5-100 range) removes incoherent and repetitive content.
  6. Reward model scoring (Nemotron 5-dimension or single-score) ranks by quality.
  7. Difficulty calibration (20/50/30 easy/medium/hard) ensures training distribution balance.
  8. Domain-specific validation (execution for code, verification for math, grounding for medical) applies the strongest available correctness signal.
  9. JSONL export in ChatML format, ready for training.

Compute Cost per Pipeline Stage (1M Raw Examples)

(USD)
Instruction gen (API) $500
500 USD
Response gen (API) $2,000
2,000 USD
Format validation $1 (CPU)
1 USD
MinHash dedup $5 (CPU)
5 USD
Perplexity filter $20 (1 GPU-hour)
20 USD
Reward model scoring $50 (2 GPU-hours)
50 USD
Difficulty calibration $1 (CPU)
1 USD

The total cost for a 1M-raw-to-100K-filtered pipeline using API-based generation: approximately $2,500-$3,000. Using local open-source models (Llama-3-70B on rented A100s at $2/GPU-hour), the generation cost drops to approximately $400-$600 for the same volume, bringing the total pipeline under $700.

For $700 you get 100K high-quality, deduplicated, reward-filtered, difficulty-calibrated instruction-response pairs. The equivalent human annotation cost at $10 per example: $1,000,000.

That is the synthetic data value proposition. Not that the data is free — it is not. But the cost curve has shifted by three orders of magnitude, and the quality, when the pipeline is done right, is competitive with human annotation for most instruction-tuning tasks.