Part of Series The Dataset Frontier 7 of 27
1 Synthetic Data Pipelines: Magpie, Nemotron-4, and Generating Training Data at Scale 2 Data Curation at Scale: DCLM, FineWeb-Edu, and the Exact Heuristics That Filter the Web 3 Agent-Based Simulation: Using 10,000 AI Agents to Generate Synthetic Training Data 4 Code Dataset Curation: Deduplication, License Filtering, and Quality Scoring for LLM Training 5 Multilingual Data: Cross-Lingual Transfer, Low-Resource Languages, and Translation Quality 6 Instruction Tuning Data: ShareGPT, OpenAssistant, and Quality Metrics for Alignment 7 Preference Data: Building DPO/RLHF Datasets from Human and AI Feedback 8 Data Mixing: Optimal Proportions of Code, Math, Web, and Books for LLM Training 9 Evaluation Datasets: Building Benchmarks That Actually Measure LLM Capability 10 Data Contamination: Detecting and Preventing Benchmark Leakage in Training Data 11 The Data Scaling Law: How Much Data Is Enough, and What Happens When You Run Out 12 Training a Tokenizer from Scratch: BPE Merge Rules, Vocabulary Optimization, and Compression Ratio 13 Multimodal Training Data: Image-Text Pairs, Video Captioning, and Interleaved Document Formats 14 RLHF Data at Scale: Collecting Millions of Human Preferences with Minimal Cost 15 Building a Decontamination Pipeline: Removing Benchmark Data from Training Corpora 16 Safety Training Data: Red Teaming, Refusal Training, and Building Datasets for Harmless AI 17 Data Versioning and Reproducibility: Tracking What Changed Between Training Runs 18 Domain-Specific Data: Building Medical, Legal, and Financial Training Datasets 19 Data Attribution and Provenance: Tracing Model Outputs Back to Training Examples 20 The Data Flywheel: Using Production Logs to Continuously Improve Training Data 21 Reward Model Training Data: Building Datasets for Math Verification and Code Correctness 22 Long-Context Training Data: Book-Length Documents, Multi-Document QA, and Needle-in-Haystack 23 Agentic Interaction Data: Tool Use Traces, Multi-Step Planning Logs, and Environment Feedback 24 Data Labeling Platforms: Scale AI, Surge AI, and Building Your Own Annotation Pipeline 25 Data Legal Issues: Copyright, Fair Use, Opt-Out, and the Regulatory Landscape for Training Data 26 Data Pipeline at Scale: Spark, Ray, and Processing 15 Trillion Tokens Across 1000 Nodes 27 Building a Data Pipeline: From Raw HTML to Clean Training Tokens in 500 Lines

Llama 2 Base completes β€œWhat is 2+2?” with β€œWhat is 3+3? What is 4+4?” β€” continuing the pattern, not answering. After instruction tuning on 27,540 examples, Llama 2 Chat responds with β€œ4.” The transformation costs 2,754inannotation(at2,754 in annotation (at 0.10/example) and 8 GPU-hours of fine-tuning. Instruction tuning is the cheapest, highest-ROI stage of LLM development: 30K examples turn a next-token predictor into a usable assistant, and the quality ceiling is determined entirely by dataset diversity and correctness.

Anatomy of an Instruction-Response Pair

The Format

Every SFT training example is a multi-turn conversation with roles:

{
  "conversations": [
    {
      "role": "system",
      "content": "You are a helpful assistant."
    },
    {
      "role": "user",
      "content": "Explain the difference between TCP and UDP."
    },
    {
      "role": "assistant",
      "content": "TCP (Transmission Control Protocol) provides reliable, ordered delivery of data. It establishes a connection via a three-way handshake, ensures packets arrive in order, retransmits lost packets, and implements flow control. UDP (User Datagram Protocol) provides unreliable, unordered delivery. It sends packets without establishing a connection and does not retransmit lost packets. TCP is used for web browsing, email, and file transfer where data integrity matters. UDP is used for video streaming, gaming, and DNS lookups where low latency matters more than guaranteed delivery."
    }
  ]
}

The model is trained to predict only the assistant tokens. System and user tokens appear in the input but their loss is masked β€” the model sees them but is not penalized for failing to predict them.

Loss Masking

The training loss computation:

L=βˆ’1∣Taβˆ£βˆ‘t∈Talog⁑P(xt∣x<t)\mathcal{L} = -\frac{1}{|T_a|} \sum_{t \in T_a} \log P(x_t | x_{<t})

where TaT_a is the set of token positions that correspond to assistant responses. User and system tokens are part of the conditioning context x<tx_{<t} but not part of the loss summation. This is critical β€” without loss masking, the model learns to β€œroleplay” as the user, generating questions instead of answers.

def create_loss_mask(token_ids, role_boundaries):
    """
    Create a loss mask that only includes assistant tokens.

    token_ids: list of token IDs for the full conversation
    role_boundaries: list of (start_idx, end_idx, role) tuples
    """
    mask = [0] * len(token_ids)

    for start, end, role in role_boundaries:
        if role == "assistant":
            for i in range(start, min(end, len(mask))):
                mask[i] = 1

    return mask

# Example: conversation with 100 tokens total
# System: tokens 0-15
# User: tokens 16-40
# Assistant: tokens 41-99
mask = create_loss_mask(
    token_ids=list(range(100)),
    role_boundaries=[
        (0, 16, "system"),
        (16, 41, "user"),
        (41, 100, "assistant"),
    ],
)
# mask = [0]*41 + [1]*59
# Only the 59 assistant tokens contribute to training loss

Data Sources

ShareGPT

ShareGPT is a collection of real user conversations with ChatGPT, scraped from the sharegpt.com website where users voluntarily shared their interactions. Characteristics:

  • Volume: ~90K conversations, ~700K turns
  • Strengths: Real user queries reflect actual usage patterns. Multi-turn conversations teach the model to maintain context. Covers a wide distribution of tasks (coding, writing, analysis, roleplay, math).
  • Weaknesses: Responses are from a specific model version (GPT-3.5/GPT-4), creating a ceiling on response quality. Contains some unsafe content. No systematic quality control.
  • License: Gray area β€” user-submitted data without clear redistribution rights.

OpenAssistant

A crowdsourced dataset where human annotators wrote both instructions and responses, with quality ratings from other annotators.

  • Volume: ~160K messages across ~66K conversation trees
  • Strengths: Human-written responses (not model-generated). Quality ratings per message. Multi-turn trees with branching (multiple response alternatives). Explicit safety annotations.
  • Weaknesses: Annotator pool was heavily English-speaking and tech-oriented. Response quality varies widely. Smaller effective size after filtering to high-quality subsets.
  • License: Apache 2.0 β€” fully permissive.

Dolly (Databricks)

Corporate effort: Databricks employees wrote 15K instruction-response pairs.

  • Volume: 15K pairs
  • Strengths: Clean license (cc-by-sa-3.0). Consistent formatting. Covers standard NLP tasks (QA, summarization, classification, extraction).
  • Weaknesses: Small scale. Limited task diversity. Corporate writing style may not match user expectations.

Alpaca (Stanford)

Synthetic dataset: 52K instructions generated by GPT-3.5 from 175 seed tasks, with responses also generated by GPT-3.5.

  • Volume: 52K pairs
  • Strengths: Cheap to produce. Consistent format. Easy to filter and clean.
  • Weaknesses: Entirely synthetic β€” model-generated instructions and responses. Lower diversity than human data. Quality ceiling bounded by the teacher model.
πŸ“Š

Instruction Data Source Comparison

SourceSizeMulti-TurnHuman WrittenLicenseQuality Variance
ShareGPT ~90K convos Yes Instructions only Unclear High
OpenAssistant ~66K trees Yes Both sides Apache 2.0 Very High
Dolly 15K pairs No Both sides CC-BY-SA Low
Alpaca 52K pairs No Neither Research only Medium
WildChat 1M convos Yes Instructions only Apache 2.0 Very High

Quality Metrics for Instruction Data

Why Quality Matters More Than Quantity

The LIMA paper (Zhou et al., 2023) demonstrated that fine-tuning Llama-65B on only 1,000 carefully curated examples produced a model competitive with models trained on 50K+ examples. The conclusion: instruction data quality dominates quantity. The 99th percentile of a 50K dataset teaches more than the median of a 500K dataset.

Model Quality vs SFT Dataset Size (Holding Quality Fixed)

(% win rate vs GPT-3.5)
1K (LIMA-quality) 72% win rate vs GPT-3.5
72 % win rate vs GPT-3.5
10K (curated) 78% win rate
78 % win rate vs GPT-3.5
50K (mixed quality) 75% win rate
75 % win rate vs GPT-3.5
200K (unfiltered) 68% win rate
68 % win rate vs GPT-3.5
500K (noisy) 62% win rate
62 % win rate vs GPT-3.5

The curve peaks around 10K-50K curated examples and then declines as noisy data is added. This is because low-quality instruction pairs teach the model bad habits: short, lazy responses; incorrect information stated confidently; responses that ignore parts of the instruction.

The Quality Scorer

A complete implementation that scores instruction-response pairs on multiple dimensions:

import re
import math
from collections import Counter
from dataclasses import dataclass

@dataclass
class QualityScore:
    instruction_complexity: float  # 0-1
    response_completeness: float  # 0-1
    response_specificity: float  # 0-1
    format_quality: float  # 0-1
    diversity_signal: float  # 0-1
    overall: float  # Weighted combination

class InstructionQualityScorer:
    """
    Score instruction-response pairs on multiple quality dimensions.
    """

    def __init__(self):
        self.seen_instructions = []
        self.instruction_embeddings = []

    def score_instruction_complexity(self, instruction):
        """
        Higher complexity = better training signal.
        Simple factoid questions teach less than multi-step reasoning tasks.
        """
        score = 0.0
        words = instruction.split()
        word_count = len(words)

        # Length component: very short instructions are low quality
        if word_count < 5:
            score += 0.1
        elif word_count < 15:
            score += 0.3
        elif word_count < 50:
            score += 0.6
        else:
            score += 0.8

        # Multi-step detection: instructions that require multiple
        # actions score higher
        step_indicators = [
            "and then", "after that", "next", "first", "second",
            "finally", "also", "additionally", "step",
        ]
        step_count = sum(
            1 for indicator in step_indicators
            if indicator in instruction.lower()
        )
        score += min(step_count * 0.05, 0.15)

        # Constraint detection: instructions with constraints
        # (format, length, style) are more complex
        constraint_words = [
            "format", "exactly", "must", "should not", "avoid",
            "only", "between", "at most", "at least", "without",
        ]
        constraint_count = sum(
            1 for w in constraint_words
            if w in instruction.lower()
        )
        score += min(constraint_count * 0.03, 0.1)

        return min(score, 1.0)

    def score_response_completeness(self, instruction, response):
        """
        Does the response address all parts of the instruction?
        """
        # Extract question marks -- each represents a sub-question
        questions = instruction.count("?")
        if questions == 0:
            questions = 1  # Imperative instruction

        # Simple heuristic: response should have substantial content
        response_words = len(response.split())

        # Minimum viable response length
        if response_words < 20:
            return 0.2

        # Length relative to instruction complexity
        ratio = response_words / max(len(instruction.split()), 1)
        if ratio < 1.0:
            length_score = 0.3
        elif ratio < 3.0:
            length_score = 0.5
        elif ratio < 10.0:
            length_score = 0.8
        else:
            length_score = 0.7  # Penalty for excessively long responses

        # Check for response structure (paragraphs, lists, code blocks)
        has_structure = (
            response.count("\n\n") >= 1
            or response.count("- ") >= 2
            or response.count("```") >= 2
            or response.count("1.") >= 1
        )
        structure_bonus = 0.1 if has_structure else 0.0

        return min(length_score + structure_bonus, 1.0)

    def score_response_specificity(self, response):
        """
        Penalize vague, hedge-filled responses.
        Reward concrete, specific information.
        """
        words = response.lower().split()
        word_count = len(words)
        if word_count == 0:
            return 0.0

        # Vague/hedge phrases that indicate low-quality responses
        hedge_phrases = [
            "it depends", "there are many", "in general",
            "it is important to note", "as an ai",
            "i cannot", "i'm not sure", "it varies",
            "there are several", "various factors",
        ]
        hedge_count = sum(
            1 for phrase in hedge_phrases
            if phrase in response.lower()
        )

        # Specific content indicators
        has_numbers = bool(re.search(r'\d+\.?\d*', response))
        has_code = "```" in response
        has_examples = "example" in response.lower() or "e.g." in response.lower()
        has_citations = bool(
            re.search(r'\([A-Z][a-z]+ et al', response)
        )

        specificity = 0.5
        specificity -= hedge_count * 0.08
        specificity += 0.1 if has_numbers else 0
        specificity += 0.15 if has_code else 0
        specificity += 0.1 if has_examples else 0
        specificity += 0.1 if has_citations else 0

        return max(0.0, min(specificity, 1.0))

    def score_format_quality(self, response):
        """
        Check formatting: markdown structure, code block validity,
        consistent list formatting.
        """
        score = 0.5

        # Code block matching
        backtick_count = response.count("```")
        if backtick_count % 2 != 0:
            score -= 0.2  # Unmatched code blocks

        # Consistent list formatting
        lines = response.split("\n")
        list_styles = set()
        for line in lines:
            stripped = line.strip()
            if stripped.startswith("- "):
                list_styles.add("dash")
            elif stripped.startswith("* "):
                list_styles.add("star")
            elif re.match(r'^\d+\. ', stripped):
                list_styles.add("numbered")

        if len(list_styles) > 1:
            score -= 0.1  # Mixed list styles

        # Paragraph structure
        paragraphs = [p.strip() for p in response.split("\n\n") if p.strip()]
        if len(paragraphs) >= 2:
            score += 0.2  # Well-structured response
        if len(paragraphs) >= 4:
            score += 0.1  # Detailed response

        # Headers (markdown)
        header_count = len(re.findall(r'^#{1,4} ', response, re.MULTILINE))
        if header_count >= 2:
            score += 0.1

        return max(0.0, min(score, 1.0))

    def score_diversity(self, instruction, existing_instructions):
        """
        Penalize instructions that are near-duplicates of existing ones.
        Uses simple word overlap as a proxy.
        """
        if not existing_instructions:
            return 1.0

        instruction_words = set(instruction.lower().split())

        max_overlap = 0.0
        for existing in existing_instructions[-1000:]:
            existing_words = set(existing.lower().split())
            if not instruction_words or not existing_words:
                continue
            overlap = len(
                instruction_words & existing_words
            ) / len(instruction_words | existing_words)
            max_overlap = max(max_overlap, overlap)

        # High overlap with existing = low diversity
        return 1.0 - max_overlap

    def score(self, instruction, response, existing_instructions=None):
        """
        Compute overall quality score for an instruction-response pair.
        """
        complexity = self.score_instruction_complexity(instruction)
        completeness = self.score_response_completeness(
            instruction, response
        )
        specificity = self.score_response_specificity(response)
        formatting = self.score_format_quality(response)
        diversity = self.score_diversity(
            instruction, existing_instructions or []
        )

        # Weighted combination
        overall = (
            0.20 * complexity
            + 0.25 * completeness
            + 0.25 * specificity
            + 0.15 * formatting
            + 0.15 * diversity
        )

        return QualityScore(
            instruction_complexity=complexity,
            response_completeness=completeness,
            response_specificity=specificity,
            format_quality=formatting,
            diversity_signal=diversity,
            overall=overall,
        )

Using the Scorer to Filter Data

def filter_instruction_dataset(
    dataset,
    scorer,
    quality_threshold=0.55,
    max_examples=50000,
):
    """
    Filter an instruction dataset to keep only high-quality pairs.

    dataset: list of dicts with 'instruction' and 'response' keys
    scorer: InstructionQualityScorer instance
    quality_threshold: minimum overall score to keep
    max_examples: maximum number of examples to keep
    """
    scored = []
    existing_instructions = []

    for item in dataset:
        instruction = item["instruction"]
        response = item["response"]

        score = scorer.score(
            instruction, response, existing_instructions
        )

        if score.overall >= quality_threshold:
            scored.append({
                "instruction": instruction,
                "response": response,
                "quality_score": score.overall,
                "scores": {
                    "complexity": score.instruction_complexity,
                    "completeness": score.response_completeness,
                    "specificity": score.response_specificity,
                    "format": score.format_quality,
                    "diversity": score.diversity_signal,
                },
            })
            existing_instructions.append(instruction)

    # Sort by quality and take top N
    scored.sort(key=lambda x: x["quality_score"], reverse=True)
    return scored[:max_examples]

# Example usage
scorer = InstructionQualityScorer()

# Score a single example
result = scorer.score(
    instruction="Explain how TCP congestion control works, including slow start, congestion avoidance, and fast retransmit. Include a concrete example with window sizes.",
    response="TCP congestion control manages network throughput through three phases...",
)
# result.overall might be 0.72 -- good quality

Decontamination

The Problem

Evaluation benchmarks (MMLU, HumanEval, GSM8K, HellaSwag) appear on the web. If these benchmark questions or their answers appear in the SFT training data, the model memorizes them, and evaluation scores become meaningless. This is not hypothetical β€” multiple high-profile models have been caught with contaminated evaluation scores.

Detection Methods

Three detection methods, in order of increasing sophistication:

import hashlib
import re

class ContaminationDetector:
    """
    Detect and remove evaluation benchmark questions
    from instruction training data.
    """

    def __init__(self, benchmark_questions):
        """
        benchmark_questions: list of strings -- the evaluation
        questions to protect against contamination
        """
        # Exact match index
        self.exact_hashes = set()
        for q in benchmark_questions:
            canonical = self._canonicalize(q)
            h = hashlib.sha256(canonical.encode()).hexdigest()
            self.exact_hashes.add(h)

        # N-gram index for fuzzy matching
        self.benchmark_ngrams = {}
        for i, q in enumerate(benchmark_questions):
            for ngram in self._extract_ngrams(q, n=8):
                if ngram not in self.benchmark_ngrams:
                    self.benchmark_ngrams[ngram] = set()
                self.benchmark_ngrams[ngram].add(i)

        self.benchmark_questions = benchmark_questions

    def _canonicalize(self, text):
        """Normalize text for comparison."""
        text = text.lower().strip()
        text = re.sub(r'\s+', ' ', text)
        text = re.sub(r'[^\w\s]', '', text)
        return text

    def _extract_ngrams(self, text, n=8):
        """Extract word-level n-grams."""
        words = self._canonicalize(text).split()
        return [
            " ".join(words[i:i+n])
            for i in range(len(words) - n + 1)
        ]

    def check_exact_match(self, text):
        """Check for exact match after canonicalization."""
        canonical = self._canonicalize(text)
        h = hashlib.sha256(canonical.encode()).hexdigest()
        return h in self.exact_hashes

    def check_ngram_overlap(self, text, threshold=0.5):
        """
        Check for n-gram overlap with benchmark questions.
        Returns (is_contaminated, matched_question_index, overlap_ratio)
        """
        text_ngrams = self._extract_ngrams(text, n=8)
        if not text_ngrams:
            return False, -1, 0.0

        # Count matches per benchmark question
        match_counts = Counter()
        for ngram in text_ngrams:
            if ngram in self.benchmark_ngrams:
                for idx in self.benchmark_ngrams[ngram]:
                    match_counts[idx] += 1

        if not match_counts:
            return False, -1, 0.0

        best_idx, best_count = match_counts.most_common(1)[0]
        best_question = self.benchmark_questions[best_idx]
        best_ngrams = len(self._extract_ngrams(best_question, n=8))

        overlap = best_count / max(best_ngrams, 1)

        return overlap >= threshold, best_idx, overlap

    def is_contaminated(self, text):
        """
        Check if text is contaminated with benchmark data.
        Returns True if any detection method flags the text.
        """
        # Fast exact match check first
        if self.check_exact_match(text):
            return True

        # Fuzzy n-gram overlap check
        contaminated, _, _ = self.check_ngram_overlap(text)
        return contaminated

Decontamination Pipeline

def decontaminate_dataset(
    sft_dataset,
    benchmark_sets,
):
    """
    Remove contaminated examples from SFT dataset.

    sft_dataset: list of instruction-response dicts
    benchmark_sets: dict mapping benchmark_name -> list of questions
    """
    # Build detector with all benchmark questions
    all_questions = []
    for name, questions in benchmark_sets.items():
        all_questions.extend(questions)

    detector = ContaminationDetector(all_questions)

    clean = []
    contaminated = []

    for item in sft_dataset:
        # Check both instruction and response for contamination
        instruction_dirty = detector.is_contaminated(
            item["instruction"]
        )
        response_dirty = detector.is_contaminated(
            item["response"]
        )

        if instruction_dirty or response_dirty:
            contaminated.append(item)
        else:
            clean.append(item)

    print(f"Total: {len(sft_dataset)}")
    print(f"Clean: {len(clean)}")
    print(f"Contaminated: {len(contaminated)} "
          f"({100*len(contaminated)/len(sft_dataset):.1f}%)")

    return clean, contaminated
🚨 Decontamination Is Not Optional

Every SFT dataset published on HuggingFace contains some degree of benchmark contamination. The MMLU test set, GSM8K questions, and HumanEval problems are widely distributed across the web. Running decontamination before training is not a best practice β€” it is a requirement for trustworthy evaluation. Models trained without decontamination may appear to outperform their actual capability by 2-10 points on contaminated benchmarks.

Chat Templates

The Format Problem

Different models use different chat template formats. The same conversation formatted for Llama 3 vs ChatML vs Vicuna produces different token sequences. The model must see a consistent format during training and inference.

CHAT_TEMPLATES = {
    "llama3": {
        "system_start": "<|start_header_id|>system<|end_header_id|>\n\n",
        "system_end": "<|eot_id|>",
        "user_start": "<|start_header_id|>user<|end_header_id|>\n\n",
        "user_end": "<|eot_id|>",
        "assistant_start": "<|start_header_id|>assistant<|end_header_id|>\n\n",
        "assistant_end": "<|eot_id|>",
        "bos": "<|begin_of_text|>",
    },
    "chatml": {
        "system_start": "<|im_start|>system\n",
        "system_end": "<|im_end|>\n",
        "user_start": "<|im_start|>user\n",
        "user_end": "<|im_end|>\n",
        "assistant_start": "<|im_start|>assistant\n",
        "assistant_end": "<|im_end|>\n",
        "bos": "",
    },
}

def format_conversation(conversation, template_name):
    """
    Format a conversation using a specific chat template.

    conversation: list of {"role": str, "content": str}
    template_name: "llama3" or "chatml"
    """
    template = CHAT_TEMPLATES[template_name]
    parts = [template["bos"]]

    for turn in conversation:
        role = turn["role"]
        content = turn["content"]

        if role == "system":
            parts.append(template["system_start"])
            parts.append(content)
            parts.append(template["system_end"])
        elif role == "user":
            parts.append(template["user_start"])
            parts.append(content)
            parts.append(template["user_end"])
        elif role == "assistant":
            parts.append(template["assistant_start"])
            parts.append(content)
            parts.append(template["assistant_end"])

    return "".join(parts)

Multi-Turn Handling

Multi-turn conversations require careful handling of context accumulation:

def prepare_multiturn_for_training(conversation, tokenizer, template_name):
    """
    Prepare a multi-turn conversation for SFT training.
    Returns token_ids and loss_mask.
    """
    template = CHAT_TEMPLATES[template_name]
    all_token_ids = []
    loss_mask = []

    # BOS token
    bos_tokens = tokenizer.encode(template["bos"])
    all_token_ids.extend(bos_tokens)
    loss_mask.extend([0] * len(bos_tokens))

    for turn in conversation:
        role = turn["role"]
        content = turn["content"]

        # Format this turn
        if role == "system":
            prefix = template["system_start"]
            suffix = template["system_end"]
        elif role == "user":
            prefix = template["user_start"]
            suffix = template["user_end"]
        elif role == "assistant":
            prefix = template["assistant_start"]
            suffix = template["assistant_end"]
        else:
            continue

        # Tokenize prefix (no loss)
        prefix_tokens = tokenizer.encode(prefix)
        all_token_ids.extend(prefix_tokens)
        loss_mask.extend([0] * len(prefix_tokens))

        # Tokenize content
        content_tokens = tokenizer.encode(content)
        all_token_ids.extend(content_tokens)
        # Only compute loss on assistant content
        if role == "assistant":
            loss_mask.extend([1] * len(content_tokens))
        else:
            loss_mask.extend([0] * len(content_tokens))

        # Tokenize suffix (no loss)
        suffix_tokens = tokenizer.encode(suffix)
        all_token_ids.extend(suffix_tokens)
        loss_mask.extend([0] * len(suffix_tokens))

    return all_token_ids, loss_mask

Category Distribution

Task Diversity

A good SFT dataset covers a wide range of task categories. Over-representation of any single category biases the model.

Ideal SFT Task Distribution (Approximate Targets)

(% of SFT dataset)
Reasoning/Analysis 20% -- multi-step logic
20 % of SFT dataset
Coding 18% -- generation + debugging
18 % of SFT dataset
Creative Writing 12% -- stories, poems, essays
12 % of SFT dataset
Math 12% -- algebra through calculus
12 % of SFT dataset
Factual QA 10% -- knowledge questions
10 % of SFT dataset
Summarization 8% -- compress documents
8 % of SFT dataset
Instruction Following 8% -- format constraints
8 % of SFT dataset
Conversation 7% -- multi-turn chat
7 % of SFT dataset
Safety/Refusal 5% -- harmful request handling
5 % of SFT dataset

Measuring Distribution

def analyze_category_distribution(dataset, classifier_fn):
    """
    Analyze the task category distribution of an SFT dataset.

    classifier_fn: callable(instruction) -> category string
    """
    categories = Counter()

    for item in dataset:
        category = classifier_fn(item["instruction"])
        categories[category] += 1

    total = sum(categories.values())
    distribution = {
        cat: count / total
        for cat, count in categories.most_common()
    }

    # Compute entropy as diversity measure
    # Higher entropy = more diverse distribution
    entropy = -sum(
        p * math.log2(p) for p in distribution.values() if p > 0
    )
    max_entropy = math.log2(len(distribution))
    normalized_entropy = entropy / max_entropy if max_entropy > 0 else 0

    return {
        "distribution": distribution,
        "entropy": entropy,
        "normalized_entropy": normalized_entropy,  # 0-1 scale
        "num_categories": len(distribution),
    }

Response Length Distribution

The Length Problem

SFT datasets often bias toward either very short responses (Alpaca) or very long responses (ShareGPT). The ideal distribution matches real usage β€” most queries deserve 100-500 word responses, some require brief answers (under 50 words), and a few require detailed explanations (over 1000 words).

πŸ“Š

Response Length Distribution by Source

SourceMedian Wordsp10 Wordsp90 WordsSkew
Alpaca 68 22 180 Short-biased
ShareGPT 312 45 1250 Long-biased
OpenAssistant 185 30 620 Moderate
Dolly 95 18 320 Short-biased
Ideal target 150 25 800 Balanced

Models trained on short-response data learn to give terse, incomplete answers. Models trained on long-response data learn to pad responses with unnecessary preamble and repetition. The target: match the response length to the complexity of the instruction.

def length_quality_score(instruction, response):
    """
    Score whether the response length is appropriate for the
    instruction complexity.
    """
    instruction_words = len(instruction.split())
    response_words = len(response.split())

    # Expected response length based on instruction complexity
    if instruction_words < 10:
        # Simple question -- response should be concise
        expected_range = (20, 200)
    elif instruction_words < 30:
        # Moderate question
        expected_range = (50, 500)
    elif instruction_words < 60:
        # Complex question
        expected_range = (100, 1000)
    else:
        # Very complex / multi-part question
        expected_range = (200, 2000)

    min_expected, max_expected = expected_range

    if response_words < min_expected:
        # Too short
        return response_words / min_expected
    elif response_words > max_expected:
        # Too long -- mild penalty
        return max(0.5, max_expected / response_words)
    else:
        # In range
        return 1.0

Safety Filtering

Handling Harmful Instructions

SFT data must include examples of the model refusing harmful requests. But it must also avoid teaching the model to refuse benign requests. The balance:

  • 3-5% of SFT data should be β€œrefusal” examples where the model declines harmful requests with brief, specific explanations
  • These should cover categories: violence, illegal activity, privacy violations, manipulation
  • Refusals should be concise β€” not long lectures about AI safety
SAFETY_CATEGORIES = [
    "violence", "illegal_activity", "privacy_violation",
    "manipulation", "self_harm", "csam", "weapons",
    "fraud", "harassment",
]

def create_refusal_example(harmful_instruction, category):
    """
    Create a training example where the model appropriately
    refuses a harmful instruction.
    """
    # Short, specific refusal -- not a lecture
    refusal_templates = {
        "violence": "I cannot provide instructions for causing physical harm to people.",
        "illegal_activity": "I cannot help with illegal activities.",
        "privacy_violation": "I cannot help with accessing private information without authorization.",
        "manipulation": "I cannot help with manipulating or deceiving people.",
        "self_harm": "I cannot provide information that could facilitate self-harm. If you are struggling, please contact a crisis helpline.",
    }

    refusal = refusal_templates.get(
        category,
        "I cannot assist with this request.",
    )

    return {
        "conversations": [
            {"role": "user", "content": harmful_instruction},
            {"role": "assistant", "content": refusal},
        ],
        "category": "safety_refusal",
        "safety_category": category,
    }
⚠️ Over-Refusal

A common failure mode: the model refuses benign requests because its safety training is too aggressive. β€œHow do I kill a process in Linux?” should not trigger a refusal. The SFT safety data should include borderline examples with correct non-refusal responses to prevent over-refusal.

Putting It All Together

The Complete SFT Data Pipeline

def build_sft_dataset(
    raw_sources,
    benchmark_questions,
    target_size=50000,
    quality_threshold=0.55,
):
    """
    End-to-end SFT dataset construction pipeline.
    """
    scorer = InstructionQualityScorer()

    # Step 1: Combine all sources
    all_examples = []
    for source_name, examples in raw_sources.items():
        for ex in examples:
            ex["source"] = source_name
            all_examples.append(ex)

    print(f"Raw examples: {len(all_examples)}")

    # Step 2: Decontaminate
    detector = ContaminationDetector(benchmark_questions)
    clean_examples = [
        ex for ex in all_examples
        if not detector.is_contaminated(ex["instruction"])
        and not detector.is_contaminated(ex["response"])
    ]
    print(f"After decontamination: {len(clean_examples)}")

    # Step 3: Quality scoring
    scored = []
    existing_instructions = []
    for ex in clean_examples:
        score = scorer.score(
            ex["instruction"],
            ex["response"],
            existing_instructions,
        )
        if score.overall >= quality_threshold:
            ex["quality_score"] = score.overall
            scored.append(ex)
            existing_instructions.append(ex["instruction"])

    print(f"After quality filter: {len(scored)}")

    # Step 4: Sort by quality and select top N
    scored.sort(key=lambda x: x["quality_score"], reverse=True)
    selected = scored[:target_size]

    print(f"Final dataset: {len(selected)} examples")
    print(f"Mean quality: "
          f"{sum(e['quality_score'] for e in selected)/len(selected):.3f}")

    return selected
πŸ’‘ The Key Insight

Instruction tuning data quality is not about sophisticated filtering algorithms. It is about two simple principles: (1) the response must completely and specifically address the instruction, and (2) the dataset must cover a diverse range of tasks. A 10K dataset with these properties outperforms a 500K dataset without them. The quality scorer, decontamination pipeline, and category analysis exist to enforce these principles at scale.