Llama 2 Base completes βWhat is 2+2?β with βWhat is 3+3? What is 4+4?β β continuing the pattern, not answering. After instruction tuning on 27,540 examples, Llama 2 Chat responds with β4.β The transformation costs 0.10/example) and 8 GPU-hours of fine-tuning. Instruction tuning is the cheapest, highest-ROI stage of LLM development: 30K examples turn a next-token predictor into a usable assistant, and the quality ceiling is determined entirely by dataset diversity and correctness.
Anatomy of an Instruction-Response Pair
The Format
Every SFT training example is a multi-turn conversation with roles:
{
"conversations": [
{
"role": "system",
"content": "You are a helpful assistant."
},
{
"role": "user",
"content": "Explain the difference between TCP and UDP."
},
{
"role": "assistant",
"content": "TCP (Transmission Control Protocol) provides reliable, ordered delivery of data. It establishes a connection via a three-way handshake, ensures packets arrive in order, retransmits lost packets, and implements flow control. UDP (User Datagram Protocol) provides unreliable, unordered delivery. It sends packets without establishing a connection and does not retransmit lost packets. TCP is used for web browsing, email, and file transfer where data integrity matters. UDP is used for video streaming, gaming, and DNS lookups where low latency matters more than guaranteed delivery."
}
]
}
The model is trained to predict only the assistant tokens. System and user tokens appear in the input but their loss is masked β the model sees them but is not penalized for failing to predict them.
Loss Masking
The training loss computation:
where is the set of token positions that correspond to assistant responses. User and system tokens are part of the conditioning context but not part of the loss summation. This is critical β without loss masking, the model learns to βroleplayβ as the user, generating questions instead of answers.
def create_loss_mask(token_ids, role_boundaries):
"""
Create a loss mask that only includes assistant tokens.
token_ids: list of token IDs for the full conversation
role_boundaries: list of (start_idx, end_idx, role) tuples
"""
mask = [0] * len(token_ids)
for start, end, role in role_boundaries:
if role == "assistant":
for i in range(start, min(end, len(mask))):
mask[i] = 1
return mask
# Example: conversation with 100 tokens total
# System: tokens 0-15
# User: tokens 16-40
# Assistant: tokens 41-99
mask = create_loss_mask(
token_ids=list(range(100)),
role_boundaries=[
(0, 16, "system"),
(16, 41, "user"),
(41, 100, "assistant"),
],
)
# mask = [0]*41 + [1]*59
# Only the 59 assistant tokens contribute to training loss
Data Sources
ShareGPT
ShareGPT is a collection of real user conversations with ChatGPT, scraped from the sharegpt.com website where users voluntarily shared their interactions. Characteristics:
- Volume: ~90K conversations, ~700K turns
- Strengths: Real user queries reflect actual usage patterns. Multi-turn conversations teach the model to maintain context. Covers a wide distribution of tasks (coding, writing, analysis, roleplay, math).
- Weaknesses: Responses are from a specific model version (GPT-3.5/GPT-4), creating a ceiling on response quality. Contains some unsafe content. No systematic quality control.
- License: Gray area β user-submitted data without clear redistribution rights.
OpenAssistant
A crowdsourced dataset where human annotators wrote both instructions and responses, with quality ratings from other annotators.
- Volume: ~160K messages across ~66K conversation trees
- Strengths: Human-written responses (not model-generated). Quality ratings per message. Multi-turn trees with branching (multiple response alternatives). Explicit safety annotations.
- Weaknesses: Annotator pool was heavily English-speaking and tech-oriented. Response quality varies widely. Smaller effective size after filtering to high-quality subsets.
- License: Apache 2.0 β fully permissive.
Dolly (Databricks)
Corporate effort: Databricks employees wrote 15K instruction-response pairs.
- Volume: 15K pairs
- Strengths: Clean license (cc-by-sa-3.0). Consistent formatting. Covers standard NLP tasks (QA, summarization, classification, extraction).
- Weaknesses: Small scale. Limited task diversity. Corporate writing style may not match user expectations.
Alpaca (Stanford)
Synthetic dataset: 52K instructions generated by GPT-3.5 from 175 seed tasks, with responses also generated by GPT-3.5.
- Volume: 52K pairs
- Strengths: Cheap to produce. Consistent format. Easy to filter and clean.
- Weaknesses: Entirely synthetic β model-generated instructions and responses. Lower diversity than human data. Quality ceiling bounded by the teacher model.
Instruction Data Source Comparison
| Source | Size | Multi-Turn | Human Written | License | Quality Variance |
|---|---|---|---|---|---|
| ShareGPT | ~90K convos | Yes | Instructions only | Unclear | High |
| OpenAssistant | ~66K trees | Yes | Both sides | Apache 2.0 | Very High |
| Dolly | 15K pairs | No | Both sides | CC-BY-SA | Low |
| Alpaca | 52K pairs | No | Neither | Research only | Medium |
| WildChat | 1M convos | Yes | Instructions only | Apache 2.0 | Very High |
Quality Metrics for Instruction Data
Why Quality Matters More Than Quantity
The LIMA paper (Zhou et al., 2023) demonstrated that fine-tuning Llama-65B on only 1,000 carefully curated examples produced a model competitive with models trained on 50K+ examples. The conclusion: instruction data quality dominates quantity. The 99th percentile of a 50K dataset teaches more than the median of a 500K dataset.
Model Quality vs SFT Dataset Size (Holding Quality Fixed)
(% win rate vs GPT-3.5)The curve peaks around 10K-50K curated examples and then declines as noisy data is added. This is because low-quality instruction pairs teach the model bad habits: short, lazy responses; incorrect information stated confidently; responses that ignore parts of the instruction.
The Quality Scorer
A complete implementation that scores instruction-response pairs on multiple dimensions:
import re
import math
from collections import Counter
from dataclasses import dataclass
@dataclass
class QualityScore:
instruction_complexity: float # 0-1
response_completeness: float # 0-1
response_specificity: float # 0-1
format_quality: float # 0-1
diversity_signal: float # 0-1
overall: float # Weighted combination
class InstructionQualityScorer:
"""
Score instruction-response pairs on multiple quality dimensions.
"""
def __init__(self):
self.seen_instructions = []
self.instruction_embeddings = []
def score_instruction_complexity(self, instruction):
"""
Higher complexity = better training signal.
Simple factoid questions teach less than multi-step reasoning tasks.
"""
score = 0.0
words = instruction.split()
word_count = len(words)
# Length component: very short instructions are low quality
if word_count < 5:
score += 0.1
elif word_count < 15:
score += 0.3
elif word_count < 50:
score += 0.6
else:
score += 0.8
# Multi-step detection: instructions that require multiple
# actions score higher
step_indicators = [
"and then", "after that", "next", "first", "second",
"finally", "also", "additionally", "step",
]
step_count = sum(
1 for indicator in step_indicators
if indicator in instruction.lower()
)
score += min(step_count * 0.05, 0.15)
# Constraint detection: instructions with constraints
# (format, length, style) are more complex
constraint_words = [
"format", "exactly", "must", "should not", "avoid",
"only", "between", "at most", "at least", "without",
]
constraint_count = sum(
1 for w in constraint_words
if w in instruction.lower()
)
score += min(constraint_count * 0.03, 0.1)
return min(score, 1.0)
def score_response_completeness(self, instruction, response):
"""
Does the response address all parts of the instruction?
"""
# Extract question marks -- each represents a sub-question
questions = instruction.count("?")
if questions == 0:
questions = 1 # Imperative instruction
# Simple heuristic: response should have substantial content
response_words = len(response.split())
# Minimum viable response length
if response_words < 20:
return 0.2
# Length relative to instruction complexity
ratio = response_words / max(len(instruction.split()), 1)
if ratio < 1.0:
length_score = 0.3
elif ratio < 3.0:
length_score = 0.5
elif ratio < 10.0:
length_score = 0.8
else:
length_score = 0.7 # Penalty for excessively long responses
# Check for response structure (paragraphs, lists, code blocks)
has_structure = (
response.count("\n\n") >= 1
or response.count("- ") >= 2
or response.count("```") >= 2
or response.count("1.") >= 1
)
structure_bonus = 0.1 if has_structure else 0.0
return min(length_score + structure_bonus, 1.0)
def score_response_specificity(self, response):
"""
Penalize vague, hedge-filled responses.
Reward concrete, specific information.
"""
words = response.lower().split()
word_count = len(words)
if word_count == 0:
return 0.0
# Vague/hedge phrases that indicate low-quality responses
hedge_phrases = [
"it depends", "there are many", "in general",
"it is important to note", "as an ai",
"i cannot", "i'm not sure", "it varies",
"there are several", "various factors",
]
hedge_count = sum(
1 for phrase in hedge_phrases
if phrase in response.lower()
)
# Specific content indicators
has_numbers = bool(re.search(r'\d+\.?\d*', response))
has_code = "```" in response
has_examples = "example" in response.lower() or "e.g." in response.lower()
has_citations = bool(
re.search(r'\([A-Z][a-z]+ et al', response)
)
specificity = 0.5
specificity -= hedge_count * 0.08
specificity += 0.1 if has_numbers else 0
specificity += 0.15 if has_code else 0
specificity += 0.1 if has_examples else 0
specificity += 0.1 if has_citations else 0
return max(0.0, min(specificity, 1.0))
def score_format_quality(self, response):
"""
Check formatting: markdown structure, code block validity,
consistent list formatting.
"""
score = 0.5
# Code block matching
backtick_count = response.count("```")
if backtick_count % 2 != 0:
score -= 0.2 # Unmatched code blocks
# Consistent list formatting
lines = response.split("\n")
list_styles = set()
for line in lines:
stripped = line.strip()
if stripped.startswith("- "):
list_styles.add("dash")
elif stripped.startswith("* "):
list_styles.add("star")
elif re.match(r'^\d+\. ', stripped):
list_styles.add("numbered")
if len(list_styles) > 1:
score -= 0.1 # Mixed list styles
# Paragraph structure
paragraphs = [p.strip() for p in response.split("\n\n") if p.strip()]
if len(paragraphs) >= 2:
score += 0.2 # Well-structured response
if len(paragraphs) >= 4:
score += 0.1 # Detailed response
# Headers (markdown)
header_count = len(re.findall(r'^#{1,4} ', response, re.MULTILINE))
if header_count >= 2:
score += 0.1
return max(0.0, min(score, 1.0))
def score_diversity(self, instruction, existing_instructions):
"""
Penalize instructions that are near-duplicates of existing ones.
Uses simple word overlap as a proxy.
"""
if not existing_instructions:
return 1.0
instruction_words = set(instruction.lower().split())
max_overlap = 0.0
for existing in existing_instructions[-1000:]:
existing_words = set(existing.lower().split())
if not instruction_words or not existing_words:
continue
overlap = len(
instruction_words & existing_words
) / len(instruction_words | existing_words)
max_overlap = max(max_overlap, overlap)
# High overlap with existing = low diversity
return 1.0 - max_overlap
def score(self, instruction, response, existing_instructions=None):
"""
Compute overall quality score for an instruction-response pair.
"""
complexity = self.score_instruction_complexity(instruction)
completeness = self.score_response_completeness(
instruction, response
)
specificity = self.score_response_specificity(response)
formatting = self.score_format_quality(response)
diversity = self.score_diversity(
instruction, existing_instructions or []
)
# Weighted combination
overall = (
0.20 * complexity
+ 0.25 * completeness
+ 0.25 * specificity
+ 0.15 * formatting
+ 0.15 * diversity
)
return QualityScore(
instruction_complexity=complexity,
response_completeness=completeness,
response_specificity=specificity,
format_quality=formatting,
diversity_signal=diversity,
overall=overall,
)
Using the Scorer to Filter Data
def filter_instruction_dataset(
dataset,
scorer,
quality_threshold=0.55,
max_examples=50000,
):
"""
Filter an instruction dataset to keep only high-quality pairs.
dataset: list of dicts with 'instruction' and 'response' keys
scorer: InstructionQualityScorer instance
quality_threshold: minimum overall score to keep
max_examples: maximum number of examples to keep
"""
scored = []
existing_instructions = []
for item in dataset:
instruction = item["instruction"]
response = item["response"]
score = scorer.score(
instruction, response, existing_instructions
)
if score.overall >= quality_threshold:
scored.append({
"instruction": instruction,
"response": response,
"quality_score": score.overall,
"scores": {
"complexity": score.instruction_complexity,
"completeness": score.response_completeness,
"specificity": score.response_specificity,
"format": score.format_quality,
"diversity": score.diversity_signal,
},
})
existing_instructions.append(instruction)
# Sort by quality and take top N
scored.sort(key=lambda x: x["quality_score"], reverse=True)
return scored[:max_examples]
# Example usage
scorer = InstructionQualityScorer()
# Score a single example
result = scorer.score(
instruction="Explain how TCP congestion control works, including slow start, congestion avoidance, and fast retransmit. Include a concrete example with window sizes.",
response="TCP congestion control manages network throughput through three phases...",
)
# result.overall might be 0.72 -- good quality
Decontamination
The Problem
Evaluation benchmarks (MMLU, HumanEval, GSM8K, HellaSwag) appear on the web. If these benchmark questions or their answers appear in the SFT training data, the model memorizes them, and evaluation scores become meaningless. This is not hypothetical β multiple high-profile models have been caught with contaminated evaluation scores.
Detection Methods
Three detection methods, in order of increasing sophistication:
import hashlib
import re
class ContaminationDetector:
"""
Detect and remove evaluation benchmark questions
from instruction training data.
"""
def __init__(self, benchmark_questions):
"""
benchmark_questions: list of strings -- the evaluation
questions to protect against contamination
"""
# Exact match index
self.exact_hashes = set()
for q in benchmark_questions:
canonical = self._canonicalize(q)
h = hashlib.sha256(canonical.encode()).hexdigest()
self.exact_hashes.add(h)
# N-gram index for fuzzy matching
self.benchmark_ngrams = {}
for i, q in enumerate(benchmark_questions):
for ngram in self._extract_ngrams(q, n=8):
if ngram not in self.benchmark_ngrams:
self.benchmark_ngrams[ngram] = set()
self.benchmark_ngrams[ngram].add(i)
self.benchmark_questions = benchmark_questions
def _canonicalize(self, text):
"""Normalize text for comparison."""
text = text.lower().strip()
text = re.sub(r'\s+', ' ', text)
text = re.sub(r'[^\w\s]', '', text)
return text
def _extract_ngrams(self, text, n=8):
"""Extract word-level n-grams."""
words = self._canonicalize(text).split()
return [
" ".join(words[i:i+n])
for i in range(len(words) - n + 1)
]
def check_exact_match(self, text):
"""Check for exact match after canonicalization."""
canonical = self._canonicalize(text)
h = hashlib.sha256(canonical.encode()).hexdigest()
return h in self.exact_hashes
def check_ngram_overlap(self, text, threshold=0.5):
"""
Check for n-gram overlap with benchmark questions.
Returns (is_contaminated, matched_question_index, overlap_ratio)
"""
text_ngrams = self._extract_ngrams(text, n=8)
if not text_ngrams:
return False, -1, 0.0
# Count matches per benchmark question
match_counts = Counter()
for ngram in text_ngrams:
if ngram in self.benchmark_ngrams:
for idx in self.benchmark_ngrams[ngram]:
match_counts[idx] += 1
if not match_counts:
return False, -1, 0.0
best_idx, best_count = match_counts.most_common(1)[0]
best_question = self.benchmark_questions[best_idx]
best_ngrams = len(self._extract_ngrams(best_question, n=8))
overlap = best_count / max(best_ngrams, 1)
return overlap >= threshold, best_idx, overlap
def is_contaminated(self, text):
"""
Check if text is contaminated with benchmark data.
Returns True if any detection method flags the text.
"""
# Fast exact match check first
if self.check_exact_match(text):
return True
# Fuzzy n-gram overlap check
contaminated, _, _ = self.check_ngram_overlap(text)
return contaminated
Decontamination Pipeline
def decontaminate_dataset(
sft_dataset,
benchmark_sets,
):
"""
Remove contaminated examples from SFT dataset.
sft_dataset: list of instruction-response dicts
benchmark_sets: dict mapping benchmark_name -> list of questions
"""
# Build detector with all benchmark questions
all_questions = []
for name, questions in benchmark_sets.items():
all_questions.extend(questions)
detector = ContaminationDetector(all_questions)
clean = []
contaminated = []
for item in sft_dataset:
# Check both instruction and response for contamination
instruction_dirty = detector.is_contaminated(
item["instruction"]
)
response_dirty = detector.is_contaminated(
item["response"]
)
if instruction_dirty or response_dirty:
contaminated.append(item)
else:
clean.append(item)
print(f"Total: {len(sft_dataset)}")
print(f"Clean: {len(clean)}")
print(f"Contaminated: {len(contaminated)} "
f"({100*len(contaminated)/len(sft_dataset):.1f}%)")
return clean, contaminated
Every SFT dataset published on HuggingFace contains some degree of benchmark contamination. The MMLU test set, GSM8K questions, and HumanEval problems are widely distributed across the web. Running decontamination before training is not a best practice β it is a requirement for trustworthy evaluation. Models trained without decontamination may appear to outperform their actual capability by 2-10 points on contaminated benchmarks.
Chat Templates
The Format Problem
Different models use different chat template formats. The same conversation formatted for Llama 3 vs ChatML vs Vicuna produces different token sequences. The model must see a consistent format during training and inference.
CHAT_TEMPLATES = {
"llama3": {
"system_start": "<|start_header_id|>system<|end_header_id|>\n\n",
"system_end": "<|eot_id|>",
"user_start": "<|start_header_id|>user<|end_header_id|>\n\n",
"user_end": "<|eot_id|>",
"assistant_start": "<|start_header_id|>assistant<|end_header_id|>\n\n",
"assistant_end": "<|eot_id|>",
"bos": "<|begin_of_text|>",
},
"chatml": {
"system_start": "<|im_start|>system\n",
"system_end": "<|im_end|>\n",
"user_start": "<|im_start|>user\n",
"user_end": "<|im_end|>\n",
"assistant_start": "<|im_start|>assistant\n",
"assistant_end": "<|im_end|>\n",
"bos": "",
},
}
def format_conversation(conversation, template_name):
"""
Format a conversation using a specific chat template.
conversation: list of {"role": str, "content": str}
template_name: "llama3" or "chatml"
"""
template = CHAT_TEMPLATES[template_name]
parts = [template["bos"]]
for turn in conversation:
role = turn["role"]
content = turn["content"]
if role == "system":
parts.append(template["system_start"])
parts.append(content)
parts.append(template["system_end"])
elif role == "user":
parts.append(template["user_start"])
parts.append(content)
parts.append(template["user_end"])
elif role == "assistant":
parts.append(template["assistant_start"])
parts.append(content)
parts.append(template["assistant_end"])
return "".join(parts)
Multi-Turn Handling
Multi-turn conversations require careful handling of context accumulation:
def prepare_multiturn_for_training(conversation, tokenizer, template_name):
"""
Prepare a multi-turn conversation for SFT training.
Returns token_ids and loss_mask.
"""
template = CHAT_TEMPLATES[template_name]
all_token_ids = []
loss_mask = []
# BOS token
bos_tokens = tokenizer.encode(template["bos"])
all_token_ids.extend(bos_tokens)
loss_mask.extend([0] * len(bos_tokens))
for turn in conversation:
role = turn["role"]
content = turn["content"]
# Format this turn
if role == "system":
prefix = template["system_start"]
suffix = template["system_end"]
elif role == "user":
prefix = template["user_start"]
suffix = template["user_end"]
elif role == "assistant":
prefix = template["assistant_start"]
suffix = template["assistant_end"]
else:
continue
# Tokenize prefix (no loss)
prefix_tokens = tokenizer.encode(prefix)
all_token_ids.extend(prefix_tokens)
loss_mask.extend([0] * len(prefix_tokens))
# Tokenize content
content_tokens = tokenizer.encode(content)
all_token_ids.extend(content_tokens)
# Only compute loss on assistant content
if role == "assistant":
loss_mask.extend([1] * len(content_tokens))
else:
loss_mask.extend([0] * len(content_tokens))
# Tokenize suffix (no loss)
suffix_tokens = tokenizer.encode(suffix)
all_token_ids.extend(suffix_tokens)
loss_mask.extend([0] * len(suffix_tokens))
return all_token_ids, loss_mask
Category Distribution
Task Diversity
A good SFT dataset covers a wide range of task categories. Over-representation of any single category biases the model.
Ideal SFT Task Distribution (Approximate Targets)
(% of SFT dataset)Measuring Distribution
def analyze_category_distribution(dataset, classifier_fn):
"""
Analyze the task category distribution of an SFT dataset.
classifier_fn: callable(instruction) -> category string
"""
categories = Counter()
for item in dataset:
category = classifier_fn(item["instruction"])
categories[category] += 1
total = sum(categories.values())
distribution = {
cat: count / total
for cat, count in categories.most_common()
}
# Compute entropy as diversity measure
# Higher entropy = more diverse distribution
entropy = -sum(
p * math.log2(p) for p in distribution.values() if p > 0
)
max_entropy = math.log2(len(distribution))
normalized_entropy = entropy / max_entropy if max_entropy > 0 else 0
return {
"distribution": distribution,
"entropy": entropy,
"normalized_entropy": normalized_entropy, # 0-1 scale
"num_categories": len(distribution),
}
Response Length Distribution
The Length Problem
SFT datasets often bias toward either very short responses (Alpaca) or very long responses (ShareGPT). The ideal distribution matches real usage β most queries deserve 100-500 word responses, some require brief answers (under 50 words), and a few require detailed explanations (over 1000 words).
Response Length Distribution by Source
| Source | Median Words | p10 Words | p90 Words | Skew |
|---|---|---|---|---|
| Alpaca | 68 | 22 | 180 | Short-biased |
| ShareGPT | 312 | 45 | 1250 | Long-biased |
| OpenAssistant | 185 | 30 | 620 | Moderate |
| Dolly | 95 | 18 | 320 | Short-biased |
| Ideal target | 150 | 25 | 800 | Balanced |
Models trained on short-response data learn to give terse, incomplete answers. Models trained on long-response data learn to pad responses with unnecessary preamble and repetition. The target: match the response length to the complexity of the instruction.
def length_quality_score(instruction, response):
"""
Score whether the response length is appropriate for the
instruction complexity.
"""
instruction_words = len(instruction.split())
response_words = len(response.split())
# Expected response length based on instruction complexity
if instruction_words < 10:
# Simple question -- response should be concise
expected_range = (20, 200)
elif instruction_words < 30:
# Moderate question
expected_range = (50, 500)
elif instruction_words < 60:
# Complex question
expected_range = (100, 1000)
else:
# Very complex / multi-part question
expected_range = (200, 2000)
min_expected, max_expected = expected_range
if response_words < min_expected:
# Too short
return response_words / min_expected
elif response_words > max_expected:
# Too long -- mild penalty
return max(0.5, max_expected / response_words)
else:
# In range
return 1.0
Safety Filtering
Handling Harmful Instructions
SFT data must include examples of the model refusing harmful requests. But it must also avoid teaching the model to refuse benign requests. The balance:
- 3-5% of SFT data should be βrefusalβ examples where the model declines harmful requests with brief, specific explanations
- These should cover categories: violence, illegal activity, privacy violations, manipulation
- Refusals should be concise β not long lectures about AI safety
SAFETY_CATEGORIES = [
"violence", "illegal_activity", "privacy_violation",
"manipulation", "self_harm", "csam", "weapons",
"fraud", "harassment",
]
def create_refusal_example(harmful_instruction, category):
"""
Create a training example where the model appropriately
refuses a harmful instruction.
"""
# Short, specific refusal -- not a lecture
refusal_templates = {
"violence": "I cannot provide instructions for causing physical harm to people.",
"illegal_activity": "I cannot help with illegal activities.",
"privacy_violation": "I cannot help with accessing private information without authorization.",
"manipulation": "I cannot help with manipulating or deceiving people.",
"self_harm": "I cannot provide information that could facilitate self-harm. If you are struggling, please contact a crisis helpline.",
}
refusal = refusal_templates.get(
category,
"I cannot assist with this request.",
)
return {
"conversations": [
{"role": "user", "content": harmful_instruction},
{"role": "assistant", "content": refusal},
],
"category": "safety_refusal",
"safety_category": category,
}
A common failure mode: the model refuses benign requests because its safety training is too aggressive. βHow do I kill a process in Linux?β should not trigger a refusal. The SFT safety data should include borderline examples with correct non-refusal responses to prevent over-refusal.
Putting It All Together
The Complete SFT Data Pipeline
def build_sft_dataset(
raw_sources,
benchmark_questions,
target_size=50000,
quality_threshold=0.55,
):
"""
End-to-end SFT dataset construction pipeline.
"""
scorer = InstructionQualityScorer()
# Step 1: Combine all sources
all_examples = []
for source_name, examples in raw_sources.items():
for ex in examples:
ex["source"] = source_name
all_examples.append(ex)
print(f"Raw examples: {len(all_examples)}")
# Step 2: Decontaminate
detector = ContaminationDetector(benchmark_questions)
clean_examples = [
ex for ex in all_examples
if not detector.is_contaminated(ex["instruction"])
and not detector.is_contaminated(ex["response"])
]
print(f"After decontamination: {len(clean_examples)}")
# Step 3: Quality scoring
scored = []
existing_instructions = []
for ex in clean_examples:
score = scorer.score(
ex["instruction"],
ex["response"],
existing_instructions,
)
if score.overall >= quality_threshold:
ex["quality_score"] = score.overall
scored.append(ex)
existing_instructions.append(ex["instruction"])
print(f"After quality filter: {len(scored)}")
# Step 4: Sort by quality and select top N
scored.sort(key=lambda x: x["quality_score"], reverse=True)
selected = scored[:target_size]
print(f"Final dataset: {len(selected)} examples")
print(f"Mean quality: "
f"{sum(e['quality_score'] for e in selected)/len(selected):.3f}")
return selected
Instruction tuning data quality is not about sophisticated filtering algorithms. It is about two simple principles: (1) the response must completely and specifically address the instruction, and (2) the dataset must cover a diverse range of tasks. A 10K dataset with these properties outperforms a 500K dataset without them. The quality scorer, decontamination pipeline, and category analysis exist to enforce these principles at scale.