Phi-3-mini (3.8B parameters) scores 69.0% on MMLU — better than GPT-3.5 (70.0%) and Llama 2 70B (68.9%). The gap is data quality: Phi-3 was trained on synthetic data generated by GPT-4, filtered through quality classifiers, and deduplicated at the semantic level. Small models with great data beat large models with mediocre data. The implications: edge deployment of near-frontier intelligence is no longer compute-constrained; it is data-constrained.
Model Specifications
class SmallModelSpecs:
models = {
'Phi-3-mini-4K': {
'params': 3.8e9,
'layers': 32,
'hidden': 3072,
'heads': 32,
'kv_heads': 32, # MHA (no GQA)
'intermediate': 8192,
'context': 4096,
'vocab_size': 32064,
'training_tokens': '3.3T',
'architecture': 'Dense transformer, MHA, RoPE',
'special': 'Synthetic data from GPT-4',
},
'Phi-3.5-mini': {
'params': 3.8e9,
'layers': 32,
'hidden': 3072,
'heads': 32,
'kv_heads': 32,
'intermediate': 8192,
'context': 128000,
'vocab_size': 32064,
'training_tokens': '3.4T',
'architecture': 'Dense transformer, MHA, longrope',
'special': 'Long context via LongRoPE',
},
'Gemma-2-2B': {
'params': 2.6e9,
'layers': 26,
'hidden': 2304,
'heads': 8,
'kv_heads': 4, # GQA
'intermediate': 9216,
'context': 8192,
'vocab_size': 256000,
'training_tokens': '2T',
'architecture': 'Dense transformer, GQA, sliding+global attention',
'special': 'Knowledge distillation from Gemma 27B',
},
'Gemma-2-9B': {
'params': 9.2e9,
'layers': 42,
'hidden': 3584,
'heads': 16,
'kv_heads': 8,
'intermediate': 14336,
'context': 8192,
'vocab_size': 256000,
'training_tokens': '8T',
'architecture': 'Dense transformer, GQA, sliding+global attention',
'special': 'Distilled from Gemma 27B',
},
'Llama-3.2-1B': {
'params': 1.24e9,
'layers': 16,
'hidden': 2048,
'heads': 32,
'kv_heads': 8,
'intermediate': 8192,
'context': 131072,
'vocab_size': 128256,
'training_tokens': '9T',
'architecture': 'Dense transformer, GQA, RoPE',
'special': 'Distilled + pruned from Llama 3.1 8B',
},
'Llama-3.2-3B': {
'params': 3.21e9,
'layers': 28,
'hidden': 3072,
'heads': 24,
'kv_heads': 8,
'intermediate': 8192,
'context': 131072,
'vocab_size': 128256,
'training_tokens': '9T',
'architecture': 'Dense transformer, GQA, RoPE',
'special': 'Distilled + pruned from Llama 3.1 8B',
},
}
Small Model Specifications
| Model | Params | Layers | Hidden | GQA | Context | Training Tokens |
|---|---|---|---|---|---|---|
| Llama 3.2 1B | 1.24B | 16 | 2048 | 4:1 | 131K | 9T |
| Gemma 2 2B | 2.6B | 26 | 2304 | 2:1 | 8K | 2T |
| Phi-3 mini | 3.8B | 32 | 3072 | None (MHA) | 4K | 3.3T |
| Llama 3.2 3B | 3.21B | 28 | 3072 | 3:1 | 131K | 9T |
| Gemma 2 9B | 9.2B | 42 | 3584 | 2:1 | 8K | 8T |
Data Distillation Strategies
Microsoft Phi: Synthetic Data
class PhiSyntheticDataPipeline:
"""
Phi's primary innovation: large-scale synthetic data from GPT-4.
Instead of distilling through logit matching, Phi uses the
teacher model to GENERATE training data.
"""
def generate_textbook_data(self, topics, teacher_model='gpt-4'):
"""
Generate textbook-quality explanations for each topic.
"""
synthetic_data = []
for topic in topics:
prompt = f"""Write a detailed textbook chapter about: {topic}
Include:
- Clear definitions
- Step-by-step explanations
- Worked examples
- Practice problems with solutions
Use precise technical language. Target graduate-level understanding."""
chapter = teacher_model.generate(prompt, max_tokens=8000)
synthetic_data.append({
'text': chapter,
'source': 'synthetic_textbook',
'topic': topic,
'quality': 'high',
})
return synthetic_data
def generate_exercise_data(self, domain, num_exercises=100000):
"""
Generate diverse exercises with solutions.
The key: vary difficulty, format, and domain systematically.
"""
exercises = []
difficulties = ['basic', 'intermediate', 'advanced', 'olympiad']
formats = ['multiple_choice', 'short_answer', 'proof', 'code']
for _ in range(num_exercises):
difficulty = np.random.choice(difficulties, p=[0.3, 0.35, 0.25, 0.1])
fmt = np.random.choice(formats, p=[0.2, 0.3, 0.2, 0.3])
prompt = f"Generate a {difficulty} {domain} problem in {fmt} format with solution."
exercise = teacher_model.generate(prompt)
# Verify solution (crucial quality control)
verification = teacher_model.generate(
f"Verify this solution is correct:\n{exercise}"
)
if 'correct' in verification.lower():
exercises.append(exercise)
return exercises
Google Gemma: Logit Distillation
class GemmaDistillation:
"""
Gemma 2 uses traditional knowledge distillation:
train the small model to match the larger model's output distribution.
"""
def distill(self, teacher_27b, student_2b, data_loader,
temperature=4.0, alpha=0.5, num_steps=500000):
"""
Standard KD with high temperature for softer targets.
Gemma uses temperature=4.0 (higher than typical 2.0).
"""
optimizer = torch.optim.AdamW(student_2b.parameters(), lr=1e-4)
for step, batch in enumerate(data_loader):
if step >= num_steps:
break
with torch.no_grad():
teacher_logits = teacher_27b(batch['input_ids']).logits
student_logits = student_2b(batch['input_ids']).logits
# KL divergence with temperature scaling
T = temperature
kl_loss = F.kl_div(
F.log_softmax(student_logits / T, dim=-1),
F.softmax(teacher_logits / T, dim=-1),
reduction='batchmean'
) * (T * T)
# Hard label loss
ce_loss = F.cross_entropy(
student_logits.view(-1, student_logits.size(-1)),
batch['labels'].view(-1),
ignore_index=-100
)
loss = alpha * kl_loss + (1 - alpha) * ce_loss
loss.backward()
optimizer.step()
optimizer.zero_grad()
Meta Llama 3.2: Pruning + Distillation
class LlamaPruneAndDistill:
"""
Llama 3.2 1B/3B are created by pruning Llama 3.1 8B
then distilling to recover quality.
"""
def structured_pruning(self, model_8b, target_params):
"""
Remove entire layers and reduce dimensions.
8B -> 3B: reduce layers from 32 to 28, heads from 32 to 24
8B -> 1B: reduce layers from 32 to 16, hidden from 4096 to 2048
"""
# Importance scoring for layers
layer_importance = []
for layer_idx in range(model_8b.config.num_hidden_layers):
# Measure layer's contribution via angular distance
importance = self.compute_layer_importance(model_8b, layer_idx)
layer_importance.append((layer_idx, importance))
# Remove least important layers
layer_importance.sort(key=lambda x: x[1])
target_layers = 28 if target_params > 2e9 else 16
layers_to_remove = [
idx for idx, _ in layer_importance[:32 - target_layers]
]
# Create pruned model
pruned = self.remove_layers(model_8b, layers_to_remove)
# Reduce hidden dimension if needed (for 1B target)
if target_params < 2e9:
pruned = self.reduce_dimensions(pruned, target_hidden=2048)
return pruned
def post_pruning_distillation(self, teacher_8b, pruned_student, data, tokens=9e12):
"""
After pruning, distill from the original 8B to recover quality.
Meta reports this takes 9T tokens of distillation.
"""
# Standard distillation + continued pretraining
# The pruned model retains much of the teacher's knowledge
# so it converges faster than training from scratch
pass
Architecture Differences
def compare_attention_designs():
"""
Each model family uses a different attention design.
"""
designs = {
'Phi-3': {
'type': 'Standard MHA',
'qkv_heads': '32Q, 32K, 32V',
'kv_cache_per_token': 32 * 128 * 2 * 2, # 16 KB
'pro': 'Maximum attention quality',
'con': 'Largest KV cache',
},
'Gemma-2': {
'type': 'GQA + Sliding/Global interleaving',
'qkv_heads': '8Q, 4K, 4V (2B); 16Q, 8K, 8V (9B)',
'kv_cache_per_token': 4 * 256 * 2 * 2, # 4 KB (2B)
'pro': 'Alternating local/global attention saves memory',
'con': 'Sliding window layers miss long-range for some tokens',
'detail': 'Even layers use sliding window (4096), odd use global',
},
'Llama-3.2': {
'type': 'GQA',
'qkv_heads': '32Q, 8K, 8V (3B); 32Q, 8K, 8V (1B)',
'kv_cache_per_token': 8 * 128 * 2 * 2, # 4 KB
'pro': '4x KV cache reduction, 128K context',
'con': 'Fewer KV heads may limit quality',
},
}
return designs
KV Cache Memory per Token per Layer
| Model | KV Heads | Head Dim | KV Bytes/Token/Layer | 32K Context Total |
|---|---|---|---|---|
| Phi-3 mini (MHA) | 32 | 96 | 12,288 | 12.6 GB |
| Gemma 2 2B (GQA) | 4 | 256 | 4,096 | 2.7 GB |
| Llama 3.2 3B (GQA) | 8 | 128 | 4,096 | 3.7 GB |
| Llama 3.2 1B (GQA) | 8 | 64 | 2,048 | 0.5 GB |
Phi-3 uses full MHA (32 KV heads), giving it the largest KV cache but potentially the best attention quality. Gemma 2 and Llama 3.2 use GQA to reduce KV cache by 4-8x, which is critical for edge deployment where memory is limited. Gemma’s sliding+global attention interleaving further reduces effective KV cache by only attending globally on half the layers.
Quality Comparison
Small Model Quality Comparison
| Model | Params | MMLU | HumanEval | GSM8K | ARC-C | HellaSwag |
|---|---|---|---|---|---|---|
| Llama 3.2 1B | 1.24B | 49.3 | 33.5 | 44.4 | 59.4 | 69.4 |
| Gemma 2 2B | 2.6B | 53.2 | 36.1 | 51.8 | 68.4 | 74.1 |
| Phi-3 mini | 3.8B | 69.7 | 58.5 | 82.5 | 78.7 | 80.1 |
| Llama 3.2 3B | 3.21B | 63.4 | 48.7 | 72.3 | 74.2 | 77.8 |
| Gemma 2 9B | 9.2B | 71.3 | 54.3 | 79.1 | 81.2 | 83.5 |
| Llama 3.1 8B | 8.0B | 68.4 | 62.2 | 79.6 | 79.5 | 82.1 |
Phi-3 mini at 3.8B parameters achieves MMLU 69.7 — competitive with Llama 3.1 8B (68.4) at less than half the size. This demonstrates the power of synthetic data distillation: GPT-4-generated training data allows a 3.8B model to compete with an 8B model trained on 15T tokens of web data. However, Phi-3’s advantage shrinks on coding tasks (HumanEval), suggesting synthetic data is more effective for knowledge tasks than code generation.
Edge Deployment Characteristics
def edge_deployment_analysis():
"""
These models target edge/mobile deployment.
Key metrics: memory footprint, tokens/watt, latency on mobile chips.
"""
devices = {
'iPhone 16 Pro': {
'npu_tops': 38,
'ram_gb': 8,
'available_for_model': 4, # iOS reserves ~4GB
},
'Pixel 9 Pro': {
'npu_tops': 45,
'ram_gb': 16,
'available_for_model': 8,
},
'M4 MacBook': {
'npu_tops': 38,
'gpu_tops': 53,
'ram_gb': 24,
'available_for_model': 16,
},
}
models = {
'Llama 3.2 1B INT4': {'memory_gb': 0.7, 'tokens_per_sec_npu': 30},
'Gemma 2 2B INT4': {'memory_gb': 1.5, 'tokens_per_sec_npu': 18},
'Phi-3 mini INT4': {'memory_gb': 2.1, 'tokens_per_sec_npu': 12},
'Llama 3.2 3B INT4': {'memory_gb': 1.8, 'tokens_per_sec_npu': 15},
}
for device_name, device in devices.items():
print(f"\n{device_name} ({device.get('available_for_model')}GB available):")
for model_name, model in models.items():
fits = model['memory_gb'] < device['available_for_model']
print(f" {model_name}: {'Fits' if fits else 'Too large'} "
f"({model['memory_gb']:.1f}GB)")
INT4 Memory Footprint (model weights only)
Key Takeaways
Distillation Strategy Comparison
| Lab | Strategy | Tokens | Strength | Weakness |
|---|---|---|---|---|
| Microsoft (Phi) | Synthetic data generation | 3.3T | Highest quality/param at 3-4B | Relies on GPT-4 API |
| Google (Gemma) | Logit distillation + arch innovation | 2-8T | Sliding window, large vocab | Lower quality than Phi at 2B |
| Meta (Llama) | Pruning + distillation | 9T | Most training tokens, 128K context | Architecture less optimized |
Quality Per Compute Dollar
def cost_efficiency_analysis():
"""
Compare quality-per-dollar across small distilled models.
For edge deployment, the relevant metric is quality achievable
within a given memory and power budget.
"""
# On a $500 edge device (8GB RAM, ~5W NPU)
deployable_models = {
'Llama 3.2 1B INT4': {
'memory_gb': 0.7,
'fits_8gb': True,
'mmlu': 49.3,
'tokens_per_watt': 6.0,
},
'Gemma 2 2B INT4': {
'memory_gb': 1.5,
'fits_8gb': True,
'mmlu': 53.2,
'tokens_per_watt': 3.6,
},
'Llama 3.2 3B INT4': {
'memory_gb': 1.8,
'fits_8gb': True,
'mmlu': 63.4,
'tokens_per_watt': 3.0,
},
'Phi-3 mini INT4': {
'memory_gb': 2.1,
'fits_8gb': True,
'mmlu': 69.7,
'tokens_per_watt': 2.4,
},
}
return deployable_models
MMLU Quality per GB of Model Memory
Llama 3.2 1B has the highest quality-per-gigabyte ratio at 70.4 MMLU points per GB of INT4 model memory. However, Phi-3 mini has the highest absolute quality (69.7 MMLU) among models that fit in 4GB. The optimal choice depends on whether the constraint is total memory (choose Llama 1B) or minimum quality threshold (choose Phi-3 mini if quality above 65 MMLU is required).
Each lab’s approach to small models reflects their broader philosophy: Microsoft leverages its GPT-4 API access for synthetic data generation, Google innovates on architecture (sliding window attention) and trains a large teacher in-house, and Meta scales training data and open-sources aggressively. All three demonstrate that data quality and distillation technique matter more than raw parameter count at the 1B-7B scale. Phi-3’s synthetic data approach achieves the highest quality per parameter, but Llama 3.2’s 128K context window and Gemma 2’s architectural innovations address deployment needs that quality benchmarks alone do not capture.