Distilled Models: Phi, Gemma, Llama 3.2 at Small Scale

Part of Series Frontier Model Architectures 24 of 27

1 Kimi K2: How Moonshot Built a 1T MoE That Rivals Claude and GPT-4o 2 MiniMax-01: Lightning Attention, 4M Token Context, and the Linear Attention Revival 3 Frontier Models in 2025: The Architectural Convergence and Where Innovation Happens 4 Llama 3 Architecture Decisions: Why Meta Chose Dense, GQA-8, 128K Vocab, and RoPE 5 Qwen 2.5: Alibaba's Architecture, Training Recipe, and What Makes It Competitive 6 Gemini: Google's Natively Multimodal Architecture and the 1M Token Context 7 Claude Architecture: Constitutional AI, RLHF at Scale, and the 200K Context Window 8 Grok: xAI's Architecture, Massive Scale, and Real-Time Information Integration 9 Open vs Closed Models in 2026: Llama vs GPT-4o vs Claude vs DeepSeek — The Capability Gap Analysis 10 DeepSeek-R1: The Architecture of Reasoning — GRPO Training, Multi-Stage Pipeline, and Open Weights 11 Phi and Small Language Models: How Microsoft Achieves GPT-3.5 Quality at 3B Parameters 12 Mistral and the Sliding Window: Efficient Long-Context with Linear Memory 13 Llama 4: Meta's Shift to Multimodal MoE and What It Signals 14 Training Infrastructure: How Frontier Labs Build Their GPU Clusters 15 Benchmark Deep Dive: What MMLU, HumanEval, MATH, and SWE-bench Actually Measure 16 Jamba: AI21's Hybrid Mamba-Attention Architecture 17 Yi Series: 01.AI's Bilingual Architecture 18 DBRX: Databricks' Enterprise MoE Architecture 19 OpenAI o1: Reasoning Compute Budgets and Internal CoT 20 Distilled Models: Phi, Gemma, Llama 3.2 at Small Scale 21 Llama 4: Meta's Shift to Multimodal MoE 22 Scaling Laws and Model Design: How Chinchilla Changed Architecture Decisions 23 Open Weight Release Strategy: Llama vs Mistral vs DeepSeek — Licensing and Ecosystem Impact 24 Safety Architecture: How Frontier Models Build Guardrails Into the Model Itself 25 Multimodal Model Comparison 2026: GPT-4o vs Gemini vs Claude vs Llama Vision 26 MoE vs Dense in Production: Serving Cost, Latency, and When Each Wins 27 Chinese Frontier Models: DeepSeek, Qwen, Yi, and Kimi — Architecture Comparison

Phi-3-mini (3.8B parameters) scores 69.0% on MMLU — better than GPT-3.5 (70.0%) and Llama 2 70B (68.9%). The gap is data quality: Phi-3 was trained on synthetic data generated by GPT-4, filtered through quality classifiers, and deduplicated at the semantic level. Small models with great data beat large models with mediocre data. The implications: edge deployment of near-frontier intelligence is no longer compute-constrained; it is data-constrained.

Model Specifications

class SmallModelSpecs:
    models = {
        'Phi-3-mini-4K': {
            'params': 3.8e9,
            'layers': 32,
            'hidden': 3072,
            'heads': 32,
            'kv_heads': 32,  # MHA (no GQA)
            'intermediate': 8192,
            'context': 4096,
            'vocab_size': 32064,
            'training_tokens': '3.3T',
            'architecture': 'Dense transformer, MHA, RoPE',
            'special': 'Synthetic data from GPT-4',
        },
        'Phi-3.5-mini': {
            'params': 3.8e9,
            'layers': 32,
            'hidden': 3072,
            'heads': 32,
            'kv_heads': 32,
            'intermediate': 8192,
            'context': 128000,
            'vocab_size': 32064,
            'training_tokens': '3.4T',
            'architecture': 'Dense transformer, MHA, longrope',
            'special': 'Long context via LongRoPE',
        },
        'Gemma-2-2B': {
            'params': 2.6e9,
            'layers': 26,
            'hidden': 2304,
            'heads': 8,
            'kv_heads': 4,  # GQA
            'intermediate': 9216,
            'context': 8192,
            'vocab_size': 256000,
            'training_tokens': '2T',
            'architecture': 'Dense transformer, GQA, sliding+global attention',
            'special': 'Knowledge distillation from Gemma 27B',
        },
        'Gemma-2-9B': {
            'params': 9.2e9,
            'layers': 42,
            'hidden': 3584,
            'heads': 16,
            'kv_heads': 8,
            'intermediate': 14336,
            'context': 8192,
            'vocab_size': 256000,
            'training_tokens': '8T',
            'architecture': 'Dense transformer, GQA, sliding+global attention',
            'special': 'Distilled from Gemma 27B',
        },
        'Llama-3.2-1B': {
            'params': 1.24e9,
            'layers': 16,
            'hidden': 2048,
            'heads': 32,
            'kv_heads': 8,
            'intermediate': 8192,
            'context': 131072,
            'vocab_size': 128256,
            'training_tokens': '9T',
            'architecture': 'Dense transformer, GQA, RoPE',
            'special': 'Distilled + pruned from Llama 3.1 8B',
        },
        'Llama-3.2-3B': {
            'params': 3.21e9,
            'layers': 28,
            'hidden': 3072,
            'heads': 24,
            'kv_heads': 8,
            'intermediate': 8192,
            'context': 131072,
            'vocab_size': 128256,
            'training_tokens': '9T',
            'architecture': 'Dense transformer, GQA, RoPE',
            'special': 'Distilled + pruned from Llama 3.1 8B',
        },
    }

📊

Small Model Specifications

Model	Params	Layers	Hidden	GQA	Context	Training Tokens
Llama 3.2 1B	1.24B	16	2048	4:1	131K	9T
Gemma 2 2B	2.6B	26	2304	2:1	8K	2T
Phi-3 mini	3.8B	32	3072	None (MHA)	4K	3.3T
Llama 3.2 3B	3.21B	28	3072	3:1	131K	9T
Gemma 2 9B	9.2B	42	3584	2:1	8K	8T

Data Distillation Strategies

Microsoft Phi: Synthetic Data

class PhiSyntheticDataPipeline:
    """
    Phi's primary innovation: large-scale synthetic data from GPT-4.
    Instead of distilling through logit matching, Phi uses the
    teacher model to GENERATE training data.
    """

    def generate_textbook_data(self, topics, teacher_model='gpt-4'):
        """
        Generate textbook-quality explanations for each topic.
        """
        synthetic_data = []

        for topic in topics:
            prompt = f"""Write a detailed textbook chapter about: {topic}
            Include:
            - Clear definitions
            - Step-by-step explanations
            - Worked examples
            - Practice problems with solutions
            Use precise technical language. Target graduate-level understanding."""

            chapter = teacher_model.generate(prompt, max_tokens=8000)
            synthetic_data.append({
                'text': chapter,
                'source': 'synthetic_textbook',
                'topic': topic,
                'quality': 'high',
            })

        return synthetic_data

    def generate_exercise_data(self, domain, num_exercises=100000):
        """
        Generate diverse exercises with solutions.
        The key: vary difficulty, format, and domain systematically.
        """
        exercises = []
        difficulties = ['basic', 'intermediate', 'advanced', 'olympiad']
        formats = ['multiple_choice', 'short_answer', 'proof', 'code']

        for _ in range(num_exercises):
            difficulty = np.random.choice(difficulties, p=[0.3, 0.35, 0.25, 0.1])
            fmt = np.random.choice(formats, p=[0.2, 0.3, 0.2, 0.3])

            prompt = f"Generate a {difficulty} {domain} problem in {fmt} format with solution."
            exercise = teacher_model.generate(prompt)

            # Verify solution (crucial quality control)
            verification = teacher_model.generate(
                f"Verify this solution is correct:\n{exercise}"
            )
            if 'correct' in verification.lower():
                exercises.append(exercise)

        return exercises

Google Gemma: Logit Distillation

class GemmaDistillation:
    """
    Gemma 2 uses traditional knowledge distillation:
    train the small model to match the larger model's output distribution.
    """

    def distill(self, teacher_27b, student_2b, data_loader,
                temperature=4.0, alpha=0.5, num_steps=500000):
        """
        Standard KD with high temperature for softer targets.
        Gemma uses temperature=4.0 (higher than typical 2.0).
        """
        optimizer = torch.optim.AdamW(student_2b.parameters(), lr=1e-4)

        for step, batch in enumerate(data_loader):
            if step >= num_steps:
                break

            with torch.no_grad():
                teacher_logits = teacher_27b(batch['input_ids']).logits

            student_logits = student_2b(batch['input_ids']).logits

            # KL divergence with temperature scaling
            T = temperature
            kl_loss = F.kl_div(
                F.log_softmax(student_logits / T, dim=-1),
                F.softmax(teacher_logits / T, dim=-1),
                reduction='batchmean'
            ) * (T * T)

            # Hard label loss
            ce_loss = F.cross_entropy(
                student_logits.view(-1, student_logits.size(-1)),
                batch['labels'].view(-1),
                ignore_index=-100
            )

            loss = alpha * kl_loss + (1 - alpha) * ce_loss
            loss.backward()
            optimizer.step()
            optimizer.zero_grad()

Meta Llama 3.2: Pruning + Distillation

class LlamaPruneAndDistill:
    """
    Llama 3.2 1B/3B are created by pruning Llama 3.1 8B
    then distilling to recover quality.
    """

    def structured_pruning(self, model_8b, target_params):
        """
        Remove entire layers and reduce dimensions.
        8B -> 3B: reduce layers from 32 to 28, heads from 32 to 24
        8B -> 1B: reduce layers from 32 to 16, hidden from 4096 to 2048
        """
        # Importance scoring for layers
        layer_importance = []
        for layer_idx in range(model_8b.config.num_hidden_layers):
            # Measure layer's contribution via angular distance
            importance = self.compute_layer_importance(model_8b, layer_idx)
            layer_importance.append((layer_idx, importance))

        # Remove least important layers
        layer_importance.sort(key=lambda x: x[1])
        target_layers = 28 if target_params > 2e9 else 16
        layers_to_remove = [
            idx for idx, _ in layer_importance[:32 - target_layers]
        ]

        # Create pruned model
        pruned = self.remove_layers(model_8b, layers_to_remove)

        # Reduce hidden dimension if needed (for 1B target)
        if target_params < 2e9:
            pruned = self.reduce_dimensions(pruned, target_hidden=2048)

        return pruned

    def post_pruning_distillation(self, teacher_8b, pruned_student, data, tokens=9e12):
        """
        After pruning, distill from the original 8B to recover quality.
        Meta reports this takes 9T tokens of distillation.
        """
        # Standard distillation + continued pretraining
        # The pruned model retains much of the teacher's knowledge
        # so it converges faster than training from scratch
        pass

Architecture Differences

def compare_attention_designs():
    """
    Each model family uses a different attention design.
    """
    designs = {
        'Phi-3': {
            'type': 'Standard MHA',
            'qkv_heads': '32Q, 32K, 32V',
            'kv_cache_per_token': 32 * 128 * 2 * 2,  # 16 KB
            'pro': 'Maximum attention quality',
            'con': 'Largest KV cache',
        },
        'Gemma-2': {
            'type': 'GQA + Sliding/Global interleaving',
            'qkv_heads': '8Q, 4K, 4V (2B); 16Q, 8K, 8V (9B)',
            'kv_cache_per_token': 4 * 256 * 2 * 2,  # 4 KB (2B)
            'pro': 'Alternating local/global attention saves memory',
            'con': 'Sliding window layers miss long-range for some tokens',
            'detail': 'Even layers use sliding window (4096), odd use global',
        },
        'Llama-3.2': {
            'type': 'GQA',
            'qkv_heads': '32Q, 8K, 8V (3B); 32Q, 8K, 8V (1B)',
            'kv_cache_per_token': 8 * 128 * 2 * 2,  # 4 KB
            'pro': '4x KV cache reduction, 128K context',
            'con': 'Fewer KV heads may limit quality',
        },
    }
    return designs

📊

KV Cache Memory per Token per Layer

Model	KV Heads	Head Dim	KV Bytes/Token/Layer	32K Context Total
Phi-3 mini (MHA)	32	96	12,288	12.6 GB
Gemma 2 2B (GQA)	4	256	4,096	2.7 GB
Llama 3.2 3B (GQA)	8	128	4,096	3.7 GB
Llama 3.2 1B (GQA)	8	64	2,048	0.5 GB

ℹ️ Note

Phi-3 uses full MHA (32 KV heads), giving it the largest KV cache but potentially the best attention quality. Gemma 2 and Llama 3.2 use GQA to reduce KV cache by 4-8x, which is critical for edge deployment where memory is limited. Gemma’s sliding+global attention interleaving further reduces effective KV cache by only attending globally on half the layers.

Quality Comparison

📊

Small Model Quality Comparison

Model	Params	MMLU	HumanEval	GSM8K	ARC-C	HellaSwag
Llama 3.2 1B	1.24B	49.3	33.5	44.4	59.4	69.4
Gemma 2 2B	2.6B	53.2	36.1	51.8	68.4	74.1
Phi-3 mini	3.8B	69.7	58.5	82.5	78.7	80.1
Llama 3.2 3B	3.21B	63.4	48.7	72.3	74.2	77.8
Gemma 2 9B	9.2B	71.3	54.3	79.1	81.2	83.5
Llama 3.1 8B	8.0B	68.4	62.2	79.6	79.5	82.1

⚡ Performance

Phi-3 mini at 3.8B parameters achieves MMLU 69.7 — competitive with Llama 3.1 8B (68.4) at less than half the size. This demonstrates the power of synthetic data distillation: GPT-4-generated training data allows a 3.8B model to compete with an 8B model trained on 15T tokens of web data. However, Phi-3’s advantage shrinks on coding tasks (HumanEval), suggesting synthetic data is more effective for knowledge tasks than code generation.

Edge Deployment Characteristics

def edge_deployment_analysis():
    """
    These models target edge/mobile deployment.
    Key metrics: memory footprint, tokens/watt, latency on mobile chips.
    """
    devices = {
        'iPhone 16 Pro': {
            'npu_tops': 38,
            'ram_gb': 8,
            'available_for_model': 4,  # iOS reserves ~4GB
        },
        'Pixel 9 Pro': {
            'npu_tops': 45,
            'ram_gb': 16,
            'available_for_model': 8,
        },
        'M4 MacBook': {
            'npu_tops': 38,
            'gpu_tops': 53,
            'ram_gb': 24,
            'available_for_model': 16,
        },
    }

    models = {
        'Llama 3.2 1B INT4': {'memory_gb': 0.7, 'tokens_per_sec_npu': 30},
        'Gemma 2 2B INT4': {'memory_gb': 1.5, 'tokens_per_sec_npu': 18},
        'Phi-3 mini INT4': {'memory_gb': 2.1, 'tokens_per_sec_npu': 12},
        'Llama 3.2 3B INT4': {'memory_gb': 1.8, 'tokens_per_sec_npu': 15},
    }

    for device_name, device in devices.items():
        print(f"\n{device_name} ({device.get('available_for_model')}GB available):")
        for model_name, model in models.items():
            fits = model['memory_gb'] < device['available_for_model']
            print(f"  {model_name}: {'Fits' if fits else 'Too large'} "
                  f"({model['memory_gb']:.1f}GB)")

INT4 Memory Footprint (model weights only)

Llama 3.2 1B

0.7

Gemma 2 2B

1.5

Llama 3.2 3B

1.8

Phi-3 mini

2.1

Gemma 2 9B

5.1

Key Takeaways

📊

Distillation Strategy Comparison

Lab	Strategy	Tokens	Strength	Weakness
Microsoft (Phi)	Synthetic data generation	3.3T	Highest quality/param at 3-4B	Relies on GPT-4 API
Google (Gemma)	Logit distillation + arch innovation	2-8T	Sliding window, large vocab	Lower quality than Phi at 2B
Meta (Llama)	Pruning + distillation	9T	Most training tokens, 128K context	Architecture less optimized

Quality Per Compute Dollar

def cost_efficiency_analysis():
    """
    Compare quality-per-dollar across small distilled models.
    For edge deployment, the relevant metric is quality achievable
    within a given memory and power budget.
    """
    # On a $500 edge device (8GB RAM, ~5W NPU)
    deployable_models = {
        'Llama 3.2 1B INT4': {
            'memory_gb': 0.7,
            'fits_8gb': True,
            'mmlu': 49.3,
            'tokens_per_watt': 6.0,
        },
        'Gemma 2 2B INT4': {
            'memory_gb': 1.5,
            'fits_8gb': True,
            'mmlu': 53.2,
            'tokens_per_watt': 3.6,
        },
        'Llama 3.2 3B INT4': {
            'memory_gb': 1.8,
            'fits_8gb': True,
            'mmlu': 63.4,
            'tokens_per_watt': 3.0,
        },
        'Phi-3 mini INT4': {
            'memory_gb': 2.1,
            'fits_8gb': True,
            'mmlu': 69.7,
            'tokens_per_watt': 2.4,
        },
    }
    return deployable_models

MMLU Quality per GB of Model Memory

Llama 3.2 1B (70.4 MMLU/GB)

70.4

Gemma 2 2B (35.5 MMLU/GB)

35.5

Llama 3.2 3B (35.2 MMLU/GB)

35.2

Phi-3 mini (33.2 MMLU/GB)

33.2

Gemma 2 9B (14.0 MMLU/GB)

ℹ️ Note

Llama 3.2 1B has the highest quality-per-gigabyte ratio at 70.4 MMLU points per GB of INT4 model memory. However, Phi-3 mini has the highest absolute quality (69.7 MMLU) among models that fit in 4GB. The optimal choice depends on whether the constraint is total memory (choose Llama 1B) or minimum quality threshold (choose Phi-3 mini if quality above 65 MMLU is required).

Each lab’s approach to small models reflects their broader philosophy: Microsoft leverages its GPT-4 API access for synthetic data generation, Google innovates on architecture (sliding window attention) and trains a large teacher in-house, and Meta scales training data and open-sources aggressively. All three demonstrate that data quality and distillation technique matter more than raw parameter count at the 1B-7B scale. Phi-3’s synthetic data approach achieves the highest quality per parameter, but Llama 3.2’s 128K context window and Gemma 2’s architectural innovations address deployment needs that quality benchmarks alone do not capture.

Model Specifications

Small Model Specifications

Data Distillation Strategies

Microsoft Phi: Synthetic Data

Google Gemma: Logit Distillation

Meta Llama 3.2: Pruning + Distillation

Architecture Differences

KV Cache Memory per Token per Layer

Quality Comparison

Small Model Quality Comparison

Edge Deployment Characteristics

INT4 Memory Footprint (model weights only)

Key Takeaways

Distillation Strategy Comparison

Quality Per Compute Dollar

MMLU Quality per GB of Model Memory

Stanley Phoong

Related Posts

Phi and Small Language Models: How Microsoft Achieves GPT-3.5 Quality at 3B Parameters

Mistral and the Sliding Window: Efficient Long-Context with Linear Memory

Open vs Closed Models in 2026: Llama vs GPT-4o vs Claude vs DeepSeek — The Capability Gap Analysis