Llama 4: Meta's Shift to Multimodal MoE

Part of Series Frontier Model Architectures 26 of 27

1 Kimi K2: How Moonshot Built a 1T MoE That Rivals Claude and GPT-4o 2 MiniMax-01: Lightning Attention, 4M Token Context, and the Linear Attention Revival 3 Frontier Models in 2025: The Architectural Convergence and Where Innovation Happens 4 Llama 3 Architecture Decisions: Why Meta Chose Dense, GQA-8, 128K Vocab, and RoPE 5 Qwen 2.5: Alibaba's Architecture, Training Recipe, and What Makes It Competitive 6 Gemini: Google's Natively Multimodal Architecture and the 1M Token Context 7 Claude Architecture: Constitutional AI, RLHF at Scale, and the 200K Context Window 8 Grok: xAI's Architecture, Massive Scale, and Real-Time Information Integration 9 Open vs Closed Models in 2026: Llama vs GPT-4o vs Claude vs DeepSeek — The Capability Gap Analysis 10 DeepSeek-R1: The Architecture of Reasoning — GRPO Training, Multi-Stage Pipeline, and Open Weights 11 Phi and Small Language Models: How Microsoft Achieves GPT-3.5 Quality at 3B Parameters 12 Mistral and the Sliding Window: Efficient Long-Context with Linear Memory 13 Llama 4: Meta's Shift to Multimodal MoE and What It Signals 14 Training Infrastructure: How Frontier Labs Build Their GPU Clusters 15 Benchmark Deep Dive: What MMLU, HumanEval, MATH, and SWE-bench Actually Measure 16 Jamba: AI21's Hybrid Mamba-Attention Architecture 17 Yi Series: 01.AI's Bilingual Architecture 18 DBRX: Databricks' Enterprise MoE Architecture 19 OpenAI o1: Reasoning Compute Budgets and Internal CoT 20 Distilled Models: Phi, Gemma, Llama 3.2 at Small Scale 21 Llama 4: Meta's Shift to Multimodal MoE 22 Scaling Laws and Model Design: How Chinchilla Changed Architecture Decisions 23 Open Weight Release Strategy: Llama vs Mistral vs DeepSeek — Licensing and Ecosystem Impact 24 Safety Architecture: How Frontier Models Build Guardrails Into the Model Itself 25 Multimodal Model Comparison 2026: GPT-4o vs Gemini vs Claude vs Llama Vision 26 MoE vs Dense in Production: Serving Cost, Latency, and When Each Wins 27 Chinese Frontier Models: DeepSeek, Qwen, Yi, and Kimi — Architecture Comparison

Llama 4 claims a 10 million token context window — 80x larger than Llama 3’s 128K. If true, this is not incremental progress; it is a complete reinvention of attention mechanisms, KV cache management, and position encoding. Standard RoPE breaks down past 1M tokens, FlashAttention still costs $O(n^2)$ FLOPs even with IO optimization, and storing 10M tokens of KV cache at FP16 requires 320 GB per layer. Meta either invented subquadratic attention that preserves quality, or the 10M claim applies only to retrieval-augmented contexts where most tokens are not actively attended.

Model Specifications

class Llama4Config:
    """Llama 4 model family configuration."""

    scout = {
        'name': 'Llama 4 Scout',
        'total_params': 109e9,
        'active_params': 17e9,
        'num_layers': 48,
        'hidden_size': 5120,
        'num_attention_heads': 40,
        'num_kv_heads': 8,      # GQA: 5 query heads per KV head
        'head_dim': 128,
        'num_experts': 16,
        'top_k': 1,             # Top-1 routing (unusual choice)
        'expert_intermediate': 8192,
        'context_length': 10_000_000,  # 10M tokens claimed
        'vocab_size': 202_000,
        'modalities': ['text', 'image'],
        'vision_encoder': 'MetaCLIP-based',
    }

    maverick = {
        'name': 'Llama 4 Maverick',
        'total_params': 400e9,
        'active_params': 17e9,   # Same active params as Scout
        'num_layers': 48,
        'hidden_size': 5120,
        'num_attention_heads': 40,
        'num_kv_heads': 8,
        'head_dim': 128,
        'num_experts': 128,      # 8x more experts than Scout
        'top_k': 1,
        'expert_intermediate': 4096,  # Smaller per-expert FFN
        'context_length': 1_000_000,
        'vocab_size': 202_000,
        'modalities': ['text', 'image'],
    }

📊

Llama 4 Model Specifications

Property	Scout	Maverick	Llama 3.1 405B (predecessor)
Total params	109B	400B	405B
Active params/token	17B	17B	405B
Experts	16	128	N/A (dense)
Top-k	1	1	N/A
Context length	10M	1M	128K
Modalities	Text + Image	Text + Image	Text only
Architecture	MoE	MoE	Dense

ℹ️ Note

Both Scout and Maverick use top-1 routing — each token is processed by exactly 1 expert. This is unusual; DeepSeek V3 uses top-8 and Mixtral uses top-2. Top-1 routing maximizes sparsity (only 1/16 or 1/128 of experts activated) but limits the model’s ability to combine expert knowledge per token. Meta compensates with more layers (48) and a shared expert mechanism.

Top-1 Routing Analysis

class Top1RoutingAnalysis:
    """
    Llama 4's top-1 routing is a deliberate serving cost optimization.
    """

    def compute_efficiency(self):
        configs = {
            'Llama 4 Scout (top-1, 16E)': {
                'experts': 16, 'top_k': 1,
                'expert_dim': 8192, 'hidden': 5120,
                'active_expert_flops': 1 * 3 * 5120 * 8192 * 2,
            },
            'Mixtral (top-2, 8E)': {
                'experts': 8, 'top_k': 2,
                'expert_dim': 14336, 'hidden': 4096,
                'active_expert_flops': 2 * 3 * 4096 * 14336 * 2,
            },
            'DeepSeek V3 (top-8, 256E)': {
                'experts': 256, 'top_k': 8,
                'expert_dim': 2048, 'hidden': 7168,
                'active_expert_flops': 8 * 3 * 7168 * 2048 * 2,
            },
        }

        for name, cfg in configs.items():
            from math import comb
            combos = comb(cfg['experts'], cfg['top_k'])
            active_gflops = cfg['active_expert_flops'] / 1e9

            print(f"{name}:")
            print(f"  Active expert FLOPs: {active_gflops:.1f} GFLOPS/token/layer")
            print(f"  Routing combinations: {combos}")
            print(f"  Active fraction: {cfg['top_k']/cfg['experts']:.1%}")

    def top1_vs_topk_quality(self):
        """
        Top-1 routing concerns:
        1. Less routing expressiveness (16 choices vs C(16,2)=120 for top-2)
        2. Each token gets a single expert's perspective
        3. Load balancing is harder (each token must go to exactly one expert)

        Mitigations in Llama 4:
        1. Shared expert: a dense FFN processed by ALL tokens, providing
           a common baseline that experts specialize on top of
        2. More layers (48): each token sees 48 different expert selections
        3. Larger expert FFN (8192 intermediate): each expert has more capacity
        """
        pass

Shared Expert Architecture

class Llama4MoELayer(nn.Module):
    """
    Llama 4 MoE layer with shared expert.
    The shared expert processes ALL tokens, providing a common
    baseline representation. The routed expert adds specialization.
    """

    def __init__(self, config):
        super().__init__()
        self.num_experts = config['num_experts']

        # Router
        self.router = nn.Linear(config['hidden_size'], self.num_experts, bias=False)

        # Shared expert: processed by every token
        self.shared_expert = SwiGLUExpert(
            config['hidden_size'],
            config['expert_intermediate']
        )

        # Routed experts: one selected per token
        self.routed_experts = nn.ModuleList([
            SwiGLUExpert(config['hidden_size'], config['expert_intermediate'])
            for _ in range(self.num_experts)
        ])

    def forward(self, x):
        batch_size, hidden_dim = x.shape

        # Shared expert: all tokens
        shared_output = self.shared_expert(x)

        # Router: select one expert per token
        router_logits = self.router(x)
        routing_weights = torch.softmax(router_logits, dim=-1)
        top1_weight, top1_index = routing_weights.max(dim=-1)

        # Routed expert computation
        routed_output = torch.zeros_like(x)
        for e in range(self.num_experts):
            mask = (top1_index == e)
            if mask.any():
                expert_out = self.routed_experts[e](x[mask])
                routed_output[mask] = expert_out

        # Combine: shared + weighted routed
        output = shared_output + top1_weight.unsqueeze(-1) * routed_output

        return output

⚡ Performance

The shared expert effectively doubles the active parameters without doubling the routing complexity. Each token processes both the shared expert (always active) and one routed expert (selected by router), giving 2x the FFN compute of pure top-1 routing while maintaining the serving simplicity of a single routing decision.

Multimodal Architecture

class Llama4VisionEncoder:
    """
    Llama 4's vision encoder processes images into token embeddings
    that are interleaved with text tokens in the main transformer.
    """

    def __init__(self):
        # MetaCLIP-based vision encoder
        self.patch_size = 14       # 14x14 pixel patches
        self.image_size = 560      # 560x560 input images
        self.num_patches = (560 // 14) ** 2  # 1,600 patches
        self.vision_hidden = 1408  # Vision encoder hidden dim

        # Projection from vision to language space
        self.vision_projection = nn.Linear(1408, 5120)  # -> Llama hidden dim

    def encode_image(self, image):
        """
        image: [batch, 3, 560, 560]
        Returns: [batch, 1600, 5120] — image tokens in language space
        """
        # Patch embedding + vision transformer
        patches = self.patch_embed(image)    # [batch, 1600, 1408]
        vision_features = self.vision_transformer(patches)

        # Project to language model dimension
        image_tokens = self.vision_projection(vision_features)  # [batch, 1600, 5120]

        return image_tokens

    def interleave_with_text(self, text_tokens, image_tokens, image_positions):
        """
        Insert image tokens at specified positions in the text sequence.
        """
        # text_tokens: [batch, text_len, 5120]
        # image_tokens: [batch, 1600, 5120]
        # image_positions: where to insert the image
        combined = []
        for pos, token in enumerate(text_tokens):
            if pos in image_positions:
                combined.extend(image_tokens)
            combined.append(token)
        return combined

Serving Performance

📊

Llama 4 Serving Characteristics

Model	Memory (FP16)	Memory (INT4)	Tokens/s (bs=1)	Min Hardware
Scout (109B)	218 GB	55 GB	68	1x H100-80G (INT4)
Maverick (400B)	800 GB	200 GB	52	4x H100-80G (INT4)
Llama 3.1 405B (dense)	810 GB	203 GB	18	4x H100-80G (INT4)
DeepSeek V3 (671B)	1342 GB	336 GB	42	8x H100-80G (INT4)

⚡ Performance

Llama 4 Scout at INT4 fits on a single H100-80G and generates at 68 tokens/s — compared to Llama 3.1 405B which requires 4 GPUs and generates at 18 tokens/s. The MoE transition gives a 15x improvement in throughput-per-GPU while achieving competitive quality. This is the core motivation for Meta’s shift to MoE.

Quality Results

📊

Llama 4 vs Competitors

Model	Active Params	MMLU	HumanEval	MATH	Vision (MMMU)
Llama 4 Scout	17B	79.6	59.4	68.2	69.4
Llama 4 Maverick	17B	82.1	68.3	73.1	73.4
Gemma 2 27B	27B	75.2	51.8	52.1	N/A
DeepSeek V3	37B	82.4	71.2	89.1	N/A
GPT-4o	~20B*	88.7	90.2	76.6	69.1
Llama 3.1 405B	405B	87.3	89.0	73.8	N/A

Quality per Active Parameter (MMLU / Active Params in B)

Llama 4 Scout (17B active)

4.68

Llama 4 Maverick (17B active)

4.83

Gemma 2 27B (27B active)

2.79

DeepSeek V3 (37B active)

2.23

Llama 3.1 405B (405B active)

0.22

Training Infrastructure for Llama 4

def llama4_training_details():
    """
    Meta trained Llama 4 on their Grand Teton clusters.
    """
    training_config = {
        'gpu_cluster': '24,576 H100 GPUs',
        'interconnect': 'NVLink + InfiniBand NDR 400G',
        'training_precision': 'BF16 with FP8 expert computation',
        'sequence_length': 32768,  # Training context
        'global_batch_tokens': 16_000_000,

        # Parallelism for MoE
        'data_parallel': 256,
        'tensor_parallel': 8,     # Within node
        'expert_parallel': 16,    # Experts distributed across GPUs
        'pipeline_parallel': 1,   # No PP for MoE (complex scheduling)

        # Training data
        'pretraining_tokens': '~30T estimated',
        'multimodal_tokens': '~5T (image-text pairs)',
        'post_training': 'RLHF + DPO',
    }
    return training_config

📊

Llama Generation Evolution: Dense to MoE

Generation	Model	Total Params	Active Params	GPUs to Serve	tok/s (bs=1)
Llama 1 (2023)	65B	65B	65B	2x A100	22
Llama 2 (2023)	70B	70B	70B	2x A100	20
Llama 3 (2024)	70B	70B	70B	1x H100	32
Llama 3.1 (2024)	405B	405B	405B	8x H100	12
Llama 4 Scout (2025)	109B	17B	17B	1x H100 (INT4)	68
Llama 4 Maverick (2025)	400B	17B	17B	4x H100 (INT4)	52

ℹ️ Note

The trajectory is clear: Llama 3.1 405B was Meta’s peak dense model. Scaling further with dense transformers would have required 800B+ parameters with proportionally more GPUs. MoE allows Llama 4 Maverick to have comparable total parameters (400B) while requiring only 17B active compute per token — a 24x reduction in serving FLOPs compared to the dense predecessor.

Competitive Positioning

def llama4_competitive_analysis():
    """
    Where Llama 4 sits in the frontier model landscape.
    """
    positioning = {
        'vs_deepseek_v3': {
            'advantage': 'Open-weight, multimodal, simpler routing (top-1)',
            'disadvantage': 'Lower quality on text benchmarks (82 vs 84 MMLU)',
            'note': 'DeepSeek uses top-8 routing for better quality',
        },
        'vs_gpt4o': {
            'advantage': 'Open-weight, free to deploy, customizable',
            'disadvantage': 'Lower quality across most benchmarks',
            'note': 'Cost advantage for self-hosted deployment',
        },
        'vs_mixtral': {
            'advantage': 'Native multimodal, larger expert pool, longer context',
            'disadvantage': 'More memory for Maverick variant',
            'note': 'Scout is the natural Mixtral successor',
        },
    }
    return positioning

Llama 4 demonstrates that Meta’s MoE transition was the right engineering decision. Scout delivers 79.6 MMLU with 17B active parameters on a single GPU, while the dense predecessor Llama 3.1 405B required 4 GPUs for 87.3 MMLU. Maverick pushes to 82.1 MMLU while maintaining the same 17B active footprint by scaling total parameters to 400B across 128 experts. The addition of native vision capabilities positions Llama 4 against multimodal competitors like GPT-4o. The key architectural choice — top-1 routing with shared experts — prioritizes serving simplicity over routing expressiveness, a pragmatic decision for Meta’s goal of widespread open-source deployment.