Llama 4 claims a 10 million token context window — 80x larger than Llama 3’s 128K. If true, this is not incremental progress; it is a complete reinvention of attention mechanisms, KV cache management, and position encoding. Standard RoPE breaks down past 1M tokens, FlashAttention still costs FLOPs even with IO optimization, and storing 10M tokens of KV cache at FP16 requires 320 GB per layer. Meta either invented subquadratic attention that preserves quality, or the 10M claim applies only to retrieval-augmented contexts where most tokens are not actively attended.
Model Specifications
class Llama4Config:
"""Llama 4 model family configuration."""
scout = {
'name': 'Llama 4 Scout',
'total_params': 109e9,
'active_params': 17e9,
'num_layers': 48,
'hidden_size': 5120,
'num_attention_heads': 40,
'num_kv_heads': 8, # GQA: 5 query heads per KV head
'head_dim': 128,
'num_experts': 16,
'top_k': 1, # Top-1 routing (unusual choice)
'expert_intermediate': 8192,
'context_length': 10_000_000, # 10M tokens claimed
'vocab_size': 202_000,
'modalities': ['text', 'image'],
'vision_encoder': 'MetaCLIP-based',
}
maverick = {
'name': 'Llama 4 Maverick',
'total_params': 400e9,
'active_params': 17e9, # Same active params as Scout
'num_layers': 48,
'hidden_size': 5120,
'num_attention_heads': 40,
'num_kv_heads': 8,
'head_dim': 128,
'num_experts': 128, # 8x more experts than Scout
'top_k': 1,
'expert_intermediate': 4096, # Smaller per-expert FFN
'context_length': 1_000_000,
'vocab_size': 202_000,
'modalities': ['text', 'image'],
}
Llama 4 Model Specifications
| Property | Scout | Maverick | Llama 3.1 405B (predecessor) |
|---|---|---|---|
| Total params | 109B | 400B | 405B |
| Active params/token | 17B | 17B | 405B |
| Experts | 16 | 128 | N/A (dense) |
| Top-k | 1 | 1 | N/A |
| Context length | 10M | 1M | 128K |
| Modalities | Text + Image | Text + Image | Text only |
| Architecture | MoE | MoE | Dense |
Both Scout and Maverick use top-1 routing — each token is processed by exactly 1 expert. This is unusual; DeepSeek V3 uses top-8 and Mixtral uses top-2. Top-1 routing maximizes sparsity (only 1/16 or 1/128 of experts activated) but limits the model’s ability to combine expert knowledge per token. Meta compensates with more layers (48) and a shared expert mechanism.
Top-1 Routing Analysis
class Top1RoutingAnalysis:
"""
Llama 4's top-1 routing is a deliberate serving cost optimization.
"""
def compute_efficiency(self):
configs = {
'Llama 4 Scout (top-1, 16E)': {
'experts': 16, 'top_k': 1,
'expert_dim': 8192, 'hidden': 5120,
'active_expert_flops': 1 * 3 * 5120 * 8192 * 2,
},
'Mixtral (top-2, 8E)': {
'experts': 8, 'top_k': 2,
'expert_dim': 14336, 'hidden': 4096,
'active_expert_flops': 2 * 3 * 4096 * 14336 * 2,
},
'DeepSeek V3 (top-8, 256E)': {
'experts': 256, 'top_k': 8,
'expert_dim': 2048, 'hidden': 7168,
'active_expert_flops': 8 * 3 * 7168 * 2048 * 2,
},
}
for name, cfg in configs.items():
from math import comb
combos = comb(cfg['experts'], cfg['top_k'])
active_gflops = cfg['active_expert_flops'] / 1e9
print(f"{name}:")
print(f" Active expert FLOPs: {active_gflops:.1f} GFLOPS/token/layer")
print(f" Routing combinations: {combos}")
print(f" Active fraction: {cfg['top_k']/cfg['experts']:.1%}")
def top1_vs_topk_quality(self):
"""
Top-1 routing concerns:
1. Less routing expressiveness (16 choices vs C(16,2)=120 for top-2)
2. Each token gets a single expert's perspective
3. Load balancing is harder (each token must go to exactly one expert)
Mitigations in Llama 4:
1. Shared expert: a dense FFN processed by ALL tokens, providing
a common baseline that experts specialize on top of
2. More layers (48): each token sees 48 different expert selections
3. Larger expert FFN (8192 intermediate): each expert has more capacity
"""
pass
Shared Expert Architecture
class Llama4MoELayer(nn.Module):
"""
Llama 4 MoE layer with shared expert.
The shared expert processes ALL tokens, providing a common
baseline representation. The routed expert adds specialization.
"""
def __init__(self, config):
super().__init__()
self.num_experts = config['num_experts']
# Router
self.router = nn.Linear(config['hidden_size'], self.num_experts, bias=False)
# Shared expert: processed by every token
self.shared_expert = SwiGLUExpert(
config['hidden_size'],
config['expert_intermediate']
)
# Routed experts: one selected per token
self.routed_experts = nn.ModuleList([
SwiGLUExpert(config['hidden_size'], config['expert_intermediate'])
for _ in range(self.num_experts)
])
def forward(self, x):
batch_size, hidden_dim = x.shape
# Shared expert: all tokens
shared_output = self.shared_expert(x)
# Router: select one expert per token
router_logits = self.router(x)
routing_weights = torch.softmax(router_logits, dim=-1)
top1_weight, top1_index = routing_weights.max(dim=-1)
# Routed expert computation
routed_output = torch.zeros_like(x)
for e in range(self.num_experts):
mask = (top1_index == e)
if mask.any():
expert_out = self.routed_experts[e](x[mask])
routed_output[mask] = expert_out
# Combine: shared + weighted routed
output = shared_output + top1_weight.unsqueeze(-1) * routed_output
return output
The shared expert effectively doubles the active parameters without doubling the routing complexity. Each token processes both the shared expert (always active) and one routed expert (selected by router), giving 2x the FFN compute of pure top-1 routing while maintaining the serving simplicity of a single routing decision.
Multimodal Architecture
class Llama4VisionEncoder:
"""
Llama 4's vision encoder processes images into token embeddings
that are interleaved with text tokens in the main transformer.
"""
def __init__(self):
# MetaCLIP-based vision encoder
self.patch_size = 14 # 14x14 pixel patches
self.image_size = 560 # 560x560 input images
self.num_patches = (560 // 14) ** 2 # 1,600 patches
self.vision_hidden = 1408 # Vision encoder hidden dim
# Projection from vision to language space
self.vision_projection = nn.Linear(1408, 5120) # -> Llama hidden dim
def encode_image(self, image):
"""
image: [batch, 3, 560, 560]
Returns: [batch, 1600, 5120] — image tokens in language space
"""
# Patch embedding + vision transformer
patches = self.patch_embed(image) # [batch, 1600, 1408]
vision_features = self.vision_transformer(patches)
# Project to language model dimension
image_tokens = self.vision_projection(vision_features) # [batch, 1600, 5120]
return image_tokens
def interleave_with_text(self, text_tokens, image_tokens, image_positions):
"""
Insert image tokens at specified positions in the text sequence.
"""
# text_tokens: [batch, text_len, 5120]
# image_tokens: [batch, 1600, 5120]
# image_positions: where to insert the image
combined = []
for pos, token in enumerate(text_tokens):
if pos in image_positions:
combined.extend(image_tokens)
combined.append(token)
return combined
Serving Performance
Llama 4 Serving Characteristics
| Model | Memory (FP16) | Memory (INT4) | Tokens/s (bs=1) | Min Hardware |
|---|---|---|---|---|
| Scout (109B) | 218 GB | 55 GB | 68 | 1x H100-80G (INT4) |
| Maverick (400B) | 800 GB | 200 GB | 52 | 4x H100-80G (INT4) |
| Llama 3.1 405B (dense) | 810 GB | 203 GB | 18 | 4x H100-80G (INT4) |
| DeepSeek V3 (671B) | 1342 GB | 336 GB | 42 | 8x H100-80G (INT4) |
Llama 4 Scout at INT4 fits on a single H100-80G and generates at 68 tokens/s — compared to Llama 3.1 405B which requires 4 GPUs and generates at 18 tokens/s. The MoE transition gives a 15x improvement in throughput-per-GPU while achieving competitive quality. This is the core motivation for Meta’s shift to MoE.
Quality Results
Llama 4 vs Competitors
| Model | Active Params | MMLU | HumanEval | MATH | Vision (MMMU) |
|---|---|---|---|---|---|
| Llama 4 Scout | 17B | 79.6 | 59.4 | 68.2 | 69.4 |
| Llama 4 Maverick | 17B | 82.1 | 68.3 | 73.1 | 73.4 |
| Gemma 2 27B | 27B | 75.2 | 51.8 | 52.1 | N/A |
| DeepSeek V3 | 37B | 82.4 | 71.2 | 89.1 | N/A |
| GPT-4o | ~20B* | 88.7 | 90.2 | 76.6 | 69.1 |
| Llama 3.1 405B | 405B | 87.3 | 89.0 | 73.8 | N/A |
Quality per Active Parameter (MMLU / Active Params in B)
Training Infrastructure for Llama 4
def llama4_training_details():
"""
Meta trained Llama 4 on their Grand Teton clusters.
"""
training_config = {
'gpu_cluster': '24,576 H100 GPUs',
'interconnect': 'NVLink + InfiniBand NDR 400G',
'training_precision': 'BF16 with FP8 expert computation',
'sequence_length': 32768, # Training context
'global_batch_tokens': 16_000_000,
# Parallelism for MoE
'data_parallel': 256,
'tensor_parallel': 8, # Within node
'expert_parallel': 16, # Experts distributed across GPUs
'pipeline_parallel': 1, # No PP for MoE (complex scheduling)
# Training data
'pretraining_tokens': '~30T estimated',
'multimodal_tokens': '~5T (image-text pairs)',
'post_training': 'RLHF + DPO',
}
return training_config
Llama Generation Evolution: Dense to MoE
| Generation | Model | Total Params | Active Params | GPUs to Serve | tok/s (bs=1) |
|---|---|---|---|---|---|
| Llama 1 (2023) | 65B | 65B | 65B | 2x A100 | 22 |
| Llama 2 (2023) | 70B | 70B | 70B | 2x A100 | 20 |
| Llama 3 (2024) | 70B | 70B | 70B | 1x H100 | 32 |
| Llama 3.1 (2024) | 405B | 405B | 405B | 8x H100 | 12 |
| Llama 4 Scout (2025) | 109B | 17B | 17B | 1x H100 (INT4) | 68 |
| Llama 4 Maverick (2025) | 400B | 17B | 17B | 4x H100 (INT4) | 52 |
The trajectory is clear: Llama 3.1 405B was Meta’s peak dense model. Scaling further with dense transformers would have required 800B+ parameters with proportionally more GPUs. MoE allows Llama 4 Maverick to have comparable total parameters (400B) while requiring only 17B active compute per token — a 24x reduction in serving FLOPs compared to the dense predecessor.
Competitive Positioning
def llama4_competitive_analysis():
"""
Where Llama 4 sits in the frontier model landscape.
"""
positioning = {
'vs_deepseek_v3': {
'advantage': 'Open-weight, multimodal, simpler routing (top-1)',
'disadvantage': 'Lower quality on text benchmarks (82 vs 84 MMLU)',
'note': 'DeepSeek uses top-8 routing for better quality',
},
'vs_gpt4o': {
'advantage': 'Open-weight, free to deploy, customizable',
'disadvantage': 'Lower quality across most benchmarks',
'note': 'Cost advantage for self-hosted deployment',
},
'vs_mixtral': {
'advantage': 'Native multimodal, larger expert pool, longer context',
'disadvantage': 'More memory for Maverick variant',
'note': 'Scout is the natural Mixtral successor',
},
}
return positioning
Llama 4 demonstrates that Meta’s MoE transition was the right engineering decision. Scout delivers 79.6 MMLU with 17B active parameters on a single GPU, while the dense predecessor Llama 3.1 405B required 4 GPUs for 87.3 MMLU. Maverick pushes to 82.1 MMLU while maintaining the same 17B active footprint by scaling total parameters to 400B across 128 experts. The addition of native vision capabilities positions Llama 4 against multimodal competitors like GPT-4o. The key architectural choice — top-1 routing with shared experts — prioritizes serving simplicity over routing expressiveness, a pragmatic decision for Meta’s goal of widespread open-source deployment.