Part of Series Frontier Model Architectures 33 of 27
1 Kimi K2: How Moonshot Built a 1T MoE That Rivals Claude and GPT-4o 2 MiniMax-01: Lightning Attention, 4M Token Context, and the Linear Attention Revival 3 Frontier Models in 2025: The Architectural Convergence and Where Innovation Happens 4 Llama 3 Architecture Decisions: Why Meta Chose Dense, GQA-8, 128K Vocab, and RoPE 5 Qwen 2.5: Alibaba's Architecture, Training Recipe, and What Makes It Competitive 6 Gemini: Google's Natively Multimodal Architecture and the 1M Token Context 7 Claude Architecture: Constitutional AI, RLHF at Scale, and the 200K Context Window 8 Grok: xAI's Architecture, Massive Scale, and Real-Time Information Integration 9 Open vs Closed Models in 2026: Llama vs GPT-4o vs Claude vs DeepSeek — The Capability Gap Analysis 10 DeepSeek-R1: The Architecture of Reasoning — GRPO Training, Multi-Stage Pipeline, and Open Weights 11 Phi and Small Language Models: How Microsoft Achieves GPT-3.5 Quality at 3B Parameters 12 Mistral and the Sliding Window: Efficient Long-Context with Linear Memory 13 Llama 4: Meta's Shift to Multimodal MoE and What It Signals 14 Training Infrastructure: How Frontier Labs Build Their GPU Clusters 15 Benchmark Deep Dive: What MMLU, HumanEval, MATH, and SWE-bench Actually Measure 16 Jamba: AI21's Hybrid Mamba-Attention Architecture 17 Yi Series: 01.AI's Bilingual Architecture 18 DBRX: Databricks' Enterprise MoE Architecture 19 OpenAI o1: Reasoning Compute Budgets and Internal CoT 20 Distilled Models: Phi, Gemma, Llama 3.2 at Small Scale 21 Llama 4: Meta's Shift to Multimodal MoE 22 Scaling Laws and Model Design: How Chinchilla Changed Architecture Decisions 23 Open Weight Release Strategy: Llama vs Mistral vs DeepSeek — Licensing and Ecosystem Impact 24 Safety Architecture: How Frontier Models Build Guardrails Into the Model Itself 25 Multimodal Model Comparison 2026: GPT-4o vs Gemini vs Claude vs Llama Vision 26 MoE vs Dense in Production: Serving Cost, Latency, and When Each Wins 27 Chinese Frontier Models: DeepSeek, Qwen, Yi, and Kimi — Architecture Comparison

Mixture-of-Experts (MoE) models achieve the quality of much larger dense models while activating only a fraction of parameters per token. Mixtral 8x7B has 47B total parameters but activates only 13B per token. DeepSeek-V3 has 671B total but activates 37B per token. This makes MoE models appear cheaper to serve — fewer active parameters means fewer FLOPs. But serving cost is not just FLOPs. Memory, load balancing, batch efficiency, and GPU utilization all differ between MoE and dense architectures. This post provides a quantitative comparison across the metrics that determine production serving cost.

Memory Requirements

The first difference: MoE models must load ALL parameters into memory, not just the active ones.

def compute_model_memory(
    total_params_b: float,
    active_params_b: float,
    dtype_bytes: int,
    num_experts: int,
    expert_params_b: float
) -> dict:
    """Compute memory requirements for MoE vs Dense."""
    # Dense model: all params are active
    # Memory = total_params * dtype_bytes

    # MoE model: all expert weights must be loaded
    # even though only K experts activate per token
    total_memory = total_params_b * 1e9 * dtype_bytes
    active_memory = active_params_b * 1e9 * dtype_bytes

    return {
        "total_memory_gb": total_memory / 1e9,
        "active_compute_params_gb": active_memory / 1e9,
        "memory_to_compute_ratio": total_memory / active_memory
    }
📊

Memory vs Active Parameters

ModelTotal ParamsActive ParamsFP16 Memory (GB)Memory/Compute Ratio
Llama 70B (Dense) 70B 70B 140 1.0x
Mixtral 8x7B (MoE) 47B 13B 94 3.6x
DeepSeek-V3 (MoE) 671B 37B 1,342 18.1x
Llama 405B (Dense) 405B 405B 810 1.0x
Qwen2.5-MoE-57B 57B 14B 114 4.1x

DeepSeek-V3 activates only 37B parameters per token (comparable to Llama 70B in compute) but requires 1.34 TB of memory to store all 671B parameters. This means:

  • Llama 70B: fits on 2x A100-80GB (TP=2)
  • DeepSeek-V3: requires 17x A100-80GB minimum (even with TP)
⚠️ Warning

The memory-to-compute ratio is the critical MoE disadvantage. DeepSeek-V3 uses 18x more memory than a dense model with equivalent per-token compute (37B active). This means 18x more GPU memory for the same throughput per token, unless expert parallelism is used effectively.

FLOPs per Token

The compute advantage of MoE is straightforward:

def flops_per_token(hidden_dim: int, num_layers: int,
                    ffn_dim: int, num_active_experts: int,
                    is_moe: bool) -> float:
    """Estimate FLOPs per token for forward pass."""
    # Attention: 8 * hidden^2 per layer (QKV + O projections + attention)
    attn_flops = 8 * hidden_dim ** 2

    if is_moe:
        # MoE FFN: only K experts activated
        # Each expert: 3 * hidden * ffn (gate + up + down for SwiGLU)
        ffn_flops = num_active_experts * 3 * hidden_dim * ffn_dim * 2
        # Router: hidden * num_experts (small)
        router_flops = hidden_dim * 8  # negligible
    else:
        # Dense FFN: 3 * hidden * ffn * 2 (SwiGLU)
        ffn_flops = 3 * hidden_dim * ffn_dim * 2

    total = num_layers * (attn_flops + ffn_flops)
    return total

# Comparison
llama_70b_flops = flops_per_token(8192, 80, 28672, 1, False)
# = 80 * (8 * 8192^2 + 3 * 8192 * 28672 * 2)
# = 80 * (536M + 1.41B) = 155.6 TFLOP

mixtral_flops = flops_per_token(4096, 32, 14336, 2, True)
# = 32 * (8 * 4096^2 + 2 * 3 * 4096 * 14336 * 2)
# = 32 * (134M + 706M) = 26.9 TFLOP

deepseek_v3_flops = flops_per_token(7168, 61, 18432, 8, True)
# = 61 * (8 * 7168^2 + 8 * 3 * 7168 * 18432 * 2)
# = 61 * (411M + 6.34B) = 411.8 TFLOP

TFLOPs per Token (Forward Pass)

Mixtral 8x7B
26.9
Qwen2.5-MoE-57B
28.4
Llama 70B
155.6
DeepSeek-V3
411.8
Llama 405B
892

Throughput: Decode Phase

During decode (autoregressive token generation), the operation is memory-bandwidth bound for both MoE and dense models. The key metric is how many bytes must be loaded from GPU memory per token:

def decode_bytes_per_token(total_params_b: float,
                            active_params_b: float,
                            dtype_bytes: int,
                            is_moe: bool) -> float:
    """Bytes loaded from HBM per decode token."""
    if is_moe:
        # Must load attention weights (shared) + active expert weights
        # Attention weights: ~30% of active params (typical)
        # Expert weights: ~70% of active params
        # BUT: expert weights are loaded fully for active experts
        # Other experts are NOT loaded
        # In practice, active expert weights are scattered in memory
        return active_params_b * 1e9 * dtype_bytes
    else:
        # Dense: load all weights
        return total_params_b * 1e9 * dtype_bytes

For batch size 1 decode (pure memory-bandwidth bound):

📊

Decode Throughput at Batch Size 1 (A100-80GB, 2 TB/s)

ModelBytes to LoadTheoretical tok/sMeasured tok/sEfficiency
Mixtral 8x7B (FP16) 26 GB 77 68 88%
Llama 70B (FP16, TP=2) 70 GB 57* 48 84%
Llama 70B (INT4, TP=1) 17.5 GB 114 95 83%
DeepSeek-V3 (FP8, TP=8) 42 GB 381* 285 75%
Llama 405B (FP8, TP=8) 203 GB 79* 62 78%

*Theoretical throughput for multi-GPU counts aggregate bandwidth.

At batch size 1, MoE models benefit from loading only active expert weights. Mixtral loads 26 GB per token vs Llama 70B’s 140 GB (or 70 GB per GPU with TP=2). But the comparison changes at larger batch sizes where compute, not bandwidth, is the bottleneck.

Throughput: Prefill Phase

During prefill, the operation is compute-bound. Here, active parameters determine throughput:

# Prefill throughput (tokens/s) = GPU TFLOPS / FLOPs_per_token
# A100: 312 TFLOPS FP16 Tensor Core

# Mixtral 8x7B:
# 312 TFLOPS / 26.9 TFLOP/token = 11.6 tokens/ms
# For 2048-token prompt: 2048 / 11.6 = 177 ms

# Llama 70B (TP=4, 1248 TFLOPS aggregate):
# 1248 / 155.6 = 8.0 tokens/ms
# For 2048-token prompt: 2048 / 8.0 = 256 ms
📊

Prefill Latency for 2048-Token Prompt

ModelGPU ConfigAggregate TFLOPSFLOPs/TokenPrefill Time (ms)
Mixtral 8x7B 2xA100 624 26.9 T 88
Llama 70B 4xA100 1,248 155.6 T 256
DeepSeek-V3 8xH100 7,920 411.8 T 106
Llama 405B 8xH100 7,920 892 T 230
Llama 70B INT4 Marlin 4xA100 1,248 ~80 T 131

MoE models have faster prefill per GPU-dollar because they use fewer FLOPs per token. Mixtral achieves 88ms prefill on 2 GPUs while Llama 70B needs 256ms on 4 GPUs.

The Expert Load Balancing Problem

MoE models have a unique serving challenge: expert load imbalance degrades batch throughput.

# In a batch of 64 tokens:
# Ideal: each expert processes 64 * (K/E) = 64 * (2/8) = 16 tokens
# Reality: some experts get 30 tokens, others get 2

# The batch processes at the speed of the SLOWEST expert
# If expert 3 gets 30 tokens while others get 5:
# - Expert 3 takes 6x longer than the average
# - All other GPUs wait for expert 3
# - Effective throughput drops by ~60%

def compute_load_imbalance_cost(
    batch_size: int,
    num_experts: int,
    active_experts: int,
    routing_probs: torch.Tensor
) -> float:
    """Compute throughput loss from expert imbalance."""
    # Simulate routing
    top_k_indices = torch.topk(routing_probs, active_experts, dim=-1).indices
    expert_loads = torch.zeros(num_experts)
    for i in range(batch_size):
        for k in range(active_experts):
            expert_loads[top_k_indices[i, k]] += 1

    # Imbalance factor: max_load / avg_load
    avg_load = batch_size * active_experts / num_experts
    max_load = expert_loads.max().item()
    imbalance = max_load / avg_load

    # Throughput loss
    throughput_ratio = 1.0 / imbalance
    return throughput_ratio
📊

Expert Load Imbalance Impact on Throughput

Batch SizeAvg Tokens/ExpertMax Tokens/ExpertImbalance FactorThroughput Loss
16 4.0 8 2.0x 50%
64 16.0 24 1.5x 33%
256 64.0 82 1.28x 22%
1024 256.0 290 1.13x 12%
4096 1024.0 1080 1.05x 5%

At small batch sizes, expert imbalance is devastating. At batch size 16, the worst-case expert gets 2x the average load, cutting throughput in half. This is a strong argument for dense models in low-batch-size (latency-sensitive) scenarios.

Performance

Expert load balancing improves with batch size due to the law of large numbers. At batch size 1024+, the imbalance factor approaches 1.0 and MoE throughput reaches its theoretical peak. For high-throughput offline processing with large batches, MoE models are more cost-efficient than dense. For interactive serving with small batches (under 32), dense models avoid the imbalance penalty entirely.

Expert Parallelism

MoE models support a unique parallelism strategy: distributing experts across GPUs:

# Expert Parallelism (EP):
# 8 experts across 8 GPUs: each GPU holds 1 expert
# All-to-all communication: each GPU sends tokens to the correct expert

# Tensor Parallelism (TP):
# Same model sharded across GPUs
# All-reduce after each layer

# EP vs TP communication pattern:
# EP: all-to-all (each GPU sends/receives from all others)
# TP: all-reduce (each GPU broadcasts result to all others)

# EP communication volume per layer:
# Each token sends hidden_dim * dtype_bytes to its experts
# For batch=256, hidden=4096, FP16:
# Total bytes = 256 * 4096 * 2 * 2 (K=2 experts) = 4 MB

# TP all-reduce volume per layer:
# Same hidden state reduced across all GPUs
# Total bytes = 256 * 4096 * 2 = 2 MB (but 2 all-reduces per layer)
📊

Parallelism Strategy Comparison — Mixtral 8x7B on 8 GPUs

StrategyExpert PlacementCommunicationThroughput (tok/s)Memory/GPU
EP=8 1 expert/GPU All-to-all 14,200 11.75 GB
TP=8 All experts sharded All-reduce 12,800 11.75 GB
EP=4, TP=2 2 experts/GPU pair Hybrid 13,500 23.5 GB
EP=2, TP=4 4 experts/GPU quad Hybrid 13,100 47 GB
Replicated (2 GPU) All on each None 8,400 47 GB

EP=8 provides the best throughput because all-to-all communication has lower latency than all-reduce for small message sizes (typical in MoE serving).

Cost-Efficiency Comparison

The ultimate metric: quality-adjusted cost per million tokens.

def cost_per_million_tokens(
    num_gpus: int,
    gpu_cost_per_hour: float,
    throughput_tok_per_sec: float
) -> float:
    total_hourly = num_gpus * gpu_cost_per_hour
    tokens_per_hour = throughput_tok_per_sec * 3600
    return (total_hourly / tokens_per_hour) * 1e6
📊

Serving Cost — Quality-Matched Models

ModelGPUsHourly CostThroughput (tok/s)$/M Tokens
Llama 70B FP16 4xA100 $12.00 5,120 $0.65
Llama 70B INT4 4xA100 $12.00 8,350 $0.40
Mixtral 8x7B FP16 2xA100 $6.00 6,800 $0.24
Mixtral 8x7B INT4 2xA100 $6.00 9,200 $0.18
DeepSeek-V3 FP8 8xH100 $40.00 18,500 $0.60
Llama 405B FP8 8xH100 $40.00 7,200 $1.54

Cost per Million Tokens (Lower is Better)

Mixtral INT4
0.18
Mixtral FP16
0.24
Llama 70B INT4
0.4
DeepSeek-V3 FP8
0.6
Llama 70B FP16
0.65
Llama 405B FP8
1.54

Mixtral 8x7B INT4 at $0.18/M tokens is the cheapest option for Llama-70B-quality output. The key: it fits on 2 GPUs instead of 4, halving the hourly cost, while achieving higher throughput due to fewer active FLOPs.

Latency Comparison

For interactive applications, TTFT and TPOT matter more than throughput:

📊

Latency Comparison — Single Request (No Batching)

ModelGPUsTTFT (512 input)TPOTTotal for 256 Output Tokens
Mixtral 8x7B 2xA100 42 ms 14.7 ms 3.80 s
Llama 70B 4xA100 62 ms 20.8 ms 5.39 s
DeepSeek-V3 8xH100 35 ms 5.4 ms 1.42 s
Llama 405B 8xH100 68 ms 16.1 ms 4.18 s
Llama 70B INT4 2xA100 48 ms 10.5 ms 2.72 s

MoE models have lower TPOT because fewer parameters are loaded per decode step. DeepSeek-V3 on 8xH100 achieves 5.4ms TPOT despite its 671B total size because only 37B parameters are active — and H100’s 3.35 TB/s bandwidth loads 37B FP8 params in about 5.5ms.

When Each Architecture Wins

decision_matrix = {
    "MoE wins": [
        "High-throughput batch processing (batch > 128)",
        "Cost-sensitive deployment (minimize $/token)",
        "Quality needs exceed what smaller dense models offer",
        "Sufficient GPU memory for full model",
        "Expert parallelism across many GPUs is available",
    ],
    "Dense wins": [
        "Low-latency interactive serving (batch < 16)",
        "Memory-constrained deployment (fewer GPUs)",
        "Simple infrastructure (no expert routing)",
        "Quantized to fit in fewer GPUs (INT4 on 1-2 GPUs)",
        "Consistent per-request latency required",
    ],
    "Either works": [
        "Medium batch sizes (16-128)",
        "Flexible GPU budget",
        "Quality comparable between available MoE and dense options",
    ]
}
💡 Tip

The simplest decision rule: if you need Llama 70B quality and have 2 GPUs, use Mixtral 8x7B (MoE). If you have 4+ GPUs, quantized Llama 70B (dense) is simpler to operate. If you need 405B quality, DeepSeek-V3 (MoE) is 2-3x cheaper to serve than Llama 405B (dense) but requires 8+ GPUs for the full model.

Operational Complexity

MoE models add serving complexity beyond the raw cost numbers:

operational_overhead = {
    "Dense": {
        "load_balancing": "Simple — all GPUs do same work",
        "debugging": "Straightforward — one execution path",
        "quantization": "Well-supported (GPTQ, AWQ, FP8)",
        "monitoring": "Standard GPU metrics",
        "scaling": "Add TP or PP, well-understood",
    },
    "MoE": {
        "load_balancing": "Complex — expert routing causes uneven load",
        "debugging": "Harder — expert selection varies per token",
        "quantization": "Supported but per-expert calibration needed",
        "monitoring": "Need per-expert utilization tracking",
        "scaling": "EP + TP + PP combinations, more tuning needed",
    }
}

Summary

MoE models are more cost-efficient than dense models at equivalent quality when serving at high batch sizes (128+), achieving 2-3x lower cost per million tokens. The advantage comes from fewer FLOPs per token (Mixtral: 27 TFLOP vs Llama 70B: 156 TFLOP) despite requiring more total GPU memory (all experts loaded). The critical weakness is expert load imbalance at small batch sizes, which can cut throughput by 50% at batch size 16. Dense models win at low batch sizes (interactive serving) due to consistent per-token compute cost and simpler infrastructure. Quantized dense models (Llama 70B INT4 on 2 GPUs) compete directly with MoE on cost while being simpler to operate. The decision reduces to: batch size, GPU budget, and operational complexity tolerance.