Mixture-of-Experts (MoE) models achieve the quality of much larger dense models while activating only a fraction of parameters per token. Mixtral 8x7B has 47B total parameters but activates only 13B per token. DeepSeek-V3 has 671B total but activates 37B per token. This makes MoE models appear cheaper to serve — fewer active parameters means fewer FLOPs. But serving cost is not just FLOPs. Memory, load balancing, batch efficiency, and GPU utilization all differ between MoE and dense architectures. This post provides a quantitative comparison across the metrics that determine production serving cost.
Memory Requirements
The first difference: MoE models must load ALL parameters into memory, not just the active ones.
def compute_model_memory(
total_params_b: float,
active_params_b: float,
dtype_bytes: int,
num_experts: int,
expert_params_b: float
) -> dict:
"""Compute memory requirements for MoE vs Dense."""
# Dense model: all params are active
# Memory = total_params * dtype_bytes
# MoE model: all expert weights must be loaded
# even though only K experts activate per token
total_memory = total_params_b * 1e9 * dtype_bytes
active_memory = active_params_b * 1e9 * dtype_bytes
return {
"total_memory_gb": total_memory / 1e9,
"active_compute_params_gb": active_memory / 1e9,
"memory_to_compute_ratio": total_memory / active_memory
}
Memory vs Active Parameters
| Model | Total Params | Active Params | FP16 Memory (GB) | Memory/Compute Ratio |
|---|---|---|---|---|
| Llama 70B (Dense) | 70B | 70B | 140 | 1.0x |
| Mixtral 8x7B (MoE) | 47B | 13B | 94 | 3.6x |
| DeepSeek-V3 (MoE) | 671B | 37B | 1,342 | 18.1x |
| Llama 405B (Dense) | 405B | 405B | 810 | 1.0x |
| Qwen2.5-MoE-57B | 57B | 14B | 114 | 4.1x |
DeepSeek-V3 activates only 37B parameters per token (comparable to Llama 70B in compute) but requires 1.34 TB of memory to store all 671B parameters. This means:
- Llama 70B: fits on 2x A100-80GB (TP=2)
- DeepSeek-V3: requires 17x A100-80GB minimum (even with TP)
The memory-to-compute ratio is the critical MoE disadvantage. DeepSeek-V3 uses 18x more memory than a dense model with equivalent per-token compute (37B active). This means 18x more GPU memory for the same throughput per token, unless expert parallelism is used effectively.
FLOPs per Token
The compute advantage of MoE is straightforward:
def flops_per_token(hidden_dim: int, num_layers: int,
ffn_dim: int, num_active_experts: int,
is_moe: bool) -> float:
"""Estimate FLOPs per token for forward pass."""
# Attention: 8 * hidden^2 per layer (QKV + O projections + attention)
attn_flops = 8 * hidden_dim ** 2
if is_moe:
# MoE FFN: only K experts activated
# Each expert: 3 * hidden * ffn (gate + up + down for SwiGLU)
ffn_flops = num_active_experts * 3 * hidden_dim * ffn_dim * 2
# Router: hidden * num_experts (small)
router_flops = hidden_dim * 8 # negligible
else:
# Dense FFN: 3 * hidden * ffn * 2 (SwiGLU)
ffn_flops = 3 * hidden_dim * ffn_dim * 2
total = num_layers * (attn_flops + ffn_flops)
return total
# Comparison
llama_70b_flops = flops_per_token(8192, 80, 28672, 1, False)
# = 80 * (8 * 8192^2 + 3 * 8192 * 28672 * 2)
# = 80 * (536M + 1.41B) = 155.6 TFLOP
mixtral_flops = flops_per_token(4096, 32, 14336, 2, True)
# = 32 * (8 * 4096^2 + 2 * 3 * 4096 * 14336 * 2)
# = 32 * (134M + 706M) = 26.9 TFLOP
deepseek_v3_flops = flops_per_token(7168, 61, 18432, 8, True)
# = 61 * (8 * 7168^2 + 8 * 3 * 7168 * 18432 * 2)
# = 61 * (411M + 6.34B) = 411.8 TFLOP
TFLOPs per Token (Forward Pass)
Throughput: Decode Phase
During decode (autoregressive token generation), the operation is memory-bandwidth bound for both MoE and dense models. The key metric is how many bytes must be loaded from GPU memory per token:
def decode_bytes_per_token(total_params_b: float,
active_params_b: float,
dtype_bytes: int,
is_moe: bool) -> float:
"""Bytes loaded from HBM per decode token."""
if is_moe:
# Must load attention weights (shared) + active expert weights
# Attention weights: ~30% of active params (typical)
# Expert weights: ~70% of active params
# BUT: expert weights are loaded fully for active experts
# Other experts are NOT loaded
# In practice, active expert weights are scattered in memory
return active_params_b * 1e9 * dtype_bytes
else:
# Dense: load all weights
return total_params_b * 1e9 * dtype_bytes
For batch size 1 decode (pure memory-bandwidth bound):
Decode Throughput at Batch Size 1 (A100-80GB, 2 TB/s)
| Model | Bytes to Load | Theoretical tok/s | Measured tok/s | Efficiency |
|---|---|---|---|---|
| Mixtral 8x7B (FP16) | 26 GB | 77 | 68 | 88% |
| Llama 70B (FP16, TP=2) | 70 GB | 57* | 48 | 84% |
| Llama 70B (INT4, TP=1) | 17.5 GB | 114 | 95 | 83% |
| DeepSeek-V3 (FP8, TP=8) | 42 GB | 381* | 285 | 75% |
| Llama 405B (FP8, TP=8) | 203 GB | 79* | 62 | 78% |
*Theoretical throughput for multi-GPU counts aggregate bandwidth.
At batch size 1, MoE models benefit from loading only active expert weights. Mixtral loads 26 GB per token vs Llama 70B’s 140 GB (or 70 GB per GPU with TP=2). But the comparison changes at larger batch sizes where compute, not bandwidth, is the bottleneck.
Throughput: Prefill Phase
During prefill, the operation is compute-bound. Here, active parameters determine throughput:
# Prefill throughput (tokens/s) = GPU TFLOPS / FLOPs_per_token
# A100: 312 TFLOPS FP16 Tensor Core
# Mixtral 8x7B:
# 312 TFLOPS / 26.9 TFLOP/token = 11.6 tokens/ms
# For 2048-token prompt: 2048 / 11.6 = 177 ms
# Llama 70B (TP=4, 1248 TFLOPS aggregate):
# 1248 / 155.6 = 8.0 tokens/ms
# For 2048-token prompt: 2048 / 8.0 = 256 ms
Prefill Latency for 2048-Token Prompt
| Model | GPU Config | Aggregate TFLOPS | FLOPs/Token | Prefill Time (ms) |
|---|---|---|---|---|
| Mixtral 8x7B | 2xA100 | 624 | 26.9 T | 88 |
| Llama 70B | 4xA100 | 1,248 | 155.6 T | 256 |
| DeepSeek-V3 | 8xH100 | 7,920 | 411.8 T | 106 |
| Llama 405B | 8xH100 | 7,920 | 892 T | 230 |
| Llama 70B INT4 Marlin | 4xA100 | 1,248 | ~80 T | 131 |
MoE models have faster prefill per GPU-dollar because they use fewer FLOPs per token. Mixtral achieves 88ms prefill on 2 GPUs while Llama 70B needs 256ms on 4 GPUs.
The Expert Load Balancing Problem
MoE models have a unique serving challenge: expert load imbalance degrades batch throughput.
# In a batch of 64 tokens:
# Ideal: each expert processes 64 * (K/E) = 64 * (2/8) = 16 tokens
# Reality: some experts get 30 tokens, others get 2
# The batch processes at the speed of the SLOWEST expert
# If expert 3 gets 30 tokens while others get 5:
# - Expert 3 takes 6x longer than the average
# - All other GPUs wait for expert 3
# - Effective throughput drops by ~60%
def compute_load_imbalance_cost(
batch_size: int,
num_experts: int,
active_experts: int,
routing_probs: torch.Tensor
) -> float:
"""Compute throughput loss from expert imbalance."""
# Simulate routing
top_k_indices = torch.topk(routing_probs, active_experts, dim=-1).indices
expert_loads = torch.zeros(num_experts)
for i in range(batch_size):
for k in range(active_experts):
expert_loads[top_k_indices[i, k]] += 1
# Imbalance factor: max_load / avg_load
avg_load = batch_size * active_experts / num_experts
max_load = expert_loads.max().item()
imbalance = max_load / avg_load
# Throughput loss
throughput_ratio = 1.0 / imbalance
return throughput_ratio
Expert Load Imbalance Impact on Throughput
| Batch Size | Avg Tokens/Expert | Max Tokens/Expert | Imbalance Factor | Throughput Loss |
|---|---|---|---|---|
| 16 | 4.0 | 8 | 2.0x | 50% |
| 64 | 16.0 | 24 | 1.5x | 33% |
| 256 | 64.0 | 82 | 1.28x | 22% |
| 1024 | 256.0 | 290 | 1.13x | 12% |
| 4096 | 1024.0 | 1080 | 1.05x | 5% |
At small batch sizes, expert imbalance is devastating. At batch size 16, the worst-case expert gets 2x the average load, cutting throughput in half. This is a strong argument for dense models in low-batch-size (latency-sensitive) scenarios.
Expert load balancing improves with batch size due to the law of large numbers. At batch size 1024+, the imbalance factor approaches 1.0 and MoE throughput reaches its theoretical peak. For high-throughput offline processing with large batches, MoE models are more cost-efficient than dense. For interactive serving with small batches (under 32), dense models avoid the imbalance penalty entirely.
Expert Parallelism
MoE models support a unique parallelism strategy: distributing experts across GPUs:
# Expert Parallelism (EP):
# 8 experts across 8 GPUs: each GPU holds 1 expert
# All-to-all communication: each GPU sends tokens to the correct expert
# Tensor Parallelism (TP):
# Same model sharded across GPUs
# All-reduce after each layer
# EP vs TP communication pattern:
# EP: all-to-all (each GPU sends/receives from all others)
# TP: all-reduce (each GPU broadcasts result to all others)
# EP communication volume per layer:
# Each token sends hidden_dim * dtype_bytes to its experts
# For batch=256, hidden=4096, FP16:
# Total bytes = 256 * 4096 * 2 * 2 (K=2 experts) = 4 MB
# TP all-reduce volume per layer:
# Same hidden state reduced across all GPUs
# Total bytes = 256 * 4096 * 2 = 2 MB (but 2 all-reduces per layer)
Parallelism Strategy Comparison — Mixtral 8x7B on 8 GPUs
| Strategy | Expert Placement | Communication | Throughput (tok/s) | Memory/GPU |
|---|---|---|---|---|
| EP=8 | 1 expert/GPU | All-to-all | 14,200 | 11.75 GB |
| TP=8 | All experts sharded | All-reduce | 12,800 | 11.75 GB |
| EP=4, TP=2 | 2 experts/GPU pair | Hybrid | 13,500 | 23.5 GB |
| EP=2, TP=4 | 4 experts/GPU quad | Hybrid | 13,100 | 47 GB |
| Replicated (2 GPU) | All on each | None | 8,400 | 47 GB |
EP=8 provides the best throughput because all-to-all communication has lower latency than all-reduce for small message sizes (typical in MoE serving).
Cost-Efficiency Comparison
The ultimate metric: quality-adjusted cost per million tokens.
def cost_per_million_tokens(
num_gpus: int,
gpu_cost_per_hour: float,
throughput_tok_per_sec: float
) -> float:
total_hourly = num_gpus * gpu_cost_per_hour
tokens_per_hour = throughput_tok_per_sec * 3600
return (total_hourly / tokens_per_hour) * 1e6
Serving Cost — Quality-Matched Models
| Model | GPUs | Hourly Cost | Throughput (tok/s) | $/M Tokens |
|---|---|---|---|---|
| Llama 70B FP16 | 4xA100 | $12.00 | 5,120 | $0.65 |
| Llama 70B INT4 | 4xA100 | $12.00 | 8,350 | $0.40 |
| Mixtral 8x7B FP16 | 2xA100 | $6.00 | 6,800 | $0.24 |
| Mixtral 8x7B INT4 | 2xA100 | $6.00 | 9,200 | $0.18 |
| DeepSeek-V3 FP8 | 8xH100 | $40.00 | 18,500 | $0.60 |
| Llama 405B FP8 | 8xH100 | $40.00 | 7,200 | $1.54 |
Cost per Million Tokens (Lower is Better)
Mixtral 8x7B INT4 at $0.18/M tokens is the cheapest option for Llama-70B-quality output. The key: it fits on 2 GPUs instead of 4, halving the hourly cost, while achieving higher throughput due to fewer active FLOPs.
Latency Comparison
For interactive applications, TTFT and TPOT matter more than throughput:
Latency Comparison — Single Request (No Batching)
| Model | GPUs | TTFT (512 input) | TPOT | Total for 256 Output Tokens |
|---|---|---|---|---|
| Mixtral 8x7B | 2xA100 | 42 ms | 14.7 ms | 3.80 s |
| Llama 70B | 4xA100 | 62 ms | 20.8 ms | 5.39 s |
| DeepSeek-V3 | 8xH100 | 35 ms | 5.4 ms | 1.42 s |
| Llama 405B | 8xH100 | 68 ms | 16.1 ms | 4.18 s |
| Llama 70B INT4 | 2xA100 | 48 ms | 10.5 ms | 2.72 s |
MoE models have lower TPOT because fewer parameters are loaded per decode step. DeepSeek-V3 on 8xH100 achieves 5.4ms TPOT despite its 671B total size because only 37B parameters are active — and H100’s 3.35 TB/s bandwidth loads 37B FP8 params in about 5.5ms.
When Each Architecture Wins
decision_matrix = {
"MoE wins": [
"High-throughput batch processing (batch > 128)",
"Cost-sensitive deployment (minimize $/token)",
"Quality needs exceed what smaller dense models offer",
"Sufficient GPU memory for full model",
"Expert parallelism across many GPUs is available",
],
"Dense wins": [
"Low-latency interactive serving (batch < 16)",
"Memory-constrained deployment (fewer GPUs)",
"Simple infrastructure (no expert routing)",
"Quantized to fit in fewer GPUs (INT4 on 1-2 GPUs)",
"Consistent per-request latency required",
],
"Either works": [
"Medium batch sizes (16-128)",
"Flexible GPU budget",
"Quality comparable between available MoE and dense options",
]
}
The simplest decision rule: if you need Llama 70B quality and have 2 GPUs, use Mixtral 8x7B (MoE). If you have 4+ GPUs, quantized Llama 70B (dense) is simpler to operate. If you need 405B quality, DeepSeek-V3 (MoE) is 2-3x cheaper to serve than Llama 405B (dense) but requires 8+ GPUs for the full model.
Operational Complexity
MoE models add serving complexity beyond the raw cost numbers:
operational_overhead = {
"Dense": {
"load_balancing": "Simple — all GPUs do same work",
"debugging": "Straightforward — one execution path",
"quantization": "Well-supported (GPTQ, AWQ, FP8)",
"monitoring": "Standard GPU metrics",
"scaling": "Add TP or PP, well-understood",
},
"MoE": {
"load_balancing": "Complex — expert routing causes uneven load",
"debugging": "Harder — expert selection varies per token",
"quantization": "Supported but per-expert calibration needed",
"monitoring": "Need per-expert utilization tracking",
"scaling": "EP + TP + PP combinations, more tuning needed",
}
}
Summary
MoE models are more cost-efficient than dense models at equivalent quality when serving at high batch sizes (128+), achieving 2-3x lower cost per million tokens. The advantage comes from fewer FLOPs per token (Mixtral: 27 TFLOP vs Llama 70B: 156 TFLOP) despite requiring more total GPU memory (all experts loaded). The critical weakness is expert load imbalance at small batch sizes, which can cut throughput by 50% at batch size 16. Dense models win at low batch sizes (interactive serving) due to consistent per-token compute cost and simpler infrastructure. Quantized dense models (Llama 70B INT4 on 2 GPUs) compete directly with MoE on cost while being simpler to operate. The decision reduces to: batch size, GPU budget, and operational complexity tolerance.