Introduction
In 2019, Neural Architecture Search (NAS) was the frontier of automated model design. Researchers spent thousands of GPU hours searching for optimal convolutional architectures, and the results were impressive — NAS-discovered models outperformed human-designed ones on image classification benchmarks.
Then LLMs arrived, and NAS became irrelevant for the largest models almost overnight.
The reason is simple: NAS works by evaluating many candidate architectures, and evaluating a single LLM architecture costs millions of dollars. You cannot search over thousands of architectures when each evaluation requires training a model for weeks on thousands of GPUs. The field needed a different approach to architecture design, and it found one in scaling laws — mathematical relationships that predict model performance from a few key variables without exhaustive search.
This post traces that evolution: from why NAS was abandoned for LLMs, through the scaling laws that replaced it, to the concrete architecture decisions behind models like Llama 3 and DeepSeek-V3. We cover the tradeoffs that actually matter at scale — width vs depth, attention head count, FFN ratios, and when Mixture of Experts (MoE) beats dense architectures.
Why NAS Was Abandoned for LLMs
The Search Cost Problem
NAS methods vary in efficiency, but all require evaluating multiple architectures. Even the most efficient methods (DARTS, one-shot NAS) need at minimum tens of GPU-days to search a moderately complex space.
NAS Search Cost vs Model Scale
| Model Scale | Training Cost (1 arch) | NAS Candidates | Total Search Cost | Feasible? |
|---|---|---|---|---|
| ResNet-50 (25M params) | ~2 GPU-hours | 1,000 | 2,000 GPU-hours | Yes |
| EfficientNet (66M params) | ~12 GPU-hours | 500 | 6,000 GPU-hours | Yes (expensive) |
| GPT-2 (1.5B params) | ~200 GPU-hours | 100 | 20,000 GPU-hours | Barely |
| Llama 3 8B | ~50,000 GPU-hours | 50 | 2.5M GPU-hours | No |
| Llama 3 70B | ~1.5M GPU-hours | 10 | 15M GPU-hours | Absolutely not |
| Llama 3 405B | ~30M GPU-hours | 5 | 150M GPU-hours | Impossible |
At the 405B scale, even evaluating 5 architecture candidates would cost $150M+ in compute. This is not a practical search strategy. The field needed a way to predict the right architecture without training it first.
The Search Space Mismatch
NAS was designed to search over discrete choices in convolutional networks: which operations to use (3x3 conv, 5x5 conv, pooling, skip connection), how to connect them, and how many layers to stack. The search space was rich but bounded.
LLM architectures, by contrast, are surprisingly uniform. Nearly every competitive LLM since GPT-2 uses the same basic structure: stacked Transformer blocks with multi-head self-attention and feed-forward networks. The meaningful choices are continuous parameters (model width, depth, head dimension, FFN multiplier) rather than discrete operations. This makes the problem better suited to mathematical modeling than combinatorial search.
The irony of modern LLM design is that architecture matters less than it used to. Most performance gains come from scale (more parameters, more data, more compute) rather than architectural innovation. The Transformer is good enough that the main challenge is deciding how big to make it, not what shape it should be.
Scaling Laws: The Replacement for NAS
Kaplan et al. (2020): The First LLM Scaling Laws
The OpenAI scaling laws paper (Kaplan et al., 2020) established that LLM loss follows predictable power-law relationships with three variables: parameter count , dataset size , and compute budget .
The key finding was that model performance improves smoothly and predictably as you scale up, following power laws with consistent exponents. This means you can train small models, fit the scaling curve, and extrapolate to predict the performance of much larger models.
Chinchilla: Compute-Optimal Training
The Chinchilla paper (Hoffmann et al., 2022) refined these scaling laws with a critical correction. Kaplan et al. had suggested that for a fixed compute budget, you should train a very large model on relatively little data. Chinchilla showed the opposite: optimal performance comes from a roughly equal scaling of parameters and data.
The Chinchilla-optimal ratio is approximately:
meaning you should train on about 20 tokens per parameter.
Chinchilla vs Kaplan Compute Allocation
| Compute Budget | Kaplan (params / tokens) | Chinchilla (params / tokens) | Chinchilla Advantage |
|---|---|---|---|
| 10^21 FLOPs | 6.7B / 30B | 1.4B / 27B | Same loss, 5x smaller model |
| 10^22 FLOPs | 40B / 60B | 7B / 140B | Same loss, 6x smaller model |
| 10^23 FLOPs | 200B / 200B | 33B / 660B | Same loss, 6x smaller model |
| 10^24 FLOPs | 1T / 500B | 175B / 3.5T | Same loss, 6x smaller model |
The practical impact was enormous: Chinchilla (70B parameters, 1.4T tokens) matched or exceeded Gopher (280B parameters, 300B tokens) despite being 4x smaller, purely by training on more data. This meant cheaper inference, smaller memory footprint, and faster serving — all from a better understanding of the scaling relationship.
Training Compute Efficiency: Chinchilla vs Previous Practice
(relative efficiency)Chinchilla-optimality minimizes training compute for a given loss. But in practice, inference cost often dominates total cost of ownership. A smaller model trained on more data than Chinchilla-optimal (“over-trained”) achieves slightly worse training loss but much cheaper inference. Llama 3 8B was trained on 15T tokens — roughly 1,875x its parameter count, far beyond the Chinchilla ratio of 20x. This is deliberate: the extra training cost is amortized over billions of inference queries.
Using Scaling Laws for Architecture Decisions
Scaling laws do not directly tell you what architecture to use. They tell you the relationship between total compute, model size, and data size. But combined with empirical studies, they inform architecture decisions in several ways:
- Compute budget determines model size: Given a fixed training budget, Chinchilla tells you the optimal parameter count.
- Parameter count constrains architecture: Once you know you are building a 70B model, the design space is much more constrained than if you were choosing between 7B and 700B.
- Small-scale experiments transfer: The scaling laws show that architectural trends at small scale (1B-7B) generally hold at large scale (70B-405B), allowing you to test design choices cheaply.
Architecture Decisions That Matter
Width vs Depth
The two primary axes of Transformer scaling are width (model dimension , which determines the size of each layer) and depth (number of layers ). For a given parameter budget, wider models have fewer layers and deeper models have more layers with smaller dimensions.
def compute_transformer_params(d_model, n_layers, n_heads, d_ff_multiplier=4):
"""
Approximate parameter count for a Transformer.
Each layer has:
- Self-attention: 4 * d_model^2 (Q, K, V, and output projections)
- FFN: 2 * d_model * d_ff (up and down projections)
- Layer norms: negligible
"""
d_ff = int(d_model * d_ff_multiplier)
attention_params = 4 * d_model ** 2
ffn_params = 2 * d_model * d_ff
per_layer = attention_params + ffn_params
total = n_layers * per_layer
return total
# Same ~7B parameter budget, different allocations:
wide_shallow = compute_transformer_params(d_model=6144, n_layers=24) # ~7.2B
balanced = compute_transformer_params(d_model=4096, n_layers=32) # ~6.7B
narrow_deep = compute_transformer_params(d_model=3072, n_layers=48) # ~7.2B
Width vs Depth Tradeoffs (Same ~7B Parameter Budget)
| Configuration | d_model | Layers | Training Loss | Inference Speed | Quality |
|---|---|---|---|---|---|
| Wide-shallow | 6144 | 24 | Slightly higher | Faster (fewer layers) | Weaker on reasoning |
| Balanced | 4096 | 32 | Optimal | Moderate | Best overall |
| Narrow-deep | 3072 | 48 | Slightly higher | Slower (more layers) | Better on some tasks |
Why depth matters: Each Transformer layer performs one round of attention and one round of feed-forward processing. Deeper models can express more complex compositional functions — the kind needed for multi-step reasoning, where the output of one reasoning step feeds into the next. Shallow models struggle with tasks requiring many sequential inference steps.
Why width matters: The model dimension determines the capacity of each layer. Wider layers can represent more features simultaneously and have larger feed-forward networks that act as knowledge stores. Width is also more parallelizable than depth — you can shard a wide layer across GPUs (tensor parallelism) more easily than you can parallelize sequential layers.
The practical answer: Most successful LLMs use a roughly relationship. Llama 3 8B uses with layers. Llama 3 70B uses with layers.
Attention Head Count and Dimension
The number of attention heads and the head dimension are related by . Increasing the head count while keeping fixed reduces the dimension per head.
Head Dimension Impact (d_model = 4096)
| Heads | Head Dim | Attention Capacity | Efficiency | Used By |
|---|---|---|---|---|
| 16 | 256 | Fewer, richer representations | Lower | Rare (too few heads) |
| 32 | 128 | Good balance | Good | Llama 3 8B, Mistral 7B |
| 64 | 64 | Many diverse patterns | Good | GPT-3 era models |
| 128 | 32 | Very diverse but shallow | Highest | Not used (too small) |
A head dimension of 128 has become the standard for a hardware reason: NVIDIA tensor cores operate on tiles of 16x16 or 32x32 elements, and provides clean tiling. This seemingly minor implementation detail has driven a convergence across architectures.
FFN Ratio and SwiGLU
The feed-forward network (FFN) in each Transformer layer traditionally uses a hidden dimension of . Modern LLMs have converged on a few modifications:
SwiGLU activation: Instead of the original ReLU FFN (), most modern LLMs use SwiGLU (). This adds a third weight matrix but empirically improves quality for the same parameter count.
Reduced multiplier: Because SwiGLU has three matrices instead of two, the FFN hidden dimension is reduced to compensate. The standard SwiGLU multiplier is , rounded to a multiple of 256 for hardware efficiency.
# Traditional FFN
class TraditionalFFN(nn.Module):
def __init__(self, d_model, d_ff=None):
super().__init__()
d_ff = d_ff or 4 * d_model
self.w1 = nn.Linear(d_model, d_ff)
self.w2 = nn.Linear(d_ff, d_model)
def forward(self, x):
return self.w2(F.relu(self.w1(x)))
# Parameters: 2 * d_model * d_ff
# SwiGLU FFN (used in Llama, Mistral, etc.)
class SwiGLUFFN(nn.Module):
def __init__(self, d_model, d_ff=None):
super().__init__()
d_ff = d_ff or int(8/3 * d_model)
d_ff = ((d_ff + 255) // 256) * 256 # Round to 256
self.w1 = nn.Linear(d_model, d_ff) # Gate projection
self.w3 = nn.Linear(d_model, d_ff) # Up projection
self.w2 = nn.Linear(d_ff, d_model) # Down projection
def forward(self, x):
return self.w2(F.silu(self.w1(x)) * self.w3(x))
# Parameters: 3 * d_model * d_ff
# But d_ff is smaller, so total params ~ same as traditional
SwiGLU outperforms ReLU and GELU activations by 0.5-1% on language modeling benchmarks at equivalent parameter count. The gating mechanism (element-wise multiplication of two projections) provides more expressive nonlinearity. The cost is a 50% increase in FFN computation for the same hidden dimension, which is offset by reducing the hidden dimension.
MoE vs Dense: When to Use Mixture of Experts
Mixture of Experts (MoE) replaces the dense FFN with multiple “expert” FFNs, routing each token to only a subset (typically 2 out of 8 or 16 experts). This increases total parameters without proportionally increasing computation.
Dense vs MoE Architecture Comparison
| Architecture | Total Params | Active Params/Token | Training FLOPs | Inference Cost | Quality |
|---|---|---|---|---|---|
| Dense 7B | 7B | 7B | 1x | 1x | Baseline |
| Dense 13B | 13B | 13B | 1.9x | 1.9x | +3-5% over 7B |
| MoE 8x7B (top-2) | 47B | 13B | ~2x | ~2x | +5-8% over 7B |
| Dense 70B | 70B | 70B | 10x | 10x | +15-20% over 7B |
| MoE 8x22B (top-2) | 141B | 39B | ~5.5x | ~5.5x | +12-16% over 7B |
When MoE wins: MoE is most valuable when you have abundant memory but limited compute budget. It gives you the quality benefits of a larger model at the compute cost of a smaller one. This is why Mixtral 8x7B (47B total, 13B active) can compete with dense models 2-3x its active parameter count.
When dense wins: Dense models are simpler to train (no load balancing issues), require less total memory (no inactive expert weights), and have more predictable performance. For small models (under 13B) where memory is not the bottleneck, dense architectures are generally preferred.
How DeepSeek Chose 671B MoE
DeepSeek-V3 (2024) is one of the most detailed public examples of compute-informed architecture design. They chose a 671B MoE architecture with 37B active parameters per token, routing to 8 out of 256 experts.
The Cost Analysis
DeepSeek’s decision was driven by a specific cost target: achieve GPT-4-class performance at a fraction of the training cost.
DeepSeek-V3 Architecture Decision Analysis
| Alternative | Total Params | Active Params | Est. Training Cost | Expected Quality |
|---|---|---|---|---|
| Dense 70B | 70B | 70B | $4M | Below GPT-4 |
| Dense 405B | 405B | 405B | $60M | Near GPT-4 |
| MoE 671B (chosen) | 671B | 37B | $5.6M | Near GPT-4 |
| Dense 37B (same active) | 37B | 37B | $2M | Well below GPT-4 |
The key insight is in the last two rows: a dense 37B model (same active parameters as DeepSeek-V3) would be significantly weaker, while a dense 405B model (similar quality target) would cost 10x more to train. MoE gives you the knowledge capacity of a much larger model at the compute cost of the active parameter count.
DeepSeek’s Architectural Innovations
DeepSeek-V3 introduced several innovations beyond basic MoE:
Multi-head Latent Attention (MLA): Instead of standard multi-head attention, DeepSeek compresses the KV cache using learned down-projections. This reduces the KV cache by 6-8x compared to standard MHA at the same quality.
Fine-grained experts with shared experts: Instead of 8 large experts, DeepSeek uses 256 small experts (routed top-8) plus 1 shared expert that processes every token. The shared expert handles common patterns while the routed experts specialize.
Auxiliary-loss-free load balancing: Traditional MoE uses an auxiliary loss to encourage balanced expert utilization, which can hurt model quality. DeepSeek uses a bias-based approach that achieves balance without an explicit loss term.
DeepSeek-V3 has 671B total parameters but only 37B active per token. The full model requires ~1.3 TB of memory in FP16, but the compute cost per token is similar to a dense 37B model. This means DeepSeek-V3 needs many GPUs for memory (to hold all expert weights) but uses each GPU’s compute relatively sparingly. It is the opposite of the typical dense model bottleneck.
Llama 3 Architecture Choices Explained
Meta’s Llama 3 family (8B, 70B, 405B) provides a clear case study in modern architecture design because Meta published detailed ablation results.
The Key Decisions
Llama 3 Architecture Details
| Parameter | 8B | 70B | 405B |
|---|---|---|---|
| Layers | 32 | 80 | 126 |
| d_model | 4096 | 8192 | 16384 |
| Attention heads | 32 | 64 | 128 |
| KV heads (GQA) | 8 | 8 | 8 |
| Head dimension | 128 | 128 | 128 |
| FFN hidden dim | 14336 | 28672 | 53248 |
| FFN multiplier | 3.5x | 3.5x | 3.25x |
| Vocabulary size | 128K | 128K | 128K |
| Context length | 128K | 128K | 128K |
| Training tokens | 15T | 15T | 15T |
Several patterns are notable:
Constant KV heads (8): All three model sizes use exactly 8 KV heads. This is a deliberate choice for inference efficiency — the KV cache scales linearly with KV head count, and 8 provides a good quality-efficiency tradeoff at 128K context.
Constant head dimension (128): All models use 128-dimensional heads, regardless of model width. This is driven by hardware efficiency (tensor core tile sizes) and the observation that head dimension has diminishing returns beyond 128.
Large vocabulary (128K): Llama 3 uses a much larger vocabulary than Llama 2 (128K vs 32K). Larger vocabularies improve tokenization efficiency (fewer tokens per word, especially for non-English languages) at a modest parameter cost (the embedding layer grows, but it is a small fraction of total parameters).
Over-training: All three models were trained on 15T tokens, far beyond Chinchilla-optimal. For the 8B model, this is 1,875 tokens per parameter (vs Chinchilla’s 20). This over-training improves the quality of the smaller models significantly, at the cost of “wasted” training compute that would have been better spent on a larger model under Chinchilla rules.
Why Over-Training Makes Economic Sense
Chinchilla-Optimal vs Llama 3 Training Strategy
| Model | Chinchilla Tokens | Llama 3 Tokens | Over-training Factor | Inference Savings |
|---|---|---|---|---|
| 8B | 160B | 15T | 93.75x | Model is 93x smaller than Chinchilla-optimal for same data |
| 70B | 1.4T | 15T | 10.7x | Model is 10x smaller than Chinchilla-optimal |
| 405B | 8.1T | 15T | 1.85x | Near Chinchilla-optimal |
The 8B model is the most aggressively over-trained. The logic: the 8B model will be deployed billions of times. Every token of inference is cheap because the model is small. The extra training compute is a one-time cost that is amortized over the lifetime of deployments. A Chinchilla-optimal model trained on only 160B tokens would be much weaker, requiring deployment of the more expensive 70B model for the same quality.
For models intended for broad deployment, over-training beyond Chinchilla-optimal is almost always worth it. The formula is:
Total cost = Training cost + (Inference cost per query * Expected queries)
When expected queries is large (millions to billions), minimizing inference cost per query (smaller model) dominates, even if it means higher training cost.
Modern Architecture Design Process
Putting it all together, here is how a team designing a new LLM in 2025 typically approaches architecture decisions:
Step 1: Define the Compute Budget and Use Case
The compute budget determines the maximum model size. The use case determines the deployment constraints (latency, memory, throughput).
Step 2: Use Scaling Laws to Determine Model Size
Given the compute budget, use Chinchilla-style scaling laws (adjusted for over-training if deploying at scale) to determine the target parameter count and training data size.
Step 3: Choose Dense vs MoE
If memory is abundant but compute is constrained, MoE. If simplicity and predictability are priorities, dense. If the model will be deployed on consumer hardware, dense (MoE memory requirements are too high).
Step 4: Set Architecture Hyperparameters
For a dense model with parameters:
def design_llm_architecture(target_params_billions):
"""
Heuristic architecture design for a dense Transformer LLM.
"""
N = target_params_billions * 1e9
# Head dimension is always 128
d_head = 128
# Estimate layers and width from empirical relationships
# d_model ~ 128 * L is a rough guide
# Total params ~ 12 * L * d_model^2 (for SwiGLU with 3.5x multiplier)
# Solve for d_model and L given target params
# Using the approximation: N ~ 12 * d_model^2 * (d_model / 128)
# = 12/128 * d_model^3
d_model = int((N * 128 / 12) ** (1/3))
d_model = ((d_model + 127) // 128) * 128 # Round to multiple of 128
n_layers = d_model // 128
n_heads = d_model // d_head
n_kv_heads = 8 # Standard GQA
# SwiGLU FFN dimension
d_ff = int(d_model * 3.5)
d_ff = ((d_ff + 255) // 256) * 256 # Round to 256
return {
'd_model': d_model,
'n_layers': n_layers,
'n_heads': n_heads,
'n_kv_heads': n_kv_heads,
'd_head': d_head,
'd_ff': d_ff,
}
Step 5: Validate with Small-Scale Experiments
Train 100M-1B parameter versions of the candidate architectures on a small dataset and compare loss curves. The relative ordering of architectures is generally preserved at scale (this is the key insight that makes scaling-law-based design work).
Small-Scale Experiment Transferability
(rank correlation)What NAS Got Right (And What Carries Forward)
While NAS is not used directly for LLM architecture search, several ideas from the NAS era remain relevant:
Hardware-aware design: NAS introduced the idea of optimizing architectures for specific hardware. This principle is alive in LLM design — the choice of , FFN dimensions rounded to multiples of 256, and GQA head counts are all hardware-driven.
Automated hyperparameter search: While we do not search over architecture topology, automated search over training hyperparameters (learning rate, batch size, warmup schedule) is standard practice and uses similar Bayesian optimization techniques.
Efficiency frontiers: NAS established the concept of Pareto-optimal architectures that balance accuracy against cost. Scaling laws serve the same purpose for LLMs, defining the frontier of what is achievable at each compute budget.
Transferable patterns: NAS discovered that certain motifs (skip connections, inverted bottlenecks) work well across scales. In LLMs, the analogous transferable patterns are SwiGLU, RMSNorm, RoPE, and GQA.
Architecture Patterns: NAS Era vs LLM Era
| Concept | NAS Era (2017-2020) | LLM Era (2022-2025) |
|---|---|---|
| Search method | RL/evolutionary/gradient search | Scaling laws + ablations |
| Search cost | 1K-10K GPU-hours | 100-1K GPU-hours (small-scale expts) |
| Key decisions | Operation type, connectivity | Width, depth, FFN ratio, MoE |
| Hardware awareness | Latency tables | Tensor core tile sizes, memory hierarchy |
| Validation method | Train and evaluate | Scaling law extrapolation |
| Transferability | Limited across tasks | Strong across scales |
Conclusion
The evolution from NAS to scaling laws reflects a broader maturation of the field. When architectures are diverse and the design space is poorly understood, automated search over many candidates makes sense. When architectures have converged and the key variables are continuous, mathematical modeling is far more efficient.
Modern LLM architecture design is driven by a few key principles:
- Scaling laws determine model size given a compute budget and deployment plan.
- Over-training beyond Chinchilla is standard for models intended for broad deployment, because inference cost dominates total cost.
- The basic Transformer architecture is fixed: self-attention + SwiGLU FFN + RMSNorm + RoPE + GQA. The interesting choices are the continuous parameters.
- MoE is chosen when memory is cheap but compute is expensive, allowing larger knowledge capacity at lower per-token cost.
- Hardware dictates many “architectural” choices: head dimension of 128, FFN dimensions rounded to 256, GQA head count of 8 — these are driven by tensor core efficiency and memory hierarchy, not theoretical optimality.
- Small-scale experiments validate large-scale decisions, because architectural trends transfer reliably across scales.
The field has moved from “search for the best architecture” to “compute the right size and shape.” This is both less exciting and far more effective. The billions of dollars saved by not running NAS at LLM scale have been redirected into what actually drives progress: more data, more compute, and better training recipes.