Part of Series Inference Optimization Timeline 54 of 60
1 Transformer Fundamentals for Systems Engineers: The 10-Minute Bridge from Architecture to Inference 2 LLM Inference Fundamentals: Prefill, Decode, and the Memory-Compute Divide 3 KV Cache: The Hidden Memory Giant in LLM Serving 4 Quantization for LLM Inference: From FP16 to INT4 — A Deep Dive into Precision, Performance, and Production Deployment 5 FlashAttention: Why Tiling Attention Through the Memory Hierarchy Changes Everything 6 PagedAttention: How vLLM Borrowed OS Virtual Memory to Fix LLM Serving 7 Continuous Batching: The Complete Guide to LLM Inference Scheduling 8 Speculative Decoding: Why Autoregressive LLMs Leave 99% of Your GPU Idle and How to Fix It 9 Prefix Caching: RadixAttention, Cache Hierarchies, and Reusing Computation Across Requests 10 LoRA and QLoRA for Serving: Multi-Adapter Inference, S-LoRA, and When to Merge 11 Disaggregated Prefill-Decode: Why Splitting LLM Inference Changes Everything 12 Constrained Generation: FSM-Based Decoding, Outlines, and Grammar-Guided LLM Output 13 Mamba and State Space Models: The O(n) Alternative to Attention 14 Inference-Time Compute Scaling: When More Thinking Helps (o1, DeepSeek-R1, and the Reasoning Frontier) 15 CPU and Edge Inference: llama.cpp Internals, GGUF Format, and When CPU Actually Wins 16 Inference Cost Economics: Tokens per Dollar, GPU-Hours, and the Real Math of LLM Serving 17 Model Loading and Cold Start: safetensors, mmap, and Startup Optimization 18 Batched GEMM: Why Matrix Multiply Throughput Determines Everything in LLM Inference 19 Kernel Autotuning: How TensorRT and torch.compile Find Optimal CUDA Kernels 20 Attention Kernel Comparison: FlashAttention vs FlashInfer vs xformers vs Triton 21 Token Generation Pipeline: Logit Processing, Sampling Strategies, and Stop Criteria 22 Dynamic Batching: Orca, Sarathi, and Iteration-Level Scheduling Algorithms 23 Memory Pool Management: Slab Allocators for GPU Inference 24 Prefill vs Decode Optimization: Different Bottlenecks, Different Solutions 25 Decode Optimization: CUDA Graphs, Persistent Batches, and Speculative Verification 26 Multi-Model Serving: GPU Sharing, Model Switching, and Adapter Pool Management 27 Structured Output Acceleration: Compressed FSMs, Speculative JSON, and Grammar Caching 28 Vision-Language Model Serving: ViT Encoding, Cross-Attention, and KV Cache Paging for Multimodal 29 Long-Context Serving: Ring Attention, KV Offloading, and Chunked Processing in Production 30 Inference Profiling: Nsight Systems, torch.profiler, and Finding Where Time Actually Goes 31 FP8 Inference: E4M3 Format, Per-Tensor Scaling, and the Hardware Support Matrix 32 Speculative Decoding v2: Medusa, EAGLE, Lookahead, and Token Tree Verification 33 Disaggregated Serving v2: Mooncake KV-Centric Architecture and LoongServe Elastic SP 34 Request Preemption and Priority Scheduling in Production LLM Serving 35 Autoscaling LLM Inference: Signals, Lag, Warm Pools, and Cost-Optimal Scaling 36 The Inference Stack in 2026: From HTTP Request to GPU Kernel and Back 37 Video and Audio LLM Serving: Temporal Encoding, Chunked Streaming, and Latency Budgets 38 KV Cache Compression and Eviction: H2O, Attention Sinks, Sliding Window, and Quantized KV 39 Distributed Inference: Tensor Parallelism vs Pipeline Parallelism for Serving 40 Serving Benchmark Methodology: How to Properly Measure LLM Inference Performance 41 Compute-Communication Overlap: Hiding Distributed Training Latency 42 DeepSpeed ZeRO: Memory Optimization for Distributed Training at Scale 43 Pipeline Parallelism: From GPipe to DualPipe -- Eliminating the Bubble 44 Gradient Compression for Distributed Training: Promise, Reality, and Where It Still Wins 45 The Definitive Guide to Distributed Parallelism: Data, Tensor, Pipeline, Expert, and Sequence Parallelism for Large-Scale Training 46 Decoding Performance: Beam Search vs Sampling — Latency, Throughput, Memory, and the Full Design Space 47 LLM Prefill Phase Optimization: Why Prompt Processing Is Compute-Bound and How to Fix It 48 LLM Serving Engines: vLLM vs SGLang vs TensorRT-LLM — A Systems Comparison 49 Request Routing for LLM Inference: From Naive Load Balancing to KV Cache-Aware Scheduling 50 Why Adam Is Expensive and What To Do About It: 8-bit Adam, Adafactor, CAME, and the Memory Math of Optimizers 51 How Large Models Actually Get Loaded: Safetensors, mmap, Tensor Parallelism, and Progressive Loading 52 Mixed Precision Training: The Complete Precision Landscape from FP32 to FP4 53 Model Compression: Pruning, Distillation, and Why Quantization Won 54 From NAS to Scaling Laws: How We Design LLM Architectures Now 55 NVIDIA NCCL Performance Tuning for Multi-GPU Training 56 ONNX Runtime in Practice: Graph Optimization, Execution Providers, Quantization, and When ORT Is the Right Choice 57 Optimizing GEMM for Neural Networks: BLAS vs Custom Kernels (Nov 2019) 58 Long Context: From Sparse Attention to Ring Attention 59 TensorRT-LLM: Graph Optimization for Maximum Inference Performance 60 Long Context LLMs: From 2K to 1M Tokens

Introduction

In 2019, Neural Architecture Search (NAS) was the frontier of automated model design. Researchers spent thousands of GPU hours searching for optimal convolutional architectures, and the results were impressive — NAS-discovered models outperformed human-designed ones on image classification benchmarks.

Then LLMs arrived, and NAS became irrelevant for the largest models almost overnight.

The reason is simple: NAS works by evaluating many candidate architectures, and evaluating a single LLM architecture costs millions of dollars. You cannot search over thousands of architectures when each evaluation requires training a model for weeks on thousands of GPUs. The field needed a different approach to architecture design, and it found one in scaling laws — mathematical relationships that predict model performance from a few key variables without exhaustive search.

This post traces that evolution: from why NAS was abandoned for LLMs, through the scaling laws that replaced it, to the concrete architecture decisions behind models like Llama 3 and DeepSeek-V3. We cover the tradeoffs that actually matter at scale — width vs depth, attention head count, FFN ratios, and when Mixture of Experts (MoE) beats dense architectures.

Why NAS Was Abandoned for LLMs

The Search Cost Problem

NAS methods vary in efficiency, but all require evaluating multiple architectures. Even the most efficient methods (DARTS, one-shot NAS) need at minimum tens of GPU-days to search a moderately complex space.

📊

NAS Search Cost vs Model Scale

Model ScaleTraining Cost (1 arch)NAS CandidatesTotal Search CostFeasible?
ResNet-50 (25M params) ~2 GPU-hours 1,000 2,000 GPU-hours Yes
EfficientNet (66M params) ~12 GPU-hours 500 6,000 GPU-hours Yes (expensive)
GPT-2 (1.5B params) ~200 GPU-hours 100 20,000 GPU-hours Barely
Llama 3 8B ~50,000 GPU-hours 50 2.5M GPU-hours No
Llama 3 70B ~1.5M GPU-hours 10 15M GPU-hours Absolutely not
Llama 3 405B ~30M GPU-hours 5 150M GPU-hours Impossible
Note: GPU-hours estimated for H100-class hardware. NAS candidate counts are generous minimums for meaningful search.

At the 405B scale, even evaluating 5 architecture candidates would cost $150M+ in compute. This is not a practical search strategy. The field needed a way to predict the right architecture without training it first.

The Search Space Mismatch

NAS was designed to search over discrete choices in convolutional networks: which operations to use (3x3 conv, 5x5 conv, pooling, skip connection), how to connect them, and how many layers to stack. The search space was rich but bounded.

LLM architectures, by contrast, are surprisingly uniform. Nearly every competitive LLM since GPT-2 uses the same basic structure: stacked Transformer blocks with multi-head self-attention and feed-forward networks. The meaningful choices are continuous parameters (model width, depth, head dimension, FFN multiplier) rather than discrete operations. This makes the problem better suited to mathematical modeling than combinatorial search.

ℹ️ The Simplification of Architecture Design

The irony of modern LLM design is that architecture matters less than it used to. Most performance gains come from scale (more parameters, more data, more compute) rather than architectural innovation. The Transformer is good enough that the main challenge is deciding how big to make it, not what shape it should be.

Scaling Laws: The Replacement for NAS

Kaplan et al. (2020): The First LLM Scaling Laws

The OpenAI scaling laws paper (Kaplan et al., 2020) established that LLM loss follows predictable power-law relationships with three variables: parameter count NN, dataset size DD, and compute budget CC.

L(N)(NcN)αN,L(D)(DcD)αD,L(C)(CcC)αCL(N) \approx \left(\frac{N_c}{N}\right)^{\alpha_N}, \quad L(D) \approx \left(\frac{D_c}{D}\right)^{\alpha_D}, \quad L(C) \approx \left(\frac{C_c}{C}\right)^{\alpha_C}

The key finding was that model performance improves smoothly and predictably as you scale up, following power laws with consistent exponents. This means you can train small models, fit the scaling curve, and extrapolate to predict the performance of much larger models.

Chinchilla: Compute-Optimal Training

The Chinchilla paper (Hoffmann et al., 2022) refined these scaling laws with a critical correction. Kaplan et al. had suggested that for a fixed compute budget, you should train a very large model on relatively little data. Chinchilla showed the opposite: optimal performance comes from a roughly equal scaling of parameters and data.

The Chinchilla-optimal ratio is approximately:

Doptimal20×ND_{\text{optimal}} \approx 20 \times N

meaning you should train on about 20 tokens per parameter.

📊

Chinchilla vs Kaplan Compute Allocation

Compute BudgetKaplan (params / tokens)Chinchilla (params / tokens)Chinchilla Advantage
10^21 FLOPs 6.7B / 30B 1.4B / 27B Same loss, 5x smaller model
10^22 FLOPs 40B / 60B 7B / 140B Same loss, 6x smaller model
10^23 FLOPs 200B / 200B 33B / 660B Same loss, 6x smaller model
10^24 FLOPs 1T / 500B 175B / 3.5T Same loss, 6x smaller model
Note: Chinchilla-optimal models achieve the same training loss with far fewer parameters, meaning faster inference.

The practical impact was enormous: Chinchilla (70B parameters, 1.4T tokens) matched or exceeded Gopher (280B parameters, 300B tokens) despite being 4x smaller, purely by training on more data. This meant cheaper inference, smaller memory footprint, and faster serving — all from a better understanding of the scaling relationship.

Training Compute Efficiency: Chinchilla vs Previous Practice

(relative efficiency)
Kaplan-optimal (large model, less data) Baseline
1 relative efficiency
Chinchilla-optimal (balanced) 80% more efficient
1.8 relative efficiency
Over-trained (small model, more data) Better for inference
1.5 relative efficiency
Beyond Chinchilla: Over-Training for Inference

Chinchilla-optimality minimizes training compute for a given loss. But in practice, inference cost often dominates total cost of ownership. A smaller model trained on more data than Chinchilla-optimal (“over-trained”) achieves slightly worse training loss but much cheaper inference. Llama 3 8B was trained on 15T tokens — roughly 1,875x its parameter count, far beyond the Chinchilla ratio of 20x. This is deliberate: the extra training cost is amortized over billions of inference queries.

Using Scaling Laws for Architecture Decisions

Scaling laws do not directly tell you what architecture to use. They tell you the relationship between total compute, model size, and data size. But combined with empirical studies, they inform architecture decisions in several ways:

  1. Compute budget determines model size: Given a fixed training budget, Chinchilla tells you the optimal parameter count.
  2. Parameter count constrains architecture: Once you know you are building a 70B model, the design space is much more constrained than if you were choosing between 7B and 700B.
  3. Small-scale experiments transfer: The scaling laws show that architectural trends at small scale (1B-7B) generally hold at large scale (70B-405B), allowing you to test design choices cheaply.

Architecture Decisions That Matter

Width vs Depth

The two primary axes of Transformer scaling are width (model dimension dmodeld_{\text{model}}, which determines the size of each layer) and depth (number of layers LL). For a given parameter budget, wider models have fewer layers and deeper models have more layers with smaller dimensions.

def compute_transformer_params(d_model, n_layers, n_heads, d_ff_multiplier=4):
    """
    Approximate parameter count for a Transformer.
    Each layer has:
    - Self-attention: 4 * d_model^2 (Q, K, V, and output projections)
    - FFN: 2 * d_model * d_ff (up and down projections)
    - Layer norms: negligible
    """
    d_ff = int(d_model * d_ff_multiplier)
    attention_params = 4 * d_model ** 2
    ffn_params = 2 * d_model * d_ff
    per_layer = attention_params + ffn_params
    total = n_layers * per_layer
    return total

# Same ~7B parameter budget, different allocations:
wide_shallow = compute_transformer_params(d_model=6144, n_layers=24)   # ~7.2B
balanced     = compute_transformer_params(d_model=4096, n_layers=32)   # ~6.7B
narrow_deep  = compute_transformer_params(d_model=3072, n_layers=48)   # ~7.2B
📊

Width vs Depth Tradeoffs (Same ~7B Parameter Budget)

Configurationd_modelLayersTraining LossInference SpeedQuality
Wide-shallow 6144 24 Slightly higher Faster (fewer layers) Weaker on reasoning
Balanced 4096 32 Optimal Moderate Best overall
Narrow-deep 3072 48 Slightly higher Slower (more layers) Better on some tasks
Note: Balanced configurations generally win. Going too shallow hurts compositional reasoning; going too deep increases latency and can cause training instability.

Why depth matters: Each Transformer layer performs one round of attention and one round of feed-forward processing. Deeper models can express more complex compositional functions — the kind needed for multi-step reasoning, where the output of one reasoning step feeds into the next. Shallow models struggle with tasks requiring many sequential inference steps.

Why width matters: The model dimension determines the capacity of each layer. Wider layers can represent more features simultaneously and have larger feed-forward networks that act as knowledge stores. Width is also more parallelizable than depth — you can shard a wide layer across GPUs (tensor parallelism) more easily than you can parallelize sequential layers.

The practical answer: Most successful LLMs use a roughly dmodel128×Ld_{\text{model}} \approx 128 \times L relationship. Llama 3 8B uses dmodel=4096d_{\text{model}} = 4096 with L=32L = 32 layers. Llama 3 70B uses dmodel=8192d_{\text{model}} = 8192 with L=80L = 80 layers.

Attention Head Count and Dimension

The number of attention heads HH and the head dimension dhd_h are related by dmodel=H×dhd_{\text{model}} = H \times d_h. Increasing the head count while keeping dmodeld_{\text{model}} fixed reduces the dimension per head.

📊

Head Dimension Impact (d_model = 4096)

HeadsHead DimAttention CapacityEfficiencyUsed By
16 256 Fewer, richer representations Lower Rare (too few heads)
32 128 Good balance Good Llama 3 8B, Mistral 7B
64 64 Many diverse patterns Good GPT-3 era models
128 32 Very diverse but shallow Highest Not used (too small)
Note: Head dimension of 128 has become standard because it maps well to GPU tensor core tile sizes.

A head dimension of 128 has become the standard for a hardware reason: NVIDIA tensor cores operate on tiles of 16x16 or 32x32 elements, and dh=128d_h = 128 provides clean tiling. This seemingly minor implementation detail has driven a convergence across architectures.

FFN Ratio and SwiGLU

The feed-forward network (FFN) in each Transformer layer traditionally uses a hidden dimension of 4×dmodel4 \times d_{\text{model}}. Modern LLMs have converged on a few modifications:

SwiGLU activation: Instead of the original ReLU FFN (FFN(x)=max(0,xW1)W2\text{FFN}(x) = \max(0, xW_1)W_2), most modern LLMs use SwiGLU (FFN(x)=(Swish(xW1)xW3)W2\text{FFN}(x) = (\text{Swish}(xW_1) \odot xW_3) W_2). This adds a third weight matrix but empirically improves quality for the same parameter count.

Reduced multiplier: Because SwiGLU has three matrices instead of two, the FFN hidden dimension is reduced to compensate. The standard SwiGLU multiplier is 83×dmodel2.67×dmodel\frac{8}{3} \times d_{\text{model}} \approx 2.67 \times d_{\text{model}}, rounded to a multiple of 256 for hardware efficiency.

# Traditional FFN
class TraditionalFFN(nn.Module):
    def __init__(self, d_model, d_ff=None):
        super().__init__()
        d_ff = d_ff or 4 * d_model
        self.w1 = nn.Linear(d_model, d_ff)
        self.w2 = nn.Linear(d_ff, d_model)

    def forward(self, x):
        return self.w2(F.relu(self.w1(x)))
        # Parameters: 2 * d_model * d_ff

# SwiGLU FFN (used in Llama, Mistral, etc.)
class SwiGLUFFN(nn.Module):
    def __init__(self, d_model, d_ff=None):
        super().__init__()
        d_ff = d_ff or int(8/3 * d_model)
        d_ff = ((d_ff + 255) // 256) * 256  # Round to 256
        self.w1 = nn.Linear(d_model, d_ff)   # Gate projection
        self.w3 = nn.Linear(d_model, d_ff)   # Up projection
        self.w2 = nn.Linear(d_ff, d_model)   # Down projection

    def forward(self, x):
        return self.w2(F.silu(self.w1(x)) * self.w3(x))
        # Parameters: 3 * d_model * d_ff
        # But d_ff is smaller, so total params ~ same as traditional
ℹ️ Why SwiGLU Wins

SwiGLU outperforms ReLU and GELU activations by 0.5-1% on language modeling benchmarks at equivalent parameter count. The gating mechanism (element-wise multiplication of two projections) provides more expressive nonlinearity. The cost is a 50% increase in FFN computation for the same hidden dimension, which is offset by reducing the hidden dimension.

MoE vs Dense: When to Use Mixture of Experts

Mixture of Experts (MoE) replaces the dense FFN with multiple “expert” FFNs, routing each token to only a subset (typically 2 out of 8 or 16 experts). This increases total parameters without proportionally increasing computation.

📊

Dense vs MoE Architecture Comparison

ArchitectureTotal ParamsActive Params/TokenTraining FLOPsInference CostQuality
Dense 7B 7B 7B 1x 1x Baseline
Dense 13B 13B 13B 1.9x 1.9x +3-5% over 7B
MoE 8x7B (top-2) 47B 13B ~2x ~2x +5-8% over 7B
Dense 70B 70B 70B 10x 10x +15-20% over 7B
MoE 8x22B (top-2) 141B 39B ~5.5x ~5.5x +12-16% over 7B
Note: Active params determine inference compute. MoE achieves dense-model quality at lower compute cost but higher memory.

When MoE wins: MoE is most valuable when you have abundant memory but limited compute budget. It gives you the quality benefits of a larger model at the compute cost of a smaller one. This is why Mixtral 8x7B (47B total, 13B active) can compete with dense models 2-3x its active parameter count.

When dense wins: Dense models are simpler to train (no load balancing issues), require less total memory (no inactive expert weights), and have more predictable performance. For small models (under 13B) where memory is not the bottleneck, dense architectures are generally preferred.

How DeepSeek Chose 671B MoE

DeepSeek-V3 (2024) is one of the most detailed public examples of compute-informed architecture design. They chose a 671B MoE architecture with 37B active parameters per token, routing to 8 out of 256 experts.

The Cost Analysis

DeepSeek’s decision was driven by a specific cost target: achieve GPT-4-class performance at a fraction of the training cost.

📊

DeepSeek-V3 Architecture Decision Analysis

AlternativeTotal ParamsActive ParamsEst. Training CostExpected Quality
Dense 70B 70B 70B $4M Below GPT-4
Dense 405B 405B 405B $60M Near GPT-4
MoE 671B (chosen) 671B 37B $5.6M Near GPT-4
Dense 37B (same active) 37B 37B $2M Well below GPT-4
Note: Training costs estimated for H800 hardware. DeepSeek-V3 achieved near-GPT-4 quality at roughly 10% of the estimated dense-405B cost.

The key insight is in the last two rows: a dense 37B model (same active parameters as DeepSeek-V3) would be significantly weaker, while a dense 405B model (similar quality target) would cost 10x more to train. MoE gives you the knowledge capacity of a much larger model at the compute cost of the active parameter count.

DeepSeek’s Architectural Innovations

DeepSeek-V3 introduced several innovations beyond basic MoE:

Multi-head Latent Attention (MLA): Instead of standard multi-head attention, DeepSeek compresses the KV cache using learned down-projections. This reduces the KV cache by 6-8x compared to standard MHA at the same quality.

Fine-grained experts with shared experts: Instead of 8 large experts, DeepSeek uses 256 small experts (routed top-8) plus 1 shared expert that processes every token. The shared expert handles common patterns while the routed experts specialize.

Auxiliary-loss-free load balancing: Traditional MoE uses an auxiliary loss to encourage balanced expert utilization, which can hurt model quality. DeepSeek uses a bias-based approach that achieves balance without an explicit loss term.

MoE Memory vs Compute Tradeoff

DeepSeek-V3 has 671B total parameters but only 37B active per token. The full model requires ~1.3 TB of memory in FP16, but the compute cost per token is similar to a dense 37B model. This means DeepSeek-V3 needs many GPUs for memory (to hold all expert weights) but uses each GPU’s compute relatively sparingly. It is the opposite of the typical dense model bottleneck.

Llama 3 Architecture Choices Explained

Meta’s Llama 3 family (8B, 70B, 405B) provides a clear case study in modern architecture design because Meta published detailed ablation results.

The Key Decisions

📊

Llama 3 Architecture Details

Parameter8B70B405B
Layers 32 80 126
d_model 4096 8192 16384
Attention heads 32 64 128
KV heads (GQA) 8 8 8
Head dimension 128 128 128
FFN hidden dim 14336 28672 53248
FFN multiplier 3.5x 3.5x 3.25x
Vocabulary size 128K 128K 128K
Context length 128K 128K 128K
Training tokens 15T 15T 15T

Several patterns are notable:

Constant KV heads (8): All three model sizes use exactly 8 KV heads. This is a deliberate choice for inference efficiency — the KV cache scales linearly with KV head count, and 8 provides a good quality-efficiency tradeoff at 128K context.

Constant head dimension (128): All models use 128-dimensional heads, regardless of model width. This is driven by hardware efficiency (tensor core tile sizes) and the observation that head dimension has diminishing returns beyond 128.

Large vocabulary (128K): Llama 3 uses a much larger vocabulary than Llama 2 (128K vs 32K). Larger vocabularies improve tokenization efficiency (fewer tokens per word, especially for non-English languages) at a modest parameter cost (the embedding layer grows, but it is a small fraction of total parameters).

Over-training: All three models were trained on 15T tokens, far beyond Chinchilla-optimal. For the 8B model, this is 1,875 tokens per parameter (vs Chinchilla’s 20). This over-training improves the quality of the smaller models significantly, at the cost of “wasted” training compute that would have been better spent on a larger model under Chinchilla rules.

Why Over-Training Makes Economic Sense

📊

Chinchilla-Optimal vs Llama 3 Training Strategy

ModelChinchilla TokensLlama 3 TokensOver-training FactorInference Savings
8B 160B 15T 93.75x Model is 93x smaller than Chinchilla-optimal for same data
70B 1.4T 15T 10.7x Model is 10x smaller than Chinchilla-optimal
405B 8.1T 15T 1.85x Near Chinchilla-optimal

The 8B model is the most aggressively over-trained. The logic: the 8B model will be deployed billions of times. Every token of inference is cheap because the model is small. The extra training compute is a one-time cost that is amortized over the lifetime of deployments. A Chinchilla-optimal model trained on only 160B tokens would be much weaker, requiring deployment of the more expensive 70B model for the same quality.

💡 The Over-Training Principle

For models intended for broad deployment, over-training beyond Chinchilla-optimal is almost always worth it. The formula is:

Total cost = Training cost + (Inference cost per query * Expected queries)

When expected queries is large (millions to billions), minimizing inference cost per query (smaller model) dominates, even if it means higher training cost.

Modern Architecture Design Process

Putting it all together, here is how a team designing a new LLM in 2025 typically approaches architecture decisions:

Step 1: Define the Compute Budget and Use Case

The compute budget determines the maximum model size. The use case determines the deployment constraints (latency, memory, throughput).

Step 2: Use Scaling Laws to Determine Model Size

Given the compute budget, use Chinchilla-style scaling laws (adjusted for over-training if deploying at scale) to determine the target parameter count and training data size.

Step 3: Choose Dense vs MoE

If memory is abundant but compute is constrained, MoE. If simplicity and predictability are priorities, dense. If the model will be deployed on consumer hardware, dense (MoE memory requirements are too high).

Step 4: Set Architecture Hyperparameters

For a dense model with NN parameters:

def design_llm_architecture(target_params_billions):
    """
    Heuristic architecture design for a dense Transformer LLM.
    """
    N = target_params_billions * 1e9

    # Head dimension is always 128
    d_head = 128

    # Estimate layers and width from empirical relationships
    # d_model ~ 128 * L is a rough guide
    # Total params ~ 12 * L * d_model^2 (for SwiGLU with 3.5x multiplier)

    # Solve for d_model and L given target params
    # Using the approximation: N ~ 12 * d_model^2 * (d_model / 128)
    # = 12/128 * d_model^3
    d_model = int((N * 128 / 12) ** (1/3))
    d_model = ((d_model + 127) // 128) * 128  # Round to multiple of 128

    n_layers = d_model // 128
    n_heads = d_model // d_head
    n_kv_heads = 8  # Standard GQA

    # SwiGLU FFN dimension
    d_ff = int(d_model * 3.5)
    d_ff = ((d_ff + 255) // 256) * 256  # Round to 256

    return {
        'd_model': d_model,
        'n_layers': n_layers,
        'n_heads': n_heads,
        'n_kv_heads': n_kv_heads,
        'd_head': d_head,
        'd_ff': d_ff,
    }

Step 5: Validate with Small-Scale Experiments

Train 100M-1B parameter versions of the candidate architectures on a small dataset and compare loss curves. The relative ordering of architectures is generally preserved at scale (this is the key insight that makes scaling-law-based design work).

Small-Scale Experiment Transferability

(rank correlation)
100M to 1B Very high
0.95 rank correlation
1B to 7B
0.92 rank correlation
7B to 70B
0.88 rank correlation
100M to 70B Still reliable
0.82 rank correlation
100M to 405B Somewhat reliable
0.75 rank correlation

What NAS Got Right (And What Carries Forward)

While NAS is not used directly for LLM architecture search, several ideas from the NAS era remain relevant:

Hardware-aware design: NAS introduced the idea of optimizing architectures for specific hardware. This principle is alive in LLM design — the choice of dh=128d_h = 128, FFN dimensions rounded to multiples of 256, and GQA head counts are all hardware-driven.

Automated hyperparameter search: While we do not search over architecture topology, automated search over training hyperparameters (learning rate, batch size, warmup schedule) is standard practice and uses similar Bayesian optimization techniques.

Efficiency frontiers: NAS established the concept of Pareto-optimal architectures that balance accuracy against cost. Scaling laws serve the same purpose for LLMs, defining the frontier of what is achievable at each compute budget.

Transferable patterns: NAS discovered that certain motifs (skip connections, inverted bottlenecks) work well across scales. In LLMs, the analogous transferable patterns are SwiGLU, RMSNorm, RoPE, and GQA.

📊

Architecture Patterns: NAS Era vs LLM Era

ConceptNAS Era (2017-2020)LLM Era (2022-2025)
Search method RL/evolutionary/gradient search Scaling laws + ablations
Search cost 1K-10K GPU-hours 100-1K GPU-hours (small-scale expts)
Key decisions Operation type, connectivity Width, depth, FFN ratio, MoE
Hardware awareness Latency tables Tensor core tile sizes, memory hierarchy
Validation method Train and evaluate Scaling law extrapolation
Transferability Limited across tasks Strong across scales

Conclusion

The evolution from NAS to scaling laws reflects a broader maturation of the field. When architectures are diverse and the design space is poorly understood, automated search over many candidates makes sense. When architectures have converged and the key variables are continuous, mathematical modeling is far more efficient.

Modern LLM architecture design is driven by a few key principles:

  1. Scaling laws determine model size given a compute budget and deployment plan.
  2. Over-training beyond Chinchilla is standard for models intended for broad deployment, because inference cost dominates total cost.
  3. The basic Transformer architecture is fixed: self-attention + SwiGLU FFN + RMSNorm + RoPE + GQA. The interesting choices are the continuous parameters.
  4. MoE is chosen when memory is cheap but compute is expensive, allowing larger knowledge capacity at lower per-token cost.
  5. Hardware dictates many “architectural” choices: head dimension of 128, FFN dimensions rounded to 256, GQA head count of 8 — these are driven by tensor core efficiency and memory hierarchy, not theoretical optimality.
  6. Small-scale experiments validate large-scale decisions, because architectural trends transfer reliably across scales.

The field has moved from “search for the best architecture” to “compute the right size and shape.” This is both less exciting and far more effective. The billions of dollars saved by not running NAS at LLM scale have been redirected into what actually drives progress: more data, more compute, and better training recipes.