Part of Series Transformer Anatomy 13 of 23
1 The Transformer Attention Mechanism: From First Principles to Performance Reality 2 Tokenization and BPE: How LLMs See Text — From Characters to Subwords 3 Embedding Layers: The Geometry of Meaning in LLMs 4 Position Encoding in Transformers: From Sinusoidal to RoPE, ALiBi, and Long-Context Scaling 5 Softmax Numerics: Log-Sum-Exp, Temperature, and Why Numerical Stability Matters 6 Attention Variants Compared: MHA, MQA, GQA, and MLA 7 Normalization in Transformers: LayerNorm, RMSNorm, and the Training Stability Story 8 Residual Connections and Skip Paths: Why Transformers Can Be 100 Layers Deep 9 The Feed-Forward Network: SwiGLU, Gating, and the FFN-as-Memory Hypothesis 10 Mixture of Experts: Why Conditional Computation Is the Path to Trillion-Parameter Models 11 The Output Head: Unembedding, Weight Tying, and Vocabulary Projection 12 Cross-Entropy Loss: How the Loss Function Shapes What an LLM Learns 13 Encoder vs Decoder: Why Decoder-Only Won 14 DeepSeek V3: How 671B Parameters Trained for the Cost of a 70B Dense Model 15 Building a Transformer From Scratch: Putting Every Component Together 16 Gradient Flow and Backpropagation Through Transformers: What Happens During the Backward Pass 17 Weight Initialization: Xavier, Kaiming, and Why mu-P Changes Everything for Large Models 18 Training Loop Anatomy: Forward Pass, Loss Computation, Backward Pass, Optimizer Step 19 Learning Rate Schedules: Warmup, Cosine Decay, and Why WSD Changes Everything 20 Activation Functions Deep Dive: ReLU, GELU, SiLU, and Why Each Matters for Transformers 21 Attention Masking: Causal, Bidirectional, Sliding Window, Block Sparse, and Custom Patterns 22 Knowledge Distillation: Training Small Models to Match Large Ones 23 Model Merging: Weight Averaging, TIES, DARE, and Evolutionary Search

Introduction

In 2018, two papers redefined natural language processing. Google released BERT (Bidirectional Encoder Representations from Transformers), and OpenAI released GPT (Generative Pre-trained Transformer). Both were built on the transformer architecture from “Attention Is All You Need,” but they made fundamentally different design choices. BERT used only the encoder half. GPT used only the decoder half. Each came with a different pre-training objective, a different attention pattern, and a different theory about what language models should optimize for.

At the time, the consensus was clear: BERT was better for understanding, GPT was better for generation, and encoder-decoder models like T5 were the best of both worlds. The community expected a diverse ecosystem of architectures, each specialized for its niche.

That is not what happened. By 2024, decoder-only models had won almost every category. GPT-4, Claude, Llama, Gemini, Mistral — all decoder-only. Encoder models like BERT still exist in production for specific tasks, but the frontier of AI research and deployment runs entirely on autoregressive decoders. Encoder-decoder models like T5 and BART are largely historical artifacts at the frontier scale.

This post traces the full arc of that convergence. We start with the original architectural split, examine why each design excels where it does, then follow the scaling law arguments, the inference economics, and the practical engineering pressures that made decoder-only the default. Finally, we look at what remains of the encoder niche and where the field is heading.

The Original Split: Two Philosophies of Language Modeling

BERT: Bidirectional Encoder

BERT’s core insight was that language understanding benefits from seeing context in both directions simultaneously. A word’s meaning depends on what comes before and after it. Consider the word “bank” in “I sat on the river bank” versus “I went to the bank to deposit money.” A left-to-right model must commit to a representation of “bank” before seeing “river” or “deposit.” BERT sees everything at once.

The mechanism is masked language modeling (MLM). During pre-training, BERT randomly masks 15% of input tokens and learns to predict them from the surrounding context. This forces every hidden state to encode information from both directions:

class BERTEncoder:
    """Bidirectional encoder: all tokens attend to all tokens."""
    def __init__(self, num_layers, hidden_size, num_heads):
        self.layers = [
            TransformerLayer(
                attention_type="bidirectional",  # No causal mask
                hidden_size=hidden_size,
                num_heads=num_heads
            ) for _ in range(num_layers)
        ]

    def forward(self, input_ids, attention_mask=None):
        # Every token attends to every other token in the sequence.
        # The attention matrix is fully dense (no triangular mask).
        x = self.embeddings(input_ids)
        for layer in self.layers:
            x = layer(x, attention_mask=attention_mask)
        return x  # Shape: [batch, seq_len, hidden_size]

The attention pattern is a full n×nn \times n matrix. Token 1 attends to token nn; token nn attends to token 1. There is no notion of “past” or “future.” This is enormously powerful for tasks where you have the entire input available up front — classification, named entity recognition, question answering over a fixed passage.

GPT: Autoregressive Decoder

GPT takes the opposite approach. It uses causal language modeling (CLM): predict the next token given all previous tokens. The attention mask is lower-triangular, so token ii can only attend to tokens 1,2,,i1, 2, \ldots, i. This is autoregressive by construction.

class GPTDecoder:
    """Causal decoder: each token attends only to previous tokens."""
    def __init__(self, num_layers, hidden_size, num_heads):
        self.layers = [
            TransformerLayer(
                attention_type="causal",  # Lower-triangular mask
                hidden_size=hidden_size,
                num_heads=num_heads
            ) for _ in range(num_layers)
        ]

    def forward(self, input_ids):
        # Token i attends to tokens [1..i] only.
        # The causal mask prevents information leakage from the future.
        x = self.embeddings(input_ids)
        for layer in self.layers:
            x = layer(x, causal_mask=True)
        return x  # Shape: [batch, seq_len, hidden_size]

The causal constraint means GPT’s representations at position ii encode information only from the left context. This is weaker for understanding — you are literally throwing away half the context. But it is perfectly aligned with generation: to produce the next token, you only need the tokens you have already generated.

The Pre-Training Objective Matters More Than You Think

The difference between MLM and CLM is not just a training trick. It shapes what the model learns to represent.

MLM (BERT) trains the model to be a good fill-in-the-blank predictor. Given “The cat sat on the [MASK],” predict “mat.” This is a discriminative task. The model learns rich bidirectional representations, but it does not learn to generate coherent sequences. You cannot easily sample from BERT — there is no natural left-to-right generation procedure.

CLM (GPT) trains the model to be a good next-word predictor. Given “The cat sat on the,” predict “mat.” This is a generative task. The model learns a proper probability distribution P(xtx1,,xt1)P(x_t | x_1, \ldots, x_{t-1}) that you can sample from autoregressively. Generation falls out naturally.

ℹ️ The Generation Problem for Encoders

BERT cannot generate text natively. To use BERT for generation, you need workarounds like iterative masking and replacement, Gibbs sampling, or bolting on a separate decoder. These approaches are slow, produce lower-quality text, and add engineering complexity. The autoregressive formulation of GPT makes generation a first-class operation.

📊

Architectural Comparison: Encoder vs Decoder

PropertyBERT (Encoder)GPT (Decoder)Winner For...
Attention Pattern Full bidirectional (n x n) Causal lower-triangular Understanding vs Generation
Pre-training Objective Masked Language Model (15% tokens) Causal Language Model (all tokens) Representation richness vs Generative quality
Context at Position i All n tokens Tokens 1..i only Comprehension vs Sequential generation
Native Generation No (requires workarounds) Yes (autoregressive sampling) GPT
Training Signal Density 15% of tokens per example 100% of tokens per example GPT (more efficient use of data)
Inference Mode Single forward pass Sequential token-by-token BERT for encoding, GPT for generation

Note the training signal density row. BERT only gets a gradient signal from the 15% of tokens it masks. GPT gets a signal from every single token position. This means GPT extracts more learning per training example, which becomes increasingly important at scale.

Why Decoders Won for Generation

The answer is almost tautological: causal attention is generation. When you train a model to predict P(xtx1:t1)P(x_t | x_{1:t-1}) and then sample from that distribution token by token, you are doing exactly what the model was trained for. There is no gap between training and inference.

The Autoregressive Sampling Loop

Generation with a decoder model is straightforward:

def autoregressive_generate(model, prompt_tokens, max_new_tokens, temperature=1.0):
    """Generate text token by token using a causal decoder."""
    tokens = list(prompt_tokens)
    kv_cache = None

    for _ in range(max_new_tokens):
        # Forward pass: compute logits for the next token
        logits, kv_cache = model.forward(
            tokens[-1:] if kv_cache else tokens,
            kv_cache=kv_cache
        )

        # Sample from the probability distribution
        probs = softmax(logits[-1] / temperature)
        next_token = sample(probs)

        tokens.append(next_token)
        if next_token == EOS_TOKEN:
            break

    return tokens

Each step produces exactly one token. The KV cache stores previously computed key-value pairs so we do not recompute attention over the full sequence at every step. The model was trained to predict the next token — and that is exactly what we ask it to do.

Why Encoders Cannot Generate Naturally

BERT has no concept of “the next token.” Its attention is bidirectional, so every position already sees every other position. To generate with BERT, you would need something like:

  1. Start with a sequence of [MASK] tokens.
  2. Predict all masked tokens simultaneously.
  3. Replace some masks with predictions, keep others masked.
  4. Repeat until convergence.

This is essentially Gibbs sampling and has serious problems: it does not define a proper joint distribution over sequences, the order of unmasking is arbitrary, and it requires many forward passes. In practice, BERT-based generation produces incoherent text compared to autoregressive models.

The KV Cache Advantage

Decoder models use the KV cache to make generation efficient. After processing the prompt in a single forward pass (the “prefill” phase), each new token requires attending only to the cached keys and values plus the new token’s own query. The per-token cost during generation is O(nd)O(n \cdot d) where nn is the current sequence length and dd is the model dimension — linear in sequence length, not quadratic. Encoders have no equivalent optimization because they do not generate sequentially.

The Prompt Engineering Paradigm

Decoder-only models enabled a new paradigm: in-context learning and prompt engineering. Instead of fine-tuning a separate model for each task, you prepend instructions and examples to your input and let the autoregressive model continue. This only works because the model is trained to predict continuations of arbitrary text. BERT cannot do this — it is trained to fill in blanks, not to continue sequences.

Why Encoders Won for Understanding

Despite everything above, encoders genuinely excel at tasks where you have the full input and need a single output. The bidirectional attention means every token’s representation is informed by the entire context, not just the left context.

Classification and Sequence Labeling

For text classification, you feed the entire document through the encoder, take the [CLS] token’s representation (or pool across all positions), and pass it through a classification head. This is a single forward pass with full bidirectional context. The model sees every word in the document when computing the representation of every other word.

For named entity recognition (NER), each token’s representation captures both left and right context, which is critical. Is “Washington” a person, a city, or a state? The answer depends on tokens to the left and right.

📊

Task Performance: Encoder vs Decoder (Comparable Model Sizes, circa 2019-2020)

TaskBERT-base (110M)GPT-2 (117M)DeltaWhy Encoder Wins
MNLI (NLI) 84.6 Acc 78.2 Acc +6.4 Bidirectional cross-sentence attention
SQuAD v1.1 (QA) 88.5 F1 79.1 F1 +9.4 Full passage context for span extraction
CoNLL-03 (NER) 92.8 F1 85.3 F1 +7.5 Both left and right context for entity boundaries
SST-2 (Sentiment) 93.5 Acc 88.7 Acc +4.8 Full sentence context
STS-B (Similarity) 89.3 Pearson 82.1 Pearson +7.2 Bidirectional sentence representations

At comparable model sizes, BERT’s advantage on understanding tasks was substantial — 5 to 10 points on most benchmarks. This led to the widespread adoption of BERT, RoBERTa, and later DeBERTa for production NLU systems.

The Embarrassingly Parallel Inference Advantage

Encoder inference is a single forward pass. You feed in the entire input, get out a representation for every token, and you are done. There is no sequential generation loop. This has massive implications for latency:

def encoder_inference(model, input_tokens):
    """Single forward pass -- all tokens processed in parallel."""
    # One matrix multiply per layer, fully parallel across sequence length.
    representations = model.forward(input_tokens)
    # Classification: take [CLS] or pool
    logits = classification_head(representations[:, 0, :])
    return logits  # Done. No loop.

def decoder_generation(model, input_tokens, output_length=100):
    """Sequential loop -- one token at a time."""
    # Prefill: process input (parallel, like encoder)
    kv_cache = model.prefill(input_tokens)
    # Decode: generate tokens one by one (SEQUENTIAL)
    output = []
    for _ in range(output_length):
        logit, kv_cache = model.decode_step(output[-1:], kv_cache)
        output.append(sample(logit))
    return output  # 100 sequential steps
📊

Inference Latency Comparison (Single Input, A100 GPU)

OperationModelLatencyThroughputNote
Classification BERT-base (110M) 2.1 ms 476 inputs/sec Single forward pass
Classification BERT-large (340M) 5.8 ms 172 inputs/sec Single forward pass
Classification (prompt) GPT-2 (117M) 3.4 ms 294 inputs/sec Prefill only, no generation
Generate 100 tokens GPT-2 (117M) 48 ms 2,083 tok/sec 100 sequential decode steps
Generate 100 tokens GPT-2 Medium (345M) 112 ms 893 tok/sec 100 sequential decode steps
Generate 500 tokens GPT-2 (117M) 235 ms 2,128 tok/sec 500 sequential decode steps

For classification and retrieval workloads, encoder models are 10-50x faster than decoder models generating an equivalent response. This is why BERT-family models still power most production search engines, recommendation systems, and content classification pipelines. The economics are compelling: if you do not need generation, why pay the cost of autoregressive decoding?

The Encoder Lineage: BERT to DeBERTa

The encoder family continued to improve after BERT:

  • RoBERTa (2019): Trained longer with more data, removed next-sentence prediction, dynamic masking. Significant gains across all benchmarks.
  • ALBERT (2020): Parameter sharing and factorized embeddings for efficiency.
  • ELECTRA (2020): Replaced MLM with a replaced-token-detection objective. Far more sample-efficient.
  • DeBERTa (2021): Disentangled attention for content and position. DeBERTa-v3 remains the strongest encoder model on SuperGLUE.

These models are smaller, faster, and cheaper than frontier decoder models. A DeBERTa-large (304M parameters) still outperforms GPT-3 (175B parameters) on several understanding benchmarks when both are fine-tuned. The per-parameter efficiency of bidirectional attention for understanding tasks is remarkable.

Encoder-Decoder: The Road Not Taken

The encoder-decoder architecture — exemplified by T5 (Text-to-Text Transfer Transformer) and BART — seemed like the obvious best of both worlds. The encoder processes the input bidirectionally (like BERT), and the decoder generates the output autoregressively (like GPT). For sequence-to-sequence tasks like translation, summarization, and question answering, this is architecturally elegant.

How Encoder-Decoder Works

class EncoderDecoderModel:
    """T5/BART style: bidirectional encoder + autoregressive decoder."""
    def __init__(self, encoder_layers, decoder_layers, hidden_size, num_heads):
        self.encoder = TransformerEncoder(encoder_layers, hidden_size, num_heads)
        self.decoder = TransformerDecoder(decoder_layers, hidden_size, num_heads)

    def forward(self, source_ids, target_ids):
        # Encoder: bidirectional attention over input
        encoder_output = self.encoder(source_ids)  # [batch, src_len, hidden]

        # Decoder: causal attention over output + cross-attention to encoder
        decoder_output = self.decoder(
            target_ids,
            encoder_hidden_states=encoder_output,  # Cross-attention
            causal_mask=True
        )
        return decoder_output

The cross-attention mechanism in the decoder allows each generated token to attend to all positions in the encoder output. This gives the decoder full bidirectional context over the input while maintaining autoregressive generation.

Why Encoder-Decoder Lost at Scale

Despite the architectural elegance, encoder-decoder models faced several scaling challenges:

1. Complexity budget allocation. If you have a fixed parameter budget (say, 70B parameters), how do you split between encoder and decoder? A 35B encoder + 35B decoder? Or 50B decoder + 20B encoder? There is no obvious answer, and the optimal split depends on the task mix. A decoder-only model avoids this decision entirely.

2. Training objective complexity. T5 uses a span corruption objective: mask spans of tokens in the input, and the decoder must generate the missing spans. This is more complex to implement and tune than simple next-token prediction. BART uses a denoising objective with multiple noise functions. More hyperparameters, more things to get wrong.

3. Inference complexity. Encoder-decoder models require two separate forward passes during inference — one through the encoder (for the input) and one through the decoder (for generation). The KV cache must store both the encoder’s output and the decoder’s own KV cache. This doubles memory management complexity.

4. The scaling laws do not favor it. Empirically, decoder-only models show smoother and more predictable scaling behavior. The Chinchilla scaling laws (Hoffmann et al., 2022) were derived for decoder-only models. No equivalent analysis has shown encoder-decoder models to be more compute-efficient at frontier scale.

📊

Encoder-Decoder vs Decoder-Only at Scale

ModelArchitectureParametersTraining TokensKey Observation
T5-11B Encoder-Decoder 11B 1T Strong on benchmarks but complex training
GPT-3 Decoder-Only 175B 300B In-context learning without fine-tuning
PaLM Decoder-Only 540B 780B Dominated benchmarks at scale
UL2 Encoder-Decoder 20B 1T Google's attempt to revive enc-dec; limited adoption
Llama 2 Decoder-Only 70B 2T Open-source standard, decoder-only
Llama 3.1 Decoder-Only 405B 15T+ Frontier open model, decoder-only
💡 The T5 Lesson

T5 demonstrated that framing every NLP task as text-to-text is powerful. But the insight that survived was the “text-to-text” framing, not the encoder-decoder architecture. Modern decoder-only models use the same text-to-text approach — they just implement it with causal attention and prompt formatting instead of separate encoder and decoder stacks.

The practical outcome: no frontier lab trains encoder-decoder models at the largest scales. Google’s own Gemini models are decoder-only, despite Google having invented T5. The simplicity and predictability of decoder-only scaling won.

The Scaling Law Argument

The most powerful argument for decoder-only models comes from scaling laws. These empirical relationships describe how model performance (measured by loss on a held-out set) improves as you increase model size, dataset size, and compute budget.

Kaplan et al. (2020): The Original Scaling Laws

OpenAI’s scaling laws paper showed that for autoregressive language models, test loss follows a power law in model parameters NN, dataset size DD, and compute budget CC:

L(N)(NcN)αNL(N) \approx \left(\frac{N_c}{N}\right)^{\alpha_N}

L(D)(DcD)αDL(D) \approx \left(\frac{D_c}{D}\right)^{\alpha_D}

L(C)(CcC)αCL(C) \approx \left(\frac{C_c}{C}\right)^{\alpha_C}

where αN0.076\alpha_N \approx 0.076, αD0.095\alpha_D \approx 0.095, and αC0.050\alpha_C \approx 0.050. These laws were derived for GPT-style decoder-only models and held across six orders of magnitude of compute.

Chinchilla (2022): Compute-Optimal Training

DeepMind’s Chinchilla paper refined the scaling laws and showed that most large language models were significantly undertrained. The compute-optimal relationship is roughly:

NoptC0.5,DoptC0.5N_{opt} \propto C^{0.5}, \quad D_{opt} \propto C^{0.5}

meaning that model size and training data should scale equally with compute budget. A 70B parameter model should be trained on approximately 1.4T tokens for compute-optimal performance.

Again, these laws were derived for decoder-only models. The smooth, predictable scaling behavior gives labs confidence to invest hundreds of millions of dollars in training runs. You can extrapolate from small experiments to predict large-model performance with reasonable accuracy.

Why Scaling Favors Decoder-Only

Several properties of decoder-only models make them better scaling targets:

1. Training signal density. Every token in every training example provides a gradient signal. BERT-style MLM wastes 85% of tokens. At 15T training tokens (Llama 3.1 scale), this difference is enormous. The decoder model gets 15T15T token-level predictions; an MLM model gets 2.25T\sim 2.25T.

2. Architectural simplicity. One stack of transformer layers, one attention pattern, one loss function. No encoder-decoder split, no cross-attention, no span corruption. Simpler architectures have fewer failure modes at scale.

3. Emergent capabilities. As decoder models scale, they develop capabilities that were not present at smaller scales: in-context learning, chain-of-thought reasoning, instruction following. These emergent capabilities appear to be uniquely strong in autoregressive models, possibly because the next-token prediction objective forces the model to develop general-purpose reasoning.

Scaling Behavior: Decoder-Only Models

(Cross-Entropy Loss) line
📊 line chart (Cross-Entropy Loss)
📊

Scaling Efficiency: Decoder-Only vs Alternatives

PropertyDecoder-Only (GPT)Encoder (BERT)Encoder-Decoder (T5)
Training signal per token 1.0 (every token) 0.15 (masked tokens only) ~0.5 (depends on corruption ratio)
Scaling law predictability Excellent (power law) Good (less studied at scale) Moderate (split budget complicates)
Compute-optimal known? Yes (Chinchilla) No No
Emergent capabilities Strong (ICL, CoT, tool use) Weak (limited to fine-tuned tasks) Moderate (some transfer)
Largest model trained 1.8T (GPT-4 class) ~1.5B (DeBERTa-xxl) 11B (T5)
Investment confidence High (predictable returns) Low at frontier scale Low at frontier scale

The scaling argument is self-reinforcing. Because decoder-only models scale predictably, labs invest more compute in them. Because labs invest more compute, they produce better results. Because they produce better results, the next generation of scaling laws is again derived from decoder-only models.

Performance Analysis: Inference Economics

Understanding why decoder-only models dominate requires examining the inference cost structure. In production, inference costs typically exceed training costs by 10-100x over a model’s lifetime.

Prefill vs Decode: The Two Phases

Decoder-only inference has two distinct phases with very different computational profiles:

Prefill phase: Process the entire input prompt in a single forward pass. This is compute-bound and highly parallel — similar to encoder inference. All input tokens are processed simultaneously through the attention layers. The output is the KV cache for the input tokens plus the logits for the first generated token.

Decode phase: Generate tokens one at a time. Each step processes a single new token, attending to all previously cached keys and values. This is memory-bandwidth-bound because you are reading the entire KV cache from memory for each token but doing relatively little computation.

def analyze_inference_costs(model_params_B, seq_len, gen_len, batch_size=1):
    """Analyze inference costs for a decoder-only model."""
    d_model = int((model_params_B * 1e9 / 12) ** 0.5) * 2  # rough estimate
    n_layers = int(model_params_B * 1e9 / (12 * d_model * d_model))

    # Prefill: compute-bound
    prefill_flops = 2 * model_params_B * 1e9 * seq_len  # ~2NP per token
    prefill_time_ms = prefill_flops / (312e12) * 1000  # A100 peak FLOPS

    # Decode: memory-bandwidth-bound
    kv_cache_bytes_per_token = 2 * n_layers * 2 * d_model * 2  # 2 (K,V) * layers * 2 bytes (FP16)
    total_kv_bytes = kv_cache_bytes_per_token * (seq_len + gen_len)
    decode_time_per_token_ms = total_kv_bytes / (2e12) * 1000  # A100 mem BW

    return {
        "prefill_ms": prefill_time_ms,
        "per_token_decode_ms": decode_time_per_token_ms,
        "total_decode_ms": decode_time_per_token_ms * gen_len,
        "total_ms": prefill_time_ms + decode_time_per_token_ms * gen_len,
    }
📊

Inference Cost Breakdown (A100 80GB, FP16)

ModelInput (tokens)Output (tokens)Prefill (ms)Decode (ms)Total (ms)
7B Decoder 512 128 3.3 38.4 41.7
7B Decoder 2048 128 13.1 46.1 59.2
7B Decoder 2048 512 13.1 215.0 228.1
70B Decoder 512 128 33.0 384.0 417.0
70B Decoder 2048 512 131.0 2150.0 2281.0
110M Encoder (BERT) 512 N/A 0.6 N/A 0.6
340M Encoder (BERT-L) 512 N/A 1.8 N/A 1.8

The table reveals the fundamental tradeoff. For pure encoding tasks (classification, embedding, retrieval), encoders are 50-1000x faster than using a decoder to generate an answer. But for any task requiring generation, you need a decoder, and the decode phase dominates total latency.

Batching and Throughput

In production, you batch multiple requests together. This changes the economics significantly:

Encoder batching is simple: stack inputs, run one forward pass, get all outputs. Throughput scales linearly with batch size up to GPU memory limits.

Decoder batching is complex because different requests are at different points in generation. You need continuous batching (also called iteration-level batching) to keep the GPU busy. Frameworks like vLLM, TensorRT-LLM, and SGLang implement this.

📊

Throughput at Scale (A100 80GB, Concurrent Requests)

ScenarioModelBatch SizeThroughputLatency (p50)
Classification DeBERTa-large (304M) 256 48,000 inputs/sec 5.3 ms
Classification Llama-3 8B (via prompt) 32 890 inputs/sec 36 ms
Generation (128 tok) Llama-3 8B 64 12,400 tok/sec 165 ms
Generation (512 tok) Llama-3 8B 32 8,200 tok/sec 580 ms
Embedding BGE-large (335M) 512 62,000 inputs/sec 8.2 ms

For classification workloads, using a 7B decoder model is roughly 50x more expensive than using a 300M encoder model. This is why encoder models persist in production.

The Memory Wall

The KV cache is the dominant memory cost during decoder inference. For a model with LL layers, hidden dimension dd, and hh attention heads with head dimension dh=d/hd_h = d/h, the KV cache per token is:

KV bytes per token=2×L×2×d×bytes_per_element\text{KV bytes per token} = 2 \times L \times 2 \times d \times \text{bytes\_per\_element}

For Llama-3 70B (80 layers, d=8192d = 8192, FP16): 2×80×2×8192×2=5.24 MB per token2 \times 80 \times 2 \times 8192 \times 2 = 5.24\text{ MB per token}.

At 8192 tokens of context, the KV cache alone is 5.24×819242 GB5.24 \times 8192 \approx 42\text{ GB} — more than half an A100’s memory. This is the memory wall, and it constrains batch size, throughput, and maximum context length.

Encoders have no KV cache. Their memory usage is constant regardless of how many inputs you process (assuming fixed sequence length). This is another reason encoder models are preferred for high-throughput inference of non-generative tasks.

KV Cache Optimization Techniques

The memory wall has driven enormous innovation in KV cache management: Multi-Query Attention (MQA) and Grouped-Query Attention (GQA) reduce KV cache by sharing keys/values across heads. KV cache quantization (FP8, INT4) halves or quarters memory. PagedAttention (vLLM) eliminates memory fragmentation. Sliding window attention (Mistral) bounds the cache to a fixed window size. All of these are decoder-specific optimizations — encoders do not need them.

Modern Convergence: Decoder-Only Does Everything

The most striking development of 2023-2025 is that decoder-only models now match or exceed encoder performance even on understanding tasks, when sufficiently large. The encoder advantage at comparable sizes has been overwhelmed by the sheer scale of modern decoder models.

Understanding Tasks: The Scale Argument

Consider the trajectory:

  • GPT-2 (2019, 1.5B): Mediocre on understanding benchmarks. BERT clearly better.
  • GPT-3 (2020, 175B): Competitive on many understanding tasks via few-shot prompting, without fine-tuning.
  • GPT-4 (2023, rumored ~1.8T MoE): Exceeds fine-tuned BERT/DeBERTa on most understanding benchmarks.
  • Llama 3.1 (2024, 405B): Open-source model exceeding encoder models on understanding tasks.

At 400B+ parameters, the decoder’s left-context-only disadvantage is compensated by the enormous model capacity and training data. The model effectively learns to “look ahead” through patterns in its training data.

📊

Understanding Tasks: Modern Decoder vs Best Encoder

BenchmarkDeBERTa-v3 Large (304M, fine-tuned)GPT-4 (zero-shot)Llama-3.1 70B (fine-tuned)Winner
SuperGLUE 91.4 93.8 92.1 GPT-4
MMLU N/A 86.4 82.0 GPT-4 (decoders only)
HellaSwag N/A 95.3 88.0 GPT-4 (decoders only)
SQuAD v2.0 (F1) 90.7 ~91 89.5 Comparable
Sentiment (SST-2) 96.8 97.1 96.5 Comparable
NER (CoNLL-03) 93.8 ~88 (zero-shot) 92.1 (fine-tuned) DeBERTa (specialized)

The table shows an important nuance: for highly specialized tasks like NER, fine-tuned encoders still hold an edge. But the gap is small and shrinking. And for the new benchmarks that matter — MMLU, reasoning, multi-step tasks — only decoder models compete.

The Embedding Revolution

One area where encoders seemed irreplaceable was text embedding for retrieval and similarity. Models like Sentence-BERT, BGE, and E5 used encoder architectures to produce fixed-size embeddings. But decoder-based embeddings have caught up:

  • LLM2Vec (2024): Converts decoder-only Llama models into effective text encoders by enabling bidirectional attention and using mean pooling.
  • GritLM (2024): A single model that does both generation and embedding, outperforming dedicated encoder models.
  • Nomic Embed and SFR-Embedding: Decoder-derived models competitive with the best encoder embeddings on MTEB.

The trick is simple: take a pretrained decoder model, fine-tune it with contrastive learning for embedding, and optionally enable bidirectional attention during the embedding forward pass. The decoder’s vast pretraining knowledge transfers directly.

What Encoders Still Win

Despite the convergence, encoder models retain clear advantages in specific production scenarios:

1. Latency-critical classification. When you need sub-5ms latency for content moderation, spam detection, or real-time sentiment analysis, a 100M-parameter encoder is the right tool. No 7B decoder can match it.

2. High-throughput embedding. Generating millions of embeddings per hour for search indexing is 50-100x cheaper with a dedicated encoder model than with a decoder.

3. Token-level tasks. NER, POS tagging, and chunking benefit from bidirectional attention at every token position. Fine-tuned encoders are still slightly more accurate per parameter.

4. Edge deployment. Encoder models compress well (DistilBERT is 66M parameters) and run efficiently on CPUs and mobile devices. Decoder models require significantly more resources.

💡 The Right Tool for the Right Job

The “decoder-only won” narrative applies to frontier AI research and general-purpose assistants. In production systems, the choice depends on your workload. Classification pipeline processing 10M documents/day? Use an encoder. Building a conversational AI product? Decoder-only is the only option. Generating search embeddings? Encoder is 50x cheaper, but decoder-based embeddings are closing the quality gap.

Hardware Implications and Co-Design

The dominance of decoder-only models has shaped hardware design and vice versa.

GPU Architecture Alignment

Modern GPU features are increasingly optimized for autoregressive decoding:

  • Tensor Cores are designed for large matrix multiplications, which dominate both prefill and decode.
  • HBM bandwidth is the bottleneck for decode (reading KV cache), driving investment in HBM3 and HBM3e.
  • NVLink and NVSwitch enable tensor parallelism across GPUs, critical for large decoder models.
  • FP8 support (H100, H200) directly targets the decode memory-bandwidth bottleneck by halving data movement.

Encoder inference is compute-bound and relatively simple. It does not push hardware boundaries the way autoregressive decoding does. Hardware vendors optimize for the harder, higher-value problem — which is decoding.

📊

Hardware Utilization: Encoder vs Decoder Workloads

MetricEncoder (Classification)Decoder (Prefill)Decoder (Decode)Bottleneck
Compute Utilization 85-95% 70-85% 5-15% Decode is memory-bound
Memory Bandwidth Util. 30-40% 50-70% 85-95% Decode saturates bandwidth
GPU Power Draw 250-300W 300-350W 200-280W Decode underutilizes compute
Batch Efficiency Near-linear scaling Good scaling Complex (varying seq lens) Decode needs continuous batching

Custom Silicon for Decoding

The decode bottleneck has motivated custom hardware designs:

  • Google TPU v5e: Optimized for serving workloads, high memory bandwidth per chip.
  • Groq LPU: Deterministic, low-latency architecture specifically targeting autoregressive decoding.
  • Cerebras CS-3: Wafer-scale chip with massive on-chip SRAM to hold KV caches without HBM latency.
  • AWS Inferentia2/Trainium: Cost-optimized accelerators for decoder inference and training.

None of these chips were designed for encoder workloads. The hardware ecosystem is co-evolving with the decoder-only paradigm.

Speculative Decoding and the Generation Speed Problem

One of the most important recent innovations addresses the fundamental speed disadvantage of autoregressive decoding: speculative decoding.

The Core Idea

Use a small, fast “draft” model to propose multiple tokens at once. Then use the large “target” model to verify all proposed tokens in a single forward pass (which is parallel, like an encoder). Accept the tokens that match; reject and regenerate from the first mismatch.

def speculative_decode(draft_model, target_model, prompt, gamma=5):
    """Generate faster using a draft model for speculation."""
    tokens = list(prompt)
    kv_cache_target = None

    while not done:
        # Draft model: quickly generate gamma candidate tokens
        draft_tokens = draft_model.generate(tokens, num_tokens=gamma)

        # Target model: verify all gamma tokens in ONE parallel forward pass
        # This is essentially an encoder-style parallel computation
        target_logits, kv_cache_target = target_model.forward(
            tokens + draft_tokens,
            kv_cache=kv_cache_target
        )

        # Accept tokens that match target model's distribution
        num_accepted = verify_and_accept(draft_tokens, target_logits)

        tokens.extend(draft_tokens[:num_accepted])
        # If all accepted, also sample one bonus token from target
        if num_accepted == gamma:
            tokens.append(sample(target_logits[num_accepted]))

    return tokens

Speculative decoding achieves 2-3x speedup without any quality loss. It is mathematically guaranteed to produce the same distribution as the target model alone. The key insight is that verification is parallel (like an encoder), even though generation is sequential.

📊

Speculative Decoding Performance

Target ModelDraft ModelAcceptance RateSpeedupQuality
Llama-3.1 70B Llama-3.1 8B ~75% 2.3x Identical distribution
Llama-3.1 70B Llama-3.1 1B ~55% 1.8x Identical distribution
GPT-4 GPT-3.5-turbo (est.) ~70% ~2x Identical distribution
Mixtral 8x22B Mixtral 8x7B ~65% 2.0x Identical distribution

Speculative decoding borrows the encoder’s key insight — parallel processing of multiple tokens — while remaining within the decoder-only framework. This is a microcosm of the broader trend: the decoder-only architecture absorbs the useful properties of other architectures.

The Emerging Alternatives

While decoder-only has won the current generation, research continues on architectures that might challenge the paradigm.

State Space Models (Mamba)

Mamba and its successors replace attention entirely with a selective state space model that processes sequences in O(n)O(n) time and constant memory. There is no attention matrix, no KV cache, no quadratic scaling. Early results show competitive performance with transformers at moderate scale (up to ~7B parameters).

However, Mamba has not yet demonstrated the scaling behavior that makes transformers so compelling. The jury is still out on whether state space models can reach frontier quality.

Hybrid Architectures (Jamba, Zamba)

Some models combine transformer layers with Mamba layers, attempting to get the best of both. Jamba (AI21) and Zamba (Zyphra) interleave attention and state-space layers. These are still fundamentally decoder-only in their generation paradigm — they just use a different attention mechanism in some layers.

Diffusion Language Models

Diffusion models generate all tokens simultaneously through iterative denoising, more like an encoder than a decoder. Research models like MDLM and SEDD show promise, but they lag autoregressive models in quality and have not yet demonstrated strong scaling.

ℹ️ The Architecture Search Continues

The decoder-only transformer is the current local optimum, not necessarily the global optimum. Future architectures may combine insights from encoders (parallel processing), decoders (autoregressive quality), state space models (linear scaling), and diffusion (parallel generation). But any challenger must demonstrate the smooth scaling laws that make decoder-only transformers so bankable.

A Practical Decision Framework

Given everything above, here is how to choose an architecture for your workload in 2025:

📊

Architecture Selection Guide (2025)

WorkloadRecommended ArchitectureModel ExamplesRationale
Text classification (high throughput) Encoder DeBERTa-v3, ModernBERT, BGE-reranker 50-100x cheaper than decoder for same task
Text embedding / retrieval Encoder (or decoder-derived) BGE, E5, Nomic-embed Latency and cost matter; encoder still cheaper
Named Entity Recognition Encoder DeBERTa + token classifier Bidirectional context helps boundary detection
Text generation (any kind) Decoder-only Llama 3, Mistral, GPT-4 Only viable option for high-quality generation
Conversational AI / Chat Decoder-only Claude, GPT-4, Llama-3 Requires generation + instruction following
Summarization Decoder-only Llama 3, GPT-4, Claude Decoder-only models now outperform enc-dec on ROUGE
Translation Decoder-only (or enc-dec for legacy) GPT-4, NLLB (enc-dec) Decoder matches enc-dec quality at scale
Code generation Decoder-only Claude, GPT-4, CodeLlama, DeepSeek-Coder Autoregressive generation is essential
Multi-modal (vision + language) Decoder-only with vision encoder GPT-4V, LLaVA, Claude Vision encoder + language decoder is the standard
Edge / mobile deployment Encoder (or small decoder) DistilBERT, MobileBERT, Phi-3-mini Encoder for NLU; small decoder if generation needed

Conclusion

The encoder vs decoder debate is resolved, but the resolution is nuanced. Decoder-only models won the frontier. They scale more predictably, generate text natively, support in-context learning, and at sufficient scale they match encoders even on understanding tasks. The entire frontier AI ecosystem — hardware, frameworks, serving infrastructure, research — is organized around decoder-only models.

But encoders did not disappear. They retreated to a well-defined niche: high-throughput, low-latency inference for classification, embedding, and token-level tasks. In this niche, they are 50-100x more cost-effective than decoder alternatives. Every major search engine, content moderation system, and recommendation pipeline still runs encoder models.

The encoder-decoder architecture, which once seemed like the elegant middle ground, turned out to be the worst position: too complex to scale like decoder-only, not fast enough for the encoder niche. T5’s legacy is the text-to-text paradigm, not the architecture.

Looking forward, the decoder-only transformer faces challenges from state space models, hybrid architectures, and diffusion approaches. But any successor must match the transformer’s extraordinary property: smooth, predictable scaling across many orders of magnitude of compute. Until something demonstrates that, decoder-only transformers will remain the foundation of frontier AI.

The lesson for practitioners is pragmatic. Use decoder-only models for anything involving generation, reasoning, or general-purpose AI. Use encoder models when you need speed and cost-efficiency for non-generative tasks. Ignore encoder-decoder models for new projects unless you have a very specific sequence-to-sequence workload with constraints that favor them. The architecture debate is over. The engineering optimization has just begun.