Introduction
In 2018, two papers redefined natural language processing. Google released BERT (Bidirectional Encoder Representations from Transformers), and OpenAI released GPT (Generative Pre-trained Transformer). Both were built on the transformer architecture from “Attention Is All You Need,” but they made fundamentally different design choices. BERT used only the encoder half. GPT used only the decoder half. Each came with a different pre-training objective, a different attention pattern, and a different theory about what language models should optimize for.
At the time, the consensus was clear: BERT was better for understanding, GPT was better for generation, and encoder-decoder models like T5 were the best of both worlds. The community expected a diverse ecosystem of architectures, each specialized for its niche.
That is not what happened. By 2024, decoder-only models had won almost every category. GPT-4, Claude, Llama, Gemini, Mistral — all decoder-only. Encoder models like BERT still exist in production for specific tasks, but the frontier of AI research and deployment runs entirely on autoregressive decoders. Encoder-decoder models like T5 and BART are largely historical artifacts at the frontier scale.
This post traces the full arc of that convergence. We start with the original architectural split, examine why each design excels where it does, then follow the scaling law arguments, the inference economics, and the practical engineering pressures that made decoder-only the default. Finally, we look at what remains of the encoder niche and where the field is heading.
The Original Split: Two Philosophies of Language Modeling
BERT: Bidirectional Encoder
BERT’s core insight was that language understanding benefits from seeing context in both directions simultaneously. A word’s meaning depends on what comes before and after it. Consider the word “bank” in “I sat on the river bank” versus “I went to the bank to deposit money.” A left-to-right model must commit to a representation of “bank” before seeing “river” or “deposit.” BERT sees everything at once.
The mechanism is masked language modeling (MLM). During pre-training, BERT randomly masks 15% of input tokens and learns to predict them from the surrounding context. This forces every hidden state to encode information from both directions:
class BERTEncoder:
"""Bidirectional encoder: all tokens attend to all tokens."""
def __init__(self, num_layers, hidden_size, num_heads):
self.layers = [
TransformerLayer(
attention_type="bidirectional", # No causal mask
hidden_size=hidden_size,
num_heads=num_heads
) for _ in range(num_layers)
]
def forward(self, input_ids, attention_mask=None):
# Every token attends to every other token in the sequence.
# The attention matrix is fully dense (no triangular mask).
x = self.embeddings(input_ids)
for layer in self.layers:
x = layer(x, attention_mask=attention_mask)
return x # Shape: [batch, seq_len, hidden_size]
The attention pattern is a full matrix. Token 1 attends to token ; token attends to token 1. There is no notion of “past” or “future.” This is enormously powerful for tasks where you have the entire input available up front — classification, named entity recognition, question answering over a fixed passage.
GPT: Autoregressive Decoder
GPT takes the opposite approach. It uses causal language modeling (CLM): predict the next token given all previous tokens. The attention mask is lower-triangular, so token can only attend to tokens . This is autoregressive by construction.
class GPTDecoder:
"""Causal decoder: each token attends only to previous tokens."""
def __init__(self, num_layers, hidden_size, num_heads):
self.layers = [
TransformerLayer(
attention_type="causal", # Lower-triangular mask
hidden_size=hidden_size,
num_heads=num_heads
) for _ in range(num_layers)
]
def forward(self, input_ids):
# Token i attends to tokens [1..i] only.
# The causal mask prevents information leakage from the future.
x = self.embeddings(input_ids)
for layer in self.layers:
x = layer(x, causal_mask=True)
return x # Shape: [batch, seq_len, hidden_size]
The causal constraint means GPT’s representations at position encode information only from the left context. This is weaker for understanding — you are literally throwing away half the context. But it is perfectly aligned with generation: to produce the next token, you only need the tokens you have already generated.
The Pre-Training Objective Matters More Than You Think
The difference between MLM and CLM is not just a training trick. It shapes what the model learns to represent.
MLM (BERT) trains the model to be a good fill-in-the-blank predictor. Given “The cat sat on the [MASK],” predict “mat.” This is a discriminative task. The model learns rich bidirectional representations, but it does not learn to generate coherent sequences. You cannot easily sample from BERT — there is no natural left-to-right generation procedure.
CLM (GPT) trains the model to be a good next-word predictor. Given “The cat sat on the,” predict “mat.” This is a generative task. The model learns a proper probability distribution that you can sample from autoregressively. Generation falls out naturally.
BERT cannot generate text natively. To use BERT for generation, you need workarounds like iterative masking and replacement, Gibbs sampling, or bolting on a separate decoder. These approaches are slow, produce lower-quality text, and add engineering complexity. The autoregressive formulation of GPT makes generation a first-class operation.
Architectural Comparison: Encoder vs Decoder
| Property | BERT (Encoder) | GPT (Decoder) | Winner For... |
|---|---|---|---|
| Attention Pattern | Full bidirectional (n x n) | Causal lower-triangular | Understanding vs Generation |
| Pre-training Objective | Masked Language Model (15% tokens) | Causal Language Model (all tokens) | Representation richness vs Generative quality |
| Context at Position i | All n tokens | Tokens 1..i only | Comprehension vs Sequential generation |
| Native Generation | No (requires workarounds) | Yes (autoregressive sampling) | GPT |
| Training Signal Density | 15% of tokens per example | 100% of tokens per example | GPT (more efficient use of data) |
| Inference Mode | Single forward pass | Sequential token-by-token | BERT for encoding, GPT for generation |
Note the training signal density row. BERT only gets a gradient signal from the 15% of tokens it masks. GPT gets a signal from every single token position. This means GPT extracts more learning per training example, which becomes increasingly important at scale.
Why Decoders Won for Generation
The answer is almost tautological: causal attention is generation. When you train a model to predict and then sample from that distribution token by token, you are doing exactly what the model was trained for. There is no gap between training and inference.
The Autoregressive Sampling Loop
Generation with a decoder model is straightforward:
def autoregressive_generate(model, prompt_tokens, max_new_tokens, temperature=1.0):
"""Generate text token by token using a causal decoder."""
tokens = list(prompt_tokens)
kv_cache = None
for _ in range(max_new_tokens):
# Forward pass: compute logits for the next token
logits, kv_cache = model.forward(
tokens[-1:] if kv_cache else tokens,
kv_cache=kv_cache
)
# Sample from the probability distribution
probs = softmax(logits[-1] / temperature)
next_token = sample(probs)
tokens.append(next_token)
if next_token == EOS_TOKEN:
break
return tokens
Each step produces exactly one token. The KV cache stores previously computed key-value pairs so we do not recompute attention over the full sequence at every step. The model was trained to predict the next token — and that is exactly what we ask it to do.
Why Encoders Cannot Generate Naturally
BERT has no concept of “the next token.” Its attention is bidirectional, so every position already sees every other position. To generate with BERT, you would need something like:
- Start with a sequence of [MASK] tokens.
- Predict all masked tokens simultaneously.
- Replace some masks with predictions, keep others masked.
- Repeat until convergence.
This is essentially Gibbs sampling and has serious problems: it does not define a proper joint distribution over sequences, the order of unmasking is arbitrary, and it requires many forward passes. In practice, BERT-based generation produces incoherent text compared to autoregressive models.
Decoder models use the KV cache to make generation efficient. After processing the prompt in a single forward pass (the “prefill” phase), each new token requires attending only to the cached keys and values plus the new token’s own query. The per-token cost during generation is where is the current sequence length and is the model dimension — linear in sequence length, not quadratic. Encoders have no equivalent optimization because they do not generate sequentially.
The Prompt Engineering Paradigm
Decoder-only models enabled a new paradigm: in-context learning and prompt engineering. Instead of fine-tuning a separate model for each task, you prepend instructions and examples to your input and let the autoregressive model continue. This only works because the model is trained to predict continuations of arbitrary text. BERT cannot do this — it is trained to fill in blanks, not to continue sequences.
Why Encoders Won for Understanding
Despite everything above, encoders genuinely excel at tasks where you have the full input and need a single output. The bidirectional attention means every token’s representation is informed by the entire context, not just the left context.
Classification and Sequence Labeling
For text classification, you feed the entire document through the encoder, take the [CLS] token’s representation (or pool across all positions), and pass it through a classification head. This is a single forward pass with full bidirectional context. The model sees every word in the document when computing the representation of every other word.
For named entity recognition (NER), each token’s representation captures both left and right context, which is critical. Is “Washington” a person, a city, or a state? The answer depends on tokens to the left and right.
Task Performance: Encoder vs Decoder (Comparable Model Sizes, circa 2019-2020)
| Task | BERT-base (110M) | GPT-2 (117M) | Delta | Why Encoder Wins |
|---|---|---|---|---|
| MNLI (NLI) | 84.6 Acc | 78.2 Acc | +6.4 | Bidirectional cross-sentence attention |
| SQuAD v1.1 (QA) | 88.5 F1 | 79.1 F1 | +9.4 | Full passage context for span extraction |
| CoNLL-03 (NER) | 92.8 F1 | 85.3 F1 | +7.5 | Both left and right context for entity boundaries |
| SST-2 (Sentiment) | 93.5 Acc | 88.7 Acc | +4.8 | Full sentence context |
| STS-B (Similarity) | 89.3 Pearson | 82.1 Pearson | +7.2 | Bidirectional sentence representations |
At comparable model sizes, BERT’s advantage on understanding tasks was substantial — 5 to 10 points on most benchmarks. This led to the widespread adoption of BERT, RoBERTa, and later DeBERTa for production NLU systems.
The Embarrassingly Parallel Inference Advantage
Encoder inference is a single forward pass. You feed in the entire input, get out a representation for every token, and you are done. There is no sequential generation loop. This has massive implications for latency:
def encoder_inference(model, input_tokens):
"""Single forward pass -- all tokens processed in parallel."""
# One matrix multiply per layer, fully parallel across sequence length.
representations = model.forward(input_tokens)
# Classification: take [CLS] or pool
logits = classification_head(representations[:, 0, :])
return logits # Done. No loop.
def decoder_generation(model, input_tokens, output_length=100):
"""Sequential loop -- one token at a time."""
# Prefill: process input (parallel, like encoder)
kv_cache = model.prefill(input_tokens)
# Decode: generate tokens one by one (SEQUENTIAL)
output = []
for _ in range(output_length):
logit, kv_cache = model.decode_step(output[-1:], kv_cache)
output.append(sample(logit))
return output # 100 sequential steps
Inference Latency Comparison (Single Input, A100 GPU)
| Operation | Model | Latency | Throughput | Note |
|---|---|---|---|---|
| Classification | BERT-base (110M) | 2.1 ms | 476 inputs/sec | Single forward pass |
| Classification | BERT-large (340M) | 5.8 ms | 172 inputs/sec | Single forward pass |
| Classification (prompt) | GPT-2 (117M) | 3.4 ms | 294 inputs/sec | Prefill only, no generation |
| Generate 100 tokens | GPT-2 (117M) | 48 ms | 2,083 tok/sec | 100 sequential decode steps |
| Generate 100 tokens | GPT-2 Medium (345M) | 112 ms | 893 tok/sec | 100 sequential decode steps |
| Generate 500 tokens | GPT-2 (117M) | 235 ms | 2,128 tok/sec | 500 sequential decode steps |
For classification and retrieval workloads, encoder models are 10-50x faster than decoder models generating an equivalent response. This is why BERT-family models still power most production search engines, recommendation systems, and content classification pipelines. The economics are compelling: if you do not need generation, why pay the cost of autoregressive decoding?
The Encoder Lineage: BERT to DeBERTa
The encoder family continued to improve after BERT:
- RoBERTa (2019): Trained longer with more data, removed next-sentence prediction, dynamic masking. Significant gains across all benchmarks.
- ALBERT (2020): Parameter sharing and factorized embeddings for efficiency.
- ELECTRA (2020): Replaced MLM with a replaced-token-detection objective. Far more sample-efficient.
- DeBERTa (2021): Disentangled attention for content and position. DeBERTa-v3 remains the strongest encoder model on SuperGLUE.
These models are smaller, faster, and cheaper than frontier decoder models. A DeBERTa-large (304M parameters) still outperforms GPT-3 (175B parameters) on several understanding benchmarks when both are fine-tuned. The per-parameter efficiency of bidirectional attention for understanding tasks is remarkable.
Encoder-Decoder: The Road Not Taken
The encoder-decoder architecture — exemplified by T5 (Text-to-Text Transfer Transformer) and BART — seemed like the obvious best of both worlds. The encoder processes the input bidirectionally (like BERT), and the decoder generates the output autoregressively (like GPT). For sequence-to-sequence tasks like translation, summarization, and question answering, this is architecturally elegant.
How Encoder-Decoder Works
class EncoderDecoderModel:
"""T5/BART style: bidirectional encoder + autoregressive decoder."""
def __init__(self, encoder_layers, decoder_layers, hidden_size, num_heads):
self.encoder = TransformerEncoder(encoder_layers, hidden_size, num_heads)
self.decoder = TransformerDecoder(decoder_layers, hidden_size, num_heads)
def forward(self, source_ids, target_ids):
# Encoder: bidirectional attention over input
encoder_output = self.encoder(source_ids) # [batch, src_len, hidden]
# Decoder: causal attention over output + cross-attention to encoder
decoder_output = self.decoder(
target_ids,
encoder_hidden_states=encoder_output, # Cross-attention
causal_mask=True
)
return decoder_output
The cross-attention mechanism in the decoder allows each generated token to attend to all positions in the encoder output. This gives the decoder full bidirectional context over the input while maintaining autoregressive generation.
Why Encoder-Decoder Lost at Scale
Despite the architectural elegance, encoder-decoder models faced several scaling challenges:
1. Complexity budget allocation. If you have a fixed parameter budget (say, 70B parameters), how do you split between encoder and decoder? A 35B encoder + 35B decoder? Or 50B decoder + 20B encoder? There is no obvious answer, and the optimal split depends on the task mix. A decoder-only model avoids this decision entirely.
2. Training objective complexity. T5 uses a span corruption objective: mask spans of tokens in the input, and the decoder must generate the missing spans. This is more complex to implement and tune than simple next-token prediction. BART uses a denoising objective with multiple noise functions. More hyperparameters, more things to get wrong.
3. Inference complexity. Encoder-decoder models require two separate forward passes during inference — one through the encoder (for the input) and one through the decoder (for generation). The KV cache must store both the encoder’s output and the decoder’s own KV cache. This doubles memory management complexity.
4. The scaling laws do not favor it. Empirically, decoder-only models show smoother and more predictable scaling behavior. The Chinchilla scaling laws (Hoffmann et al., 2022) were derived for decoder-only models. No equivalent analysis has shown encoder-decoder models to be more compute-efficient at frontier scale.
Encoder-Decoder vs Decoder-Only at Scale
| Model | Architecture | Parameters | Training Tokens | Key Observation |
|---|---|---|---|---|
| T5-11B | Encoder-Decoder | 11B | 1T | Strong on benchmarks but complex training |
| GPT-3 | Decoder-Only | 175B | 300B | In-context learning without fine-tuning |
| PaLM | Decoder-Only | 540B | 780B | Dominated benchmarks at scale |
| UL2 | Encoder-Decoder | 20B | 1T | Google's attempt to revive enc-dec; limited adoption |
| Llama 2 | Decoder-Only | 70B | 2T | Open-source standard, decoder-only |
| Llama 3.1 | Decoder-Only | 405B | 15T+ | Frontier open model, decoder-only |
T5 demonstrated that framing every NLP task as text-to-text is powerful. But the insight that survived was the “text-to-text” framing, not the encoder-decoder architecture. Modern decoder-only models use the same text-to-text approach — they just implement it with causal attention and prompt formatting instead of separate encoder and decoder stacks.
The practical outcome: no frontier lab trains encoder-decoder models at the largest scales. Google’s own Gemini models are decoder-only, despite Google having invented T5. The simplicity and predictability of decoder-only scaling won.
The Scaling Law Argument
The most powerful argument for decoder-only models comes from scaling laws. These empirical relationships describe how model performance (measured by loss on a held-out set) improves as you increase model size, dataset size, and compute budget.
Kaplan et al. (2020): The Original Scaling Laws
OpenAI’s scaling laws paper showed that for autoregressive language models, test loss follows a power law in model parameters , dataset size , and compute budget :
where , , and . These laws were derived for GPT-style decoder-only models and held across six orders of magnitude of compute.
Chinchilla (2022): Compute-Optimal Training
DeepMind’s Chinchilla paper refined the scaling laws and showed that most large language models were significantly undertrained. The compute-optimal relationship is roughly:
meaning that model size and training data should scale equally with compute budget. A 70B parameter model should be trained on approximately 1.4T tokens for compute-optimal performance.
Again, these laws were derived for decoder-only models. The smooth, predictable scaling behavior gives labs confidence to invest hundreds of millions of dollars in training runs. You can extrapolate from small experiments to predict large-model performance with reasonable accuracy.
Why Scaling Favors Decoder-Only
Several properties of decoder-only models make them better scaling targets:
1. Training signal density. Every token in every training example provides a gradient signal. BERT-style MLM wastes 85% of tokens. At 15T training tokens (Llama 3.1 scale), this difference is enormous. The decoder model gets token-level predictions; an MLM model gets .
2. Architectural simplicity. One stack of transformer layers, one attention pattern, one loss function. No encoder-decoder split, no cross-attention, no span corruption. Simpler architectures have fewer failure modes at scale.
3. Emergent capabilities. As decoder models scale, they develop capabilities that were not present at smaller scales: in-context learning, chain-of-thought reasoning, instruction following. These emergent capabilities appear to be uniquely strong in autoregressive models, possibly because the next-token prediction objective forces the model to develop general-purpose reasoning.
Scaling Behavior: Decoder-Only Models
(Cross-Entropy Loss) lineScaling Efficiency: Decoder-Only vs Alternatives
| Property | Decoder-Only (GPT) | Encoder (BERT) | Encoder-Decoder (T5) |
|---|---|---|---|
| Training signal per token | 1.0 (every token) | 0.15 (masked tokens only) | ~0.5 (depends on corruption ratio) |
| Scaling law predictability | Excellent (power law) | Good (less studied at scale) | Moderate (split budget complicates) |
| Compute-optimal known? | Yes (Chinchilla) | No | No |
| Emergent capabilities | Strong (ICL, CoT, tool use) | Weak (limited to fine-tuned tasks) | Moderate (some transfer) |
| Largest model trained | 1.8T (GPT-4 class) | ~1.5B (DeBERTa-xxl) | 11B (T5) |
| Investment confidence | High (predictable returns) | Low at frontier scale | Low at frontier scale |
The scaling argument is self-reinforcing. Because decoder-only models scale predictably, labs invest more compute in them. Because labs invest more compute, they produce better results. Because they produce better results, the next generation of scaling laws is again derived from decoder-only models.
Performance Analysis: Inference Economics
Understanding why decoder-only models dominate requires examining the inference cost structure. In production, inference costs typically exceed training costs by 10-100x over a model’s lifetime.
Prefill vs Decode: The Two Phases
Decoder-only inference has two distinct phases with very different computational profiles:
Prefill phase: Process the entire input prompt in a single forward pass. This is compute-bound and highly parallel — similar to encoder inference. All input tokens are processed simultaneously through the attention layers. The output is the KV cache for the input tokens plus the logits for the first generated token.
Decode phase: Generate tokens one at a time. Each step processes a single new token, attending to all previously cached keys and values. This is memory-bandwidth-bound because you are reading the entire KV cache from memory for each token but doing relatively little computation.
def analyze_inference_costs(model_params_B, seq_len, gen_len, batch_size=1):
"""Analyze inference costs for a decoder-only model."""
d_model = int((model_params_B * 1e9 / 12) ** 0.5) * 2 # rough estimate
n_layers = int(model_params_B * 1e9 / (12 * d_model * d_model))
# Prefill: compute-bound
prefill_flops = 2 * model_params_B * 1e9 * seq_len # ~2NP per token
prefill_time_ms = prefill_flops / (312e12) * 1000 # A100 peak FLOPS
# Decode: memory-bandwidth-bound
kv_cache_bytes_per_token = 2 * n_layers * 2 * d_model * 2 # 2 (K,V) * layers * 2 bytes (FP16)
total_kv_bytes = kv_cache_bytes_per_token * (seq_len + gen_len)
decode_time_per_token_ms = total_kv_bytes / (2e12) * 1000 # A100 mem BW
return {
"prefill_ms": prefill_time_ms,
"per_token_decode_ms": decode_time_per_token_ms,
"total_decode_ms": decode_time_per_token_ms * gen_len,
"total_ms": prefill_time_ms + decode_time_per_token_ms * gen_len,
}
Inference Cost Breakdown (A100 80GB, FP16)
| Model | Input (tokens) | Output (tokens) | Prefill (ms) | Decode (ms) | Total (ms) |
|---|---|---|---|---|---|
| 7B Decoder | 512 | 128 | 3.3 | 38.4 | 41.7 |
| 7B Decoder | 2048 | 128 | 13.1 | 46.1 | 59.2 |
| 7B Decoder | 2048 | 512 | 13.1 | 215.0 | 228.1 |
| 70B Decoder | 512 | 128 | 33.0 | 384.0 | 417.0 |
| 70B Decoder | 2048 | 512 | 131.0 | 2150.0 | 2281.0 |
| 110M Encoder (BERT) | 512 | N/A | 0.6 | N/A | 0.6 |
| 340M Encoder (BERT-L) | 512 | N/A | 1.8 | N/A | 1.8 |
The table reveals the fundamental tradeoff. For pure encoding tasks (classification, embedding, retrieval), encoders are 50-1000x faster than using a decoder to generate an answer. But for any task requiring generation, you need a decoder, and the decode phase dominates total latency.
Batching and Throughput
In production, you batch multiple requests together. This changes the economics significantly:
Encoder batching is simple: stack inputs, run one forward pass, get all outputs. Throughput scales linearly with batch size up to GPU memory limits.
Decoder batching is complex because different requests are at different points in generation. You need continuous batching (also called iteration-level batching) to keep the GPU busy. Frameworks like vLLM, TensorRT-LLM, and SGLang implement this.
Throughput at Scale (A100 80GB, Concurrent Requests)
| Scenario | Model | Batch Size | Throughput | Latency (p50) |
|---|---|---|---|---|
| Classification | DeBERTa-large (304M) | 256 | 48,000 inputs/sec | 5.3 ms |
| Classification | Llama-3 8B (via prompt) | 32 | 890 inputs/sec | 36 ms |
| Generation (128 tok) | Llama-3 8B | 64 | 12,400 tok/sec | 165 ms |
| Generation (512 tok) | Llama-3 8B | 32 | 8,200 tok/sec | 580 ms |
| Embedding | BGE-large (335M) | 512 | 62,000 inputs/sec | 8.2 ms |
For classification workloads, using a 7B decoder model is roughly 50x more expensive than using a 300M encoder model. This is why encoder models persist in production.
The Memory Wall
The KV cache is the dominant memory cost during decoder inference. For a model with layers, hidden dimension , and attention heads with head dimension , the KV cache per token is:
For Llama-3 70B (80 layers, , FP16): .
At 8192 tokens of context, the KV cache alone is — more than half an A100’s memory. This is the memory wall, and it constrains batch size, throughput, and maximum context length.
Encoders have no KV cache. Their memory usage is constant regardless of how many inputs you process (assuming fixed sequence length). This is another reason encoder models are preferred for high-throughput inference of non-generative tasks.
The memory wall has driven enormous innovation in KV cache management: Multi-Query Attention (MQA) and Grouped-Query Attention (GQA) reduce KV cache by sharing keys/values across heads. KV cache quantization (FP8, INT4) halves or quarters memory. PagedAttention (vLLM) eliminates memory fragmentation. Sliding window attention (Mistral) bounds the cache to a fixed window size. All of these are decoder-specific optimizations — encoders do not need them.
Modern Convergence: Decoder-Only Does Everything
The most striking development of 2023-2025 is that decoder-only models now match or exceed encoder performance even on understanding tasks, when sufficiently large. The encoder advantage at comparable sizes has been overwhelmed by the sheer scale of modern decoder models.
Understanding Tasks: The Scale Argument
Consider the trajectory:
- GPT-2 (2019, 1.5B): Mediocre on understanding benchmarks. BERT clearly better.
- GPT-3 (2020, 175B): Competitive on many understanding tasks via few-shot prompting, without fine-tuning.
- GPT-4 (2023, rumored ~1.8T MoE): Exceeds fine-tuned BERT/DeBERTa on most understanding benchmarks.
- Llama 3.1 (2024, 405B): Open-source model exceeding encoder models on understanding tasks.
At 400B+ parameters, the decoder’s left-context-only disadvantage is compensated by the enormous model capacity and training data. The model effectively learns to “look ahead” through patterns in its training data.
Understanding Tasks: Modern Decoder vs Best Encoder
| Benchmark | DeBERTa-v3 Large (304M, fine-tuned) | GPT-4 (zero-shot) | Llama-3.1 70B (fine-tuned) | Winner |
|---|---|---|---|---|
| SuperGLUE | 91.4 | 93.8 | 92.1 | GPT-4 |
| MMLU | N/A | 86.4 | 82.0 | GPT-4 (decoders only) |
| HellaSwag | N/A | 95.3 | 88.0 | GPT-4 (decoders only) |
| SQuAD v2.0 (F1) | 90.7 | ~91 | 89.5 | Comparable |
| Sentiment (SST-2) | 96.8 | 97.1 | 96.5 | Comparable |
| NER (CoNLL-03) | 93.8 | ~88 (zero-shot) | 92.1 (fine-tuned) | DeBERTa (specialized) |
The table shows an important nuance: for highly specialized tasks like NER, fine-tuned encoders still hold an edge. But the gap is small and shrinking. And for the new benchmarks that matter — MMLU, reasoning, multi-step tasks — only decoder models compete.
The Embedding Revolution
One area where encoders seemed irreplaceable was text embedding for retrieval and similarity. Models like Sentence-BERT, BGE, and E5 used encoder architectures to produce fixed-size embeddings. But decoder-based embeddings have caught up:
- LLM2Vec (2024): Converts decoder-only Llama models into effective text encoders by enabling bidirectional attention and using mean pooling.
- GritLM (2024): A single model that does both generation and embedding, outperforming dedicated encoder models.
- Nomic Embed and SFR-Embedding: Decoder-derived models competitive with the best encoder embeddings on MTEB.
The trick is simple: take a pretrained decoder model, fine-tune it with contrastive learning for embedding, and optionally enable bidirectional attention during the embedding forward pass. The decoder’s vast pretraining knowledge transfers directly.
What Encoders Still Win
Despite the convergence, encoder models retain clear advantages in specific production scenarios:
1. Latency-critical classification. When you need sub-5ms latency for content moderation, spam detection, or real-time sentiment analysis, a 100M-parameter encoder is the right tool. No 7B decoder can match it.
2. High-throughput embedding. Generating millions of embeddings per hour for search indexing is 50-100x cheaper with a dedicated encoder model than with a decoder.
3. Token-level tasks. NER, POS tagging, and chunking benefit from bidirectional attention at every token position. Fine-tuned encoders are still slightly more accurate per parameter.
4. Edge deployment. Encoder models compress well (DistilBERT is 66M parameters) and run efficiently on CPUs and mobile devices. Decoder models require significantly more resources.
The “decoder-only won” narrative applies to frontier AI research and general-purpose assistants. In production systems, the choice depends on your workload. Classification pipeline processing 10M documents/day? Use an encoder. Building a conversational AI product? Decoder-only is the only option. Generating search embeddings? Encoder is 50x cheaper, but decoder-based embeddings are closing the quality gap.
Hardware Implications and Co-Design
The dominance of decoder-only models has shaped hardware design and vice versa.
GPU Architecture Alignment
Modern GPU features are increasingly optimized for autoregressive decoding:
- Tensor Cores are designed for large matrix multiplications, which dominate both prefill and decode.
- HBM bandwidth is the bottleneck for decode (reading KV cache), driving investment in HBM3 and HBM3e.
- NVLink and NVSwitch enable tensor parallelism across GPUs, critical for large decoder models.
- FP8 support (H100, H200) directly targets the decode memory-bandwidth bottleneck by halving data movement.
Encoder inference is compute-bound and relatively simple. It does not push hardware boundaries the way autoregressive decoding does. Hardware vendors optimize for the harder, higher-value problem — which is decoding.
Hardware Utilization: Encoder vs Decoder Workloads
| Metric | Encoder (Classification) | Decoder (Prefill) | Decoder (Decode) | Bottleneck |
|---|---|---|---|---|
| Compute Utilization | 85-95% | 70-85% | 5-15% | Decode is memory-bound |
| Memory Bandwidth Util. | 30-40% | 50-70% | 85-95% | Decode saturates bandwidth |
| GPU Power Draw | 250-300W | 300-350W | 200-280W | Decode underutilizes compute |
| Batch Efficiency | Near-linear scaling | Good scaling | Complex (varying seq lens) | Decode needs continuous batching |
Custom Silicon for Decoding
The decode bottleneck has motivated custom hardware designs:
- Google TPU v5e: Optimized for serving workloads, high memory bandwidth per chip.
- Groq LPU: Deterministic, low-latency architecture specifically targeting autoregressive decoding.
- Cerebras CS-3: Wafer-scale chip with massive on-chip SRAM to hold KV caches without HBM latency.
- AWS Inferentia2/Trainium: Cost-optimized accelerators for decoder inference and training.
None of these chips were designed for encoder workloads. The hardware ecosystem is co-evolving with the decoder-only paradigm.
Speculative Decoding and the Generation Speed Problem
One of the most important recent innovations addresses the fundamental speed disadvantage of autoregressive decoding: speculative decoding.
The Core Idea
Use a small, fast “draft” model to propose multiple tokens at once. Then use the large “target” model to verify all proposed tokens in a single forward pass (which is parallel, like an encoder). Accept the tokens that match; reject and regenerate from the first mismatch.
def speculative_decode(draft_model, target_model, prompt, gamma=5):
"""Generate faster using a draft model for speculation."""
tokens = list(prompt)
kv_cache_target = None
while not done:
# Draft model: quickly generate gamma candidate tokens
draft_tokens = draft_model.generate(tokens, num_tokens=gamma)
# Target model: verify all gamma tokens in ONE parallel forward pass
# This is essentially an encoder-style parallel computation
target_logits, kv_cache_target = target_model.forward(
tokens + draft_tokens,
kv_cache=kv_cache_target
)
# Accept tokens that match target model's distribution
num_accepted = verify_and_accept(draft_tokens, target_logits)
tokens.extend(draft_tokens[:num_accepted])
# If all accepted, also sample one bonus token from target
if num_accepted == gamma:
tokens.append(sample(target_logits[num_accepted]))
return tokens
Speculative decoding achieves 2-3x speedup without any quality loss. It is mathematically guaranteed to produce the same distribution as the target model alone. The key insight is that verification is parallel (like an encoder), even though generation is sequential.
Speculative Decoding Performance
| Target Model | Draft Model | Acceptance Rate | Speedup | Quality |
|---|---|---|---|---|
| Llama-3.1 70B | Llama-3.1 8B | ~75% | 2.3x | Identical distribution |
| Llama-3.1 70B | Llama-3.1 1B | ~55% | 1.8x | Identical distribution |
| GPT-4 | GPT-3.5-turbo (est.) | ~70% | ~2x | Identical distribution |
| Mixtral 8x22B | Mixtral 8x7B | ~65% | 2.0x | Identical distribution |
Speculative decoding borrows the encoder’s key insight — parallel processing of multiple tokens — while remaining within the decoder-only framework. This is a microcosm of the broader trend: the decoder-only architecture absorbs the useful properties of other architectures.
The Emerging Alternatives
While decoder-only has won the current generation, research continues on architectures that might challenge the paradigm.
State Space Models (Mamba)
Mamba and its successors replace attention entirely with a selective state space model that processes sequences in time and constant memory. There is no attention matrix, no KV cache, no quadratic scaling. Early results show competitive performance with transformers at moderate scale (up to ~7B parameters).
However, Mamba has not yet demonstrated the scaling behavior that makes transformers so compelling. The jury is still out on whether state space models can reach frontier quality.
Hybrid Architectures (Jamba, Zamba)
Some models combine transformer layers with Mamba layers, attempting to get the best of both. Jamba (AI21) and Zamba (Zyphra) interleave attention and state-space layers. These are still fundamentally decoder-only in their generation paradigm — they just use a different attention mechanism in some layers.
Diffusion Language Models
Diffusion models generate all tokens simultaneously through iterative denoising, more like an encoder than a decoder. Research models like MDLM and SEDD show promise, but they lag autoregressive models in quality and have not yet demonstrated strong scaling.
The decoder-only transformer is the current local optimum, not necessarily the global optimum. Future architectures may combine insights from encoders (parallel processing), decoders (autoregressive quality), state space models (linear scaling), and diffusion (parallel generation). But any challenger must demonstrate the smooth scaling laws that make decoder-only transformers so bankable.
A Practical Decision Framework
Given everything above, here is how to choose an architecture for your workload in 2025:
Architecture Selection Guide (2025)
| Workload | Recommended Architecture | Model Examples | Rationale |
|---|---|---|---|
| Text classification (high throughput) | Encoder | DeBERTa-v3, ModernBERT, BGE-reranker | 50-100x cheaper than decoder for same task |
| Text embedding / retrieval | Encoder (or decoder-derived) | BGE, E5, Nomic-embed | Latency and cost matter; encoder still cheaper |
| Named Entity Recognition | Encoder | DeBERTa + token classifier | Bidirectional context helps boundary detection |
| Text generation (any kind) | Decoder-only | Llama 3, Mistral, GPT-4 | Only viable option for high-quality generation |
| Conversational AI / Chat | Decoder-only | Claude, GPT-4, Llama-3 | Requires generation + instruction following |
| Summarization | Decoder-only | Llama 3, GPT-4, Claude | Decoder-only models now outperform enc-dec on ROUGE |
| Translation | Decoder-only (or enc-dec for legacy) | GPT-4, NLLB (enc-dec) | Decoder matches enc-dec quality at scale |
| Code generation | Decoder-only | Claude, GPT-4, CodeLlama, DeepSeek-Coder | Autoregressive generation is essential |
| Multi-modal (vision + language) | Decoder-only with vision encoder | GPT-4V, LLaVA, Claude | Vision encoder + language decoder is the standard |
| Edge / mobile deployment | Encoder (or small decoder) | DistilBERT, MobileBERT, Phi-3-mini | Encoder for NLU; small decoder if generation needed |
Conclusion
The encoder vs decoder debate is resolved, but the resolution is nuanced. Decoder-only models won the frontier. They scale more predictably, generate text natively, support in-context learning, and at sufficient scale they match encoders even on understanding tasks. The entire frontier AI ecosystem — hardware, frameworks, serving infrastructure, research — is organized around decoder-only models.
But encoders did not disappear. They retreated to a well-defined niche: high-throughput, low-latency inference for classification, embedding, and token-level tasks. In this niche, they are 50-100x more cost-effective than decoder alternatives. Every major search engine, content moderation system, and recommendation pipeline still runs encoder models.
The encoder-decoder architecture, which once seemed like the elegant middle ground, turned out to be the worst position: too complex to scale like decoder-only, not fast enough for the encoder niche. T5’s legacy is the text-to-text paradigm, not the architecture.
Looking forward, the decoder-only transformer faces challenges from state space models, hybrid architectures, and diffusion approaches. But any successor must match the transformer’s extraordinary property: smooth, predictable scaling across many orders of magnitude of compute. Until something demonstrates that, decoder-only transformers will remain the foundation of frontier AI.
The lesson for practitioners is pragmatic. Use decoder-only models for anything involving generation, reasoning, or general-purpose AI. Use encoder models when you need speed and cost-efficiency for non-generative tasks. Ignore encoder-decoder models for new projects unless you have a very specific sequence-to-sequence workload with constraints that favor them. The architecture debate is over. The engineering optimization has just begun.