A transformer-based language model does not produce text. It produces logits — a vector of raw scores over a vocabulary of tens or hundreds of thousands of tokens. The question of how you turn those logits into an actual token sequence is the decoding problem, and it is simultaneously a question about output quality, inference latency, memory consumption, and throughput. Get it wrong and you either produce degenerate text or burn three times the GPU budget you need.

This post is a comprehensive treatment of the decoding design space. We will start from the simplest possible strategy (greedy argmax), build up through temperature scaling, top-k, and top-p sampling, cover beam search in detail, analyze the performance characteristics of each approach with real numbers, examine the emerging world of constrained and grammar-guided decoding, and conclude with a practical decision framework for production systems.

The decoding design space

After the model’s final layer produces a logit vector zRVz \in \mathbb{R}^V (where VV is the vocabulary size), every decoding strategy answers the same question: which token do we emit next? The answer determines not just the quality of the output but also the computational cost of producing it.

The design space breaks down along two axes. The first axis is determinism vs stochasticity: do we always pick the highest-scoring token, or do we inject randomness? The second axis is single-hypothesis vs multi-hypothesis: do we maintain one partial sequence or several?

📊

Decoding strategy taxonomy

StrategyDeterministic?HypothesesKey parameterPrimary use
Greedy Yes 1 None Benchmarking baseline
Temperature sampling No 1 T (temperature) Creative generation
Top-k sampling No 1 k (cutoff count) General generation
Top-p (nucleus) No 1 p (cumulative prob) Chat, creative tasks
Beam search Yes (mostly) B B (beam width) Translation, summarization
Constrained decoding Varies 1 Grammar / schema Structured output
Note: Deterministic strategies always produce the same output for the same input. Stochastic strategies vary across runs.

Every strategy shares the same forward pass through the model. The differences lie entirely in what happens after the logits are produced. This is important from a performance perspective: the model forward pass typically dominates wall-clock time, so the selection overhead is often small — but not always, and when you move to beam search the forward pass itself gets multiplied.

Greedy decoding

The simplest possible strategy is greedy decoding: take the argmax of the logit vector at every step.

xt=argmaxi{1,,V}zix_t = \arg\max_{i \in \{1, \dots, V\}} z_i

There is no softmax, no sampling, no randomness. You just find the index of the largest logit and emit that token. The computational cost of this selection step is O(V)O(V) — a single linear scan.

Why greedy decoding matters

Greedy decoding is not a strategy you would typically deploy in production for user-facing text generation. Its outputs tend to be repetitive and dull. The model often falls into loops, repeating the same phrase or sentence structure because the locally highest-probability token at each step tends to reinforce patterns that lead back to the same locally highest-probability token.

However, greedy decoding is extremely useful as a performance baseline. Because it has essentially zero overhead beyond the forward pass, any latency you measure under greedy decoding represents the floor — the minimum time the model needs to produce a token. When you switch to a more sophisticated strategy, the difference in latency tells you exactly how much overhead that strategy introduces.

Greedy decoding is also deterministic, which makes it valuable for debugging and regression testing. The same prompt will always produce the same output, eliminating one source of variance when tracking down issues.

ℹ️ Greedy decoding and repetition penalties

In practice, even “greedy” deployments often add a repetition penalty — modifying the logits to penalize tokens that have already appeared. This breaks the degeneracy problem at minimal cost (one extra O(V)O(V) pass over the logits to apply the penalty). The result is still deterministic but much less repetitive.

Performance profile

  • Forward passes per token: 1
  • Selection cost: O(V)O(V) argmax
  • KV cache: 1x (single sequence)
  • Memory overhead: Negligible beyond model weights and KV cache

This is the cheapest possible decoding strategy and the one against which everything else should be measured.

Temperature sampling

Temperature sampling introduces stochasticity by converting logits into a probability distribution and then sampling from it. The key insight is that a temperature parameter TT controls how peaked or flat the distribution is.

Given logits zRVz \in \mathbb{R}^V, we compute:

P(xt=i)=exp(zi/T)j=1Vexp(zj/T)P(x_t = i) = \frac{\exp(z_i / T)}{\sum_{j=1}^{V} \exp(z_j / T)}

The temperature TT is a positive scalar that divides the logits before the softmax. Its effect is intuitive:

  • T0T \to 0: The distribution becomes infinitely peaked. The highest logit gets probability approaching 1. In the limit, this is greedy decoding.
  • T=1T = 1: The distribution is the model’s “native” output — whatever the training process produced.
  • TT \to \infty: The distribution becomes uniform. Every token in the vocabulary is equally likely. The output becomes random noise.

Choosing temperature for different use cases

Temperature is the most commonly tuned decoding parameter, and the right value depends heavily on the application.

📊

Temperature guidelines by use case

Use caseTypical T rangeRationale
Code generation 0.0 - 0.3 Correctness matters more than diversity; near-greedy behavior avoids syntax errors
Factual Q&A 0.1 - 0.4 Stick close to the model's most confident answers
Chat / conversation 0.6 - 0.9 Balance coherence with natural variation
Creative writing 0.8 - 1.2 Encourage surprising word choices and narrative turns
Brainstorming 1.0 - 1.5 Maximize diversity of ideas at the cost of some coherence
Note: These are rough guidelines. Optimal temperature depends on the specific model, its training data, and the quality of its calibration.

Performance profile

Temperature sampling adds two operations beyond the forward pass: dividing the logit vector by TT (an O(V)O(V) elementwise operation that is trivially cheap), computing the softmax (O(V)O(V)), and sampling from the resulting distribution. The sampling step itself is typically implemented as computing the cumulative distribution function (CDF) and performing a binary search for a uniform random number, which is O(V)O(V) for the CDF and O(logV)O(\log V) for the search.

In practice, the total overhead of temperature sampling over greedy decoding is negligible. The softmax and sampling together take microseconds on a GPU, while the model forward pass takes milliseconds. You would need an extraordinarily small model or an extraordinarily large vocabulary before this overhead became measurable.

Temperature sampling overhead is negligible

On a modern GPU, softmax over a 128k vocabulary takes roughly 5-15 microseconds. A single transformer forward pass for a 7B model takes 8-15 milliseconds. The sampling overhead is less than 0.2% of the total per-token time.

Top-k sampling

Pure temperature sampling has a problem: even at moderate temperatures, the long tail of the vocabulary distribution can produce nonsensical tokens. If the model assigns probability 0.001 to a completely irrelevant token, temperature sampling will occasionally select it — and in a long generation, “occasionally” becomes “frequently enough to be annoying.”

Top-k sampling addresses this by truncating the distribution. After computing logits (optionally scaled by temperature), we:

  1. Find the kk tokens with the highest logits.
  2. Set all other logits to -\infty (effectively zeroing their probabilities after softmax).
  3. Apply softmax to the remaining kk logits.
  4. Sample from this truncated distribution.
zi={ziif zi is among the top-k logitsotherwisez'_i = \begin{cases} z_i & \text{if } z_i \text{ is among the top-}k \text{ logits} \\ -\infty & \text{otherwise} \end{cases} P(xt=i)=exp(zi/T)j=1Vexp(zj/T)P(x_t = i) = \frac{\exp(z'_i / T)}{\sum_{j=1}^{V} \exp(z'_j / T)}

The k=40k = 40 default and its history

The value k=40k = 40 became a widely used default after the 2018 paper by Fan et al. on hierarchical story generation. They found that k=40k = 40 produced a good balance between diversity and coherence for their task. This default propagated through the Hugging Face Transformers library and became the de facto standard.

However, k=40k = 40 is not universally optimal. For large vocabulary models (100k+ tokens), k=40k = 40 might be too restrictive, cutting off legitimate continuations. For highly structured tasks like code generation, even k=40k = 40 might be too permissive.

The fundamental problem with top-k

Top-k has a conceptual flaw: it uses a fixed cutoff regardless of the shape of the distribution. Consider two scenarios:

Scenario A: The model is very confident. The top token has probability 0.9, the second has 0.05, and the rest are negligible. Here, k=40k = 40 keeps 38 tokens that the model considers extremely unlikely. You are sampling from noise.

Scenario B: The model is uncertain. The probability is spread fairly evenly across 200 plausible next tokens, each with probability around 0.005. Here, k=40k = 40 cuts off 160 perfectly reasonable continuations. You are artificially restricting the model’s expressiveness.

This inflexibility is what motivated the development of top-p sampling.

Performance profile

The main computational cost of top-k is the partial sort needed to find the kk largest elements. A naive implementation takes O(V)O(V) using a selection algorithm like introselect. A heap-based approach takes O(Vlogk)O(V \log k). On GPUs, radix-sort-based implementations can find top-k in O(V)O(V) time with good parallelism.

Top-k selection overhead vs vocabulary size

(us)
V=32k GPT-2 class
3 us
V=50k LLaMA class
5 us
+66.7%
V=100k GPT-4 / Qwen class
9 us
+200.0%
V=150k DeepSeek class
14 us
+366.7%
V=250k Large multilingual
22 us
+633.3%

Even at V=250kV = 250k, the top-k selection takes roughly 20 microseconds on an A100 — still negligible compared to the forward pass. The overhead only becomes relevant for very small models (sub-1B parameters) where the forward pass itself is fast.

Top-p (nucleus) sampling

Top-p sampling, introduced by Holtzman et al. in 2019 under the name “nucleus sampling,” solves the adaptivity problem of top-k. Instead of keeping a fixed number of tokens, it keeps the smallest set of tokens whose cumulative probability exceeds a threshold pp.

The algorithm:

  1. Compute softmax probabilities from the (optionally temperature-scaled) logits.
  2. Sort tokens by probability in descending order.
  3. Compute the cumulative sum of probabilities.
  4. Find the smallest index kk^* such that the cumulative sum exceeds pp.
  5. Zero out all tokens beyond index kk^*.
  6. Renormalize the remaining probabilities.
  7. Sample from the renormalized distribution.
k=min{k:i=1kPsorted(i)p}k^* = \min \left\{ k : \sum_{i=1}^{k} P_{\text{sorted}}(i) \geq p \right\}

Why top-p is preferred over top-k

The key advantage is adaptivity to the distribution shape. When the model is confident (peaked distribution), top-p naturally keeps fewer tokens — perhaps only 2 or 3. When the model is uncertain (flat distribution), top-p keeps many more — perhaps 100 or 200. The parameter pp controls the total probability mass retained, not the count of tokens.

This means p=0.95p = 0.95 is a much more stable default than k=40k = 40. Regardless of whether the model is outputting the only plausible next word in a fixed expression or choosing among dozens of valid continuations, p=0.95p = 0.95 ensures that the sampled token is always within the model’s “plausible set” while excluding only the lowest-probability tail.

💡 Combining temperature with top-p

In practice, temperature and top-p are often used together. A common configuration is T=0.7,p=0.95T = 0.7, p = 0.95. The temperature sharpens the distribution first, and then top-p trims the tail. This combination gives fine-grained control: temperature controls the “creativity” and top-p controls the “safety net” against nonsense.

Practical top-p values

  • p=1.0p = 1.0: No truncation. Equivalent to pure temperature sampling.
  • p=0.95p = 0.95: The most common default. Keeps the top 95% of the probability mass, which usually eliminates obviously wrong tokens while retaining diversity.
  • p=0.9p = 0.9: Slightly more conservative. Good for tasks where you want diversity but not too much risk.
  • p=0.5p = 0.5: Aggressive truncation. Only tokens in the upper half of the probability mass are considered. Produces more predictable, less diverse output.

Performance profile

Top-p is slightly more expensive than top-k because it requires a full sort of the probability vector (or at least a partial sort until the cumulative sum exceeds pp). A full sort is O(VlogV)O(V \log V) compared to top-k’s O(Vlogk)O(V \log k) or O(V)O(V).

However, efficient GPU implementations use radix sort, which is O(V)O(V) in practice for the fixed-width floating-point values used in inference. The overhead remains in the microsecond range.

📊

Sampling strategy selection overhead (A100, V=128k)

StrategySelection time (us)Relative to argmaxNotes
Argmax (greedy) 2 1.0x Single linear scan
Top-k (k=40) 8 4.0x Partial sort + sample
Top-p (p=0.95) 12 6.0x Full sort + cumsum + sample
Top-k + top-p combined 14 7.0x Top-k first, then cumsum
Note: All measurements assume fused GPU kernels. The forward pass for a 7B model is ~10,000 us, making all selection overheads negligible (less than 0.15%).

The critical takeaway: all single-hypothesis sampling strategies have negligible overhead compared to the model forward pass. The performance differences between greedy, top-k, and top-p are academically interesting but practically irrelevant for models larger than about 1B parameters.

Beam search is fundamentally different from all sampling strategies because it maintains multiple hypotheses simultaneously. Instead of committing to one token at each step, beam search explores BB partial sequences (called “beams”) in parallel, keeping the BB highest-scoring sequences at each step.

The algorithm

  1. Start with BB copies of the input prefix.
  2. At each generation step, for each of the BB beams, compute the logits (run the model forward pass).
  3. For each beam, consider all VV possible next tokens. Score each candidate as the beam’s accumulated log-probability plus the new token’s log-probability.
  4. Among all B×VB \times V candidates, select the top BB by total score.
  5. Update the beams to these BB sequences.
  6. Repeat until all beams produce an end-of-sequence token or hit the maximum length.

The scoring for a beam at step tt:

score(y1:t)=i=1tlogP(yiy1:i1)\text{score}(y_{1:t}) = \sum_{i=1}^{t} \log P(y_i \mid y_{1:i-1})

And the selection step:

beamst=top-B(b=1B{(beamb,w):w{1,,V}},by score)\text{beams}_{t} = \operatorname{top\text{-}B}\left(\bigcup_{b=1}^{B} \left\{ (\text{beam}_b, w) : w \in \{1, \dots, V\} \right\}, \text{by score}\right)

Compute cost: BB times the forward passes

The most obvious cost of beam search is that it requires BB forward passes per generation step — or equivalently, one batched forward pass with batch size BB. If a single forward pass takes time TmodelT_{\text{model}}, then beam search takes:

TbeamBTmodelE(B)+TselectT_{\text{beam}} \approx \frac{B \cdot T_{\text{model}}}{E(B)} + T_{\text{select}}

where E(B)E(B) is the batching efficiency. On GPUs, batching efficiency is high for small BB (the forward pass is memory-bandwidth-bound and can process small batches with near-zero overhead), so E(B)1E(B) \approx 1 for B4B \leq 4 on typical hardware. This means 4-beam search costs about 4×4\times the compute but only about 1.51.5-2×2\times the wall-clock time of single-beam decoding, because the batched forward pass amortizes some overhead.

However, as BB grows, batching efficiency degrades because:

  • KV cache memory grows linearly with BB, potentially causing memory pressure.
  • The batched matrix multiplications become compute-bound rather than memory-bandwidth-bound.
  • Cache locality suffers as BB beams compete for GPU memory.

Memory cost: BB times the KV cache

This is the real killer. The KV cache stores the key and value tensors for all previous tokens in the sequence, and beam search requires a separate KV cache for each beam (since each beam has a different token history).

For a model with LL layers, HH attention heads, head dimension dd, and sequence length SS, the KV cache for a single beam is:

KVsingle=2×L×H×d×S×dtype_bytes\text{KV}_{\text{single}} = 2 \times L \times H \times d \times S \times \text{dtype\_bytes}

For a typical 7B parameter model (32 layers, 32 heads, dimension 128, FP16):

KVsingle=2×32×32×128×S×2=524,288×S bytes\text{KV}_{\text{single}} = 2 \times 32 \times 32 \times 128 \times S \times 2 = 524{,}288 \times S \text{ bytes}

At S=2048S = 2048 tokens, that is roughly 1 GB per beam. With B=4B = 4 beams, you need 4 GB just for the KV cache — on top of the model weights, activations, and other overhead.

def kv_cache_bytes(layers, heads, head_dim, seq_len, beams=1, dtype_bytes=2):
    """Calculate KV cache memory in bytes."""
    # Factor of 2 for K + V
    per_beam = 2 * layers * heads * head_dim * seq_len * dtype_bytes
    return per_beam * beams

# Example: 7B model, 2048 tokens, 4 beams
single = kv_cache_bytes(32, 32, 128, 2048, beams=1)  # ~1.07 GB
beam_4 = kv_cache_bytes(32, 32, 128, 2048, beams=4)   # ~4.29 GB
beam_8 = kv_cache_bytes(32, 32, 128, 2048, beams=8)   # ~8.59 GB
⚠️ Beam search can become memory-bound before it becomes compute-bound

For long-context generation (4k+ tokens) with wide beams (B=8+), the KV cache alone can exceed the memory budget of a single GPU. This forces either smaller batch sizes (destroying throughput), model parallelism (adding communication overhead), or offloading (adding latency). In contrast, sampling with a single hypothesis keeps the KV cache at 1x regardless of context length.

Selection overhead: top-B over B×VB \times V candidates

At each step, beam search must select the top BB candidates from B×VB \times V total options. For B=4B = 4 and V=128,000V = 128{,}000, that is a top-4 selection over 512,000 candidates. This is more expensive than the top-k selection in sampling (which operates over VV candidates), but it is still O(B×V)O(B \times V) and well within the “negligible compared to the forward pass” regime.

The more significant overhead is the beam management: tracking which beams survive, reordering KV caches to match the surviving beams, and handling the bookkeeping for early termination of beams that have produced an end-of-sequence token.

Performance analysis: the full picture

Now let us put concrete numbers on these strategies. We will analyze three dimensions: latency per token, throughput (tokens per second across a batch of requests), and peak memory consumption.

Sampling: the lean path

A single-hypothesis sampling strategy (greedy, temperature, top-k, or top-p) has the following per-token cost:

Tsample=Tforward+Tsoftmax+Tselect+TsampleT_{\text{sample}} = T_{\text{forward}} + T_{\text{softmax}} + T_{\text{select}} + T_{\text{sample}}

Where:

  • TforwardT_{\text{forward}}: the model forward pass (dominates, typically 8-15 ms for a 7B model on A100)
  • TsoftmaxT_{\text{softmax}}: O(V)O(V) exponentiation and normalization (~5-15 us)
  • TselectT_{\text{select}}: O(V)O(V) for top-k, O(VlogV)O(V \log V) for top-p (5-15 us)
  • TsampleT_{\text{sample}}: CDF computation and binary search (~2-5 us)

Total selection overhead: roughly 10-35 microseconds, or less than 0.3% of the forward pass time.

Memory consumption is simply the model weights plus one KV cache:

Msample=Mmodel+KVsingleM_{\text{sample}} = M_{\text{model}} + \text{KV}_{\text{single}}

Beam search: the multiplicative path

Beam search at width BB has per-token cost:

Tbeam=BTforwardE(B)+Tselect(B×V)+TreorderT_{\text{beam}} = \frac{B \cdot T_{\text{forward}}}{E(B)} + T_{\text{select}}(B \times V) + T_{\text{reorder}}

Where:

  • BTforwardE(B)\frac{B \cdot T_{\text{forward}}}{E(B)}: batched forward pass for BB beams
  • Tselect(B×V)T_{\text{select}}(B \times V): top-B selection over all candidates
  • TreorderT_{\text{reorder}}: KV cache reordering for surviving beams

The batching efficiency E(B)E(B) is crucial. Let us look at measured values.

📊

Batching efficiency E(B) for 7B model on A100-80GB

Beam width BBatched forward (ms)Ideal (B x single)Efficiency E(B)Overhead factor
1 (sampling) 12.0 12.0 1.00 1.00x
2 14.5 24.0 1.66 1.21x
4 21.0 48.0 2.29 1.75x
8 38.0 96.0 2.53 3.17x
16 72.0 192.0 2.67 6.00x
Note: Measured on LLaMA-2 7B, FP16, sequence length 1024. Efficiency = B * single_time / batched_time. Overhead = batched_time / single_time.

The overhead factor tells the real story. At B=4B = 4, beam search takes about 1.75×1.75\times the wall time of sampling — not 4×4\times, thanks to batching. But at B=8B = 8, the overhead jumps to 3.17×3.17\times as memory pressure starts to dominate. At B=16B = 16, you are paying 6×6\times for a strategy that rarely provides meaningfully better output than B=4B = 4.

Latency per token vs beam width (7B model, A100-80GB)

line
Metric 1 (sample)24816
ms/token
12
14.5
21
38
72

Memory scaling

Memory is where beam search truly hurts. The KV cache scales linearly with beam width, and at long sequence lengths this dominates GPU memory.

Peak memory usage vs beam width (7B model, seq_len=2048)

(GB)
B=1 (sample) Model + 1 KV cache
15.1 GB
B=2 +1.0 GB
16.1 GB
+6.6%
B=4 +3.1 GB
18.2 GB
+20.5%
B=8 +7.3 GB
22.4 GB
+48.3%
B=16 +15.7 GB
30.8 GB
+104.0%

At B=16B = 16 and S=2048S = 2048, the KV cache alone consumes roughly 16 GB — more than the model weights for a 7B model in FP16. If you are serving multiple concurrent requests, this memory pressure forces smaller batch sizes, which destroys throughput.

Throughput comparison

Throughput (tokens per second across concurrent requests) is the metric that matters most for serving. Because beam search consumes more memory per request, it allows fewer concurrent requests, which reduces aggregate throughput.

📊

Serving throughput comparison (7B model, A100-80GB, target latency less than 50ms/token)

StrategyKV cache per reqMax concurrent reqsTokens/sec (aggregate)Relative throughput
Top-p sampling 1.07 GB 48 3,840 1.00x
Beam search B=2 2.14 GB 28 1,932 0.50x
Beam search B=4 4.29 GB 14 784 0.20x
Beam search B=8 8.59 GB 7 280 0.07x
Note: Assumes 14 GB available for KV caches after model weights and activations. Each request generates 2048 tokens.

The numbers are stark. Beam search with B=4B = 4 delivers one-fifth the throughput of sampling. At B=8B = 8, it is less than one-fourteenth. This is not because beam search is computationally expensive per se — it is because the KV cache memory consumption limits how many requests you can serve concurrently.

🚨 Beam search is a throughput killer in serving scenarios

If you are optimizing for cost per token in a serving environment, beam search is almost never the right choice. The memory overhead reduces concurrency so severely that you would need 5-15x more GPU capacity to match the throughput of sampling. Reserve beam search for offline batch processing where latency and throughput are not primary concerns.

Latency breakdown at different model scales

The relationship between beam width and overhead changes with model size. Larger models have longer forward passes, which means the fixed overhead of beam management becomes proportionally smaller — but the KV cache also grows.

📊

Per-token latency breakdown by model size (B=4, seq_len=1024, A100-80GB)

Model sizeForward pass (ms)Selection (ms)Reorder (ms)Total (ms)vs sampling
1.3B 3.2 0.04 0.15 3.4 1.5x
7B 12.0 0.05 0.42 12.5 1.75x
13B 22.0 0.05 0.78 22.8 1.82x
70B (TP=4) 45.0 0.06 2.10 47.2 1.95x
Note: Forward pass time is for the batched B=4 pass. KV cache reorder cost grows with model depth and sequence length.

For the 70B model, beam search at B=4B = 4 costs nearly 2×2\times the latency of sampling. The KV cache reordering becomes non-trivial (2.1 ms) because there are more layers to reorder and the caches are physically larger.

Structured output and constrained decoding

A rapidly growing area of the decoding design space is constrained decoding, where the generation is forced to conform to a grammar, schema, or set of rules. The most common application is generating valid JSON that matches a specific schema, but the technique generalizes to any context-free grammar.

How it works

Constrained decoding operates by modifying the logit vector at each step. Before sampling or argmax, it sets the logits of all tokens that would violate the grammar to -\infty. This is called logit masking or logit biasing.

The process:

  1. Maintain a grammar state (e.g., a position in a finite state machine or pushdown automaton).
  2. At each generation step, determine which tokens are valid continuations from the current state.
  3. Set all other logits to -\infty.
  4. Proceed with normal decoding (sampling, greedy, or beam search) over the remaining valid tokens.
  5. After emitting a token, advance the grammar state.

For JSON generation, the grammar state tracks things like “we are inside a string literal,” “we expect a colon after a key,” or “we need to close this array bracket.” The set of valid next tokens changes at each step based on this state.

Performance implications

The main cost of constrained decoding is computing the valid token mask at each step. For a simple grammar (like JSON), the number of grammar states is small (on the order of dozens) and the mask computation is fast — typically a lookup table indexed by grammar state, returning a precomputed bitmask over the vocabulary.

For more complex grammars, the mask computation can be more expensive. However, several frameworks have developed efficient implementations:

Outlines (by .txt) precompiles regular expressions and context-free grammars into finite state machines and precomputes the valid token set for each state. The per-step cost is a single lookup plus a bitmask application — O(V)O(V) for the mask but with excellent memory access patterns.

SGLang takes this further with FSM (finite state machine) acceleration that caches state transitions and uses RadixAttention for prefix sharing. Their constrained decoding adds less than 5% overhead over unconstrained generation for typical JSON schemas.

📊

Constrained decoding overhead (JSON schema, 7B model, A100)

FrameworkOverhead per tokenGrammar compile timeNotes
Outlines (regex FSM) 0.3 ms (+2.5%) 50-200 ms (one-time) Precompiled state transitions
SGLang (FSM + RadixAttention) 0.15 ms (+1.2%) 30-100 ms (one-time) Cached state transitions
llama.cpp (GBNF grammar) 0.5 ms (+4.2%) 10-50 ms (one-time) CPU-side grammar evaluation
Naive per-step grammar check 2-5 ms (+20-40%) None Recomputes valid set each step
Note: Overhead measured relative to unconstrained top-p sampling. Compile time is amortized over all tokens in the generation.
💡 Constrained decoding is nearly free with the right framework

With precompiled grammars and cached state transitions, constrained decoding adds less than 3% overhead. The quality benefit is enormous: guaranteed valid JSON, no need for retry loops, no post-processing validation. If you are generating structured output, use constrained decoding rather than hoping the model gets the format right.

Constrained decoding can be combined with beam search to explore multiple valid structured outputs. However, this multiplies both the KV cache cost (from beam search) and the grammar state tracking (one state per beam). In practice, constrained decoding with sampling is almost always sufficient because the grammar constraints eliminate most of the “bad” outputs that beam search was meant to avoid.

The KV cache killer: beam search and memory optimization

We have established that beam search’s primary cost is the B×B\times multiplication of the KV cache. Let us examine the optimizations that mitigate this.

Shared-prefix optimization (copy-on-write)

In beam search, all beams start from the same prefix. For the first several steps, many beams share identical token histories. The shared-prefix optimization (also called copy-on-write) exploits this by sharing KV cache pages across beams that have identical prefixes.

The implementation mirrors virtual memory copy-on-write:

  1. All beams initially point to the same physical KV cache pages.
  2. When a beam diverges (selects a different token than another beam), only the new page is allocated and the previous shared pages remain shared.
  3. When a beam is pruned, its unique pages are freed.

The savings depend on how quickly beams diverge. In practice:

📊

KV cache savings with shared-prefix optimization

Generation phaseBeam divergenceCache sharingEffective memory vs naive
First 10 tokens Low (most beams agree) 80-95% shared 1.2-1.5x (vs B x)
Tokens 10-50 Medium 50-70% shared 1.8-2.5x
Tokens 50+ High (beams fully diverged) 30-40% shared (prefix only) 2.5-3.5x
Overall average Progressive 40-60% shared 2.0-3.0x (vs 4.0x naive)
Note: Measured with B=4 beams on translation tasks. Savings are higher for tasks where beams remain similar (e.g., translation) and lower for open-ended generation.

With shared-prefix optimization, B=4B = 4 beam search uses roughly 2.02.0-3.0×3.0\times the KV cache memory instead of a naive 4.0×4.0\times. This is a 30-60% reduction in memory overhead, which can translate to significantly more concurrent requests in a serving environment.

Systems like vLLM use paged attention, which manages the KV cache as a set of fixed-size pages (analogous to OS virtual memory pages). This naturally enables copy-on-write for beam search: beams share pages for their common prefix and allocate new pages only for divergent suffixes.

Paged attention also eliminates the memory fragmentation that plagues naive KV cache implementations. Without paging, each beam’s KV cache must be a contiguous allocation, leading to fragmentation as beams are created and pruned. With paging, the physical memory can be reused efficiently.

Memory efficiency: naive vs paged KV cache (B=4, seq=2048)

(GB)
Sampling (baseline)
1.07 GB
Beam naive alloc 4.0x
4.29 GB
+300.9%
Beam + shared prefix 2.4x
2.57 GB
+140.2%
Beam + paged attn 2.0x
2.14 GB
+100.0%

Even with all optimizations, beam search at B=4B = 4 still uses roughly 2×2\times the memory of sampling. The fundamental issue remains: beam search requires distinct state for distinct hypotheses, and there is no way around that.

Speculative decoding interaction

Speculative decoding (using a small draft model to propose tokens that the large model verifies in parallel) interacts poorly with beam search. Speculative decoding assumes a single hypothesis — the draft model proposes one sequence, and the large model verifies it. Extending this to BB beams would require the draft model to propose BB sequences, eliminating most of the speedup.

For this reason, speculative decoding is almost exclusively used with sampling strategies. This further widens the practical performance gap between sampling and beam search in production deployments that use speculative decoding.

When to use what: a practical decision framework

Given everything we have covered, here is a concrete decision framework for choosing a decoding strategy.

Chat and conversational AI: top-p sampling

Recommended: T=0.7T = 0.7, p=0.95p = 0.95, with a repetition penalty of 1.1.

Conversational applications require low latency (users expect responses within 100-200 ms per token for streaming) and high throughput (many concurrent users). Top-p sampling delivers both while providing sufficient diversity to keep conversations natural.

Beam search is inappropriate here. The quality improvement is marginal for conversational text (where there is no single “correct” answer), and the memory overhead would slash your concurrent user capacity by 3-5x.

Code generation: low temperature

Recommended: T=0.1T = 0.1-0.30.3, p=0.95p = 0.95 (or even greedy with repetition penalty).

Code has a much narrower distribution of correct continuations than natural language. A missing semicolon or wrong variable name produces a syntax error. Low temperature keeps the model on its most confident path, which is usually the syntactically and semantically correct one.

Some code generation systems use a slightly higher temperature (T=0.4T = 0.4-0.60.6) and generate multiple samples independently, then rank them with a separate verifier (execution-based or model-based). This “sample-and-rank” approach achieves similar quality to beam search while maintaining the throughput advantages of sampling.

Recommended: B=4B = 4-55, with length normalization.

Translation is the classic beam search use case, and it remains one of the few tasks where beam search demonstrably outperforms sampling. The reason is that translation has a relatively peaked output distribution (there are only so many ways to say the same thing in the target language), and the quality metric (BLEU) rewards finding the single best translation rather than a diverse set.

However, even here, the trend is toward sampling. Recent work on quality-aware decoding shows that sampling with a reranker can match beam search quality at lower compute cost, especially for large models that are already well-calibrated.

Batch inference and offline processing: sampling with high throughput

Recommended: Top-p sampling, maximize batch size.

When processing large datasets offline (e.g., generating summaries for millions of documents), throughput is everything. Sampling allows you to pack the maximum number of requests per GPU, achieving 5-15x the throughput of beam search at the same hardware cost.

If quality is critical, generate NN samples per input and select the best one with a reward model or heuristic. This “best-of-N” approach is more hardware-efficient than NN-beam search because the NN samples do not require KV cache sharing or beam management overhead.

Structured output (JSON, SQL, etc.): constrained sampling

Recommended: Constrained decoding with top-p sampling, T=0.3T = 0.3-0.70.7.

When the output must conform to a grammar, constrained decoding eliminates the possibility of format violations. This makes beam search unnecessary for format correctness — the grammar mask handles it. Use low temperature for accuracy-critical schemas and moderate temperature when the content within the structure should be diverse.

📊

Decision matrix: decoding strategy by use case

Use caseStrategyKey paramsWhy this choice
Chat / conversation Top-p sampling T=0.7, p=0.95 Low latency, high throughput, natural diversity
Code generation Low-temperature sampling T=0.1-0.3, p=0.95 Correctness over diversity
Translation Beam search (small) B=4-5, length norm Quality metric rewards best single output
Batch summarization Top-p sampling T=0.5, p=0.9 Maximize throughput per GPU
JSON generation Constrained + top-p T=0.5, p=0.95, grammar Guaranteed valid structure
Creative writing Temperature sampling T=0.9-1.2, p=0.95 Maximum diversity and surprise
Benchmarking Greedy T=0 (argmax) Deterministic, reproducible, baseline

Advanced topic: combining strategies

Modern inference frameworks often combine multiple strategies. Some noteworthy combinations:

Min-p sampling

A recent alternative to top-p that sets a minimum probability threshold relative to the top token. If the top token has probability pmaxp_{\text{max}}, min-p keeps all tokens with probability at least min_p×pmax\text{min\_p} \times p_{\text{max}}. This is even more adaptive than top-p because the threshold scales with the model’s confidence on the top token.

Top-k then top-p

Apply top-k first (as a hard ceiling on the number of candidates), then top-p within the survivors. This provides a safety bound: even if the distribution is very flat, you never consider more than kk tokens. Typical values: k=100k = 100, p=0.95p = 0.95.

Sample-and-rank (best-of-N)

Generate NN independent samples with sampling, then select the best one using a reward model, log-probability, or task-specific metric. This achieves many of the quality benefits of beam search without the memory overhead, because the NN samples can be generated sequentially or in parallel without sharing KV cache state.

The performance advantage is significant. Generating 4 independent samples costs 4×4\times the compute but only 1×1\times the KV cache memory (processing sequentially) or 4×4\times (processing in parallel, but without the beam management overhead). Compare this to B=4B = 4 beam search, which costs 4×4\times the KV cache memory and requires beam management.

📊

Sample-and-rank vs beam search (7B model, 4 candidates)

MethodQuality (BLEU)LatencyMemoryThroughput
Beam B=4 32.1 1.75x 4.0x KV 0.20x
Sample-and-rank N=4 (seq) 31.8 4.0x 1.0x KV 1.0x
Sample-and-rank N=4 (par) 31.8 1.0x 4.0x KV 0.25x
Sample-and-rank N=4 + reranker 33.2 4.2x 1.0x KV 0.95x
Note: BLEU scores on WMT14 En-De. Sample-and-rank with a reranker exceeds beam search quality while using 1x KV cache memory.

Putting it all together: a performance summary

Let us conclude with a comprehensive comparison across all dimensions.

Relative cost per token by strategy (normalized to greedy = 1.0)

Greedy Baseline
1
Top-p sample +1% overhead
1.01
+1.0%
Constrained +3% overhead
1.03
+3.0%
Beam B=2 +21% overhead
1.21
+21.0%
Beam B=4 +75% overhead
1.75
+75.0%
Beam B=8 +217% overhead
3.17
+217.0%
Beam B=16 +500% overhead
6
+500.0%

KV cache memory multiplier by strategy

Sampling (any) 1x baseline
1
Beam B=4 (naive) 4x
4
+300.0%
Beam B=4 (shared prefix) 2.4x (40% savings)
2.4
+140.0%
Beam B=4 (paged) 2.0x (50% savings)
2
+100.0%
Beam B=8 (paged) 3.5x
3.5
+250.0%

The story these charts tell is clear. Sampling strategies (greedy, temperature, top-k, top-p, constrained) all live in a narrow band of 1.0-1.03x overhead. They are, for all practical purposes, free beyond the model forward pass. Beam search starts at 1.2x and escalates rapidly, with both compute and memory costs scaling super-linearly for large beam widths due to batching inefficiency and memory pressure.

Conclusion

The decoding strategy you choose is a systems decision as much as a quality decision. Here are the key takeaways:

Sampling is almost always the right default. Top-p sampling with T=0.7T = 0.7 and p=0.95p = 0.95 is a good starting point for most tasks. The overhead is negligible, the memory footprint is minimal, and the output quality is good. Constrained decoding adds another 1-3% overhead to guarantee structured output.

Beam search is a specialized tool. It is justified for tasks where finding the single highest-probability sequence matters (machine translation, some summarization tasks) and where you can afford the 2-6x memory overhead and 1.5-3x latency increase. It is almost never justified for serving interactive applications.

The KV cache is the bottleneck, not compute. Beam search’s real cost is not the extra forward passes (which batch well) but the extra KV cache memory, which limits concurrency and destroys throughput. Shared-prefix optimization and paged attention reduce this by 30-60%, but the fundamental overhead remains.

Sample-and-rank is the modern alternative to beam search. Generating N independent samples and selecting the best one with a reranker achieves comparable or better quality than beam search while maintaining sampling’s memory efficiency.

The decoding strategy landscape continues to evolve. Min-p sampling, speculative decoding, and grammar-guided generation are all active areas of research. But the core performance tradeoff is unlikely to change: maintaining multiple hypotheses costs memory, and memory determines throughput. Choose accordingly.