A large language model has hundreds of billions of parameters, distributed across attention heads, feed-forward networks, embedding matrices, and normalization layers. All of these parameters are shaped by a single objective: minimize the cross-entropy loss between the model’s predicted distribution and the actual next token. Every capability the model exhibits — factual recall, reasoning, code generation, following instructions — emerges from optimizing this one number.
This is a remarkable fact. There is no explicit “reasoning loss” or “factual accuracy reward” during pre-training. The model is simply asked, trillions of times, to predict the next token given all previous tokens. Cross-entropy measures how well it does this, and gradient descent adjusts every parameter to improve it. Understanding the loss function is therefore understanding the only training signal that shapes the model’s behavior.
This post covers the full story: the information-theoretic foundations of cross-entropy, how it works for language modeling, teacher forcing and the parallelism it enables, the exposure bias problem, label smoothing, perplexity as the standard evaluation metric, alternative losses used in post-training, and finally, how the properties of cross-entropy explain many puzzling behaviors of language models.
Cross-Entropy from Information Theory
Entropy: Measuring Uncertainty
Before cross-entropy, we need entropy. Given a discrete random variable with distribution , the entropy is:
Entropy measures the average information content of the distribution — how “surprised” you are on average when you observe a sample from . If is concentrated on a single outcome, entropy is zero (no surprise). If is uniform over outcomes, entropy is (maximum surprise).
For natural language, the true distribution has moderate entropy. Given “The capital of France is”, the next token is almost certainly “Paris” — low entropy. Given “I went to the”, many tokens are plausible (“store”, “park”, “doctor”, “beach”) — moderate entropy. The average entropy of English text is estimated at roughly 1—2 bits per character, or 5—10 bits per BPE token.
The entropy of a discrete distribution over vocabulary is:
It is the minimum expected number of bits (if using ) or nats (if using ) needed to encode a sample from using an optimal code. Entropy is maximized by the uniform distribution and minimized (at zero) by a point mass.
Cross-Entropy: Measuring Distribution Distance
Now suppose we have the true distribution (which we do not know explicitly) and a model distribution (our LLM’s output). The cross-entropy between and is:
Cross-entropy measures the expected number of bits needed to encode samples from using the code optimized for . If , cross-entropy equals entropy (the optimal encoding). If differs from , cross-entropy is strictly larger — the encoding is suboptimal.
The difference between cross-entropy and entropy is the KL divergence:
KL divergence is always non-negative and is zero if and only if . Since is a constant (it depends only on the true distribution, not on our model), minimizing cross-entropy is equivalent to minimizing KL divergence .
We minimize cross-entropy rather than KL divergence because they differ by a constant (), which does not affect gradients. More practically, computing cross-entropy requires only a single sample from (the actual next token), while computing KL divergence requires knowing itself, which we never have. Cross-entropy with a one-hot target reduces to a log probability lookup, which is trivially efficient.
The Asymmetry of KL Divergence
KL divergence is asymmetric: . The standard cross-entropy loss minimizes , which is called the forward KL divergence. This has a specific consequence for model behavior.
Forward KL penalizes the model heavily for assigning low probability to events that actually occur (because appears in the loss, and as ). However, it does not penalize the model for assigning probability to events that do not occur. The loss is , so false positives (assigning probability to wrong tokens) are not directly penalized — they are only indirectly penalized because probability is a finite resource that sums to 1.
This asymmetry is fundamental. It means cross-entropy-trained models are mode-covering: they try to assign nonzero probability to everything that could plausibly come next, rather than concentrating probability on a single best answer. This is why language models are generative — they produce diverse outputs — and why they sometimes hallucinate — they assign probability to plausible-sounding but incorrect continuations.
Minimizing over produces a distribution that is zero-avoiding: if , then at the optimum. The model learns to cover all modes of the true distribution, even at the cost of assigning some probability to regions where . This is because the loss goes to infinity as for any that appears in the training data.
Cross-Entropy for Language Modeling
The Language Modeling Loss
For autoregressive language modeling, the training data consists of sequences . The model predicts each token given all previous tokens:
where is the transformer and is the (tied) embedding matrix.
The cross-entropy loss for a single sequence is:
For a single position, the “true distribution” is a one-hot vector: probability 1 on the actual next token, probability 0 on everything else. Cross-entropy with a one-hot target simplifies enormously:
The loss at each position is simply the negative log probability assigned to the correct token. If the model assigns probability 0.95 to the correct token, the loss is . If it assigns probability 0.01, the loss is .
Why Log Probability?
The use of logarithm has several important properties.
Additivity. The log probability of a sequence factorizes into a sum over positions:
This means the sequence-level loss is the average of position-level losses, which decomposes nicely for gradient computation.
Dynamic range. Raw probabilities for individual tokens are often tiny (a typical next-token probability for a 128K vocabulary might be 0.001 for a moderately confident prediction). The log transform maps these to a more manageable range: nats. Without logs, gradient magnitudes would span many orders of magnitude, making optimization numerically unstable.
Information-theoretic meaning. is the number of bits needed to encode token under the model’s distribution. Minimizing average is equivalent to finding the most efficient coding of the training data, which is equivalent to finding the closest approximation to the true distribution.
Why Sum Over Positions?
The loss sums over all positions in the sequence. This means every token prediction contributes equally to the gradient signal (before averaging). The model learns to predict function words (“the”, “of”, “is”) with the same intensity as content words (“Paris”, “quantum”, “recursion”).
In practice, function words are far more predictable (high probability), so they contribute small losses and small gradients. Content words are less predictable, so they contribute larger losses and larger gradients. The loss naturally upweights the “hard” predictions without any explicit curriculum.
Typical Per-Token Cross-Entropy Loss by Token Type
(nats)Teacher Forcing: The Key to Parallel Training
How Teacher Forcing Works
During training, the model must predict given for every position . There are two ways to provide :
-
Free running (autoregressive): Feed the model’s own predictions as input. At position , use the model’s predicted token as input (not the ground truth ).
-
Teacher forcing: Always feed the ground-truth token as input, regardless of what the model predicted.
Language model training universally uses teacher forcing. The name comes from the analogy of a teacher who always provides the correct answer for the student to build upon, rather than letting the student’s errors compound.
Teacher forcing is not merely a training trick — it is what makes transformer training computationally feasible. Without it, you cannot parallelize the forward pass across sequence positions, and training a model on trillions of tokens would be impossibly slow.
Why Teacher Forcing Enables Parallelism
This is the critical insight. With teacher forcing, the input to the model at position is the ground-truth token , which is known before training begins. This means:
- The input to position 1 is (the start token) — known.
- The input to position 2 is — known.
- The input to position is — known.
All inputs are known in advance. The entire sequence can be processed in a single forward pass. The causal mask in the attention mechanism ensures that position only attends to positions , so the model cannot “cheat” by looking ahead. But all positions are computed simultaneously through efficient matrix operations.
Without teacher forcing, the input to position depends on the model’s output at position , which depends on the output at position , and so on. The forward pass becomes inherently sequential — you must generate one token at a time, exactly like inference. For a sequence of length 2048, this means 2048 sequential forward passes instead of one.
The speedup is enormous:
Training Throughput: Teacher Forcing vs Free Running (Hypothetical)
| Method | Forward Passes per Sequence | GPU Utilization | Tokens/sec (A100) | Time for 1T Tokens |
|---|---|---|---|---|
| Teacher forcing | 1 | ~60% | ~500,000 | ~23 days (1024 GPUs) |
| Free running | 2,048 (seq_len) | ~2% | ~500 | ~63 years (1024 GPUs) |
Teacher forcing converts the sequential forward passes of free-running training into parallel forward passes. For , this is a 2048x speedup. Without teacher forcing, modern LLM training would be literally impossible at current data scales.
The Causal Mask Makes It Work
Teacher forcing works because the causal attention mask prevents information leakage. At position , the model sees embeddings for tokens (the ground-truth inputs) and must predict . The causal mask ensures that the representation at position is computed using only positions — it cannot attend to positions where future ground-truth tokens are present in the input.
The loss at position is , computed from the model’s output at position . All losses are computed in a single forward pass, and the total loss is their average.
def compute_lm_loss(model, input_ids, targets):
"""
Teacher-forced language model training.
input_ids: [B, T] -- ground truth tokens (shifted right)
targets: [B, T] -- ground truth tokens to predict
The causal mask inside the model prevents information leakage.
All positions are computed in a single parallel forward pass.
"""
# Single forward pass: all positions computed simultaneously
# The causal attention mask ensures position t only sees positions <= t
logits = model(input_ids) # [B, T, V]
# Compute cross-entropy at every position in parallel
# logits[:, t, :] predicts targets[:, t] using only input_ids[:, :t+1]
loss = F.cross_entropy(
logits.view(-1, logits.size(-1)), # [B*T, V]
targets.view(-1), # [B*T]
reduction='mean'
)
return loss
Under teacher forcing with a causal attention mask, the loss computed in a single parallel forward pass is mathematically identical to the loss that would be computed by running the model autoregressively on the ground-truth sequence and computing the loss at each step. The causal mask guarantees that , so the parallel and sequential computations produce identical outputs at every position.
Exposure Bias: The Train-Test Mismatch
The Problem
Teacher forcing creates a mismatch between training and inference.
During training: The model always receives the correct previous token as input. If the ground truth is “The cat sat on the mat”, the input at each position is the correct token. The model never encounters its own errors.
During inference: The model feeds its own predicted tokens as input. If it generates “The cat sat on the table” (instead of “mat”), all subsequent predictions are conditioned on “table” — a context the model never saw during training. The model has no training experience in recovering from its own mistakes.
This is exposure bias: the model is exposed only to correct sequences during training but must handle its own (potentially incorrect) sequences during inference. The distribution of input sequences at inference time differs from the distribution at training time.
How Exposure Bias Manifests
Error compounding. When the model generates an incorrect token, the error can cascade. Each subsequent prediction is conditioned on a slightly wrong context, which increases the probability of further errors. This is particularly visible in long-form generation, where early errors can cause the text to drift off topic.
Degenerate repetition. The model sometimes enters loops, repeating the same phrase indefinitely. This happens because the model was never trained on sequences containing its own repetitions, so it does not know how to escape them. The repeated phrase creates a context that reinforces itself.
Distributional shift. More subtly, even when individual tokens are correct, the overall distribution of generated text may differ from training text in ways that compound. Generated text tends to be slightly more “generic” or “safe” than training text because the model gravitates toward high-probability tokens, shifting the context distribution.
Exposure bias was a major concern in early sequence-to-sequence models (machine translation, summarization). For modern LLMs, it is partially mitigated by the massive scale of training data (the model sees enough variety to handle minor context perturbations) and by post-training techniques (RLHF, DPO) that explicitly train on the model’s own outputs. However, it is never fully eliminated and remains a contributor to degenerate generation behaviors.
Mitigation Strategies
Scheduled sampling (Bengio et al., 2015) gradually transitions from teacher forcing to free running during training. At the start, the model always receives ground-truth inputs. As training progresses, it increasingly receives its own predictions. This exposes the model to its own errors. However, scheduled sampling is rarely used for LLMs because it breaks the parallelism advantage of teacher forcing — you must generate tokens sequentially to sample the model’s own predictions.
Sequence-level training (Ranzato et al., 2016) uses reinforcement learning to optimize sequence-level metrics (BLEU, ROUGE) rather than token-level cross-entropy. This trains the model on its own generated sequences. Modern RLHF is a descendant of this approach.
Nucleus sampling and temperature at inference time partially address degenerate behavior by preventing the model from always choosing the highest-probability token, which would amplify exposure bias effects.
Label Smoothing: Softening the Target
The Hard Target Problem
Standard cross-entropy uses a one-hot target: probability 1 on the correct token, probability 0 on everything else. This pushes the model to assign all probability mass to a single token, driving logits toward extreme values. The loss is minimized when , which requires the logit for the correct token to approach relative to all other logits.
This has two problems:
-
Overconfidence. The model becomes poorly calibrated, assigning near-100% probability to its top prediction even when it should be uncertain. This is particularly harmful for downstream applications that rely on the model’s confidence scores.
-
Gradient saturation. As logits become extreme, softmax gradients vanish. The model stops learning because the gradient signal becomes negligibly small. Training effectively stalls for “easy” tokens that the model already predicts correctly.
How Label Smoothing Works
Label smoothing (Szegedy et al., 2016) replaces the one-hot target with a mixture:
where is the smoothing parameter (typically 0.1). The target probability for the correct token becomes , and every other token gets probability (for , ).
The cross-entropy loss with label smoothing becomes:
The second term is proportional to the entropy of the model’s output distribution. It penalizes the model for being too confident, acting as a regularizer that keeps the output distribution from collapsing to a point mass.
Label smoothing can be rewritten as: , where is the uniform distribution and is the standard cross-entropy loss. The label smoothing term penalizes the model’s distribution for deviating from uniform — it literally adds a force that spreads probability mass more evenly.
Empirical Impact
Label Smoothing Effects on Model Quality
| Configuration | Validation PPL | Calibration ECE | Downstream Acc | Notes |
|---|---|---|---|---|
| No smoothing (e=0) | 8.21 | 0.142 | 73.2% | Overconfident |
| e=0.05 | 8.15 | 0.068 | 73.8% | Slight improvement |
| e=0.1 (standard) | 8.18 | 0.041 | 74.1% | Best calibration |
| e=0.2 | 8.35 | 0.032 | 73.5% | Underfitting |
| e=0.3 | 8.72 | 0.028 | 72.1% | Too much smoothing |
The sweet spot is typically . Perplexity may increase slightly (the model cannot achieve zero loss with smoothed targets), but calibration improves dramatically and downstream task performance often improves. The T5 paper uses by default.
For modern LLM pre-training, label smoothing is sometimes omitted because the sheer scale of data provides sufficient regularization. But for fine-tuning on smaller datasets, it remains a valuable tool.
Perplexity: The Standard Evaluation Metric
Definition
Perplexity is the exponential of the average cross-entropy loss:
where is the average per-token cross-entropy loss (in nats, since we use the natural logarithm).
For a language model with per-token cross-entropy loss (in nats), the perplexity is . If using bits (), perplexity is . Perplexity can be interpreted as the effective number of equally likely tokens the model is choosing between at each position. A perplexity of 10 means the model is, on average, as uncertain as if it were choosing uniformly among 10 equally likely options.
Interpreting Perplexity
PPL = 1: The model predicts every token with 100% confidence and is always correct. This is impossible for real text (natural language has inherent randomness).
PPL = V (vocabulary size): The model assigns uniform probability to all tokens. For , this is the worst possible perplexity for a model that uses the full vocabulary. An untrained model should have perplexity close to .
PPL = 5—15: Typical range for state-of-the-art LLMs on standard benchmarks (WikiText-103, The Pile).
PPL = 20—50: Decent model, but clearly not state-of-the-art.
PPL = 100+: Poor model, or evaluated on a very difficult domain (e.g., code in a rare programming language).
Perplexity of Major Language Models on Comparable Benchmarks
(PPL (lower is better))Note that perplexity comparisons are only meaningful on the same dataset with the same tokenizer. A model with a 128K vocabulary will naturally have lower perplexity per character than a model with a 32K vocabulary (because each token covers more text), but higher perplexity per token (because there are more tokens to choose from). Always specify the unit: perplexity per token, per word, or per character.
Perplexity and Bits Per Character
To compare models with different tokenizers, convert to bits per character (BPC):
where is the total cross-entropy loss (in nats), is the number of tokens, and is the number of characters. Dividing by converts from nats to bits. BPC is tokenizer-independent and allows fair comparison across models with different vocabularies.
State-of-the-art LLMs achieve approximately 0.7—0.9 BPC on English text. The estimated entropy of English is roughly 0.6—1.3 BPC (Shannon, 1951), suggesting that modern models are approaching the fundamental limit — though the exact entropy of natural language remains debated.
The Scaling Laws Lens
Kaplan et al. (2020) and Hoffmann et al. (2022) showed that cross-entropy loss follows predictable power laws as a function of model size , dataset size , and compute budget :
where and is a constant. Loss decreases as a power law in model size with no sign of saturation. This means perplexity also follows a power law:
The smoothness and predictability of these scaling laws is what justifies the enormous investments in training larger models. If loss decreased erratically or plateaued, there would be no basis for extrapolating that a 10x larger model would be meaningfully better.
Scaling Laws: Loss vs Model Size (Chinchilla-Optimal Training)
| Parameters | Training Tokens | Validation Loss (nats) | Perplexity | Compute (PF-days) |
|---|---|---|---|---|
| 400M | 8.0B | 2.94 | 18.9 | 24 |
| 1B | 20.0B | 2.68 | 14.6 | 150 |
| 7B | 140B | 2.28 | 9.8 | 6,300 |
| 13B | 260B | 2.18 | 8.8 | 21,000 |
| 70B | 1.4T | 1.99 | 7.3 | 540,000 |
| 405B | 15T | 1.75 | 5.8 | 30,000,000 |
Alternative Loss Functions in Post-Training
Why Cross-Entropy Is Not Enough
Cross-entropy pre-training produces a model that predicts the next token in the training distribution. But we want a model that is helpful, harmless, and honest — properties that are not directly captured by next-token prediction. A model trained only with cross-entropy will:
- Generate text that looks like the internet average, not helpful assistant responses.
- Happily produce harmful content if it appeared in the training data.
- Not distinguish between confident facts and uncertain speculation.
Post-training techniques address this gap by introducing new loss functions that shape the model’s behavior beyond what cross-entropy provides.
RLHF: Reinforcement Learning from Human Feedback
RLHF (Ouyang et al., 2022) introduces a reward model that scores the quality of a response given a prompt . The language model is then fine-tuned to maximize expected reward:
where is the current policy (the LLM being trained), is the reference policy (the pre-trained model), and controls the KL penalty that prevents the model from drifting too far from the pre-trained distribution.
The reward model is trained on human preference data: given two responses to the same prompt, a human labels which is better. The reward model learns to assign higher scores to preferred responses.
The key difference from cross-entropy: RLHF optimizes a sequence-level reward rather than token-level log-probabilities. This allows it to capture holistic properties like helpfulness and coherence that emerge at the sequence level but are invisible to per-token loss.
Without the KL penalty, RLHF would quickly degenerate. The model would learn to produce degenerate responses that exploit the reward model’s weaknesses (reward hacking). The KL term keeps the model close to the pre-trained distribution, preserving its language modeling capabilities while shifting its behavior toward higher-reward outputs.
DPO: Direct Preference Optimization
DPO (Rafailov et al., 2023) eliminates the separate reward model by deriving a closed-form loss directly from preference data:
where is the preferred response, is the dispreferred response, and is the sigmoid function. DPO directly increases the relative probability of preferred responses while decreasing the probability of dispreferred ones.
The insight behind DPO is that the optimal policy under the RLHF objective has a closed-form solution:
By rearranging, the reward can be expressed in terms of the optimal policy:
Substituting this into the preference model (Bradley-Terry) and canceling the partition function yields the DPO loss, which is purely a function of policy log-probabilities — no reward model needed.
Post-Training Loss Comparison
| Method | Requires Reward Model? | Training Stability | Memory Overhead | Quality (MT-Bench) |
|---|---|---|---|---|
| SFT only (CE loss) | No | High | Baseline | 6.2 |
| RLHF (PPO) | Yes (separate model) | Low (reward hacking) | 2x (reward + value) | 7.8 |
| DPO | No | High | 1.5x (ref model) | 7.6 |
| KTO | No | High | 1.5x (ref model) | 7.5 |
| ORPO | No | High | Baseline | 7.3 |
Contrastive Loss
Contrastive losses train the model to distinguish between correct and incorrect completions. The simplest form is:
where is a similarity score (often the sequence-level log probability), is a positive example, are negative examples, and is a temperature.
Contrastive losses are used in embedding models (where the goal is to learn representations, not generate text) and in some RLHF variants. They differ from cross-entropy in that they explicitly model what the output should not be, rather than only what it should be.
What the Loss Function Shapes
Why Models Hallucinate
Hallucination — generating plausible-sounding but factually incorrect text — is a direct consequence of the cross-entropy objective.
Cross-entropy trains the model to assign high probability to tokens that appear in the training data in similar contexts. If “The capital of Australia is Sydney” appears in the training data (perhaps in a question about what people commonly misbelieve), the model learns to assign nonzero probability to “Sydney” after “The capital of Australia is”. More importantly, the mode-covering property of forward KL means the model tries to assign some probability to every plausible continuation it has seen, rather than concentrating all mass on the single correct answer.
The model does not distinguish between factual and counterfactual continuations in the training data. It sees both “The capital of Australia is Canberra” (factual) and “Many people believe the capital of Australia is Sydney” (factual text containing a counterfactual claim). Both contribute to the model’s distribution. Cross-entropy does not penalize the model for assigning probability to “Sydney” — it only penalizes the model for not assigning enough probability to the correct continuation in each specific context.
From the perspective of cross-entropy, hallucination is correct behavior. The model is accurately representing the distribution of text it was trained on, which includes both correct and incorrect statements. Reducing hallucination requires either (a) curating the training data to remove incorrect information (impractical at scale), or (b) using post-training objectives that explicitly penalize factual errors (RLHF, DPO with factuality rewards).
Why Models Are Surprisingly Well-Calibrated
Despite the hallucination problem, cross-entropy-trained models are often well-calibrated: when the model assigns 70% probability to a token, that token is correct roughly 70% of the time. This is a direct consequence of the loss function.
Cross-entropy is a proper scoring rule: a prediction mechanism where the expected loss is minimized by reporting the true probability distribution. If the true probability of “Paris” in a given context is 0.7, the model minimizes its expected loss by assigning exactly , not by rounding up to 1.0 or down to 0.5.
A scoring rule is strictly proper if the expected score is uniquely minimized when . Cross-entropy is strictly proper: is minimized if and only if . This means a model with infinite capacity trained with cross-entropy on infinite data will recover the true conditional distribution exactly.
This property means that the model is incentivized to be honest about its uncertainty. If two tokens are equally likely, the model should assign them equal probability — and cross-entropy ensures this is the optimal strategy. This is why LLM probabilities are useful for applications like uncertainty quantification and confidence estimation.
Why Models Struggle with Negation
Cross-entropy treats each token independently. The loss at position depends only on the model’s prediction at position , not on the semantic coherence of the full sequence. This creates a systematic weakness with negation.
Consider: “A cat is not a dog.” The model processes “A cat is” and must predict “not”. If the training data contains both “A cat is a pet” and “A cat is not a dog”, the model must assign probability to both “a” and “not” in this context. The crucial token is “not” — a single token that reverses the meaning of the entire sentence.
The problem is that cross-entropy assigns the same magnitude of loss to getting “not” wrong as to getting any other token wrong. But the semantic impact of missing “not” is far greater than missing, say, “the” before “dog”. Cross-entropy is blind to semantic importance — it treats all tokens as equally important prediction targets.
This is not merely a theoretical concern. Empirical studies consistently show that LLMs perform worse on negated statements than on affirmative ones. “Which of the following is NOT true?” reliably degrades model accuracy compared to “Which of the following IS true?” — even when the underlying factual knowledge is identical.
Why Repetition Occurs
Repetition in generated text has a cross-entropy explanation. During training with teacher forcing, the model never encounters its own repeated text as input. At inference, once a phrase is generated, it appears in the context. If the model assigns even slightly elevated probability to repeating a phrase it has just seen (a reasonable learned bias, since natural language does contain repetition), the repeated phrase enters the context and reinforces itself.
Cross-entropy cannot prevent this because it never trains on degenerate inputs. The loss function evaluates predictions on correct sequences, so it never learns that “I think the answer is clear. I think the answer is clear. I think the answer is clear.” is degenerate. Post-hoc fixes like repetition penalty, frequency penalty, and presence penalty at inference time are engineering solutions to a fundamental mismatch between the training objective and the generation process.
Impact of Repetition Penalty on Generated Text Quality (Human Evaluation)
(score (1-5))Why Models Are Better at Common Knowledge
Cross-entropy loss is proportional to the negative log probability of the correct token. Tokens that appear frequently in many contexts (common knowledge) are predicted correctly more often, contributing small losses. Tokens that represent rare facts contribute larger losses when wrong, but these rare-fact contexts appear infrequently in the training data.
The gradient from a single training example is proportional to the loss magnitude. Common patterns contribute many small, consistent gradients that accumulate over millions of examples. Rare patterns contribute fewer, larger gradients that may conflict with each other (the same rare entity might appear in contradictory contexts).
This means the model learns common knowledge reliably (many consistent gradient updates) and rare knowledge unreliably (few, potentially contradictory updates). The loss function does not explicitly prefer common over rare knowledge — it treats all tokens equally — but the frequency of training examples creates an implicit curriculum that favors frequently occurring patterns.
Numerical Stability
The Log-Sum-Exp Trick
Computing cross-entropy naively involves computing , where are logits. If any logit is large (say, ), overflows to infinity. If all logits are very negative (say, ), underflows to zero and the log of zero is undefined.
The log-sum-exp trick subtracts the maximum logit before exponentiating:
After subtracting , all exponents are , so . The sum is at least 1 (because the maximum element contributes ), so the log is non-negative. No overflow, no underflow.
def stable_cross_entropy(logits, target):
"""
Numerically stable cross-entropy computation.
logits: [V] -- raw output head scores
target: int -- correct token ID
"""
# Step 1: Log-sum-exp trick
max_logit = logits.max()
shifted = logits - max_logit
log_sum_exp = max_logit + torch.log(torch.exp(shifted).sum())
# Step 2: Loss is log_sum_exp minus the target logit
loss = log_sum_exp - logits[target]
return loss
Every deep learning framework implements this automatically in their cross-entropy functions. But understanding it is important because custom CUDA kernels for fused cross-entropy (discussed in the previous post on the output head) must implement this trick correctly.
Mixed Precision Considerations
During mixed-precision training, logits are often computed in BF16 or FP16 but the loss should be computed in FP32. The softmax operation is particularly sensitive to precision because it involves exponentials of large numbers followed by normalization.
A common pattern is to cast logits to FP32 before computing the loss:
# logits are BF16 from the output head
logits_fp32 = logits.float() # cast to FP32
loss = F.cross_entropy(logits_fp32, targets) # compute loss in FP32
This costs negligible compute (the cast is essentially free) but prevents the precision loss that would occur if softmax were computed in BF16, where the 8-bit mantissa cannot represent the fine-grained probability differences between similar logits.
Computing cross-entropy loss in BF16 can cause training instabilities, especially for large vocabularies where many logits are close in value. The softmax normalization amplifies precision errors, and the log operation further amplifies them. Always compute the loss in FP32, even if everything else runs in lower precision.
The Loss Landscape
What Does the Loss Surface Look Like?
The cross-entropy loss surface for a transformer is astronomically high-dimensional (billions of parameters), making direct visualization impossible. However, research has revealed key properties.
Local minima are rare at scale. For large models, most critical points of the loss function are saddle points, not local minima. The intuition: in dimensions, a critical point is a local minimum only if all eigenvalues of the Hessian are positive. For random functions in high dimensions, each eigenvalue is positive with probability roughly 0.5, so the probability that all are positive is roughly — vanishingly small for in the billions. This is why SGD with momentum is sufficient for LLM training; there is no need for sophisticated global optimization.
Loss decreases smoothly with scale. The scaling laws mentioned earlier show that loss decreases as a smooth power law in model size, data size, and compute. There are no “phase transitions” in the loss — a 2x larger model reliably achieves lower loss, and the magnitude of improvement is predictable.
Sharp vs flat minima. Models that converge to “flat” regions of the loss surface (where small perturbations to parameters cause small changes in loss) tend to generalize better than models in “sharp” regions. Techniques like large batch size, learning rate warmup, and weight decay bias the optimizer toward flatter minima.
Summary
The cross-entropy loss function is the single most important equation in language model training. It is the sole signal that shapes every parameter of the model during pre-training. From information theory, we know it measures how well the model approximates the true distribution of text. From teacher forcing, we know it enables the parallelism that makes trillion-token training feasible. From its properties as a proper scoring rule, we understand why models are well-calibrated. From its mode-covering behavior (forward KL), we understand why models hallucinate. And from its token-level decomposition, we understand why models struggle with negation and semantic coherence.
Perplexity — the exponential of cross-entropy — gives us a single number that summarizes model quality, one that follows smooth, predictable power laws as models scale. Alternative losses in post-training (RLHF, DPO) address the limitations of cross-entropy by introducing human preference signals, but they build on the foundation that cross-entropy provides.
Every behavior of a language model — every capability and every failure mode — can ultimately be traced back to this loss function. Understanding cross-entropy is understanding the optimization pressure that created the model. It is the closest thing we have to understanding why these models work the way they do.