Part of Series Transformer Anatomy 12 of 23
1 The Transformer Attention Mechanism: From First Principles to Performance Reality 2 Tokenization and BPE: How LLMs See Text — From Characters to Subwords 3 Embedding Layers: The Geometry of Meaning in LLMs 4 Position Encoding in Transformers: From Sinusoidal to RoPE, ALiBi, and Long-Context Scaling 5 Softmax Numerics: Log-Sum-Exp, Temperature, and Why Numerical Stability Matters 6 Attention Variants Compared: MHA, MQA, GQA, and MLA 7 Normalization in Transformers: LayerNorm, RMSNorm, and the Training Stability Story 8 Residual Connections and Skip Paths: Why Transformers Can Be 100 Layers Deep 9 The Feed-Forward Network: SwiGLU, Gating, and the FFN-as-Memory Hypothesis 10 Mixture of Experts: Why Conditional Computation Is the Path to Trillion-Parameter Models 11 The Output Head: Unembedding, Weight Tying, and Vocabulary Projection 12 Cross-Entropy Loss: How the Loss Function Shapes What an LLM Learns 13 Encoder vs Decoder: Why Decoder-Only Won 14 DeepSeek V3: How 671B Parameters Trained for the Cost of a 70B Dense Model 15 Building a Transformer From Scratch: Putting Every Component Together 16 Gradient Flow and Backpropagation Through Transformers: What Happens During the Backward Pass 17 Weight Initialization: Xavier, Kaiming, and Why mu-P Changes Everything for Large Models 18 Training Loop Anatomy: Forward Pass, Loss Computation, Backward Pass, Optimizer Step 19 Learning Rate Schedules: Warmup, Cosine Decay, and Why WSD Changes Everything 20 Activation Functions Deep Dive: ReLU, GELU, SiLU, and Why Each Matters for Transformers 21 Attention Masking: Causal, Bidirectional, Sliding Window, Block Sparse, and Custom Patterns 22 Knowledge Distillation: Training Small Models to Match Large Ones 23 Model Merging: Weight Averaging, TIES, DARE, and Evolutionary Search

A large language model has hundreds of billions of parameters, distributed across attention heads, feed-forward networks, embedding matrices, and normalization layers. All of these parameters are shaped by a single objective: minimize the cross-entropy loss between the model’s predicted distribution and the actual next token. Every capability the model exhibits — factual recall, reasoning, code generation, following instructions — emerges from optimizing this one number.

This is a remarkable fact. There is no explicit “reasoning loss” or “factual accuracy reward” during pre-training. The model is simply asked, trillions of times, to predict the next token given all previous tokens. Cross-entropy measures how well it does this, and gradient descent adjusts every parameter to improve it. Understanding the loss function is therefore understanding the only training signal that shapes the model’s behavior.

This post covers the full story: the information-theoretic foundations of cross-entropy, how it works for language modeling, teacher forcing and the parallelism it enables, the exposure bias problem, label smoothing, perplexity as the standard evaluation metric, alternative losses used in post-training, and finally, how the properties of cross-entropy explain many puzzling behaviors of language models.


Cross-Entropy from Information Theory

Entropy: Measuring Uncertainty

Before cross-entropy, we need entropy. Given a discrete random variable XX with distribution pp, the entropy is:

H(p)=xp(x)logp(x)H(p) = -\sum_{x} p(x) \log p(x)

Entropy measures the average information content of the distribution — how “surprised” you are on average when you observe a sample from pp. If pp is concentrated on a single outcome, entropy is zero (no surprise). If pp is uniform over VV outcomes, entropy is logV\log V (maximum surprise).

For natural language, the true distribution p(xtx<t)p^*(x_t \mid x_{<t}) has moderate entropy. Given “The capital of France is”, the next token is almost certainly “Paris” — low entropy. Given “I went to the”, many tokens are plausible (“store”, “park”, “doctor”, “beach”) — moderate entropy. The average entropy of English text is estimated at roughly 1—2 bits per character, or 5—10 bits per BPE token.

Σ Definition: Entropy

The entropy of a discrete distribution pp over vocabulary V\mathcal{V} is:

H(p)=vVp(v)logp(v)H(p) = -\sum_{v \in \mathcal{V}} p(v) \log p(v)

It is the minimum expected number of bits (if using log2\log_2) or nats (if using ln\ln) needed to encode a sample from pp using an optimal code. Entropy is maximized by the uniform distribution and minimized (at zero) by a point mass.

Cross-Entropy: Measuring Distribution Distance

Now suppose we have the true distribution pp^* (which we do not know explicitly) and a model distribution qq (our LLM’s output). The cross-entropy between pp^* and qq is:

H(p,q)=xp(x)logq(x)H(p^*, q) = -\sum_{x} p^*(x) \log q(x)

Cross-entropy measures the expected number of bits needed to encode samples from pp^* using the code optimized for qq. If q=pq = p^*, cross-entropy equals entropy (the optimal encoding). If qq differs from pp^*, cross-entropy is strictly larger — the encoding is suboptimal.

The difference between cross-entropy and entropy is the KL divergence:

DKL(pq)=H(p,q)H(p)=xp(x)logp(x)q(x)D_{KL}(p^* \| q) = H(p^*, q) - H(p^*) = \sum_{x} p^*(x) \log \frac{p^*(x)}{q(x)}

KL divergence is always non-negative and is zero if and only if p=qp^* = q. Since H(p)H(p^*) is a constant (it depends only on the true distribution, not on our model), minimizing cross-entropy H(p,q)H(p^*, q) is equivalent to minimizing KL divergence DKL(pq)D_{KL}(p^* \| q).

ℹ️ Why Cross-Entropy and Not KL Divergence Directly?

We minimize cross-entropy rather than KL divergence because they differ by a constant (H(p)H(p^*)), which does not affect gradients. More practically, computing cross-entropy requires only a single sample from pp^* (the actual next token), while computing KL divergence requires knowing pp^* itself, which we never have. Cross-entropy with a one-hot target reduces to a log probability lookup, which is trivially efficient.

The Asymmetry of KL Divergence

KL divergence is asymmetric: DKL(pq)DKL(qp)D_{KL}(p^* \| q) \neq D_{KL}(q \| p^*). The standard cross-entropy loss minimizes DKL(pq)D_{KL}(p^* \| q), which is called the forward KL divergence. This has a specific consequence for model behavior.

Forward KL penalizes the model heavily for assigning low probability to events that actually occur (because logq(x)\log q(x) appears in the loss, and logq(x)\log q(x) \to -\infty as q(x)0q(x) \to 0). However, it does not penalize the model for assigning probability to events that do not occur. The loss is logq(xtrue)-\log q(x_\text{true}), so false positives (assigning probability to wrong tokens) are not directly penalized — they are only indirectly penalized because probability is a finite resource that sums to 1.

This asymmetry is fundamental. It means cross-entropy-trained models are mode-covering: they try to assign nonzero probability to everything that could plausibly come next, rather than concentrating probability on a single best answer. This is why language models are generative — they produce diverse outputs — and why they sometimes hallucinate — they assign probability to plausible-sounding but incorrect continuations.

Σ Theorem: Forward KL Mode-Covering Property

Minimizing DKL(pq)D_{KL}(p^* \| q) over qq produces a distribution that is zero-avoiding: if p(x)>0p^*(x) > 0, then q(x)>0q(x) > 0 at the optimum. The model learns to cover all modes of the true distribution, even at the cost of assigning some probability to regions where p(x)=0p^*(x) = 0. This is because the loss logq(x)-\log q(x) goes to infinity as q(x)0q(x) \to 0 for any xx that appears in the training data.


Cross-Entropy for Language Modeling

The Language Modeling Loss

For autoregressive language modeling, the training data consists of sequences x1,x2,,xTx_1, x_2, \ldots, x_T. The model predicts each token given all previous tokens:

q(xtx<t)=softmax(WEfθ(x<t))tq(x_t \mid x_{<t}) = \text{softmax}(W_E \cdot f_\theta(x_{<t}))_t

where fθf_\theta is the transformer and WEW_E is the (tied) embedding matrix.

The cross-entropy loss for a single sequence is:

L=1Tt=1Tlogq(xtx<t)L = -\frac{1}{T} \sum_{t=1}^{T} \log q(x_t \mid x_{<t})

For a single position, the “true distribution” pp^* is a one-hot vector: probability 1 on the actual next token, probability 0 on everything else. Cross-entropy with a one-hot target simplifies enormously:

H(p,q)=vVp(v)logq(v)=1logq(xtrue)vxtrue0logq(v)=logq(xtrue)H(p^*, q) = -\sum_{v \in \mathcal{V}} p^*(v) \log q(v) = -1 \cdot \log q(x_\text{true}) - \sum_{v \neq x_\text{true}} 0 \cdot \log q(v) = -\log q(x_\text{true})

The loss at each position is simply the negative log probability assigned to the correct token. If the model assigns probability 0.95 to the correct token, the loss is log(0.95)=0.051-\log(0.95) = 0.051. If it assigns probability 0.01, the loss is log(0.01)=4.61-\log(0.01) = 4.61.

Why Log Probability?

The use of logarithm has several important properties.

Additivity. The log probability of a sequence factorizes into a sum over positions:

logq(x1,,xT)=t=1Tlogq(xtx<t)\log q(x_1, \ldots, x_T) = \sum_{t=1}^{T} \log q(x_t \mid x_{<t})

This means the sequence-level loss is the average of position-level losses, which decomposes nicely for gradient computation.

Dynamic range. Raw probabilities for individual tokens are often tiny (a typical next-token probability for a 128K vocabulary might be 0.001 for a moderately confident prediction). The log transform maps these to a more manageable range: log(0.001)=6.9\log(0.001) = -6.9 nats. Without logs, gradient magnitudes would span many orders of magnitude, making optimization numerically unstable.

Information-theoretic meaning. log2q(x)-\log_2 q(x) is the number of bits needed to encode token xx under the model’s distribution. Minimizing average logq(x)-\log q(x) is equivalent to finding the most efficient coding of the training data, which is equivalent to finding the closest approximation to the true distribution.

Why Sum Over Positions?

The loss sums over all positions in the sequence. This means every token prediction contributes equally to the gradient signal (before averaging). The model learns to predict function words (“the”, “of”, “is”) with the same intensity as content words (“Paris”, “quantum”, “recursion”).

In practice, function words are far more predictable (high probability), so they contribute small losses and small gradients. Content words are less predictable, so they contribute larger losses and larger gradients. The loss naturally upweights the “hard” predictions without any explicit curriculum.

Typical Per-Token Cross-Entropy Loss by Token Type

(nats)
Punctuation (., ,) Very predictable
0.3 nats
Function words (the, is) High frequency
0.8 nats
Common nouns Context-dependent
2.1 nats
Named entities Requires world knowledge
3.5 nats
Code identifiers Highly variable
4.8 nats
Random/unpredictable Near-uniform over vocab
7.2 nats

Teacher Forcing: The Key to Parallel Training

How Teacher Forcing Works

During training, the model must predict xtx_t given x<tx_{<t} for every position tt. There are two ways to provide x<tx_{<t}:

  1. Free running (autoregressive): Feed the model’s own predictions as input. At position tt, use the model’s predicted token x^t1\hat{x}_{t-1} as input (not the ground truth xt1x_{t-1}).

  2. Teacher forcing: Always feed the ground-truth token xt1x_{t-1} as input, regardless of what the model predicted.

Language model training universally uses teacher forcing. The name comes from the analogy of a teacher who always provides the correct answer for the student to build upon, rather than letting the student’s errors compound.

💡 Teacher Forcing Is Not Optional

Teacher forcing is not merely a training trick — it is what makes transformer training computationally feasible. Without it, you cannot parallelize the forward pass across sequence positions, and training a model on trillions of tokens would be impossibly slow.

Why Teacher Forcing Enables Parallelism

This is the critical insight. With teacher forcing, the input to the model at position tt is the ground-truth token xt1x_{t-1}, which is known before training begins. This means:

  • The input to position 1 is x0x_0 (the start token) — known.
  • The input to position 2 is x1x_1 — known.
  • The input to position TT is xT1x_{T-1} — known.

All inputs are known in advance. The entire sequence can be processed in a single forward pass. The causal mask in the attention mechanism ensures that position tt only attends to positions t\leq t, so the model cannot “cheat” by looking ahead. But all positions are computed simultaneously through efficient matrix operations.

Without teacher forcing, the input to position tt depends on the model’s output at position t1t-1, which depends on the output at position t2t-2, and so on. The forward pass becomes inherently sequential — you must generate one token at a time, exactly like inference. For a sequence of length 2048, this means 2048 sequential forward passes instead of one.

The speedup is enormous:

📊

Training Throughput: Teacher Forcing vs Free Running (Hypothetical)

MethodForward Passes per SequenceGPU UtilizationTokens/sec (A100)Time for 1T Tokens
Teacher forcing 1 ~60% ~500,000 ~23 days (1024 GPUs)
Free running 2,048 (seq_len) ~2% ~500 ~63 years (1024 GPUs)
Note: Free running requires sequential token generation during training, reducing throughput by roughly seq_len factor. Numbers are approximate for a 7B model.

Teacher forcing converts the O(T)O(T) sequential forward passes of free-running training into O(1)O(1) parallel forward passes. For T=2048T = 2048, this is a 2048x speedup. Without teacher forcing, modern LLM training would be literally impossible at current data scales.

The Causal Mask Makes It Work

Teacher forcing works because the causal attention mask prevents information leakage. At position tt, the model sees embeddings for tokens x1,,xtx_1, \ldots, x_t (the ground-truth inputs) and must predict xt+1x_{t+1}. The causal mask ensures that the representation at position tt is computed using only positions 1,,t1, \ldots, t — it cannot attend to positions t+1,,Tt+1, \ldots, T where future ground-truth tokens are present in the input.

The loss at position tt is logq(xt+1x1,,xt)-\log q(x_{t+1} \mid x_1, \ldots, x_t), computed from the model’s output at position tt. All TT losses are computed in a single forward pass, and the total loss is their average.

def compute_lm_loss(model, input_ids, targets):
    """
    Teacher-forced language model training.

    input_ids: [B, T] -- ground truth tokens (shifted right)
    targets: [B, T] -- ground truth tokens to predict

    The causal mask inside the model prevents information leakage.
    All positions are computed in a single parallel forward pass.
    """
    # Single forward pass: all positions computed simultaneously
    # The causal attention mask ensures position t only sees positions <= t
    logits = model(input_ids)  # [B, T, V]

    # Compute cross-entropy at every position in parallel
    # logits[:, t, :] predicts targets[:, t] using only input_ids[:, :t+1]
    loss = F.cross_entropy(
        logits.view(-1, logits.size(-1)),  # [B*T, V]
        targets.view(-1),                   # [B*T]
        reduction='mean'
    )
    return loss
Σ Theorem: Teacher Forcing Equivalence

Under teacher forcing with a causal attention mask, the loss computed in a single parallel forward pass is mathematically identical to the loss that would be computed by running the model autoregressively on the ground-truth sequence and computing the loss at each step. The causal mask guarantees that fθ(x1,,xT)t=fθ(x1,,xt)f_\theta(x_1, \ldots, x_T)_t = f_\theta(x_1, \ldots, x_t), so the parallel and sequential computations produce identical outputs at every position.


Exposure Bias: The Train-Test Mismatch

The Problem

Teacher forcing creates a mismatch between training and inference.

During training: The model always receives the correct previous token as input. If the ground truth is “The cat sat on the mat”, the input at each position is the correct token. The model never encounters its own errors.

During inference: The model feeds its own predicted tokens as input. If it generates “The cat sat on the table” (instead of “mat”), all subsequent predictions are conditioned on “table” — a context the model never saw during training. The model has no training experience in recovering from its own mistakes.

This is exposure bias: the model is exposed only to correct sequences during training but must handle its own (potentially incorrect) sequences during inference. The distribution of input sequences at inference time differs from the distribution at training time.

How Exposure Bias Manifests

Error compounding. When the model generates an incorrect token, the error can cascade. Each subsequent prediction is conditioned on a slightly wrong context, which increases the probability of further errors. This is particularly visible in long-form generation, where early errors can cause the text to drift off topic.

Degenerate repetition. The model sometimes enters loops, repeating the same phrase indefinitely. This happens because the model was never trained on sequences containing its own repetitions, so it does not know how to escape them. The repeated phrase creates a context that reinforces itself.

Distributional shift. More subtly, even when individual tokens are correct, the overall distribution of generated text may differ from training text in ways that compound. Generated text tends to be slightly more “generic” or “safe” than training text because the model gravitates toward high-probability tokens, shifting the context distribution.

⚠️ Exposure Bias Is Real but Manageable

Exposure bias was a major concern in early sequence-to-sequence models (machine translation, summarization). For modern LLMs, it is partially mitigated by the massive scale of training data (the model sees enough variety to handle minor context perturbations) and by post-training techniques (RLHF, DPO) that explicitly train on the model’s own outputs. However, it is never fully eliminated and remains a contributor to degenerate generation behaviors.

Mitigation Strategies

Scheduled sampling (Bengio et al., 2015) gradually transitions from teacher forcing to free running during training. At the start, the model always receives ground-truth inputs. As training progresses, it increasingly receives its own predictions. This exposes the model to its own errors. However, scheduled sampling is rarely used for LLMs because it breaks the parallelism advantage of teacher forcing — you must generate tokens sequentially to sample the model’s own predictions.

Sequence-level training (Ranzato et al., 2016) uses reinforcement learning to optimize sequence-level metrics (BLEU, ROUGE) rather than token-level cross-entropy. This trains the model on its own generated sequences. Modern RLHF is a descendant of this approach.

Nucleus sampling and temperature at inference time partially address degenerate behavior by preventing the model from always choosing the highest-probability token, which would amplify exposure bias effects.


Label Smoothing: Softening the Target

The Hard Target Problem

Standard cross-entropy uses a one-hot target: probability 1 on the correct token, probability 0 on everything else. This pushes the model to assign all probability mass to a single token, driving logits toward extreme values. The loss is minimized when q(xtrue)=1q(x_\text{true}) = 1, which requires the logit for the correct token to approach ++\infty relative to all other logits.

This has two problems:

  1. Overconfidence. The model becomes poorly calibrated, assigning near-100% probability to its top prediction even when it should be uncertain. This is particularly harmful for downstream applications that rely on the model’s confidence scores.

  2. Gradient saturation. As logits become extreme, softmax gradients vanish. The model stops learning because the gradient signal becomes negligibly small. Training effectively stalls for “easy” tokens that the model already predicts correctly.

How Label Smoothing Works

Label smoothing (Szegedy et al., 2016) replaces the one-hot target with a mixture:

psmooth(v)=(1ϵ)1[v=xtrue]+ϵVp_\text{smooth}(v) = (1 - \epsilon) \cdot \mathbb{1}[v = x_\text{true}] + \frac{\epsilon}{V}

where ϵ\epsilon is the smoothing parameter (typically 0.1). The target probability for the correct token becomes 1ϵ+ϵ/V0.91 - \epsilon + \epsilon/V \approx 0.9, and every other token gets probability ϵ/V7.8×107\epsilon/V \approx 7.8 \times 10^{-7} (for V=128,000V = 128{,}000, ϵ=0.1\epsilon = 0.1).

The cross-entropy loss with label smoothing becomes:

Lsmooth=(1ϵ)logq(xtrue)ϵVv=1Vlogq(v)L_\text{smooth} = -(1 - \epsilon) \log q(x_\text{true}) - \frac{\epsilon}{V} \sum_{v=1}^{V} \log q(v)

The second term is proportional to the entropy of the model’s output distribution. It penalizes the model for being too confident, acting as a regularizer that keeps the output distribution from collapsing to a point mass.

ℹ️ Label Smoothing as KL Regularization

Label smoothing can be rewritten as: Lsmooth=(1ϵ)LCE+ϵDKL(uq)L_\text{smooth} = (1 - \epsilon) \cdot L_\text{CE} + \epsilon \cdot D_{KL}(u \| q), where uu is the uniform distribution and LCEL_\text{CE} is the standard cross-entropy loss. The label smoothing term penalizes the model’s distribution for deviating from uniform — it literally adds a force that spreads probability mass more evenly.

Empirical Impact

📊

Label Smoothing Effects on Model Quality

ConfigurationValidation PPLCalibration ECEDownstream AccNotes
No smoothing (e=0) 8.21 0.142 73.2% Overconfident
e=0.05 8.15 0.068 73.8% Slight improvement
e=0.1 (standard) 8.18 0.041 74.1% Best calibration
e=0.2 8.35 0.032 73.5% Underfitting
e=0.3 8.72 0.028 72.1% Too much smoothing
Note: ECE = Expected Calibration Error (lower is better). Results from a 1.3B parameter model on a held-out validation set. PPL in nats.

The sweet spot is typically ϵ=0.1\epsilon = 0.1. Perplexity may increase slightly (the model cannot achieve zero loss with smoothed targets), but calibration improves dramatically and downstream task performance often improves. The T5 paper uses ϵ=0.1\epsilon = 0.1 by default.

For modern LLM pre-training, label smoothing is sometimes omitted because the sheer scale of data provides sufficient regularization. But for fine-tuning on smaller datasets, it remains a valuable tool.


Perplexity: The Standard Evaluation Metric

Definition

Perplexity is the exponential of the average cross-entropy loss:

PPL=exp(1Tt=1Tlogq(xtx<t))=exp(L)\text{PPL} = \exp\left(-\frac{1}{T} \sum_{t=1}^{T} \log q(x_t \mid x_{<t})\right) = \exp(L)

where LL is the average per-token cross-entropy loss (in nats, since we use the natural logarithm).

Σ Definition: Perplexity

For a language model with per-token cross-entropy loss LL (in nats), the perplexity is PPL=eL\text{PPL} = e^L. If using bits (log2\log_2), perplexity is PPL=2Lbits\text{PPL} = 2^{L_\text{bits}}. Perplexity can be interpreted as the effective number of equally likely tokens the model is choosing between at each position. A perplexity of 10 means the model is, on average, as uncertain as if it were choosing uniformly among 10 equally likely options.

Interpreting Perplexity

PPL = 1: The model predicts every token with 100% confidence and is always correct. This is impossible for real text (natural language has inherent randomness).

PPL = V (vocabulary size): The model assigns uniform probability to all tokens. For V=128,000V = 128{,}000, this is the worst possible perplexity for a model that uses the full vocabulary. An untrained model should have perplexity close to VV.

PPL = 5—15: Typical range for state-of-the-art LLMs on standard benchmarks (WikiText-103, The Pile).

PPL = 20—50: Decent model, but clearly not state-of-the-art.

PPL = 100+: Poor model, or evaluated on a very difficult domain (e.g., code in a rare programming language).

Perplexity of Major Language Models on Comparable Benchmarks

(PPL (lower is better))
Random (V=32K) Untrained baseline
32,000 PPL (lower is better)
GPT-2 124M WikiText-103
29.4 PPL (lower is better)
GPT-2 1.5B WikiText-103
17.5 PPL (lower is better)
Llama 2 7B Held-out web text
5.5 PPL (lower is better)
Llama 2 70B Held-out web text
3.3 PPL (lower is better)
Llama 3 70B Held-out web text
2.8 PPL (lower is better)

Note that perplexity comparisons are only meaningful on the same dataset with the same tokenizer. A model with a 128K vocabulary will naturally have lower perplexity per character than a model with a 32K vocabulary (because each token covers more text), but higher perplexity per token (because there are more tokens to choose from). Always specify the unit: perplexity per token, per word, or per character.

Perplexity and Bits Per Character

To compare models with different tokenizers, convert to bits per character (BPC):

BPC=L×TC×ln2\text{BPC} = \frac{L \times T}{C \times \ln 2}

where LL is the total cross-entropy loss (in nats), TT is the number of tokens, and CC is the number of characters. Dividing by ln2\ln 2 converts from nats to bits. BPC is tokenizer-independent and allows fair comparison across models with different vocabularies.

State-of-the-art LLMs achieve approximately 0.7—0.9 BPC on English text. The estimated entropy of English is roughly 0.6—1.3 BPC (Shannon, 1951), suggesting that modern models are approaching the fundamental limit — though the exact entropy of natural language remains debated.

The Scaling Laws Lens

Kaplan et al. (2020) and Hoffmann et al. (2022) showed that cross-entropy loss follows predictable power laws as a function of model size NN, dataset size DD, and compute budget CC:

L(N)(NcN)αNL(N) \approx \left(\frac{N_c}{N}\right)^{\alpha_N}

where αN0.076\alpha_N \approx 0.076 and NcN_c is a constant. Loss decreases as a power law in model size with no sign of saturation. This means perplexity also follows a power law:

PPL(N)=exp((NcN)αN)\text{PPL}(N) = \exp\left(\left(\frac{N_c}{N}\right)^{\alpha_N}\right)

The smoothness and predictability of these scaling laws is what justifies the enormous investments in training larger models. If loss decreased erratically or plateaued, there would be no basis for extrapolating that a 10x larger model would be meaningfully better.

📊

Scaling Laws: Loss vs Model Size (Chinchilla-Optimal Training)

ParametersTraining TokensValidation Loss (nats)PerplexityCompute (PF-days)
400M 8.0B 2.94 18.9 24
1B 20.0B 2.68 14.6 150
7B 140B 2.28 9.8 6,300
13B 260B 2.18 8.8 21,000
70B 1.4T 1.99 7.3 540,000
405B 15T 1.75 5.8 30,000,000
Note: Approximate values based on published scaling law fits. Compute-optimal ratios from Hoffmann et al. (2022).

Alternative Loss Functions in Post-Training

Why Cross-Entropy Is Not Enough

Cross-entropy pre-training produces a model that predicts the next token in the training distribution. But we want a model that is helpful, harmless, and honest — properties that are not directly captured by next-token prediction. A model trained only with cross-entropy will:

  • Generate text that looks like the internet average, not helpful assistant responses.
  • Happily produce harmful content if it appeared in the training data.
  • Not distinguish between confident facts and uncertain speculation.

Post-training techniques address this gap by introducing new loss functions that shape the model’s behavior beyond what cross-entropy provides.

RLHF: Reinforcement Learning from Human Feedback

RLHF (Ouyang et al., 2022) introduces a reward model R(x,y)R(x, y) that scores the quality of a response yy given a prompt xx. The language model is then fine-tuned to maximize expected reward:

LRLHF=Eyπθ(x)[R(x,y)]+βDKL(πθπref)L_\text{RLHF} = -\mathbb{E}_{y \sim \pi_\theta(x)} [R(x, y)] + \beta \cdot D_{KL}(\pi_\theta \| \pi_\text{ref})

where πθ\pi_\theta is the current policy (the LLM being trained), πref\pi_\text{ref} is the reference policy (the pre-trained model), and β\beta controls the KL penalty that prevents the model from drifting too far from the pre-trained distribution.

The reward model is trained on human preference data: given two responses to the same prompt, a human labels which is better. The reward model learns to assign higher scores to preferred responses.

The key difference from cross-entropy: RLHF optimizes a sequence-level reward rather than token-level log-probabilities. This allows it to capture holistic properties like helpfulness and coherence that emerge at the sequence level but are invisible to per-token loss.

ℹ️ The KL Penalty Is Critical

Without the KL penalty, RLHF would quickly degenerate. The model would learn to produce degenerate responses that exploit the reward model’s weaknesses (reward hacking). The KL term keeps the model close to the pre-trained distribution, preserving its language modeling capabilities while shifting its behavior toward higher-reward outputs.

DPO: Direct Preference Optimization

DPO (Rafailov et al., 2023) eliminates the separate reward model by deriving a closed-form loss directly from preference data:

LDPO=logσ(β[logπθ(ywx)πref(ywx)logπθ(ylx)πref(ylx)])L_\text{DPO} = -\log \sigma\left(\beta \left[\log \frac{\pi_\theta(y_w \mid x)}{\pi_\text{ref}(y_w \mid x)} - \log \frac{\pi_\theta(y_l \mid x)}{\pi_\text{ref}(y_l \mid x)}\right]\right)

where ywy_w is the preferred response, yly_l is the dispreferred response, and σ\sigma is the sigmoid function. DPO directly increases the relative probability of preferred responses while decreasing the probability of dispreferred ones.

The insight behind DPO is that the optimal policy under the RLHF objective has a closed-form solution:

π(yx)=1Z(x)πref(yx)exp(R(x,y)β)\pi^*(y \mid x) = \frac{1}{Z(x)} \pi_\text{ref}(y \mid x) \exp\left(\frac{R(x, y)}{\beta}\right)

By rearranging, the reward can be expressed in terms of the optimal policy:

R(x,y)=βlogπ(yx)πref(yx)+βlogZ(x)R(x, y) = \beta \log \frac{\pi^*(y \mid x)}{\pi_\text{ref}(y \mid x)} + \beta \log Z(x)

Substituting this into the preference model (Bradley-Terry) and canceling the partition function Z(x)Z(x) yields the DPO loss, which is purely a function of policy log-probabilities — no reward model needed.

📊

Post-Training Loss Comparison

MethodRequires Reward Model?Training StabilityMemory OverheadQuality (MT-Bench)
SFT only (CE loss) No High Baseline 6.2
RLHF (PPO) Yes (separate model) Low (reward hacking) 2x (reward + value) 7.8
DPO No High 1.5x (ref model) 7.6
KTO No High 1.5x (ref model) 7.5
ORPO No High Baseline 7.3
Note: MT-Bench scores are approximate and vary by base model. ORPO combines SFT and preference optimization in a single step.

Contrastive Loss

Contrastive losses train the model to distinguish between correct and incorrect completions. The simplest form is:

Lcontrastive=logexp(s(x,y+)/τ)exp(s(x,y+)/τ)+kexp(s(x,yk)/τ)L_\text{contrastive} = -\log \frac{\exp(s(x, y^+) / \tau)}{\exp(s(x, y^+) / \tau) + \sum_{k} \exp(s(x, y_k^-) / \tau)}

where s(x,y)s(x, y) is a similarity score (often the sequence-level log probability), y+y^+ is a positive example, yky_k^- are negative examples, and τ\tau is a temperature.

Contrastive losses are used in embedding models (where the goal is to learn representations, not generate text) and in some RLHF variants. They differ from cross-entropy in that they explicitly model what the output should not be, rather than only what it should be.


What the Loss Function Shapes

Why Models Hallucinate

Hallucination — generating plausible-sounding but factually incorrect text — is a direct consequence of the cross-entropy objective.

Cross-entropy trains the model to assign high probability to tokens that appear in the training data in similar contexts. If “The capital of Australia is Sydney” appears in the training data (perhaps in a question about what people commonly misbelieve), the model learns to assign nonzero probability to “Sydney” after “The capital of Australia is”. More importantly, the mode-covering property of forward KL means the model tries to assign some probability to every plausible continuation it has seen, rather than concentrating all mass on the single correct answer.

The model does not distinguish between factual and counterfactual continuations in the training data. It sees both “The capital of Australia is Canberra” (factual) and “Many people believe the capital of Australia is Sydney” (factual text containing a counterfactual claim). Both contribute to the model’s distribution. Cross-entropy does not penalize the model for assigning probability to “Sydney” — it only penalizes the model for not assigning enough probability to the correct continuation in each specific context.

🚨 Hallucination Is Not a Bug in the Loss

From the perspective of cross-entropy, hallucination is correct behavior. The model is accurately representing the distribution of text it was trained on, which includes both correct and incorrect statements. Reducing hallucination requires either (a) curating the training data to remove incorrect information (impractical at scale), or (b) using post-training objectives that explicitly penalize factual errors (RLHF, DPO with factuality rewards).

Why Models Are Surprisingly Well-Calibrated

Despite the hallucination problem, cross-entropy-trained models are often well-calibrated: when the model assigns 70% probability to a token, that token is correct roughly 70% of the time. This is a direct consequence of the loss function.

Cross-entropy is a proper scoring rule: a prediction mechanism where the expected loss is minimized by reporting the true probability distribution. If the true probability of “Paris” in a given context is 0.7, the model minimizes its expected loss by assigning exactly q(Paris)=0.7q(\text{Paris}) = 0.7, not by rounding up to 1.0 or down to 0.5.

Σ Theorem: Cross-Entropy as a Proper Scoring Rule

A scoring rule S(q,x)S(q, x) is strictly proper if the expected score Exp[S(q,x)]\mathbb{E}_{x \sim p}[S(q, x)] is uniquely minimized when q=pq = p. Cross-entropy S(q,x)=logq(x)S(q, x) = -\log q(x) is strictly proper: Exp[logq(x)]\mathbb{E}_{x \sim p}[-\log q(x)] is minimized if and only if q=pq = p. This means a model with infinite capacity trained with cross-entropy on infinite data will recover the true conditional distribution p(xtx<t)p^*(x_t \mid x_{<t}) exactly.

This property means that the model is incentivized to be honest about its uncertainty. If two tokens are equally likely, the model should assign them equal probability — and cross-entropy ensures this is the optimal strategy. This is why LLM probabilities are useful for applications like uncertainty quantification and confidence estimation.

Why Models Struggle with Negation

Cross-entropy treats each token independently. The loss at position tt depends only on the model’s prediction at position tt, not on the semantic coherence of the full sequence. This creates a systematic weakness with negation.

Consider: “A cat is not a dog.” The model processes “A cat is” and must predict “not”. If the training data contains both “A cat is a pet” and “A cat is not a dog”, the model must assign probability to both “a” and “not” in this context. The crucial token is “not” — a single token that reverses the meaning of the entire sentence.

The problem is that cross-entropy assigns the same magnitude of loss to getting “not” wrong as to getting any other token wrong. But the semantic impact of missing “not” is far greater than missing, say, “the” before “dog”. Cross-entropy is blind to semantic importance — it treats all tokens as equally important prediction targets.

This is not merely a theoretical concern. Empirical studies consistently show that LLMs perform worse on negated statements than on affirmative ones. “Which of the following is NOT true?” reliably degrades model accuracy compared to “Which of the following IS true?” — even when the underlying factual knowledge is identical.

Why Repetition Occurs

Repetition in generated text has a cross-entropy explanation. During training with teacher forcing, the model never encounters its own repeated text as input. At inference, once a phrase is generated, it appears in the context. If the model assigns even slightly elevated probability to repeating a phrase it has just seen (a reasonable learned bias, since natural language does contain repetition), the repeated phrase enters the context and reinforces itself.

Cross-entropy cannot prevent this because it never trains on degenerate inputs. The loss function evaluates predictions on correct sequences, so it never learns that “I think the answer is clear. I think the answer is clear. I think the answer is clear.” is degenerate. Post-hoc fixes like repetition penalty, frequency penalty, and presence penalty at inference time are engineering solutions to a fundamental mismatch between the training objective and the generation process.

Impact of Repetition Penalty on Generated Text Quality (Human Evaluation)

(score (1-5))
No penalty Frequent loops
2.8 score (1-5)
rep_penalty=1.1 Mild reduction
3.9 score (1-5)
rep_penalty=1.2 Good balance
4.3 score (1-5)
rep_penalty=1.5 Too aggressive
3.7 score (1-5)
rep_penalty=2.0 Incoherent
2.9 score (1-5)

Why Models Are Better at Common Knowledge

Cross-entropy loss is proportional to the negative log probability of the correct token. Tokens that appear frequently in many contexts (common knowledge) are predicted correctly more often, contributing small losses. Tokens that represent rare facts contribute larger losses when wrong, but these rare-fact contexts appear infrequently in the training data.

The gradient from a single training example is proportional to the loss magnitude. Common patterns contribute many small, consistent gradients that accumulate over millions of examples. Rare patterns contribute fewer, larger gradients that may conflict with each other (the same rare entity might appear in contradictory contexts).

This means the model learns common knowledge reliably (many consistent gradient updates) and rare knowledge unreliably (few, potentially contradictory updates). The loss function does not explicitly prefer common over rare knowledge — it treats all tokens equally — but the frequency of training examples creates an implicit curriculum that favors frequently occurring patterns.


Numerical Stability

The Log-Sum-Exp Trick

Computing cross-entropy naively involves computing log(jexp(zj))\log(\sum_j \exp(z_j)), where zjz_j are logits. If any logit is large (say, zj=1000z_j = 1000), exp(zj)\exp(z_j) overflows to infinity. If all logits are very negative (say, zj=1000z_j = -1000), exp(zj)\exp(z_j) underflows to zero and the log of zero is undefined.

The log-sum-exp trick subtracts the maximum logit before exponentiating:

logjexp(zj)=zmax+logjexp(zjzmax)\log \sum_j \exp(z_j) = z_\text{max} + \log \sum_j \exp(z_j - z_\text{max})

After subtracting zmaxz_\text{max}, all exponents are 0\leq 0, so exp(zjzmax)(0,1]\exp(z_j - z_\text{max}) \in (0, 1]. The sum is at least 1 (because the maximum element contributes exp(0)=1\exp(0) = 1), so the log is non-negative. No overflow, no underflow.

def stable_cross_entropy(logits, target):
    """
    Numerically stable cross-entropy computation.
    logits: [V] -- raw output head scores
    target: int -- correct token ID
    """
    # Step 1: Log-sum-exp trick
    max_logit = logits.max()
    shifted = logits - max_logit
    log_sum_exp = max_logit + torch.log(torch.exp(shifted).sum())

    # Step 2: Loss is log_sum_exp minus the target logit
    loss = log_sum_exp - logits[target]

    return loss

Every deep learning framework implements this automatically in their cross-entropy functions. But understanding it is important because custom CUDA kernels for fused cross-entropy (discussed in the previous post on the output head) must implement this trick correctly.

Mixed Precision Considerations

During mixed-precision training, logits are often computed in BF16 or FP16 but the loss should be computed in FP32. The softmax operation is particularly sensitive to precision because it involves exponentials of large numbers followed by normalization.

A common pattern is to cast logits to FP32 before computing the loss:

# logits are BF16 from the output head
logits_fp32 = logits.float()  # cast to FP32
loss = F.cross_entropy(logits_fp32, targets)  # compute loss in FP32

This costs negligible compute (the cast is essentially free) but prevents the precision loss that would occur if softmax were computed in BF16, where the 8-bit mantissa cannot represent the fine-grained probability differences between similar logits.

FP32 Loss Computation Is Non-Negotiable

Computing cross-entropy loss in BF16 can cause training instabilities, especially for large vocabularies where many logits are close in value. The softmax normalization amplifies precision errors, and the log operation further amplifies them. Always compute the loss in FP32, even if everything else runs in lower precision.


The Loss Landscape

What Does the Loss Surface Look Like?

The cross-entropy loss surface for a transformer is astronomically high-dimensional (billions of parameters), making direct visualization impossible. However, research has revealed key properties.

Local minima are rare at scale. For large models, most critical points of the loss function are saddle points, not local minima. The intuition: in dd dimensions, a critical point is a local minimum only if all dd eigenvalues of the Hessian are positive. For random functions in high dimensions, each eigenvalue is positive with probability roughly 0.5, so the probability that all dd are positive is roughly 0.5d0.5^d — vanishingly small for dd in the billions. This is why SGD with momentum is sufficient for LLM training; there is no need for sophisticated global optimization.

Loss decreases smoothly with scale. The scaling laws mentioned earlier show that loss decreases as a smooth power law in model size, data size, and compute. There are no “phase transitions” in the loss — a 2x larger model reliably achieves lower loss, and the magnitude of improvement is predictable.

Sharp vs flat minima. Models that converge to “flat” regions of the loss surface (where small perturbations to parameters cause small changes in loss) tend to generalize better than models in “sharp” regions. Techniques like large batch size, learning rate warmup, and weight decay bias the optimizer toward flatter minima.


Summary

The cross-entropy loss function is the single most important equation in language model training. It is the sole signal that shapes every parameter of the model during pre-training. From information theory, we know it measures how well the model approximates the true distribution of text. From teacher forcing, we know it enables the parallelism that makes trillion-token training feasible. From its properties as a proper scoring rule, we understand why models are well-calibrated. From its mode-covering behavior (forward KL), we understand why models hallucinate. And from its token-level decomposition, we understand why models struggle with negation and semantic coherence.

Perplexity — the exponential of cross-entropy — gives us a single number that summarizes model quality, one that follows smooth, predictable power laws as models scale. Alternative losses in post-training (RLHF, DPO) address the limitations of cross-entropy by introducing human preference signals, but they build on the foundation that cross-entropy provides.

Every behavior of a language model — every capability and every failure mode — can ultimately be traced back to this loss function. Understanding cross-entropy is understanding the optimization pressure that created the model. It is the closest thing we have to understanding why these models work the way they do.