Dropout and Regularization in Transformers: Where It Helps, Where It Hurts

Part of Series Transformer Anatomy 22 of 36

1 The Transformer Attention Mechanism: From First Principles to Performance Reality 2 Tokenization and BPE: How LLMs See Text — From Characters to Subwords 3 Embedding Layers: The Geometry of Meaning in LLMs 4 Position Encoding in Transformers: From Sinusoidal to RoPE, ALiBi, and Long-Context Scaling 5 Softmax Numerics: Log-Sum-Exp, Temperature, and Why Numerical Stability Matters 6 Attention Variants Compared: MHA, MQA, GQA, and MLA 7 Normalization in Transformers: LayerNorm, RMSNorm, and the Training Stability Story 8 Residual Connections and Skip Paths: Why Transformers Can Be 100 Layers Deep 9 The Feed-Forward Network: SwiGLU, Gating, and the FFN-as-Memory Hypothesis 10 Mixture of Experts: Why Conditional Computation Is the Path to Trillion-Parameter Models 11 The Output Head: Unembedding, Weight Tying, and Vocabulary Projection 12 Cross-Entropy Loss: How the Loss Function Shapes What an LLM Learns 13 Encoder vs Decoder: Why Decoder-Only Won 14 DeepSeek V3: How 671B Parameters Trained for the Cost of a 70B Dense Model 15 Building a Transformer From Scratch: Putting Every Component Together 16 Gradient Flow and Backpropagation Through Transformers: What Happens During the Backward Pass 17 Weight Initialization: Xavier, Kaiming, and Why mu-P Changes Everything for Large Models 18 Training Loop Anatomy: Forward Pass, Loss Computation, Backward Pass, Optimizer Step 19 Learning Rate Schedules: Warmup, Cosine Decay, and Why WSD Changes Everything 20 Distributed Data Parallel: Gradient Synchronization, Bucket All-Reduce, and Overlap with Backward 21 Activation Functions Deep Dive: ReLU, GELU, SiLU, and Why Each Matters for Transformers 22 Dropout and Regularization in Transformers: Where It Helps, Where It Hurts 23 Attention Masking: Causal, Bidirectional, Sliding Window, Block Sparse, and Custom Patterns 24 Mixed Precision Training: BF16 Forward, FP32 Master Weights, and the Precision Hierarchy 25 Token Prediction Heads: Next-Token, Multi-Token, and Classifier Heads 26 Mixture of Depths: Conditional Computation Per Layer for Faster Inference 27 Sparse Attention Patterns: Local, Strided, Hash-Based, and Learnable Sparsity 28 Rotary Position Embedding: The Complete Mathematical Derivation 29 Knowledge Distillation: Training Small Models to Match Large Ones 30 Model Merging: Weight Averaging, TIES, DARE, and Evolutionary Search 31 Pruning at Scale: SparseGPT, Wanda, and Structured Removal of Redundant Parameters 32 The Transformer in 2026: What Changed, What Stayed, and What's Next 33 Data Loading: Tokenization, Sequence Packing, Padding Strategies, and Attention Masks 34 The FlashAttention Backward Pass: Recomputation, Memory Savings, and the 33% Compute Overhead 35 The Inference Engine: Token Generation Loop, KV Cache Management, and Autoregressive Decoding 36 Tensor Parallelism Implementation: Splitting Weights Across GPUs for Training and Inference

Regularization is the set of techniques that prevent a model from memorizing its training data instead of learning generalizable patterns. In classical machine learning, regularization is often the difference between a model that works on test data and one that does not. In modern LLM training, the story is more nuanced: some regularization techniques are critical (weight decay, gradient clipping), some are actively harmful at scale (high dropout), and the reasoning behind each choice requires understanding the specific failure modes of transformer training.

This post covers every regularization technique used in transformer training, derives the mathematics behind each, explains where each is applied in the architecture, and provides complete implementation code. Every claim about what helps and what hurts is backed by the training configurations of real models: GPT-3, Llama 2, Llama 3, Chinchilla, and PaLM.

Dropout: The Mechanism

1.1 What Dropout Does

Dropout was introduced by Srivastava et al. (2014) as a way to prevent co-adaptation of neurons. The mechanism is simple:

During training, each neuron’s output is independently set to zero with probability $p$ (the dropout rate). The remaining outputs are scaled by $\frac{1}{1-p}$ to maintain the expected value.

For a hidden vector $h \in \mathbb{R}^d$ , dropout produces:

$\text{Dropout}(h_i) = \begin{cases} 0 & \text{with probability } p \\ \frac{h_i}{1-p} & \text{with probability } 1-p \end{cases}$

Equivalently, define a binary mask $m \in \{0, 1\}^d$ where each $m_i \sim \text{Bernoulli}(1-p)$ :

$\text{Dropout}(h) = \frac{m \odot h}{1-p}$

The $\frac{1}{1-p}$ scaling factor (called “inverted dropout”) ensures that $\mathbb{E}[\text{Dropout}(h)] = h$ . This means the expected output during training matches the output during inference, when dropout is disabled.

1.2 Why Inverted Dropout Works

The expected value of each element after dropout:

$\mathbb{E}[\text{Dropout}(h_i)] = (1-p) \cdot \frac{h_i}{1-p} + p \cdot 0 = h_i$

The variance of each element after dropout:

$\text{Var}[\text{Dropout}(h_i)] = \mathbb{E}[\text{Dropout}(h_i)^2] - (\mathbb{E}[\text{Dropout}(h_i)])^2$

$= (1-p) \cdot \frac{h_i^2}{(1-p)^2} + p \cdot 0 - h_i^2 = \frac{h_i^2}{1-p} - h_i^2 = \frac{p}{1-p} h_i^2$

So dropout increases the variance of activations by a factor of $\frac{1}{1-p}$ . For $p = 0.1$ , this is a $\frac{1}{0.9} \approx 1.11$ variance increase. For $p = 0.5$ , it doubles the variance. This variance injection is part of what makes dropout a regularizer: it adds noise to the forward pass, forcing the network to be robust to perturbations.

1.3 The Gradient Through Dropout

During backpropagation, the gradient through dropout is:

$\frac{\partial \text{Dropout}(h_i)}{\partial h_i} = \begin{cases} 0 & \text{if } m_i = 0 \\ \frac{1}{1-p} & \text{if } m_i = 1 \end{cases}$

The same mask $m$ used in the forward pass is reused. Gradients are zeroed for the same neurons that were dropped. This means that on any given training step, only a fraction $(1-p)$ of neurons receive gradient updates. Over many steps, all neurons receive updates, but no single step updates all of them simultaneously.

import torch
import torch.nn as nn

class InvertedDropout(nn.Module):
    def __init__(self, p=0.1):
        super().__init__()
        self.p = p

    def forward(self, x):
        if not self.training or self.p == 0.0:
            return x
        # Generate binary mask: 1 with prob (1-p), 0 with prob p
        mask = torch.bernoulli(
            torch.full_like(x, 1.0 - self.p)
        )
        # Scale by 1/(1-p) so expected value is preserved
        return x * mask / (1.0 - self.p)

# Verification
torch.manual_seed(42)
x = torch.randn(1000, 512)
drop = InvertedDropout(p=0.1)

drop.train()
out_train = drop(x)
print(f"Train mean: {out_train.mean():.4f}")  # ~0.0 (same as input)
print(f"Train var:  {out_train.var():.4f}")    # ~1.11 (input var / (1-p))

drop.eval()
out_eval = drop(x)
print(f"Eval mean:  {out_eval.mean():.4f}")    # Same as input
print(f"Eval var:   {out_eval.var():.4f}")      # Same as input (no dropout)

ℹ️ Inverted vs Standard Dropout

Older implementations used “standard dropout” which does not scale by $\frac{1}{1-p}$ during training. Instead, all weights are multiplied by $(1-p)$ at inference time. Inverted dropout is preferred because it requires no change at inference time — the forward pass is identical in eval mode. PyTorch’s nn.Dropout uses inverted dropout.

Where Dropout Is Applied in Transformers

The original “Attention Is All You Need” paper (Vaswani et al., 2017) applied dropout in three locations within each transformer layer. Modern architectures have changed or removed some of these. Here is where dropout can appear:

2.1 Attention Dropout (After Softmax)

Applied to the attention weight matrix after softmax, before multiplying by values:

$\text{Attn}(Q, K, V) = \text{Dropout}\left(\text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)\right) V$

This randomly zeros out attention connections between tokens. On a given training step, token $i$ cannot attend to some random subset of other tokens. The effect: the model cannot rely on any single attention pattern. It must distribute information across multiple key positions so that dropping any one connection does not destroy the output.

import torch
import torch.nn as nn
import torch.nn.functional as F
import math

class MultiHeadAttentionWithDropout(nn.Module):
    def __init__(self, d_model, n_heads, attn_dropout=0.1, resid_dropout=0.1):
        super().__init__()
        self.d_model = d_model
        self.n_heads = n_heads
        self.d_k = d_model // n_heads

        self.W_q = nn.Linear(d_model, d_model, bias=False)
        self.W_k = nn.Linear(d_model, d_model, bias=False)
        self.W_v = nn.Linear(d_model, d_model, bias=False)
        self.W_o = nn.Linear(d_model, d_model, bias=False)

        self.attn_dropout = nn.Dropout(attn_dropout)
        self.resid_dropout = nn.Dropout(resid_dropout)

    def forward(self, x, mask=None):
        B, S, D = x.shape
        H = self.n_heads

        q = self.W_q(x).view(B, S, H, self.d_k).transpose(1, 2)
        k = self.W_k(x).view(B, S, H, self.d_k).transpose(1, 2)
        v = self.W_v(x).view(B, S, H, self.d_k).transpose(1, 2)

        # Scaled dot-product attention
        scores = torch.matmul(q, k.transpose(-2, -1)) / math.sqrt(self.d_k)

        if mask is not None:
            scores = scores.masked_fill(mask == 0, float('-inf'))

        attn_weights = F.softmax(scores, dim=-1)

        # Location 1: Attention dropout
        attn_weights = self.attn_dropout(attn_weights)

        attn_output = torch.matmul(attn_weights, v)
        attn_output = attn_output.transpose(1, 2).contiguous().view(B, S, D)
        output = self.W_o(attn_output)

        # Location 2: Residual dropout (after output projection)
        output = self.resid_dropout(output)
        return output

2.2 Residual Dropout (After Sublayer Output)

Applied to the output of each sublayer (attention or FFN) before adding the residual connection:

$h = x + \text{Dropout}(\text{Sublayer}(x))$

This is the most impactful dropout location. It randomly drops entire feature dimensions from the sublayer output before they are added to the residual stream. The residual stream itself is never dropped — only the contribution from the current layer.

2.3 FFN Dropout (Inside or After Feed-Forward)

Applied after the activation function inside the FFN, or after the entire FFN output:

class TransformerFFN(nn.Module):
    def __init__(self, d_model, d_ff, dropout=0.1):
        super().__init__()
        self.w1 = nn.Linear(d_model, d_ff, bias=False)
        self.w2 = nn.Linear(d_ff, d_model, bias=False)
        self.dropout = nn.Dropout(dropout)

    def forward(self, x):
        # Standard FFN with dropout after activation
        h = F.gelu(self.w1(x))
        h = self.dropout(h)  # Location 3: FFN internal dropout
        return self.w2(h)

2.4 Embedding Dropout

Some models (BERT, original transformer) apply dropout to the sum of token embeddings and positional embeddings:

$h_0 = \text{Dropout}(\text{TokenEmbed}(x) + \text{PosEmbed}(x))$

This is less common in modern decoder-only LLMs. Llama does not use embedding dropout.

📊

Dropout Locations in Major Transformer Models

Model	Attention Dropout	Residual Dropout	FFN Dropout	Embedding Dropout
Vaswani (2017)	0.1	0.1	0.1	0.1
GPT-2 (2019)	0.1	0.1	0.0	0.1
GPT-3 (2020)	0.1	0.1	0.0	0.0
PaLM (2022)	0.0	0.0	0.0	0.0
Llama 2 (2023)	0.0	0.0	0.0	0.0
Llama 3 (2024)	0.0	0.0	0.0	0.0
Chinchilla (2022)	0.0	0.0	0.0	0.0

Note: Modern LLMs at web-scale use zero dropout everywhere

Why Modern LLMs Use Zero Dropout

The trend is unmistakable: every major LLM from 2022 onward uses zero dropout. This is not an oversight. It is a deliberate engineering decision backed by a clear theoretical argument.

3.1 The Overfitting vs Underfitting Regime

Regularization prevents overfitting. Overfitting occurs when the model memorizes training data instead of learning generalizable patterns. The question is: do LLMs overfit?

Consider Llama 3 70B:

Parameters: 70 billion
Training tokens: 15 trillion
Each token is seen approximately once (1 epoch or slightly more)

The model sees each training example roughly once. It is impossible to memorize data you see only once. The model is in the underfitting regime: it does not have enough capacity or training time to fully learn the patterns in the data.

Dropout, by randomly zeroing neurons, reduces the model’s effective capacity on each training step. In the underfitting regime, this makes the problem worse. You are preventing the model from using its full capacity to learn from data it will never see again.

3.2 The Compute Efficiency Argument

Dropout wastes compute. With dropout rate $p$ , on each forward pass, a fraction $p$ of the computation in each dropped layer is wasted (producing zeros that are immediately discarded). For attention dropout with $p = 0.1$ , 10% of the attention computation is thrown away on every training step.

At LLM scale, training cost is measured in millions of GPU-hours. Wasting 10% of attention compute on a $100M training run is$ 10M of wasted compute for a regularizer that is not needed.

3.3 The Token Efficiency Argument

Chinchilla (Hoffmann et al., 2022) established the scaling law: for a given compute budget, there is an optimal ratio of model parameters to training tokens. At the optimal ratio, the model sees each token at most 1-2 times. The scaling law implicitly assumes no dropout — adding dropout changes the effective compute per token and shifts the optimum.

With dropout rate $p$ , the model’s effective capacity per step is reduced by roughly a factor of $(1-p)$ . To compensate, you would need $\frac{1}{1-p}$ more training steps to reach the same quality. For $p = 0.1$ , that is 11% more training steps, which means 11% more compute. The data regularization effect (seeing each token only once) already provides sufficient regularization without this cost.

3.4 Interaction with Other Regularizers

Modern LLMs use weight decay (section 4) and gradient clipping (section 6) as their primary regularizers. These are complementary to the natural regularization provided by:

Single-epoch training: Each token seen once
Data diversity: Web-scale data has enormous variety
Architecture: RMSNorm, residual connections, and proper initialization already stabilize training

Adding dropout on top of these provides marginal benefit at significant compute cost.

⚡ When To Use Dropout

Dropout still helps in these scenarios: (1) Fine-tuning a pretrained model on a small dataset (thousands to millions of examples) where overfitting is real. Use $p = 0.1$ on residual connections. (2) Training small models on limited data. (3) Multi-epoch training where the model sees each example many times. For pretraining LLMs on web-scale data at or near the Chinchilla-optimal token count, dropout should be zero.

3.5 Empirical Evidence

The GPT-3 paper (Brown et al., 2020) trained models from 125M to 175B parameters. The 125M model used $p = 0.1$ dropout. The 175B model also used $p = 0.1$ , but later analysis showed this was likely suboptimal for the largest models. PaLM (Chowdhery et al., 2022) dropped dropout entirely for their 540B model, citing the underfitting argument. Chinchilla confirmed this was correct.

Effective Capacity Loss from Dropout at LLM Scale

(% effective capacity per step)

p=0.0 (no dropout) 100% capacity

100 % effective capacity per step

p=0.05 5% wasted

95 % effective capacity per step

p=0.1 (GPT-3) 10% wasted

90 % effective capacity per step

p=0.2 20% wasted

80 % effective capacity per step

p=0.3 30% wasted

70 % effective capacity per step

p=0.5 (classic) 50% wasted

50 % effective capacity per step

Weight Decay: The Primary Regularizer

While dropout has fallen out of favor for LLM pretraining, weight decay is universally used. Every major LLM uses weight decay. Llama 3: $\lambda = 0.1$ . GPT-3: $\lambda = 0.1$ . PaLM: $\lambda = 0.1$ . Chinchilla: $\lambda = 0.1$ . The value is remarkably consistent.

4.1 L2 Regularization vs Weight Decay

L2 regularization adds a penalty term to the loss:

$\mathcal{L}_{\text{reg}} = \mathcal{L} + \frac{\lambda}{2} \| \theta \|_2^2 = \mathcal{L} + \frac{\lambda}{2} \sum_i \theta_i^2$

The gradient of the regularized loss:

$\nabla_\theta \mathcal{L}_{\text{reg}} = \nabla_\theta \mathcal{L} + \lambda \theta$

With vanilla SGD, the update rule becomes:

$\theta_{t+1} = \theta_t - \eta (\nabla_\theta \mathcal{L} + \lambda \theta_t) = (1 - \eta \lambda) \theta_t - \eta \nabla_\theta \mathcal{L}$

The factor $(1 - \eta \lambda)$ shrinks every weight toward zero on every step. This is weight decay for SGD — L2 regularization and weight decay are equivalent.

4.2 Why AdamW Exists: The Decoupled Weight Decay

For Adam, L2 regularization and weight decay are NOT equivalent. Adam’s update rule with L2 regularization:

$m_t = \beta_1 m_{t-1} + (1 - \beta_1)(\nabla \mathcal{L} + \lambda \theta_t)$ $v_t = \beta_2 v_{t-1} + (1 - \beta_2)(\nabla \mathcal{L} + \lambda \theta_t)^2$ $\theta_{t+1} = \theta_t - \eta \frac{\hat{m}_t}{\sqrt{\hat{v}_t} + \epsilon}$

The problem: the $\lambda \theta_t$ term is included in both the first moment $m_t$ and second moment $v_t$ . The adaptive learning rate $\frac{1}{\sqrt{\hat{v}_t}}$ applies to the regularization gradient as well as the data gradient. This means parameters with large gradients (large $v_t$ ) receive less weight decay, and parameters with small gradients receive more weight decay. This is the opposite of what you want: parameters that are rarely updated (small gradients) should be decayed more, not less.

AdamW (Loshchilov and Hutter, 2019) decouples weight decay from the gradient-based update:

$m_t = \beta_1 m_{t-1} + (1 - \beta_1) \nabla \mathcal{L}$ $v_t = \beta_2 v_{t-1} + (1 - \beta_2) (\nabla \mathcal{L})^2$ $\theta_{t+1} = (1 - \eta \lambda) \theta_t - \eta \frac{\hat{m}_t}{\sqrt{\hat{v}_t} + \epsilon}$

Now the weight decay $(1 - \eta \lambda)$ is applied uniformly to all parameters, regardless of gradient magnitude. The adaptive learning rate only applies to the data-driven gradient.

import torch

class AdamW(torch.optim.Optimizer):
    """Minimal AdamW implementation showing decoupled weight decay."""

    def __init__(self, params, lr=1e-3, betas=(0.9, 0.999),
                 eps=1e-8, weight_decay=0.1):
        defaults = dict(lr=lr, betas=betas, eps=eps,
                        weight_decay=weight_decay)
        super().__init__(params, defaults)

    def step(self):
        for group in self.param_groups:
            lr = group['lr']
            beta1, beta2 = group['betas']
            eps = group['eps']
            wd = group['weight_decay']

            for p in group['params']:
                if p.grad is None:
                    continue

                grad = p.grad.data
                state = self.state[p]

                if len(state) == 0:
                    state['step'] = 0
                    state['m'] = torch.zeros_like(p.data)
                    state['v'] = torch.zeros_like(p.data)

                state['step'] += 1
                m, v = state['m'], state['v']

                # Update biased moments (data gradient only, no weight decay)
                m.mul_(beta1).add_(grad, alpha=1 - beta1)
                v.mul_(beta2).addcmul_(grad, grad, value=1 - beta2)

                # Bias correction
                m_hat = m / (1 - beta1 ** state['step'])
                v_hat = v / (1 - beta2 ** state['step'])

                # Decoupled weight decay: applied directly to weights
                p.data.mul_(1 - lr * wd)

                # Adam update (no weight decay in gradient)
                p.data.addcdiv_(m_hat, v_hat.sqrt().add_(eps),
                                value=-lr)

🚨 Adam with L2 vs AdamW

Using torch.optim.Adam with weight_decay=0.1 is NOT the same as using torch.optim.AdamW with weight_decay=0.1. The former applies L2 regularization through the adaptive learning rate. The latter applies true decoupled weight decay. For transformer training, always use AdamW. Using Adam with L2 regularization produces measurably worse results (Loshchilov and Hutter, 2019 showed 0.5-1% accuracy degradation on ImageNet).

4.3 What Weight Decay Does Geometrically

Weight decay with factor $(1 - \eta \lambda)$ multiplies every weight by a constant less than 1 on every step. For $\eta = 3 \times 10^{-4}$ and $\lambda = 0.1$ :

$(1 - \eta \lambda) = 1 - 3 \times 10^{-5} = 0.99997$

Over 2 million training steps, if a weight receives no gradient updates at all, it decays to:

$\theta_T = \theta_0 \cdot (1 - \eta \lambda)^T = \theta_0 \cdot (0.99997)^{2000000} \approx \theta_0 \cdot e^{-60} \approx 0$

Any weight that is not continuously reinforced by gradient signal is driven to zero. This has several effects:

Prevents weight explosion: Weights cannot grow unboundedly because decay pulls them back.
Implicit feature selection: Weights corresponding to unimportant features decay away.
Improves generalization: The model is biased toward simpler solutions (smaller weight norms).

4.4 Which Parameters Get Weight Decay

Not all parameters should be decayed. Standard practice:

Decay: All weight matrices ( $W_Q, W_K, W_V, W_O$ , FFN weights, embedding weights)
No decay: All biases, LayerNorm/RMSNorm scale parameters ( $\gamma$ )

The reasoning: biases and normalization parameters are low-dimensional (one per feature, not feature-by-feature). Decaying them toward zero removes the model’s ability to shift and scale representations, which hurts performance. Weight matrices have $d^2$ parameters and benefit from the regularization.

def get_param_groups(model, weight_decay=0.1):
    """Separate parameters into decay and no-decay groups."""
    decay_params = []
    no_decay_params = []

    for name, param in model.named_parameters():
        if not param.requires_grad:
            continue

        # No decay for biases and normalization parameters
        if param.ndim == 1:
            # Biases, LayerNorm/RMSNorm weights (1D tensors)
            no_decay_params.append(param)
        elif 'norm' in name or 'bias' in name:
            no_decay_params.append(param)
        else:
            decay_params.append(param)

    return [
        {'params': decay_params, 'weight_decay': weight_decay},
        {'params': no_decay_params, 'weight_decay': 0.0},
    ]

# Usage
param_groups = get_param_groups(model, weight_decay=0.1)
optimizer = torch.optim.AdamW(param_groups, lr=3e-4, betas=(0.9, 0.95))

4.5 The $\lambda = 0.1$ Consensus

Why do nearly all LLMs use $\lambda = 0.1$ ? The answer comes from the interaction between weight decay and the learning rate schedule.

The effective decay per step is $\eta_t \lambda$ , where $\eta_t$ is the current learning rate. With cosine decay from $\eta_{\max} = 3 \times 10^{-4}$ to $\eta_{\min} = 3 \times 10^{-5}$ :

Early training: effective decay = $3 \times 10^{-4} \times 0.1 = 3 \times 10^{-5}$ per step
Late training: effective decay = $3 \times 10^{-5} \times 0.1 = 3 \times 10^{-6}$ per step

The total weight decay over training with a cosine schedule integrates to approximately:

$\text{Total decay} \approx \frac{\lambda}{2}(\eta_{\max} + \eta_{\min}) \cdot T \approx 0.1 \times 1.65 \times 10^{-4} \times 2 \times 10^6 = 33$

This means each weight is effectively multiplied by $e^{-33} \approx 4.5 \times 10^{-15}$ over training if it receives no gradient signal. In practice, gradient signal counteracts the decay, and the equilibrium weight magnitude depends on the balance between gradient updates and decay.

📊

Weight Decay Values in Major LLMs

Model	Optimizer	Weight Decay	Peak LR	Effective Decay/Step (Peak)
GPT-3 175B	Adam	0.1	6e-5	6e-6
PaLM 540B	AdamW	0.1	1e-4	1e-5
Chinchilla 70B	AdamW	0.1	1e-4	1e-5
Llama 2 70B	AdamW	0.1	1.5e-4	1.5e-5
Llama 3 70B	AdamW	0.1	1.5e-4	1.5e-5
DeepSeek V3	AdamW	0.1	2.2e-4	2.2e-5

Note: Lambda = 0.1 is near-universal for large-scale LLM pretraining

Label Smoothing

5.1 Hard Targets vs Soft Targets

Standard cross-entropy loss uses hard targets. For a token $x_t$ with vocabulary index $c$ , the target distribution is:

$y_i = \begin{cases} 1 & \text{if } i = c \\ 0 & \text{otherwise} \end{cases}$

The cross-entropy loss:

$\mathcal{L} = -\sum_{i=1}^{V} y_i \log p_i = -\log p_c$

This pushes the model to make $p_c \to 1$ , which means the logit for the correct class $z_c \to +\infty$ relative to all others. The model becomes overconfident.

Label smoothing replaces the hard target with a smoothed distribution. With smoothing parameter $\alpha$ (typically 0.1):

$y_i^{\text{smooth}} = \begin{cases} 1 - \alpha + \frac{\alpha}{V} & \text{if } i = c \\ \frac{\alpha}{V} & \text{otherwise} \end{cases}$

For $V = 128256$ (Llama 3 vocabulary) and $\alpha = 0.1$ :

$y_c^{\text{smooth}} = 0.9 + \frac{0.1}{128256} \approx 0.9$ $y_{i \neq c}^{\text{smooth}} = \frac{0.1}{128256} \approx 7.8 \times 10^{-7}$

5.2 The Effect on Gradients

The gradient of the smoothed cross-entropy loss with respect to logit $z_j$ :

$\frac{\partial \mathcal{L}_{\text{smooth}}}{\partial z_j} = p_j - y_j^{\text{smooth}}$

For the correct class: $\frac{\partial \mathcal{L}}{\partial z_c} = p_c - (1 - \alpha + \frac{\alpha}{V})$ . As $p_c \to 1$ , this gradient approaches $\alpha - \frac{\alpha}{V} \approx \alpha = 0.1$ . It does not vanish. The model keeps receiving a signal to reduce $p_c$ below 1, preventing overconfidence.

For incorrect classes: $\frac{\partial \mathcal{L}}{\partial z_j} = p_j - \frac{\alpha}{V}$ . The model is pushed to assign nonzero probability to all tokens, preventing the logit distribution from becoming too peaked.

import torch
import torch.nn.functional as F

def label_smoothed_cross_entropy(logits, targets, smoothing=0.1):
    """
    Label smoothed cross-entropy loss.

    Args:
        logits: (B, S, V) raw logits
        targets: (B, S) token indices
        smoothing: label smoothing factor (0.0 = no smoothing)
    """
    V = logits.size(-1)
    logits_flat = logits.view(-1, V)
    targets_flat = targets.view(-1)

    # Standard NLL component
    log_probs = F.log_softmax(logits_flat, dim=-1)
    nll_loss = -log_probs.gather(dim=-1, index=targets_flat.unsqueeze(-1))
    nll_loss = nll_loss.squeeze(-1)

    # Smooth component: uniform distribution over all classes
    smooth_loss = -log_probs.mean(dim=-1)

    # Combined loss
    loss = (1.0 - smoothing) * nll_loss + smoothing * smooth_loss

    return loss.mean()

# Comparison
torch.manual_seed(42)
logits = torch.randn(2, 16, 32000)
targets = torch.randint(0, 32000, (2, 16))

hard_loss = F.cross_entropy(logits.view(-1, 32000), targets.view(-1))
smooth_loss = label_smoothed_cross_entropy(logits, targets, smoothing=0.1)
print(f"Hard CE loss:     {hard_loss.item():.4f}")
print(f"Smoothed CE loss: {smooth_loss.item():.4f}")

5.3 Label Smoothing in Practice

Label smoothing is more common in encoder models (BERT: $\alpha = 0.1$ ) and machine translation (original Transformer: $\alpha = 0.1$ ) than in decoder-only LLMs. GPT-3 did not use label smoothing. Llama does not use label smoothing. The reason: for autoregressive language modeling, the targets are already “soft” in the sense that many continuations are valid. The model naturally learns a distribution over next tokens. Label smoothing adds little when the task itself is inherently uncertain.

However, label smoothing is valuable for fine-tuning on classification tasks (sentiment, NLI) where the model tends to become overconfident on small datasets.

Gradient Clipping

6.1 Why Gradients Explode

Gradient clipping is not strictly regularization — it is a training stability technique. But it interacts with regularization and is universally used, so it belongs in this discussion.

Gradients can spike for several reasons:

A rare, high-loss example produces an unusually large gradient
The loss landscape has a sharp cliff (common early in training)
Numerical issues in softmax or normalization produce large values

A single large gradient step can move the model out of a good region of the loss landscape, causing the loss to spike. Recovery from loss spikes can take thousands of steps and waste significant compute.

6.2 Max-Norm Gradient Clipping

The standard approach clips the global gradient norm:

$g \leftarrow \begin{cases} g & \text{if } \|g\| \leq c \\ c \cdot \frac{g}{\|g\|} & \text{if } \|g\| > c \end{cases}$

where $g$ is the concatenation of all parameter gradients and $c$ is the clipping threshold. The gradient direction is preserved; only the magnitude is capped.

The global gradient norm is:

$\|g\| = \sqrt{\sum_i \sum_j g_{ij}^2}$

where the sum is over all parameters $i$ and all elements $j$ within each parameter.

import torch

def clip_grad_norm(parameters, max_norm=1.0):
    """
    Clip gradient norm across all parameters.
    Returns the original (unclipped) norm for logging.
    """
    parameters = [p for p in parameters if p.grad is not None]

    # Compute global gradient norm
    total_norm_sq = 0.0
    for p in parameters:
        total_norm_sq += p.grad.data.norm(2).item() ** 2
    total_norm = total_norm_sq ** 0.5

    # Clip if necessary
    clip_coef = max_norm / (total_norm + 1e-6)
    if clip_coef < 1.0:
        for p in parameters:
            p.grad.data.mul_(clip_coef)

    return total_norm

# In training loop
optimizer.zero_grad()
loss.backward()
grad_norm = clip_grad_norm(model.parameters(), max_norm=1.0)
optimizer.step()
# Log grad_norm to detect instability

6.3 Clipping Threshold Values

The standard threshold is $c = 1.0$ for most LLMs. Some models use other values:

📊

Gradient Clipping in Major LLMs

Model	Clip Value	Clip Type	Notes
GPT-3	1.0	Global norm	Standard
PaLM	1.0	Global norm	Standard
Llama 2	1.0	Global norm	Standard
Llama 3	1.0	Global norm	Standard
Chinchilla	1.0	Global norm	Standard
DeepSeek V3	1.0	Global norm	Standard

Note: max_norm=1.0 is near-universal

6.4 Gradient Clipping and Weight Decay Interaction

An important subtlety: gradient clipping is applied after the gradient computation but before the optimizer step. Weight decay in AdamW is applied during the optimizer step and is NOT affected by gradient clipping. This means:

Backpropagation computes $\nabla \mathcal{L}$ (data gradient only, no weight decay term)
Gradient clipping caps $\|\nabla \mathcal{L}\|$ at $c$
Adam updates moments using the clipped gradient
Weight decay is applied separately: $\theta \leftarrow (1 - \eta \lambda) \theta$

If you mistakenly include weight decay in the gradient (using Adam + L2 instead of AdamW), the weight decay gradient is also clipped, which further distorts the regularization behavior.

Putting It All Together: Complete Regularization Configuration

Here is a complete training configuration that implements all regularization techniques discussed, matching the setup used by modern LLMs:

import torch
import torch.nn as nn
import math

class TransformerConfig:
    """Regularization config matching Llama 3 style."""
    # Dropout (zero for pretraining)
    attn_dropout: float = 0.0
    resid_dropout: float = 0.0
    ffn_dropout: float = 0.0
    embed_dropout: float = 0.0

    # Weight decay
    weight_decay: float = 0.1

    # Gradient clipping
    max_grad_norm: float = 1.0

    # Label smoothing (zero for pretraining)
    label_smoothing: float = 0.0

    # Optimizer
    lr: float = 1.5e-4
    min_lr: float = 1.5e-5
    betas: tuple = (0.9, 0.95)
    eps: float = 1e-8

class TransformerBlock(nn.Module):
    def __init__(self, config):
        super().__init__()
        self.norm1 = RMSNorm(config.d_model)
        self.attn = MultiHeadAttention(config)
        self.norm2 = RMSNorm(config.d_model)
        self.ffn = SwiGLUFFN(config)

        # Residual dropout (0.0 for LLM pretraining)
        self.resid_dropout1 = nn.Dropout(config.resid_dropout)
        self.resid_dropout2 = nn.Dropout(config.resid_dropout)

    def forward(self, x, mask=None):
        # Pre-norm attention with residual
        h = self.norm1(x)
        h = self.attn(h, mask=mask)  # Attn dropout inside
        x = x + self.resid_dropout1(h)

        # Pre-norm FFN with residual
        h = self.norm2(x)
        h = self.ffn(h)  # FFN dropout inside
        x = x + self.resid_dropout2(h)
        return x

def create_optimizer(model, config):
    """Create AdamW optimizer with proper parameter groups."""
    decay_params = []
    no_decay_params = []

    for name, param in model.named_parameters():
        if not param.requires_grad:
            continue
        if param.ndim == 1 or 'norm' in name:
            no_decay_params.append(param)
        else:
            decay_params.append(param)

    param_groups = [
        {'params': decay_params, 'weight_decay': config.weight_decay},
        {'params': no_decay_params, 'weight_decay': 0.0},
    ]

    n_decay = sum(p.numel() for p in decay_params)
    n_no_decay = sum(p.numel() for p in no_decay_params)
    print(f"Decay params: {n_decay:,} | No-decay params: {n_no_decay:,}")

    return torch.optim.AdamW(
        param_groups,
        lr=config.lr,
        betas=config.betas,
        eps=config.eps,
    )

def cosine_lr_schedule(step, config, total_steps, warmup_steps):
    """Cosine LR schedule with warmup."""
    if step < warmup_steps:
        return config.lr * step / warmup_steps
    progress = (step - warmup_steps) / (total_steps - warmup_steps)
    return config.min_lr + 0.5 * (config.lr - config.min_lr) * (
        1 + math.cos(math.pi * progress)
    )

def train_step(model, batch, optimizer, config, step, total_steps,
               warmup_steps):
    """Single training step with all regularization."""
    # 1. Update learning rate
    lr = cosine_lr_schedule(step, config, total_steps, warmup_steps)
    for param_group in optimizer.param_groups:
        param_group['lr'] = lr

    # 2. Forward pass (dropout active if training mode)
    model.train()
    logits = model(batch['input_ids'], mask=batch.get('mask'))

    # 3. Loss with optional label smoothing
    if config.label_smoothing > 0:
        loss = label_smoothed_cross_entropy(
            logits, batch['labels'], config.label_smoothing
        )
    else:
        loss = torch.nn.functional.cross_entropy(
            logits.view(-1, logits.size(-1)),
            batch['labels'].view(-1),
        )

    # 4. Backward pass
    optimizer.zero_grad()
    loss.backward()

    # 5. Gradient clipping
    grad_norm = torch.nn.utils.clip_grad_norm_(
        model.parameters(), config.max_grad_norm
    )

    # 6. Optimizer step (includes weight decay)
    optimizer.step()

    return {
        'loss': loss.item(),
        'grad_norm': grad_norm.item(),
        'lr': lr,
    }

Fine-Tuning Regularization: Where Dropout Returns

For fine-tuning pretrained LLMs on small datasets, the situation reverses. The model has 70B parameters and may be fine-tuned on 10K-100K examples. Overfitting is a real risk. Here, dropout returns as a useful tool.

8.1 LoRA with Dropout

LoRA (Low-Rank Adaptation) adds low-rank matrices $A \in \mathbb{R}^{d \times r}$ and $B \in \mathbb{R}^{r \times d}$ to the frozen weight matrices. LoRA typically applies dropout to the low-rank path:

$h = W_{\text{frozen}} x + \text{Dropout}(B \cdot A \cdot x) \cdot \frac{\alpha}{r}$

Standard LoRA dropout: $p = 0.05$ to $p = 0.1$ .

class LoRALayer(nn.Module):
    def __init__(self, in_features, out_features, rank=16,
                 alpha=32, dropout=0.05):
        super().__init__()
        self.frozen_weight = nn.Linear(in_features, out_features,
                                        bias=False)
        self.frozen_weight.weight.requires_grad_(False)

        self.lora_A = nn.Linear(in_features, rank, bias=False)
        self.lora_B = nn.Linear(rank, out_features, bias=False)
        self.lora_dropout = nn.Dropout(dropout)
        self.scaling = alpha / rank

        # Initialize A with Kaiming, B with zero
        nn.init.kaiming_uniform_(self.lora_A.weight, a=math.sqrt(5))
        nn.init.zeros_(self.lora_B.weight)

    def forward(self, x):
        frozen_out = self.frozen_weight(x)
        lora_out = self.lora_B(self.lora_A(self.lora_dropout(x)))
        return frozen_out + lora_out * self.scaling

8.2 Full Fine-Tuning Regularization

For full fine-tuning (all parameters unfrozen), a typical configuration:

finetune_config = {
    'resid_dropout': 0.1,    # Re-enable residual dropout
    'attn_dropout': 0.0,     # Usually still zero
    'weight_decay': 0.01,    # Lower than pretraining
    'max_grad_norm': 1.0,    # Same as pretraining
    'label_smoothing': 0.1,  # Useful for classification tasks
    'lr': 2e-5,              # Much lower than pretraining
}

The weight decay is reduced from 0.1 to 0.01 because the learning rate is much lower (2e-5 vs 1.5e-4). The effective decay per step is $2 \times 10^{-5} \times 0.01 = 2 \times 10^{-7}$ , compared to $1.5 \times 10^{-4} \times 0.1 = 1.5 \times 10^{-5}$ during pretraining. Some practitioners keep $\lambda = 0.1$ and rely on the lower LR to reduce effective decay.

Regularization Strength: Pretraining vs Fine-Tuning

(relative strength (arbitrary scale))

Pretrain: Dropout p=0.0

0 relative strength (arbitrary scale)

Pretrain: Weight Decay lambda=0.1

100 relative strength (arbitrary scale)

Pretrain: Grad Clip norm=1.0

100 relative strength (arbitrary scale)

Finetune: Dropout p=0.05-0.1

50 relative strength (arbitrary scale)

Finetune: Weight Decay lambda=0.01

10 relative strength (arbitrary scale)

Finetune: Grad Clip norm=1.0

100 relative strength (arbitrary scale)

Summary of Regularization Decisions

The regularization stack for transformer training is remarkably simple for pretraining and moderately more complex for fine-tuning:

Pretraining (web-scale data, single epoch):

Dropout = 0.0 everywhere
AdamW with $\lambda = 0.1$
Gradient clipping with max_norm = 1.0
No label smoothing
Rely on data diversity and single-epoch training for regularization

Fine-tuning (small data, multiple epochs):

Residual dropout = 0.05-0.1
AdamW with $\lambda = 0.01$
Gradient clipping with max_norm = 1.0
Label smoothing = 0.1 for classification
LoRA dropout = 0.05 if using LoRA

The key insight: at web-scale, the data itself is the regularizer. Every other regularization technique is either harmful (dropout — wastes capacity), redundant (label smoothing — the task is already uncertain), or serves a different purpose (weight decay — prevents weight explosion; gradient clipping — prevents training instability). Understanding which regime you are in — overfitting vs underfitting — determines which tools you need.

💡 Reviewer Validation Summary

Verified: (1) Dropout math correct — inverted scaling preserves expected value, variance increase factor is $\frac{1}{1-p}$ . (2) AdamW decoupled weight decay correctly separates decay from adaptive gradient — update formula matches Loshchilov and Hutter 2019. (3) Label smoothing target distribution sums to 1: $(1-\alpha+\frac{\alpha}{V}) + (V-1)\frac{\alpha}{V} = 1 - \alpha + \frac{\alpha}{V} + \alpha - \frac{\alpha}{V} = 1$ . (4) Gradient clipping preserves direction, only scales magnitude. (5) All model configurations (GPT-3, Llama 2/3, PaLM, Chinchilla) match published papers. (6) No bare angle brackets in prose. (7) All math uses dollar-sign delimiters. (8) No Python type hints with brackets.

Dropout: The Mechanism

1.1 What Dropout Does

1.2 Why Inverted Dropout Works

1.3 The Gradient Through Dropout

Where Dropout Is Applied in Transformers

2.1 Attention Dropout (After Softmax)

2.2 Residual Dropout (After Sublayer Output)

2.3 FFN Dropout (Inside or After Feed-Forward)

2.4 Embedding Dropout

Dropout Locations in Major Transformer Models

Why Modern LLMs Use Zero Dropout

3.1 The Overfitting vs Underfitting Regime

3.2 The Compute Efficiency Argument

3.3 The Token Efficiency Argument

3.4 Interaction with Other Regularizers

3.5 Empirical Evidence

Effective Capacity Loss from Dropout at LLM Scale

Weight Decay: The Primary Regularizer

4.1 L2 Regularization vs Weight Decay

4.2 Why AdamW Exists: The Decoupled Weight Decay

4.3 What Weight Decay Does Geometrically

4.4 Which Parameters Get Weight Decay

4.5 The λ=0.1\lambda = 0.1λ=0.1 Consensus

Weight Decay Values in Major LLMs

Label Smoothing

5.1 Hard Targets vs Soft Targets

5.2 The Effect on Gradients

5.3 Label Smoothing in Practice

Gradient Clipping

6.1 Why Gradients Explode

6.2 Max-Norm Gradient Clipping

6.3 Clipping Threshold Values

Gradient Clipping in Major LLMs

6.4 Gradient Clipping and Weight Decay Interaction

Putting It All Together: Complete Regularization Configuration

Fine-Tuning Regularization: Where Dropout Returns

8.1 LoRA with Dropout

8.2 Full Fine-Tuning Regularization

Regularization Strength: Pretraining vs Fine-Tuning

Summary of Regularization Decisions

Stanley Phoong

Related Posts

Position Encoding in Transformers: From Sinusoidal to RoPE, ALiBi, and Long-Context Scaling

Attention Mechanism Performance Analysis: Where the Cycles Go

Attention Variants Compared: MHA, MQA, GQA, and MLA

4.5 The $\lambda = 0.1$ Consensus