Part of Series Transformer Anatomy 19 of 23
1 The Transformer Attention Mechanism: From First Principles to Performance Reality 2 Tokenization and BPE: How LLMs See Text — From Characters to Subwords 3 Embedding Layers: The Geometry of Meaning in LLMs 4 Position Encoding in Transformers: From Sinusoidal to RoPE, ALiBi, and Long-Context Scaling 5 Softmax Numerics: Log-Sum-Exp, Temperature, and Why Numerical Stability Matters 6 Attention Variants Compared: MHA, MQA, GQA, and MLA 7 Normalization in Transformers: LayerNorm, RMSNorm, and the Training Stability Story 8 Residual Connections and Skip Paths: Why Transformers Can Be 100 Layers Deep 9 The Feed-Forward Network: SwiGLU, Gating, and the FFN-as-Memory Hypothesis 10 Mixture of Experts: Why Conditional Computation Is the Path to Trillion-Parameter Models 11 The Output Head: Unembedding, Weight Tying, and Vocabulary Projection 12 Cross-Entropy Loss: How the Loss Function Shapes What an LLM Learns 13 Encoder vs Decoder: Why Decoder-Only Won 14 DeepSeek V3: How 671B Parameters Trained for the Cost of a 70B Dense Model 15 Building a Transformer From Scratch: Putting Every Component Together 16 Gradient Flow and Backpropagation Through Transformers: What Happens During the Backward Pass 17 Weight Initialization: Xavier, Kaiming, and Why mu-P Changes Everything for Large Models 18 Training Loop Anatomy: Forward Pass, Loss Computation, Backward Pass, Optimizer Step 19 Learning Rate Schedules: Warmup, Cosine Decay, and Why WSD Changes Everything 20 Activation Functions Deep Dive: ReLU, GELU, SiLU, and Why Each Matters for Transformers 21 Attention Masking: Causal, Bidirectional, Sliding Window, Block Sparse, and Custom Patterns 22 Knowledge Distillation: Training Small Models to Match Large Ones 23 Model Merging: Weight Averaging, TIES, DARE, and Evolutionary Search

The learning rate is the single most important hyperparameter in LLM training. Too high: training diverges (loss spikes to infinity). Too low: training converges but wastes compute by moving too slowly. The schedule — how the learning rate changes over training — determines stability, final quality, and compute efficiency.

Why Not a Constant Learning Rate?

A constant LR fails for transformers because the loss landscape changes during training:

  • Early training: The model is far from any minimum. Large gradients dominate. A high LR causes overshooting and instability, especially with randomly initialized attention weights that produce near-uniform attention distributions (high-entropy softmax outputs create large gradients through the attention mechanism).
  • Mid training: The model is in a reasonable basin. Moderate LR allows efficient descent.
  • Late training: The model is near a minimum. High LR causes oscillation around the minimum rather than convergence. Low LR is needed for fine-grained optimization.

Linear Warmup

The first 0.1-2% of training uses a linearly increasing LR:

lr(t)=lrmax×tTwarmup\text{lr}(t) = \text{lr}_{\max} \times \frac{t}{T_{\text{warmup}}}

def linear_warmup(step, max_lr, warmup_steps):
    if step < warmup_steps:
        return max_lr * step / warmup_steps
    return max_lr

Why warmup exists: At step 0, all weights are randomly initialized. The Adam optimizer’s second moment estimates (vtv_t) are zero — they haven’t accumulated any gradient history. Without warmup, the first few updates use lr/vt+ϵ\text{lr} / \sqrt{v_t + \epsilon} with tiny vtv_t, producing extremely large effective updates that can destabilize training.

Warmup gives Adam time to estimate the gradient statistics before applying full-strength updates. Typical warmup: 2,000 steps (Llama 3), which is 0.1% of the total 2M training steps.

Warmup Duration Rule of Thumb

For most LLMs: warmup = 0.1-0.5% of total steps. Longer warmup wastes compute (too slow early). Shorter warmup risks instability. Llama 3: 2,000 steps warmup out of ~2M total. DeepSeek V3: similar ratio. The exact value matters less than having warmup at all.

Cosine Decay

After warmup, the standard schedule decays the LR following a cosine curve to a minimum value:

lr(t)=lrmin+12(lrmaxlrmin)(1+cos(tTwarmupTtotalTwarmupπ))\text{lr}(t) = \text{lr}_{\min} + \frac{1}{2}(\text{lr}_{\max} - \text{lr}_{\min})\left(1 + \cos\left(\frac{t - T_{\text{warmup}}}{T_{\text{total}} - T_{\text{warmup}}} \pi\right)\right)

import math

def cosine_schedule(step, max_lr, min_lr, warmup_steps, total_steps):
    if step < warmup_steps:
        return max_lr * step / warmup_steps
    progress = (step - warmup_steps) / (total_steps - warmup_steps)
    return min_lr + 0.5 * (max_lr - min_lr) * (1 + math.cos(math.pi * progress))

Why cosine? The cosine curve has two useful properties: (1) it starts with a gentle decline (near the maximum for the first ~25% of decay), allowing most training to happen at high LR, and (2) it has a long tail at low LR, allowing fine-grained convergence. Empirically, cosine outperforms linear decay by 0.5-1% on downstream benchmarks.

Typical minimum LR: lrmin=0.1×lrmax\text{lr}_{\min} = 0.1 \times \text{lr}_{\max} (Llama 3 uses this ratio).

Learning Rate Over Training (Llama 3 Style)

(lr x 1e-5)
Warmup (0-2K steps) Linear 0 -> 3e-4
3 lr x 1e-5
Peak (2K steps) lr = 3e-4
30 lr x 1e-5
Cosine mid (500K steps) lr = 2e-4
20 lr x 1e-5
Cosine late (1.5M steps) lr = 8e-5
8 lr x 1e-5
Final (2M steps) lr = 3e-5 (min)
3 lr x 1e-5

WSD: Warmup-Stable-Decay

A newer schedule gaining adoption (used in some DeepSeek and Qwen training runs):

  1. Warmup: Linear increase to lrmax\text{lr}_{\max} (same as before)
  2. Stable: Hold at lrmax\text{lr}_{\max} for most of training (60-80% of steps)
  3. Decay: Rapid decay (linear or cosine) in the final 20-40% of steps
def wsd_schedule(step, max_lr, min_lr, warmup_steps, stable_steps, total_steps):
    if step < warmup_steps:
        return max_lr * step / warmup_steps
    elif step < warmup_steps + stable_steps:
        return max_lr
    else:
        decay_steps = total_steps - warmup_steps - stable_steps
        decay_progress = (step - warmup_steps - stable_steps) / decay_steps
        return min_lr + (max_lr - min_lr) * (1 - decay_progress)

Why WSD? Cosine decay starts reducing the LR immediately after warmup. This means most training happens at a reduced LR. WSD keeps the LR at maximum for longer, allowing the model to explore more of the loss landscape before settling. The rapid final decay then focuses the model into a sharp minimum.

📊

Cosine vs WSD Schedule Comparison

PropertyCosine DecayWSD
Time at peak LR ~0% 60-80% of training
Avg LR during training ~55% of max ~85% of max
Final quality (PPL) Baseline -0.1 to -0.3 PPL (better)
Flexibility Must know total steps upfront Can extend stable phase
Used by Llama 3, GPT-4 DeepSeek V3, some Qwen runs
ℹ️ WSD's Key Advantage: Training Extensions

With cosine decay, you must decide the total training steps upfront — the schedule is fixed. If you want to train longer (more data became available), you must restart with a new schedule. WSD lets you extend the stable phase arbitrarily — just keep training at max LR, and decay whenever you’re ready to finish. This flexibility is why it’s gaining adoption for frontier models where training duration is often adjusted mid-run.

Peak Learning Rate Selection

The maximum LR depends on model size. Larger models need lower LR (their gradients have more variance):

📊

Peak Learning Rate by Model Size

Model SizePeak LRWarmup StepsExample
125M 6e-4 500 GPT-2 small
1.3B 2e-4 1000 Llama 1B
7B 3e-4 2000 Llama 3 8B
70B 1.5e-4 2000 Llama 3 70B
405B 8e-5 2000 Llama 3 405B
671B (MoE) 2.2e-4 2000 DeepSeek V3 (higher due to MoE)
Note: MoE models can use higher LR than dense models of the same total size because only a fraction of parameters update per token.

The rough scaling: lrmaxN0.5\text{lr}_{\max} \propto N^{-0.5} where NN is parameter count. This is an empirical observation, not a derived formula. mu-P (Transformer Anatomy Part 17) provides a principled alternative.

Complete Implementation

class LRScheduler:
    """Configurable LR scheduler supporting warmup + cosine or WSD."""

    def __init__(self, max_lr, min_lr, warmup_steps, total_steps,
                 schedule="cosine", stable_fraction=0.0):
        self.max_lr = max_lr
        self.min_lr = min_lr
        self.warmup_steps = warmup_steps
        self.total_steps = total_steps
        self.schedule = schedule
        self.stable_steps = int(stable_fraction * total_steps)

    def get_lr(self, step):
        # Warmup phase
        if step < self.warmup_steps:
            return self.max_lr * step / self.warmup_steps

        if self.schedule == "wsd":
            # Stable phase
            if step < self.warmup_steps + self.stable_steps:
                return self.max_lr
            # Decay phase
            decay_start = self.warmup_steps + self.stable_steps
            decay_total = self.total_steps - decay_start
            progress = (step - decay_start) / max(1, decay_total)
            return self.min_lr + (self.max_lr - self.min_lr) * (1 - progress)

        else:  # cosine
            progress = (step - self.warmup_steps) / (self.total_steps - self.warmup_steps)
            import math
            return self.min_lr + 0.5 * (self.max_lr - self.min_lr) * (
                1 + math.cos(math.pi * progress)
            )

# Usage for Llama 3 70B:
scheduler = LRScheduler(
    max_lr=1.5e-4, min_lr=1.5e-5,
    warmup_steps=2000, total_steps=2_000_000,
    schedule="cosine"
)

Reviewer Agent Validation

Challenge: Compute the learning rate at step 500,000 for a Llama 3 70B training run with cosine schedule (max_lr=1.5e-4, min_lr=1.5e-5, warmup=2000, total=2M steps).

Expected: Progress = (500000 - 2000) / (2000000 - 2000) = 0.2494. LR = 1.5e-5 + 0.5 * (1.5e-4 - 1.5e-5) * (1 + cos(pi * 0.2494)) = 1.5e-5 + 6.75e-5 * (1 + cos(0.784)) = 1.5e-5 + 6.75e-5 * 1.707 = 1.5e-5 + 1.152e-4 = 1.302e-4.