The learning rate is the single most important hyperparameter in LLM training. Too high: training diverges (loss spikes to infinity). Too low: training converges but wastes compute by moving too slowly. The schedule — how the learning rate changes over training — determines stability, final quality, and compute efficiency.
Why Not a Constant Learning Rate?
A constant LR fails for transformers because the loss landscape changes during training:
- Early training: The model is far from any minimum. Large gradients dominate. A high LR causes overshooting and instability, especially with randomly initialized attention weights that produce near-uniform attention distributions (high-entropy softmax outputs create large gradients through the attention mechanism).
- Mid training: The model is in a reasonable basin. Moderate LR allows efficient descent.
- Late training: The model is near a minimum. High LR causes oscillation around the minimum rather than convergence. Low LR is needed for fine-grained optimization.
Linear Warmup
The first 0.1-2% of training uses a linearly increasing LR:
def linear_warmup(step, max_lr, warmup_steps):
if step < warmup_steps:
return max_lr * step / warmup_steps
return max_lr
Why warmup exists: At step 0, all weights are randomly initialized. The Adam optimizer’s second moment estimates () are zero — they haven’t accumulated any gradient history. Without warmup, the first few updates use with tiny , producing extremely large effective updates that can destabilize training.
Warmup gives Adam time to estimate the gradient statistics before applying full-strength updates. Typical warmup: 2,000 steps (Llama 3), which is 0.1% of the total 2M training steps.
For most LLMs: warmup = 0.1-0.5% of total steps. Longer warmup wastes compute (too slow early). Shorter warmup risks instability. Llama 3: 2,000 steps warmup out of ~2M total. DeepSeek V3: similar ratio. The exact value matters less than having warmup at all.
Cosine Decay
After warmup, the standard schedule decays the LR following a cosine curve to a minimum value:
import math
def cosine_schedule(step, max_lr, min_lr, warmup_steps, total_steps):
if step < warmup_steps:
return max_lr * step / warmup_steps
progress = (step - warmup_steps) / (total_steps - warmup_steps)
return min_lr + 0.5 * (max_lr - min_lr) * (1 + math.cos(math.pi * progress))
Why cosine? The cosine curve has two useful properties: (1) it starts with a gentle decline (near the maximum for the first ~25% of decay), allowing most training to happen at high LR, and (2) it has a long tail at low LR, allowing fine-grained convergence. Empirically, cosine outperforms linear decay by 0.5-1% on downstream benchmarks.
Typical minimum LR: (Llama 3 uses this ratio).
Learning Rate Over Training (Llama 3 Style)
(lr x 1e-5)WSD: Warmup-Stable-Decay
A newer schedule gaining adoption (used in some DeepSeek and Qwen training runs):
- Warmup: Linear increase to (same as before)
- Stable: Hold at for most of training (60-80% of steps)
- Decay: Rapid decay (linear or cosine) in the final 20-40% of steps
def wsd_schedule(step, max_lr, min_lr, warmup_steps, stable_steps, total_steps):
if step < warmup_steps:
return max_lr * step / warmup_steps
elif step < warmup_steps + stable_steps:
return max_lr
else:
decay_steps = total_steps - warmup_steps - stable_steps
decay_progress = (step - warmup_steps - stable_steps) / decay_steps
return min_lr + (max_lr - min_lr) * (1 - decay_progress)
Why WSD? Cosine decay starts reducing the LR immediately after warmup. This means most training happens at a reduced LR. WSD keeps the LR at maximum for longer, allowing the model to explore more of the loss landscape before settling. The rapid final decay then focuses the model into a sharp minimum.
Cosine vs WSD Schedule Comparison
| Property | Cosine Decay | WSD |
|---|---|---|
| Time at peak LR | ~0% | 60-80% of training |
| Avg LR during training | ~55% of max | ~85% of max |
| Final quality (PPL) | Baseline | -0.1 to -0.3 PPL (better) |
| Flexibility | Must know total steps upfront | Can extend stable phase |
| Used by | Llama 3, GPT-4 | DeepSeek V3, some Qwen runs |
With cosine decay, you must decide the total training steps upfront — the schedule is fixed. If you want to train longer (more data became available), you must restart with a new schedule. WSD lets you extend the stable phase arbitrarily — just keep training at max LR, and decay whenever you’re ready to finish. This flexibility is why it’s gaining adoption for frontier models where training duration is often adjusted mid-run.
Peak Learning Rate Selection
The maximum LR depends on model size. Larger models need lower LR (their gradients have more variance):
Peak Learning Rate by Model Size
| Model Size | Peak LR | Warmup Steps | Example |
|---|---|---|---|
| 125M | 6e-4 | 500 | GPT-2 small |
| 1.3B | 2e-4 | 1000 | Llama 1B |
| 7B | 3e-4 | 2000 | Llama 3 8B |
| 70B | 1.5e-4 | 2000 | Llama 3 70B |
| 405B | 8e-5 | 2000 | Llama 3 405B |
| 671B (MoE) | 2.2e-4 | 2000 | DeepSeek V3 (higher due to MoE) |
The rough scaling: where is parameter count. This is an empirical observation, not a derived formula. mu-P (Transformer Anatomy Part 17) provides a principled alternative.
Complete Implementation
class LRScheduler:
"""Configurable LR scheduler supporting warmup + cosine or WSD."""
def __init__(self, max_lr, min_lr, warmup_steps, total_steps,
schedule="cosine", stable_fraction=0.0):
self.max_lr = max_lr
self.min_lr = min_lr
self.warmup_steps = warmup_steps
self.total_steps = total_steps
self.schedule = schedule
self.stable_steps = int(stable_fraction * total_steps)
def get_lr(self, step):
# Warmup phase
if step < self.warmup_steps:
return self.max_lr * step / self.warmup_steps
if self.schedule == "wsd":
# Stable phase
if step < self.warmup_steps + self.stable_steps:
return self.max_lr
# Decay phase
decay_start = self.warmup_steps + self.stable_steps
decay_total = self.total_steps - decay_start
progress = (step - decay_start) / max(1, decay_total)
return self.min_lr + (self.max_lr - self.min_lr) * (1 - progress)
else: # cosine
progress = (step - self.warmup_steps) / (self.total_steps - self.warmup_steps)
import math
return self.min_lr + 0.5 * (self.max_lr - self.min_lr) * (
1 + math.cos(math.pi * progress)
)
# Usage for Llama 3 70B:
scheduler = LRScheduler(
max_lr=1.5e-4, min_lr=1.5e-5,
warmup_steps=2000, total_steps=2_000_000,
schedule="cosine"
)
Reviewer Agent Validation
Challenge: Compute the learning rate at step 500,000 for a Llama 3 70B training run with cosine schedule (max_lr=1.5e-4, min_lr=1.5e-5, warmup=2000, total=2M steps).
Expected: Progress = (500000 - 2000) / (2000000 - 2000) = 0.2494. LR = 1.5e-5 + 0.5 * (1.5e-4 - 1.5e-5) * (1 + cos(pi * 0.2494)) = 1.5e-5 + 6.75e-5 * (1 + cos(0.784)) = 1.5e-5 + 6.75e-5 * 1.707 = 1.5e-5 + 1.152e-4 = 1.302e-4.