RLHF (Reinforcement Learning from Human Feedback) was the original method for aligning LLMs: train a reward model on human preferences, then use PPO to optimize the policy against the reward model. It works but requires: (1) a separate reward model (same size as the policy = 2x GPU memory), (2) PPO infrastructure (actor, critic, reference model, reward model = 4 copies of similar-sized models), (3) careful hyperparameter tuning to prevent reward hacking.
DPO (Direct Preference Optimization) showed in 2023 that you can skip the reward model entirely. Since then, the field has produced KTO, ORPO, IPO, and others — each with different tradeoffs. This post implements the three most important: DPO, KTO, and ORPO.
DPO: Direct Preference Optimization
Given preference pairs where is preferred over :
where controls the deviation from the reference policy, is sigmoid, and is the frozen reference model (the SFT checkpoint).
The key insight: the optimal policy under RLHF with a Bradley-Terry reward model has a closed-form solution that depends only on the log-probability ratio between the policy and reference model. No reward model needed. No RL loop needed. Just supervised learning on preference pairs.
import torch
import torch.nn.functional as F
def dpo_loss(
policy_chosen_logps, # log P(y_w | x) under current policy
policy_rejected_logps, # log P(y_l | x) under current policy
ref_chosen_logps, # log P(y_w | x) under reference (frozen SFT)
ref_rejected_logps, # log P(y_l | x) under reference
beta=0.1,
):
"""
Direct Preference Optimization loss.
Args:
policy_chosen_logps: [B] log-probs of chosen responses under policy
policy_rejected_logps: [B] log-probs of rejected responses under policy
ref_chosen_logps: [B] log-probs of chosen responses under reference
ref_rejected_logps: [B] log-probs of rejected responses under reference
beta: temperature parameter (higher = stay closer to reference)
Returns:
loss: scalar, the DPO loss to minimize
"""
# Log-ratio: how much policy diverges from reference for each response
chosen_logratios = policy_chosen_logps - ref_chosen_logps
rejected_logratios = policy_rejected_logps - ref_rejected_logps
# DPO: maximize the gap between chosen and rejected log-ratios
logits = beta * (chosen_logratios - rejected_logratios)
# Binary cross-entropy: we want sigmoid(logits) to be close to 1
loss = -F.logsigmoid(logits).mean()
# Useful metrics
with torch.no_grad():
chosen_rewards = beta * chosen_logratios
rejected_rewards = beta * rejected_logratios
reward_margin = (chosen_rewards - rejected_rewards).mean()
accuracy = (logits > 0).float().mean()
return loss, {
"reward_margin": reward_margin.item(),
"accuracy": accuracy.item(),
}
Why matters: is standard. Lower allows the policy to deviate more from the reference (potentially higher quality but risk of reward hacking). Higher constrains the policy (safer but may under-optimize).
KTO: Kahneman-Tversky Optimization
DPO requires paired preferences: “response A is better than response B for the same prompt.” KTO works with unpaired data: “this response is good” or “this response is bad” — independent labels, not comparisons.
def kto_loss(
policy_chosen_logps, # log P(y | x) for good responses
policy_rejected_logps, # log P(y | x) for bad responses
ref_chosen_logps,
ref_rejected_logps,
beta=0.1,
):
"""
Kahneman-Tversky Optimization loss.
Works with unpaired good/bad labels instead of preference pairs.
"""
# KL divergence from reference (per response)
chosen_kl = (policy_chosen_logps - ref_chosen_logps).mean()
rejected_kl = (policy_rejected_logps - ref_rejected_logps).mean()
# KTO: maximize chosen log-ratio, minimize rejected log-ratio
# with asymmetric weighting (loss aversion from Kahneman-Tversky)
chosen_logratios = policy_chosen_logps - ref_chosen_logps
rejected_logratios = policy_rejected_logps - ref_rejected_logps
# Reference point: average KL
kl_ref = 0.5 * (chosen_kl + rejected_kl).detach()
# Losses with loss aversion coefficient
lambda_d = 1.0 # Weight for desirable (chosen) responses
lambda_u = 1.0 # Weight for undesirable (rejected) responses
chosen_loss = 1 - F.sigmoid(beta * (chosen_logratios - kl_ref))
rejected_loss = 1 - F.sigmoid(beta * (kl_ref - rejected_logratios))
loss = (lambda_d * chosen_loss.mean() + lambda_u * rejected_loss.mean()) / 2
return loss
KTO excels when you have abundant thumbs-up/thumbs-down data (easy to collect from users) but lack paired comparisons (expensive to annotate — requires showing annotators two responses to the same prompt). In practice, KTO matches DPO quality with 50% less annotation effort.
ORPO: Odds Ratio Preference Optimization
ORPO goes further: no reference model needed at all. It combines SFT and alignment into a single training phase by adding a preference term to the standard cross-entropy loss.
def orpo_loss(
policy_chosen_logps, # log P(y_w | x) under policy
policy_rejected_logps, # log P(y_l | x) under policy
sft_loss, # Standard cross-entropy loss on chosen responses
lambda_orpo=0.1,
):
"""
Odds Ratio Preference Optimization.
No reference model needed. Combines SFT + alignment.
"""
# Odds: P / (1 - P)
# In log space: log_odds = logP - log(1 - exp(logP))
chosen_odds = policy_chosen_logps - torch.log1p(-policy_chosen_logps.exp())
rejected_odds = policy_rejected_logps - torch.log1p(-policy_rejected_logps.exp())
# Log odds ratio: how much more likely is the chosen vs rejected?
log_odds_ratio = chosen_odds - rejected_odds
# ORPO loss: SFT loss + preference alignment
preference_loss = -F.logsigmoid(log_odds_ratio).mean()
loss = sft_loss + lambda_orpo * preference_loss
return loss
Alignment Method Comparison
| Method | Requires Reference Model? | Requires Paired Data? | Separate Alignment Phase? | Memory (70B model) |
|---|---|---|---|---|
| RLHF (PPO) | Yes (+ reward model) | Yes (for reward model) | Yes | 4x model size |
| DPO | Yes (frozen SFT) | Yes (preference pairs) | Yes | 2x model size |
| KTO | Yes (frozen SFT) | No (unpaired labels) | Yes | 2x model size |
| ORPO | No | Yes (preference pairs) | No (combined with SFT) | 1x model size |
| SimPO | No | Yes | Yes | 1x model size |
GPU Memory: Alignment Method Comparison (70B Model)
(GB (FP16))Reviewer Agent Validation
Challenge: Implement the DPO loss given batch tensors of log-probabilities. Verify that when the policy assigns higher probability to the chosen response (relative to reference) than to the rejected response, the loss is low.
Expected test:
# Chosen response: policy likes it more than reference
policy_chosen = torch.tensor([-1.0]) # log P(y_w) under policy
ref_chosen = torch.tensor([-2.0]) # log P(y_w) under reference
# Rejected response: policy likes it less than reference
policy_rejected = torch.tensor([-3.0]) # log P(y_l) under policy
ref_rejected = torch.tensor([-2.5]) # log P(y_l) under reference
loss, metrics = dpo_loss(policy_chosen, policy_rejected, ref_chosen, ref_rejected, beta=0.1)
# chosen_logratio = -1.0 - (-2.0) = 1.0 (policy prefers chosen more)
# rejected_logratio = -3.0 - (-2.5) = -0.5 (policy prefers rejected less)
# logits = 0.1 * (1.0 - (-0.5)) = 0.1 * 1.5 = 0.15
# loss = -log(sigmoid(0.15)) = -log(0.537) = 0.622
# accuracy = 1.0 (logits > 0, correct preference)