RLHF (Reinforcement Learning from Human Feedback) was the original method for aligning LLMs: train a reward model on human preferences, then use PPO to optimize the policy against the reward model. It works but requires: (1) a separate reward model (same size as the policy = 2x GPU memory), (2) PPO infrastructure (actor, critic, reference model, reward model = 4 copies of similar-sized models), (3) careful hyperparameter tuning to prevent reward hacking.

DPO (Direct Preference Optimization) showed in 2023 that you can skip the reward model entirely. Since then, the field has produced KTO, ORPO, IPO, and others — each with different tradeoffs. This post implements the three most important: DPO, KTO, and ORPO.

DPO: Direct Preference Optimization

Σ Theorem: DPO Loss Function

Given preference pairs (x,yw,yl)(x, y_w, y_l) where ywy_w is preferred over yly_l:

LDPO=E[logσ(β(logπθ(ywx)πref(ywx)logπθ(ylx)πref(ylx)))]\mathcal{L}_{\text{DPO}} = -\mathbb{E}\left[\log \sigma\left(\beta \left(\log \frac{\pi_\theta(y_w|x)}{\pi_{\text{ref}}(y_w|x)} - \log \frac{\pi_\theta(y_l|x)}{\pi_{\text{ref}}(y_l|x)}\right)\right)\right]

where β\beta controls the deviation from the reference policy, σ\sigma is sigmoid, and πref\pi_{\text{ref}} is the frozen reference model (the SFT checkpoint).

The key insight: the optimal policy under RLHF with a Bradley-Terry reward model has a closed-form solution that depends only on the log-probability ratio between the policy and reference model. No reward model needed. No RL loop needed. Just supervised learning on preference pairs.

import torch
import torch.nn.functional as F

def dpo_loss(
    policy_chosen_logps,    # log P(y_w | x) under current policy
    policy_rejected_logps,  # log P(y_l | x) under current policy
    ref_chosen_logps,       # log P(y_w | x) under reference (frozen SFT)
    ref_rejected_logps,     # log P(y_l | x) under reference
    beta=0.1,
):
    """
    Direct Preference Optimization loss.

    Args:
        policy_chosen_logps: [B] log-probs of chosen responses under policy
        policy_rejected_logps: [B] log-probs of rejected responses under policy
        ref_chosen_logps: [B] log-probs of chosen responses under reference
        ref_rejected_logps: [B] log-probs of rejected responses under reference
        beta: temperature parameter (higher = stay closer to reference)

    Returns:
        loss: scalar, the DPO loss to minimize
    """
    # Log-ratio: how much policy diverges from reference for each response
    chosen_logratios = policy_chosen_logps - ref_chosen_logps
    rejected_logratios = policy_rejected_logps - ref_rejected_logps

    # DPO: maximize the gap between chosen and rejected log-ratios
    logits = beta * (chosen_logratios - rejected_logratios)

    # Binary cross-entropy: we want sigmoid(logits) to be close to 1
    loss = -F.logsigmoid(logits).mean()

    # Useful metrics
    with torch.no_grad():
        chosen_rewards = beta * chosen_logratios
        rejected_rewards = beta * rejected_logratios
        reward_margin = (chosen_rewards - rejected_rewards).mean()
        accuracy = (logits > 0).float().mean()

    return loss, {
        "reward_margin": reward_margin.item(),
        "accuracy": accuracy.item(),
    }

Why β\beta matters: β=0.1\beta = 0.1 is standard. Lower β\beta allows the policy to deviate more from the reference (potentially higher quality but risk of reward hacking). Higher β\beta constrains the policy (safer but may under-optimize).

KTO: Kahneman-Tversky Optimization

DPO requires paired preferences: “response A is better than response B for the same prompt.” KTO works with unpaired data: “this response is good” or “this response is bad” — independent labels, not comparisons.

def kto_loss(
    policy_chosen_logps,    # log P(y | x) for good responses
    policy_rejected_logps,  # log P(y | x) for bad responses
    ref_chosen_logps,
    ref_rejected_logps,
    beta=0.1,
):
    """
    Kahneman-Tversky Optimization loss.
    Works with unpaired good/bad labels instead of preference pairs.
    """
    # KL divergence from reference (per response)
    chosen_kl = (policy_chosen_logps - ref_chosen_logps).mean()
    rejected_kl = (policy_rejected_logps - ref_rejected_logps).mean()

    # KTO: maximize chosen log-ratio, minimize rejected log-ratio
    # with asymmetric weighting (loss aversion from Kahneman-Tversky)
    chosen_logratios = policy_chosen_logps - ref_chosen_logps
    rejected_logratios = policy_rejected_logps - ref_rejected_logps

    # Reference point: average KL
    kl_ref = 0.5 * (chosen_kl + rejected_kl).detach()

    # Losses with loss aversion coefficient
    lambda_d = 1.0   # Weight for desirable (chosen) responses
    lambda_u = 1.0   # Weight for undesirable (rejected) responses

    chosen_loss = 1 - F.sigmoid(beta * (chosen_logratios - kl_ref))
    rejected_loss = 1 - F.sigmoid(beta * (kl_ref - rejected_logratios))

    loss = (lambda_d * chosen_loss.mean() + lambda_u * rejected_loss.mean()) / 2
    return loss
ℹ️ When KTO Beats DPO

KTO excels when you have abundant thumbs-up/thumbs-down data (easy to collect from users) but lack paired comparisons (expensive to annotate — requires showing annotators two responses to the same prompt). In practice, KTO matches DPO quality with 50% less annotation effort.

ORPO: Odds Ratio Preference Optimization

ORPO goes further: no reference model needed at all. It combines SFT and alignment into a single training phase by adding a preference term to the standard cross-entropy loss.

def orpo_loss(
    policy_chosen_logps,    # log P(y_w | x) under policy
    policy_rejected_logps,  # log P(y_l | x) under policy
    sft_loss,               # Standard cross-entropy loss on chosen responses
    lambda_orpo=0.1,
):
    """
    Odds Ratio Preference Optimization.
    No reference model needed. Combines SFT + alignment.
    """
    # Odds: P / (1 - P)
    # In log space: log_odds = logP - log(1 - exp(logP))
    chosen_odds = policy_chosen_logps - torch.log1p(-policy_chosen_logps.exp())
    rejected_odds = policy_rejected_logps - torch.log1p(-policy_rejected_logps.exp())

    # Log odds ratio: how much more likely is the chosen vs rejected?
    log_odds_ratio = chosen_odds - rejected_odds

    # ORPO loss: SFT loss + preference alignment
    preference_loss = -F.logsigmoid(log_odds_ratio).mean()
    loss = sft_loss + lambda_orpo * preference_loss

    return loss
📊

Alignment Method Comparison

MethodRequires Reference Model?Requires Paired Data?Separate Alignment Phase?Memory (70B model)
RLHF (PPO) Yes (+ reward model) Yes (for reward model) Yes 4x model size
DPO Yes (frozen SFT) Yes (preference pairs) Yes 2x model size
KTO Yes (frozen SFT) No (unpaired labels) Yes 2x model size
ORPO No Yes (preference pairs) No (combined with SFT) 1x model size
SimPO No Yes Yes 1x model size

GPU Memory: Alignment Method Comparison (70B Model)

(GB (FP16))
RLHF (PPO) 4x = 560 GB
560 GB (FP16)
DPO 2x = 280 GB
280 GB (FP16)
KTO 2x = 280 GB
280 GB (FP16)
ORPO 1x = 140 GB
140 GB (FP16)

Reviewer Agent Validation

Challenge: Implement the DPO loss given batch tensors of log-probabilities. Verify that when the policy assigns higher probability to the chosen response (relative to reference) than to the rejected response, the loss is low.

Expected test:

# Chosen response: policy likes it more than reference
policy_chosen = torch.tensor([-1.0])   # log P(y_w) under policy
ref_chosen = torch.tensor([-2.0])      # log P(y_w) under reference
# Rejected response: policy likes it less than reference
policy_rejected = torch.tensor([-3.0]) # log P(y_l) under policy
ref_rejected = torch.tensor([-2.5])    # log P(y_l) under reference

loss, metrics = dpo_loss(policy_chosen, policy_rejected, ref_chosen, ref_rejected, beta=0.1)
# chosen_logratio = -1.0 - (-2.0) = 1.0 (policy prefers chosen more)
# rejected_logratio = -3.0 - (-2.5) = -0.5 (policy prefers rejected less)
# logits = 0.1 * (1.0 - (-0.5)) = 0.1 * 1.5 = 0.15
# loss = -log(sigmoid(0.15)) = -log(0.537) = 0.622
# accuracy = 1.0 (logits > 0, correct preference)