Constitutional AI and RLHF Alternatives: DPO, KTO, ORPO, and the Post-Training Revolution

Part of Series Frontier Research 2025-2026 8 of 9

1 Reasoning Scaling Laws: How Inference-Time Compute Changes Everything We Know About Scaling 2 Lightning Attention: Implementing Linear-Time Attention for Million-Token Contexts 3 Policy of Thoughts: Test-Time Policy Evolution and Online Reasoning Refinement 4 Test-Time Compute Scaling: When a 1B Model Beats a 405B Model 5 Reward Model Engineering: ORM vs PRM, Verifier Design, and Why Reward Quality Determines Reasoning Quality 6 Constitutional AI and RLHF Alternatives: DPO, KTO, ORPO, and the Post-Training Revolution 7 Long-Context Research: Architectures and Techniques for 1M to 10M+ Token Windows 8 Multimodal Fusion: Early vs Late Fusion, Cross-Attention, and Interleaved Architectures 9 Efficient Fine-Tuning: LoRA, DoRA, QLoRA, GaLore, and LISA — When to Use Each

RLHF (Reinforcement Learning from Human Feedback) was the original method for aligning LLMs: train a reward model on human preferences, then use PPO to optimize the policy against the reward model. It works but requires: (1) a separate reward model (same size as the policy = 2x GPU memory), (2) PPO infrastructure (actor, critic, reference model, reward model = 4 copies of similar-sized models), (3) careful hyperparameter tuning to prevent reward hacking.

DPO (Direct Preference Optimization) showed in 2023 that you can skip the reward model entirely. Since then, the field has produced KTO, ORPO, IPO, and others — each with different tradeoffs. This post implements the three most important: DPO, KTO, and ORPO.

DPO: Direct Preference Optimization

Σ Theorem: DPO Loss Function

Given preference pairs $(x, y_w, y_l)$ where $y_w$ is preferred over $y_l$ :

$\mathcal{L}_{\text{DPO}} = -\mathbb{E}\left[\log \sigma\left(\beta \left(\log \frac{\pi_\theta(y_w|x)}{\pi_{\text{ref}}(y_w|x)} - \log \frac{\pi_\theta(y_l|x)}{\pi_{\text{ref}}(y_l|x)}\right)\right)\right]$

where $\beta$ controls the deviation from the reference policy, $\sigma$ is sigmoid, and $\pi_{\text{ref}}$ is the frozen reference model (the SFT checkpoint).

The key insight: the optimal policy under RLHF with a Bradley-Terry reward model has a closed-form solution that depends only on the log-probability ratio between the policy and reference model. No reward model needed. No RL loop needed. Just supervised learning on preference pairs.

import torch
import torch.nn.functional as F

def dpo_loss(
    policy_chosen_logps,    # log P(y_w | x) under current policy
    policy_rejected_logps,  # log P(y_l | x) under current policy
    ref_chosen_logps,       # log P(y_w | x) under reference (frozen SFT)
    ref_rejected_logps,     # log P(y_l | x) under reference
    beta=0.1,
):
    """
    Direct Preference Optimization loss.

    Args:
        policy_chosen_logps: [B] log-probs of chosen responses under policy
        policy_rejected_logps: [B] log-probs of rejected responses under policy
        ref_chosen_logps: [B] log-probs of chosen responses under reference
        ref_rejected_logps: [B] log-probs of rejected responses under reference
        beta: temperature parameter (higher = stay closer to reference)

    Returns:
        loss: scalar, the DPO loss to minimize
    """
    # Log-ratio: how much policy diverges from reference for each response
    chosen_logratios = policy_chosen_logps - ref_chosen_logps
    rejected_logratios = policy_rejected_logps - ref_rejected_logps

    # DPO: maximize the gap between chosen and rejected log-ratios
    logits = beta * (chosen_logratios - rejected_logratios)

    # Binary cross-entropy: we want sigmoid(logits) to be close to 1
    loss = -F.logsigmoid(logits).mean()

    # Useful metrics
    with torch.no_grad():
        chosen_rewards = beta * chosen_logratios
        rejected_rewards = beta * rejected_logratios
        reward_margin = (chosen_rewards - rejected_rewards).mean()
        accuracy = (logits > 0).float().mean()

    return loss, {
        "reward_margin": reward_margin.item(),
        "accuracy": accuracy.item(),
    }

Why $\beta$ matters: $\beta = 0.1$ is standard. Lower $\beta$ allows the policy to deviate more from the reference (potentially higher quality but risk of reward hacking). Higher $\beta$ constrains the policy (safer but may under-optimize).

KTO: Kahneman-Tversky Optimization

DPO requires paired preferences: “response A is better than response B for the same prompt.” KTO works with unpaired data: “this response is good” or “this response is bad” — independent labels, not comparisons.

def kto_loss(
    policy_chosen_logps,    # log P(y | x) for good responses
    policy_rejected_logps,  # log P(y | x) for bad responses
    ref_chosen_logps,
    ref_rejected_logps,
    beta=0.1,
):
    """
    Kahneman-Tversky Optimization loss.
    Works with unpaired good/bad labels instead of preference pairs.
    """
    # KL divergence from reference (per response)
    chosen_kl = (policy_chosen_logps - ref_chosen_logps).mean()
    rejected_kl = (policy_rejected_logps - ref_rejected_logps).mean()

    # KTO: maximize chosen log-ratio, minimize rejected log-ratio
    # with asymmetric weighting (loss aversion from Kahneman-Tversky)
    chosen_logratios = policy_chosen_logps - ref_chosen_logps
    rejected_logratios = policy_rejected_logps - ref_rejected_logps

    # Reference point: average KL
    kl_ref = 0.5 * (chosen_kl + rejected_kl).detach()

    # Losses with loss aversion coefficient
    lambda_d = 1.0   # Weight for desirable (chosen) responses
    lambda_u = 1.0   # Weight for undesirable (rejected) responses

    chosen_loss = 1 - F.sigmoid(beta * (chosen_logratios - kl_ref))
    rejected_loss = 1 - F.sigmoid(beta * (kl_ref - rejected_logratios))

    loss = (lambda_d * chosen_loss.mean() + lambda_u * rejected_loss.mean()) / 2
    return loss

ℹ️ When KTO Beats DPO

KTO excels when you have abundant thumbs-up/thumbs-down data (easy to collect from users) but lack paired comparisons (expensive to annotate — requires showing annotators two responses to the same prompt). In practice, KTO matches DPO quality with 50% less annotation effort.

ORPO: Odds Ratio Preference Optimization

ORPO goes further: no reference model needed at all. It combines SFT and alignment into a single training phase by adding a preference term to the standard cross-entropy loss.

def orpo_loss(
    policy_chosen_logps,    # log P(y_w | x) under policy
    policy_rejected_logps,  # log P(y_l | x) under policy
    sft_loss,               # Standard cross-entropy loss on chosen responses
    lambda_orpo=0.1,
):
    """
    Odds Ratio Preference Optimization.
    No reference model needed. Combines SFT + alignment.
    """
    # Odds: P / (1 - P)
    # In log space: log_odds = logP - log(1 - exp(logP))
    chosen_odds = policy_chosen_logps - torch.log1p(-policy_chosen_logps.exp())
    rejected_odds = policy_rejected_logps - torch.log1p(-policy_rejected_logps.exp())

    # Log odds ratio: how much more likely is the chosen vs rejected?
    log_odds_ratio = chosen_odds - rejected_odds

    # ORPO loss: SFT loss + preference alignment
    preference_loss = -F.logsigmoid(log_odds_ratio).mean()
    loss = sft_loss + lambda_orpo * preference_loss

    return loss

📊

Alignment Method Comparison

Method	Requires Reference Model?	Requires Paired Data?	Separate Alignment Phase?	Memory (70B model)
RLHF (PPO)	Yes (+ reward model)	Yes (for reward model)	Yes	4x model size
DPO	Yes (frozen SFT)	Yes (preference pairs)	Yes	2x model size
KTO	Yes (frozen SFT)	No (unpaired labels)	Yes	2x model size
ORPO	No	Yes (preference pairs)	No (combined with SFT)	1x model size
SimPO	No	Yes	Yes	1x model size

GPU Memory: Alignment Method Comparison (70B Model)

(GB (FP16))

RLHF (PPO) 4x = 560 GB

560 GB (FP16)

DPO 2x = 280 GB

280 GB (FP16)

KTO 2x = 280 GB

280 GB (FP16)

ORPO 1x = 140 GB

140 GB (FP16)

Reviewer Agent Validation

Challenge: Implement the DPO loss given batch tensors of log-probabilities. Verify that when the policy assigns higher probability to the chosen response (relative to reference) than to the rejected response, the loss is low.

Expected test:

# Chosen response: policy likes it more than reference
policy_chosen = torch.tensor([-1.0])   # log P(y_w) under policy
ref_chosen = torch.tensor([-2.0])      # log P(y_w) under reference
# Rejected response: policy likes it less than reference
policy_rejected = torch.tensor([-3.0]) # log P(y_l) under policy
ref_rejected = torch.tensor([-2.5])    # log P(y_l) under reference

loss, metrics = dpo_loss(policy_chosen, policy_rejected, ref_chosen, ref_rejected, beta=0.1)
# chosen_logratio = -1.0 - (-2.0) = 1.0 (policy prefers chosen more)
# rejected_logratio = -3.0 - (-2.5) = -0.5 (policy prefers rejected less)
# logits = 0.1 * (1.0 - (-0.5)) = 0.1 * 1.5 = 0.15
# loss = -log(sigmoid(0.15)) = -log(0.537) = 0.622
# accuracy = 1.0 (logits > 0, correct preference)

DPO: Direct Preference Optimization

KTO: Kahneman-Tversky Optimization

ORPO: Odds Ratio Preference Optimization

Alignment Method Comparison

GPU Memory: Alignment Method Comparison (70B Model)

Reviewer Agent Validation

Stanley Phoong

Related Posts

Preference Data: Building DPO/RLHF Datasets from Human and AI Feedback

Reward Model Engineering: ORM vs PRM, Verifier Design, and Why Reward Quality Determines Reasoning Quality

Encoder vs Decoder: Why Decoder-Only Won