RLHF works when human annotators can reliably judge which response is better. On simple tasks (summarization, basic Q&A), annotators agree 75-85% of the time. On complex tasks (advanced mathematics, novel scientific reasoning, long-term strategic planning), annotator agreement drops to 50-60% — barely above random chance. As models become more capable than their human supervisors in specific domains, the supervision signal degrades. This is the scalable oversight problem.
Three research directions address this. Weak-to-strong generalization (Burns et al., 2023) studies whether a strong model can learn good behavior from a weaker supervisor. Constitutional AI (Bai et al., 2022) replaces human preference labels with principles that the model evaluates itself. Debate (Irving et al., 2018) has two model instances argue opposing positions so that a human judge can evaluate the arguments rather than the underlying task. Each approach trades off different failure modes.
This post covers the technical details of each approach, their implementations, known limitations, and open research questions.
The Scalable Oversight Problem
Formalizing the Problem
from dataclasses import dataclass
from typing import Callable
@dataclass
class OversightCapability:
"""Model of human oversight capability."""
domain: str
human_accuracy: float # Probability human correctly evaluates
model_capability: float # Model's actual capability level
oversight_gap: float # model_capability - human_accuracy
@property
def is_supervisable(self):
"""Can humans reliably supervise this capability?"""
return self.human_accuracy > 0.7 # Threshold for reliable oversight
OVERSIGHT_LANDSCAPE = [
OversightCapability("basic_qa", 0.95, 0.92, -0.03),
OversightCapability("summarization", 0.85, 0.88, 0.03),
OversightCapability("code_review_simple", 0.80, 0.85, 0.05),
OversightCapability("mathematical_proof", 0.65, 0.78, 0.13),
OversightCapability("scientific_reasoning", 0.60, 0.75, 0.15),
OversightCapability("code_review_complex", 0.55, 0.80, 0.25),
OversightCapability("long_horizon_planning", 0.50, 0.70, 0.20),
OversightCapability("novel_research", 0.45, 0.65, 0.20),
]
The Oversight Gap: Human vs Model Capability
| Metric | Basic QA | Summarization | Simple Code | Math Proof | Science | Complex Code | Planning | Research |
|---|---|---|---|---|---|---|---|---|
| Human Evaluator Accuracy | ||||||||
| Model Capability |
Why This Gets Worse Over Time
def project_oversight_gap(
current_model_capability,
capability_growth_rate=0.15, # 15% per year
human_improvement_rate=0.02, # 2% per year (tool-assisted)
years=5,
):
"""
Project the oversight gap forward in time.
Models improve through scaling and algorithmic progress.
Human evaluation improves slowly through better tools.
"""
projections = []
model_cap = current_model_capability
human_cap = 0.60 # Current average human evaluator accuracy
for year in range(years + 1):
gap = model_cap - human_cap
projections.append({
'year': 2025 + year,
'model_capability': model_cap,
'human_capability': human_cap,
'oversight_gap': gap,
'supervisable': human_cap > 0.7,
})
model_cap = min(0.99, model_cap * (1 + capability_growth_rate))
human_cap = min(0.95, human_cap * (1 + human_improvement_rate))
return projections
Weak-to-Strong Generalization
The Setup
Burns et al. (2023) at OpenAI studied a concrete version of the oversight problem: can a strong model learn correct behavior from a weak supervisor? They used a smaller, less capable model as the “supervisor” and a larger model as the “student.”
import torch
import torch.nn as nn
import numpy as np
class WeakToStrongExperiment:
"""
Weak-to-strong generalization experiment.
Setup:
1. Train a weak model (supervisor) on ground truth labels
2. Use the weak model to label data for the strong model
3. Train the strong model on weak labels
4. Measure: does the strong model exceed the weak supervisor?
"""
def __init__(self, weak_model, strong_model, tokenizer):
self.weak_model = weak_model
self.strong_model = strong_model
self.tokenizer = tokenizer
def generate_weak_labels(self, examples):
"""
Step 1: Generate labels from the weak supervisor.
These labels contain errors proportional to the weak model's capability.
"""
self.weak_model.eval()
weak_labels = []
for example in examples:
inputs = self.tokenizer(
example['text'], return_tensors="pt", truncation=True,
).to(self.weak_model.device)
with torch.no_grad():
outputs = self.weak_model(**inputs)
logits = outputs.logits[:, -1, :]
probs = torch.softmax(logits, dim=-1)
# Weak model's prediction
prediction = probs.argmax(dim=-1).item()
confidence = probs.max(dim=-1).values.item()
weak_labels.append({
'text': example['text'],
'weak_label': prediction,
'weak_confidence': confidence,
'true_label': example['label'],
})
return weak_labels
def train_strong_on_weak_labels(self, weak_labeled_data, epochs=5,
lr=1e-5, method="naive"):
"""
Step 2: Train the strong model using weak labels.
Methods:
- naive: Train directly on weak labels
- confidence_weighted: Weight loss by weak model confidence
- auxiliary_loss: Add auxiliary unsupervised objective
"""
optimizer = torch.optim.AdamW(
self.strong_model.parameters(), lr=lr
)
self.strong_model.train()
for epoch in range(epochs):
total_loss = 0
for item in weak_labeled_data:
inputs = self.tokenizer(
item['text'], return_tensors="pt", truncation=True,
).to(self.strong_model.device)
outputs = self.strong_model(**inputs)
logits = outputs.logits[:, -1, :]
if method == "naive":
target = torch.tensor(
[item['weak_label']]
).to(self.strong_model.device)
loss = nn.CrossEntropyLoss()(logits, target)
elif method == "confidence_weighted":
target = torch.tensor(
[item['weak_label']]
).to(self.strong_model.device)
base_loss = nn.CrossEntropyLoss()(logits, target)
loss = base_loss * item['weak_confidence']
elif method == "auxiliary_loss":
target = torch.tensor(
[item['weak_label']]
).to(self.strong_model.device)
supervised_loss = nn.CrossEntropyLoss()(logits, target)
# Auxiliary: consistency regularization
# The strong model's prediction should be consistent
# across slight perturbations
probs = torch.softmax(logits, dim=-1)
entropy = -(probs * torch.log(probs + 1e-8)).sum()
loss = supervised_loss - 0.1 * entropy # Encourage confidence
loss.backward()
optimizer.step()
optimizer.zero_grad()
total_loss += loss.item()
print(f" Epoch {epoch + 1}: loss = {total_loss / len(weak_labeled_data):.4f}")
def evaluate(self, test_data):
"""Evaluate the strong model against ground truth."""
self.strong_model.eval()
correct = 0
for item in test_data:
inputs = self.tokenizer(
item['text'], return_tensors="pt", truncation=True,
).to(self.strong_model.device)
with torch.no_grad():
outputs = self.strong_model(**inputs)
logits = outputs.logits[:, -1, :]
prediction = logits.argmax(dim=-1).item()
if prediction == item['label']:
correct += 1
return correct / len(test_data)
def compute_performance_gap_recovered(
self, weak_accuracy, strong_accuracy, ceiling_accuracy
):
"""
Compute the Performance Gap Recovered (PGR) metric.
PGR = (strong_from_weak - weak_ceiling) / (strong_ceiling - weak_ceiling)
PGR = 0: strong model only matches weak supervisor
PGR = 1: strong model fully recovers to its own ceiling
"""
if ceiling_accuracy == weak_accuracy:
return 0.0
pgr = (strong_accuracy - weak_accuracy) / (ceiling_accuracy - weak_accuracy)
return pgr
Weak-to-Strong Generalization Results (Burns et al., 2023)
| Task | Weak Model Acc | Strong (Naive) Acc | Strong (Auxiliary) Acc | Strong Ceiling | PGR |
|---|---|---|---|---|---|
| NLP Classification | 78.5% | 82.1% | 85.3% | 90.2% | 0.58 |
| Reward Modeling | 65.3% | 68.7% | 72.1% | 79.8% | 0.47 |
| Chess Puzzles | 55.2% | 61.8% | 68.5% | 82.3% | 0.49 |
| Code Correctness | 72.1% | 76.5% | 80.2% | 88.7% | 0.49 |
The key finding from weak-to-strong generalization research: strong models do generalize beyond their weak supervisors, but only partially. The Performance Gap Recovered is typically 0.4-0.6, meaning roughly half of the strong model’s potential capability is left on the table. This suggests that current alignment techniques, when scaled, will leave significant model capability unaligned — a concerning gap for safety.
Constitutional AI
Principles-Based Self-Supervision
Constitutional AI replaces human preference labels with a set of principles (“the constitution”) that the model uses to critique and revise its own outputs.
class ConstitutionalAI:
"""
Constitutional AI: align models using principles, not human labels.
Three stages:
1. Generate: Model produces responses
2. Critique: Model evaluates responses against principles
3. Revise: Model improves responses based on critique
"""
def __init__(self, model, tokenizer, constitution):
"""
Args:
model: The LLM to align
tokenizer: Model's tokenizer
constitution: List of principles
"""
self.model = model
self.tokenizer = tokenizer
self.constitution = constitution
DEFAULT_CONSTITUTION = [
"Choose the response that is most helpful, honest, and harmless.",
"Choose the response that is less likely to cause harm to the user or others.",
"Choose the response that does not encourage illegal activities.",
"Choose the response that is more factually accurate.",
"Choose the response that is less biased and stereotyping.",
"Choose the response that is more respectful and considerate.",
"Choose the response that does not reveal private information.",
"Choose the response that is more nuanced and less absolutist.",
]
def generate_initial_response(self, prompt, temperature=0.8):
"""Stage 1: Generate an initial (potentially harmful) response."""
inputs = self.tokenizer(prompt, return_tensors="pt").to(
self.model.device
)
with torch.no_grad():
output = self.model.generate(
**inputs,
max_new_tokens=512,
temperature=temperature,
do_sample=True,
)
return self.tokenizer.decode(
output[0][inputs['input_ids'].shape[1]:],
skip_special_tokens=True,
)
def critique_response(self, prompt, response, principle):
"""
Stage 2: Model critiques its own response against a principle.
"""
critique_prompt = (
f"Human: {prompt}\n\n"
f"Assistant: {response}\n\n"
f"Critique the assistant's response according to this principle:\n"
f"\"{principle}\"\n\n"
f"Identify specific problems with the response:\n"
)
inputs = self.tokenizer(critique_prompt, return_tensors="pt").to(
self.model.device
)
with torch.no_grad():
output = self.model.generate(
**inputs,
max_new_tokens=256,
temperature=0.3,
do_sample=True,
)
return self.tokenizer.decode(
output[0][inputs['input_ids'].shape[1]:],
skip_special_tokens=True,
)
def revise_response(self, prompt, response, critique):
"""
Stage 3: Model revises its response based on the critique.
"""
revision_prompt = (
f"Human: {prompt}\n\n"
f"Assistant's initial response: {response}\n\n"
f"Critique: {critique}\n\n"
f"Please write a revised response that addresses the critique "
f"while remaining helpful:\n"
)
inputs = self.tokenizer(revision_prompt, return_tensors="pt").to(
self.model.device
)
with torch.no_grad():
output = self.model.generate(
**inputs,
max_new_tokens=512,
temperature=0.3,
do_sample=True,
)
return self.tokenizer.decode(
output[0][inputs['input_ids'].shape[1]:],
skip_special_tokens=True,
)
def run_cai_pipeline(self, prompt, num_revisions=3):
"""
Run the full CAI pipeline: generate, critique, revise (repeat).
"""
response = self.generate_initial_response(prompt)
history = [{'response': response, 'critique': None}]
for i in range(num_revisions):
# Pick a random principle for this revision
principle = self.constitution[i % len(self.constitution)]
critique = self.critique_response(prompt, response, principle)
revised = self.revise_response(prompt, response, critique)
history.append({
'response': revised,
'critique': critique,
'principle': principle,
})
response = revised
return response, history
def generate_preference_pairs(self, prompts, num_pairs_per_prompt=4):
"""
Generate preference pairs using constitutional principles.
These replace human preference annotations for RLHF.
"""
preference_pairs = []
for prompt in prompts:
for _ in range(num_pairs_per_prompt):
# Generate two responses
response_a = self.generate_initial_response(prompt, temperature=0.9)
response_b = self.generate_initial_response(prompt, temperature=0.9)
# Use constitutional principles to judge
preference = self._judge_pair(prompt, response_a, response_b)
preference_pairs.append({
'prompt': prompt,
'response_a': response_a,
'response_b': response_b,
'preference': preference,
})
return preference_pairs
def _judge_pair(self, prompt, response_a, response_b):
"""
Judge which response is better using the constitution.
"""
principle = random.choice(self.constitution)
judge_prompt = (
f"Human: {prompt}\n\n"
f"Response A: {response_a}\n\n"
f"Response B: {response_b}\n\n"
f"Principle: \"{principle}\"\n\n"
f"According to this principle, which response is better? "
f"Answer with just 'A' or 'B':\n"
)
inputs = self.tokenizer(judge_prompt, return_tensors="pt").to(
self.model.device
)
with torch.no_grad():
output = self.model.generate(
**inputs,
max_new_tokens=10,
temperature=0.1,
do_sample=True,
)
answer = self.tokenizer.decode(
output[0][inputs['input_ids'].shape[1]:],
skip_special_tokens=True,
).strip()
return 'A' if 'A' in answer else 'B'
RLAIF: RL from AI Feedback
class RLAIFTrainer:
"""
RLAIF: Replace human preference labels with AI-generated labels.
Pipeline:
1. Generate preference pairs
2. Label with constitutional principles (AI feedback)
3. Train reward model on AI labels
4. Run PPO/DPO with AI-trained reward model
"""
def __init__(self, policy_model, constitution_model, reward_model):
self.policy = policy_model
self.constitution = constitution_model
self.reward = reward_model
def train_reward_model_from_ai_labels(self, prompts, num_pairs=50000):
"""
Train a reward model using AI-generated preference labels.
"""
cai = ConstitutionalAI(
self.constitution,
None, # tokenizer (using constitution model's)
ConstitutionalAI.DEFAULT_CONSTITUTION,
)
# Generate labeled preference pairs
pairs = cai.generate_preference_pairs(
prompts, num_pairs_per_prompt=num_pairs // len(prompts)
)
# Train reward model on these pairs
training_data = []
for pair in pairs:
if pair['preference'] == 'A':
chosen = pair['response_a']
rejected = pair['response_b']
else:
chosen = pair['response_b']
rejected = pair['response_a']
training_data.append({
'prompt': pair['prompt'],
'chosen': chosen,
'rejected': rejected,
})
self._train_reward_model(training_data)
return training_data
def _train_reward_model(self, training_data):
"""Train reward model with Bradley-Terry loss."""
# Implementation follows standard reward model training
# See RLHF data post for details
pass
Debate as Alignment
The Debate Framework
class DebateAlignment:
"""
AI Safety via Debate (Irving et al., 2018).
Two model instances argue opposing positions.
A human judge evaluates the arguments, not the underlying task.
Key insight: even if the human cannot solve the task directly,
they can often evaluate which argument is more convincing
when presented with opposing views.
"""
def __init__(self, model, tokenizer, max_turns=4):
self.model = model
self.tokenizer = tokenizer
self.max_turns = max_turns
def run_debate(self, question, position_a, position_b):
"""
Run a multi-turn debate between two positions.
Returns the debate transcript for human evaluation.
"""
transcript = []
for turn in range(self.max_turns):
# Debater A argues for position_a
a_prompt = self._build_debater_prompt(
question, position_a, transcript, "A"
)
a_argument = self._generate_argument(a_prompt)
transcript.append({"speaker": "A", "turn": turn, "text": a_argument})
# Debater B argues for position_b
b_prompt = self._build_debater_prompt(
question, position_b, transcript, "B"
)
b_argument = self._generate_argument(b_prompt)
transcript.append({"speaker": "B", "turn": turn, "text": b_argument})
return transcript
def _build_debater_prompt(self, question, position, transcript, speaker):
"""Build prompt for a debater, including debate history."""
history = "\n".join(
f"Debater {t['speaker']}: {t['text']}"
for t in transcript
)
prompt = (
f"Question: {question}\n\n"
f"You are Debater {speaker}. Your position is: {position}\n\n"
)
if history:
prompt += f"Debate so far:\n{history}\n\n"
prompt += (
f"Make your next argument. Be specific, cite evidence, "
f"and directly address your opponent's points. "
f"Your argument:\n"
)
return prompt
def _generate_argument(self, prompt):
"""Generate a debate argument."""
inputs = self.tokenizer(prompt, return_tensors="pt").to(
self.model.device
)
with torch.no_grad():
output = self.model.generate(
**inputs,
max_new_tokens=300,
temperature=0.5,
do_sample=True,
)
return self.tokenizer.decode(
output[0][inputs['input_ids'].shape[1]:],
skip_special_tokens=True,
)
def judge_debate(self, transcript, question):
"""
Present debate transcript to a judge.
The judge only reads the arguments, not the underlying evidence.
"""
formatted = "\n\n".join(
f"Debater {t['speaker']} (Turn {t['turn'] + 1}):\n{t['text']}"
for t in transcript
)
judge_prompt = (
f"Question: {question}\n\n"
f"Two debaters have argued their positions:\n\n"
f"{formatted}\n\n"
f"Based solely on the quality of arguments presented, "
f"which debater made a stronger case? "
f"Answer 'A' or 'B' and explain your reasoning:\n"
)
return judge_prompt # Present to human or AI judge
The theoretical appeal of debate is that it reduces hard evaluation to easy evaluation: instead of asking “Is this scientific claim correct?” (which requires domain expertise), the human judge asks “Which debater presented more convincing evidence?” The open question is whether this reduction works in practice — whether a skilled debater arguing for the wrong position can consistently fool human judges.
Comparison of Approaches
Alignment Approaches Comparison
| Approach | Supervision Source | Scalability | Main Risk | Maturity |
|---|---|---|---|---|
| RLHF | Human preferences | Limited by annotator capability | Oversight gap | Production |
| Constitutional AI | Self-critique + principles | Scales with model capability | Principle gaming | Production |
| Weak-to-Strong | Weaker model labels | Scales with model gap | Incomplete recovery (PGR 0.5) | Research |
| Debate | Human judge on arguments | Scales with argument quality | Persuasive wrong arguments | Research |
| IDA (Iterated Distillation) | Amplified human + model | Scales with decomposition | Decomposition failure | Theoretical |
Open Questions
OPEN_QUESTIONS = {
"reward_hacking_at_scale": {
"question": (
"As models get more capable, they become better at "
"optimizing reward proxies without satisfying the "
"intended objective. How do we build reward models "
"that are robust to optimization by models smarter "
"than the reward model itself?"
),
"approaches": [
"Ensemble reward models",
"Uncertainty-aware reward models",
"Constitutional constraints as hard limits",
],
},
"deceptive_alignment": {
"question": (
"Could a model learn to behave well during training "
"and evaluation but pursue different objectives during "
"deployment? How would we detect this?"
),
"approaches": [
"Interpretability (mechanistic understanding)",
"Behavioral testing under distribution shift",
"Monitoring for capability use patterns",
],
},
"value_extrapolation": {
"question": (
"Human values are underspecified and context-dependent. "
"How do we train models to generalize human values to "
"novel situations that humans have not considered?"
),
"approaches": [
"Principle-based generalization (Constitutional AI)",
"Active querying of humans in novel situations",
"Conservative behavior under uncertainty",
],
},
"oversight_bootstrapping": {
"question": (
"If we use AI to help supervise AI, how do we ensure "
"the supervisory AI is itself aligned? This creates "
"a recursive dependency."
),
"approaches": [
"Formal verification of supervisory models",
"Independent parallel development of supervisors",
"Sandboxed evaluation environments",
],
},
}
The alignment problem does not have a known complete solution. Current production systems use RLHF and Constitutional AI because they work well enough for current model capabilities. As models become more capable than their evaluators, these methods will need to be supplemented by approaches like debate, weak-to-strong generalization, or interpretability-based oversight. The research trajectory suggests that no single approach will suffice — practical alignment will require a combination of multiple overlapping methods, each covering failure modes that the others miss.