The pace of LLM progress between 2023 and 2025 was unprecedented: from GPT-4’s launch to open-weight models matching its performance, from 4K context windows to 1M+, from text-only to natively multimodal. But the solved problems are the easy ones. The remaining problems are harder, more fundamental, and in some cases may require architectural or conceptual breakthroughs that scaling alone cannot provide.
This post catalogs eight open problems that define the research frontier as of early 2026. For each problem, we document: what is currently achieved, what specifically fails, why it is hard, and which research directions show promise. This is not a survey — it is a working document for researchers choosing what to work on.
Reliable Reasoning
The Current State
Frontier models (GPT-4o, Claude 3.5 Sonnet, Llama 3.1 405B) achieve 85-92% on standard reasoning benchmarks (GSM8K, MATH, ARC-Challenge). Reasoning-specialized models with chain-of-thought and test-time compute (o1, DeepSeek-R1) push further to 95%+ on many benchmarks. This suggests the problem is nearly solved.
It is not. These benchmarks have been saturated by training data contamination, benchmark-specific tuning, and the fact that they test a narrow slice of reasoning. Simple reasoning tasks that any human can solve still defeat frontier models:
Frontier Model Failures on Simple Reasoning
| Task Type | Example | GPT-4o Accuracy | Human Accuracy |
|---|---|---|---|
| Counting | How many 'r's in 'strawberry'? | ~40% | 99% |
| Spatial | If I face north and turn right, which way am I facing? | ~75% | 99% |
| Temporal | If Tuesday is 2 days after event X, when was X? | ~80% | 99% |
| Negation | What is NOT true about X given these 3 facts? | ~65% | 95% |
| Constraint satisfaction | Assign 5 people to 3 rooms with constraints | ~30% | 90% |
Why It Is Hard
The fundamental issue: transformers process tokens sequentially with fixed computation per token (in standard inference). Reasoning problems require variable-depth computation — some conclusions need 2 steps, others need 20. The fixed-depth transformer architecture does not naturally accommodate this.
Chain-of-thought (CoT) partially addresses this by allowing the model to “think out loud” — each intermediate token adds computation. But CoT is unreliable:
def analyze_cot_failure_modes(cot_trace):
"""
Classify failure modes in chain-of-thought reasoning.
cot_trace: string containing the model's reasoning steps
Returns list of detected failure modes.
"""
failures = []
steps = cot_trace.split("\n")
for i, step in enumerate(steps):
# Failure 1: Phantom steps (stating a conclusion without justification)
if any(
phrase in step.lower()
for phrase in ["therefore", "so we know", "this means"]
) and i == 0:
failures.append({
"type": "phantom_step",
"step": i,
"description": "Conclusion stated without prior reasoning",
})
# Failure 2: Arithmetic errors in intermediate steps
# (Model writes "3 * 7 = 24" mid-chain)
import re
arith_matches = re.findall(
r'(\d+)\s*[*x]\s*(\d+)\s*=\s*(\d+)', step
)
for a, b, result in arith_matches:
if int(a) * int(b) != int(result):
failures.append({
"type": "arithmetic_error",
"step": i,
"description": (
f"{a} * {b} = {result} "
f"(correct: {int(a)*int(b)})"
),
})
# Failure 3: Contradicting a previous step
if i > 0:
for prev_step in steps[:i]:
# Simple negation check
if (
"not " + step.lower().strip()[:30]
in prev_step.lower()
or step.lower().strip()[:30] + " not"
in prev_step.lower()
):
failures.append({
"type": "self_contradiction",
"step": i,
"description": "Contradicts earlier reasoning",
})
return failures
Promising Directions
- Adaptive computation: Allow the model to spend more FLOPs on harder problems. Early exit for easy tokens, iterative refinement for hard ones.
- Verified reasoning: Pair the LLM with a symbolic verifier that checks each reasoning step (e.g., a SAT solver for constraint problems, a calculator for arithmetic).
- Process reward models: Train reward models that evaluate each step of reasoning, not just the final answer. This enables credit assignment — finding which step was wrong.
Hallucination Reduction
The Problem
Hallucination is the generation of confident, fluent text that is factually wrong. This is not a bug — it is a consequence of the training objective. The model is trained to produce text that is likely, not text that is true. When the training data contains multiple contradictory claims about a topic, the model learns a distribution over claims, not the truth.
Hallucination Rates by Task Type (Frontier Models, 2025)
(Hallucination rate %)Citation hallucination is the most stark example: when asked to provide references, models fabricate plausible-looking citations (real author names, realistic journal names, plausible years) that do not exist. The model has learned the format of citations but not the mapping from claims to actual papers.
Why It Is Hard
- No internal fact-checking: The model generates the next token based on likelihood, not veracity. It has no mechanism to look up whether a claim is true.
- Training data is contradictory: The web contains both correct and incorrect information about most topics. The model learns the mixture.
- Confidence is not calibrated: The model’s probability distribution over next tokens does not correlate well with factual accuracy.
import math
def compute_calibration_error(predictions, confidences, labels):
"""
Compute expected calibration error (ECE) for model predictions.
Lower ECE = better calibrated.
predictions: list of predicted answers
confidences: list of model confidence scores (0-1)
labels: list of correct answers
"""
num_bins = 10
bin_boundaries = [i / num_bins for i in range(num_bins + 1)]
bin_correct = [0] * num_bins
bin_confidence = [0.0] * num_bins
bin_count = [0] * num_bins
for pred, conf, label in zip(predictions, confidences, labels):
# Which bin does this confidence fall into?
bin_idx = min(int(conf * num_bins), num_bins - 1)
bin_count[bin_idx] += 1
bin_confidence[bin_idx] += conf
if pred == label:
bin_correct[bin_idx] += 1
# ECE = weighted average of |accuracy - confidence| per bin
ece = 0.0
total = sum(bin_count)
for i in range(num_bins):
if bin_count[i] == 0:
continue
accuracy = bin_correct[i] / bin_count[i]
avg_conf = bin_confidence[i] / bin_count[i]
ece += (bin_count[i] / total) * abs(accuracy - avg_conf)
return ece
Promising Directions
- Retrieval-augmented generation (RAG): Ground responses in retrieved documents. Reduces hallucination on factual queries but does not eliminate it (the model can still ignore or misinterpret retrieved context).
- Self-consistency decoding: Generate multiple responses and select claims that appear in the majority. Hallucinations are unlikely to be consistent across independent samples.
- Abstention training: Train the model to say “I don’t know” when its internal confidence is low. This requires calibrated uncertainty estimates, which are an open research problem.
Long-Term Memory
Beyond the Context Window
Even with 1M token context windows, LLMs have no persistent memory across conversations. Every interaction starts from zero. A human assistant that forgets everything between conversations would be useless for most real-world applications.
The requirements for long-term memory:
- Store and retrieve facts from past interactions
- Update stored facts when they change
- Forget information when asked (privacy)
- Prioritize relevant memories (not just retrieve the most recent)
from dataclasses import dataclass
import time
import numpy as np
@dataclass
class MemoryEntry:
content: str
embedding: np.ndarray
timestamp: float
access_count: int
importance: float
source: str
class LongTermMemory:
"""
External memory system for LLMs.
Stores, retrieves, updates, and forgets information.
"""
def __init__(self, embedding_fn, capacity=100000):
self.embedding_fn = embedding_fn
self.capacity = capacity
self.memories = []
def store(self, content, importance=0.5, source="conversation"):
"""Store a new memory."""
embedding = self.embedding_fn(content)
entry = MemoryEntry(
content=content,
embedding=embedding,
timestamp=time.time(),
access_count=0,
importance=importance,
source=source,
)
self.memories.append(entry)
# Evict if over capacity
if len(self.memories) > self.capacity:
self._evict_least_important()
def retrieve(self, query, top_k=5):
"""Retrieve memories most relevant to query."""
if not self.memories:
return []
query_emb = self.embedding_fn(query)
scored = []
for mem in self.memories:
# Relevance = cosine similarity
relevance = float(np.dot(query_emb, mem.embedding) / (
np.linalg.norm(query_emb) * np.linalg.norm(mem.embedding)
+ 1e-8
))
# Recency boost (exponential decay)
age_hours = (time.time() - mem.timestamp) / 3600
recency = math.exp(-age_hours / 720) # 30-day half-life
# Importance and access frequency
score = (
0.5 * relevance
+ 0.2 * recency
+ 0.2 * mem.importance
+ 0.1 * min(mem.access_count / 10, 1.0)
)
scored.append((score, mem))
scored.sort(key=lambda x: x[0], reverse=True)
# Update access counts
results = []
for score, mem in scored[:top_k]:
mem.access_count += 1
results.append(mem)
return results
def forget(self, query, threshold=0.85):
"""Remove memories matching query (for privacy)."""
query_emb = self.embedding_fn(query)
kept = []
removed = 0
for mem in self.memories:
similarity = float(np.dot(query_emb, mem.embedding) / (
np.linalg.norm(query_emb) * np.linalg.norm(mem.embedding)
+ 1e-8
))
if similarity >= threshold:
removed += 1
else:
kept.append(mem)
self.memories = kept
return removed
def _evict_least_important(self):
"""Remove least important memory when over capacity."""
if not self.memories:
return
scored = []
for i, mem in enumerate(self.memories):
age_hours = (time.time() - mem.timestamp) / 3600
recency = math.exp(-age_hours / 720)
score = (
0.3 * mem.importance
+ 0.3 * recency
+ 0.4 * min(mem.access_count / 10, 1.0)
)
scored.append((score, i))
scored.sort(key=lambda x: x[0])
# Remove the lowest-scored memory
self.memories.pop(scored[0][1])
What Is Still Missing
External memory systems work but are brittle. The model must learn to decide what to store, when to retrieve, and how to integrate retrieved memories into its response. These decisions are currently hard-coded or heuristic-based. A true solution requires the model to manage its own memory, which is an unsolved learning problem.
Efficient Training
The Cost Problem
Training a frontier model costs $50-500M in compute. This restricts frontier research to a handful of organizations with massive GPU clusters.
Estimated Training Costs of Frontier Models
| Model | Parameters | Training Tokens | Estimated Cost |
|---|---|---|---|
| Llama 3.1 405B | 405B | 15T | $150-300M |
| GPT-4 (estimated) | ~1.8T MoE | ~13T | $100-200M |
| Claude 3.5 Sonnet | Unknown | Unknown | $50-100M (estimated) |
| Gemini Ultra | Unknown | Unknown | $200-500M (estimated) |
| DeepSeek-V3 | 671B MoE | 14.8T | ~$6M (H800s) |
DeepSeek-V3 is notable: it achieved competitive performance at a fraction of the cost through architectural innovations (MoE, multi-head latent attention) and engineering efficiency.
Promising Directions
- Mixture of Experts (MoE): Activate only a subset of parameters per token. DeepSeek-V3 uses 671B total parameters but only ~37B active per token, reducing compute by 18x.
- Knowledge distillation: Train small models to mimic large ones. The student inherits capabilities without the training cost of the teacher.
- Curriculum learning: Order training data from easy to hard. The model converges faster because it builds on simpler patterns before encountering complex ones.
- Hardware efficiency: Better utilization of existing GPUs through improved parallelism, communication optimization, and kernel fusion.
def compute_training_efficiency(
model_params,
active_params,
tokens_trained,
wall_clock_hours,
num_gpus,
gpu_peak_tflops,
):
"""
Compute training efficiency metrics.
Model FLOPs Utilization (MFU): fraction of theoretical
peak compute that is actually used for useful work.
"""
# Total FLOPs for training
total_flops = 6 * active_params * tokens_trained
# Theoretical peak
peak_flops = (
num_gpus
* gpu_peak_tflops
* 1e12 # TFLOPS to FLOPS
* wall_clock_hours
* 3600 # hours to seconds
)
mfu = total_flops / peak_flops
# Cost efficiency
cost_per_gpu_hour = 2.0 # Approximate $/hour for A100/H100
total_cost = num_gpus * wall_clock_hours * cost_per_gpu_hour
tokens_per_dollar = tokens_trained / total_cost
return {
"mfu": mfu,
"total_flops": total_flops,
"total_cost_usd": total_cost,
"tokens_per_dollar": tokens_per_dollar,
"flops_per_dollar": total_flops / total_cost,
}
# DeepSeek-V3 efficiency estimate
deepseek_eff = compute_training_efficiency(
model_params=671e9,
active_params=37e9, # MoE: only 37B active
tokens_trained=14.8e12,
wall_clock_hours=55 * 24, # ~55 days
num_gpus=2048,
gpu_peak_tflops=989, # H800 BF16 peak
)
# MFU ~ 0.52, cost ~ $5.3M
Scalable Alignment
The Core Problem
Alignment ensures models are helpful, harmless, and honest. Current methods (RLHF, DPO, Constitutional AI) work when humans can evaluate model outputs. But as models become more capable than humans in specific domains, human evaluation becomes unreliable. How do you align a model whose outputs you cannot fully verify?
This is the scalable oversight problem. The current approach — human raters evaluating model outputs — breaks down when:
- Model outputs are too complex for humans to verify (proofs, code, medical diagnoses)
- Models learn to produce outputs that appear good to evaluators but are subtly wrong (reward hacking)
- The space of harmful outputs is too large to enumerate in advance
Promising Directions
-
Debate: Two AI models argue opposing positions while a human judge evaluates. The key insight: it is easier to judge arguments than to generate them. Even if the human cannot independently solve the problem, they can identify which AI’s argument is more compelling.
-
Recursive reward modeling: Use AI to help humans evaluate AI outputs. The human + AI evaluator system can assess more complex outputs than either alone.
-
Interpretability-based alignment: Instead of evaluating outputs, inspect the model’s internal representations to verify it is pursuing the intended goal. This requires advances in interpretability (Problem 6).
Mechanistic Interpretability
What We Want
We want to understand what a neural network computes, the same way we understand what a program computes. For a program, we can trace the execution and explain why each variable has its value. For a neural network, the “program” is 70 billion floating-point parameters, and we cannot explain why any specific output was produced.
The Current State
Mechanistic interpretability has achieved partial successes:
- Sparse autoencoders (SAEs): Decompose model activations into human-interpretable features. A specific direction in activation space might correspond to “French language” or “code syntax.”
- Circuit discovery: For small models (up to ~1B parameters), researchers have identified specific circuits that implement specific behaviors (e.g., induction heads for in-context learning).
- Probing: Train linear classifiers on intermediate representations to test whether the model encodes specific information.
import torch
import torch.nn as nn
class SparseAutoencoder(nn.Module):
"""
Sparse autoencoder for mechanistic interpretability.
Decomposes model activations into a large dictionary of
interpretable features.
"""
def __init__(self, input_dim, dict_size, sparsity_coeff=1e-3):
"""
input_dim: dimension of model activations to decompose
dict_size: number of dictionary features (typically 8-64x input_dim)
sparsity_coeff: L1 penalty to encourage sparse activations
"""
super().__init__()
self.encoder = nn.Linear(input_dim, dict_size, bias=True)
self.decoder = nn.Linear(dict_size, input_dim, bias=True)
self.sparsity_coeff = sparsity_coeff
self.dict_size = dict_size
def forward(self, x):
"""
Encode activations to sparse features, then reconstruct.
x: [batch, input_dim] model activations
Returns (reconstructed, features, loss)
"""
# Encode to sparse features
features = torch.relu(self.encoder(x)) # [batch, dict_size]
# Reconstruct
reconstructed = self.decoder(features)
# Losses
recon_loss = nn.functional.mse_loss(reconstructed, x)
sparsity_loss = self.sparsity_coeff * features.abs().mean()
total_loss = recon_loss + sparsity_loss
return reconstructed, features, total_loss
def get_top_features(self, x, top_k=10):
"""Get the top-k most active features for an input."""
with torch.no_grad():
features = torch.relu(self.encoder(x))
values, indices = torch.topk(features, top_k, dim=-1)
return indices, values
What Is Missing
Current interpretability techniques work on individual features or small circuits. We cannot yet:
- Explain a model’s reasoning on a specific input end-to-end
- Predict in advance which inputs will cause failures
- Verify that a model has no deceptive behaviors hidden in its weights
- Scale circuit-level analysis to 70B+ parameter models
Edge Inference
The Goal
Run 70B-quality models on consumer devices: phones, laptops, edge servers with limited memory and no datacenter GPUs. This would democratize access and eliminate latency and privacy concerns from cloud inference.
Memory Requirements vs Device Capacity
(GB)The gap between 70B-quality and phone-deployable is enormous. An INT4-quantized 70B model needs 35 GB — far more than any phone. The 8B model fits but does not match 70B quality.
Promising Directions
-
Extreme quantization: Push below 4 bits. Recent work on 2-bit and 1.58-bit (ternary) quantization shows promise, though quality degradation is still significant.
-
Speculative decoding: Use a small on-device model to draft tokens, and verify with a larger model (cloud or larger local model). Most draft tokens are accepted, so the small model handles the bulk of generation.
-
Mixture of Experts on device: MoE models activate only a fraction of parameters per token. A 70B MoE model with 7B active parameters per token could fit in phone memory if only the active experts are loaded.
-
Progressive loading: Keep the full model in flash storage and load layers on-demand. Modern NVMe can transfer fast enough to keep the pipeline fed for moderate generation speeds.
def compute_edge_feasibility(
model_params,
bits_per_param,
device_ram_gb,
device_bandwidth_gb_s,
target_tokens_per_sec,
):
"""
Check if a model can run on an edge device.
"""
# Memory requirement
model_size_gb = (model_params * bits_per_param) / (8 * 1e9)
# KV cache requirement (approximate)
# Assume 128 layers, 128 heads, 128 dim per head, 2048 context
kv_cache_gb = 0.5 # Rough estimate for 2K context
total_memory_gb = model_size_gb + kv_cache_gb
fits_in_ram = total_memory_gb <= device_ram_gb
# Bandwidth requirement
# Each generated token reads all parameters once
bytes_per_token = model_params * bits_per_param / 8
bandwidth_needed_gb_s = (
bytes_per_token * target_tokens_per_sec / 1e9
)
bandwidth_sufficient = bandwidth_needed_gb_s <= device_bandwidth_gb_s
# Achievable tokens/sec given bandwidth
achievable_tps = (
device_bandwidth_gb_s * 1e9
/ max(bytes_per_token, 1)
)
return {
"model_size_gb": model_size_gb,
"total_memory_gb": total_memory_gb,
"fits_in_ram": fits_in_ram,
"bandwidth_needed_gb_s": bandwidth_needed_gb_s,
"bandwidth_sufficient": bandwidth_sufficient,
"achievable_tokens_per_sec": achievable_tps,
}
# Example: Llama 3 8B at 4-bit on iPhone 16 Pro
result = compute_edge_feasibility(
model_params=8e9,
bits_per_param=4,
device_ram_gb=8,
device_bandwidth_gb_s=100, # Apple M-series memory bandwidth
target_tokens_per_sec=15,
)
# fits_in_ram: True (4.5 GB total)
# achievable_tps: ~25 tokens/sec
Multimodal Grounding
The Problem
LLMs can discuss images and videos, but they do not truly understand the visual content in the way humans do. The model can describe a scene but cannot reliably:
- Count objects in an image
- Determine spatial relationships (which object is behind which)
- Track objects across video frames
- Ground language references to specific visual regions (“the red cup on the left”)
The Gap
Multimodal Grounding Tasks: AI vs Human
| Task | Best Model (2025) | Human | Gap |
|---|---|---|---|
| Object counting | ~72% | 97% | 25 points |
| Spatial relations (left/right/above) | ~78% | 99% | 21 points |
| Object tracking in video | ~65% | 95% | 30 points |
| Fine-grained visual QA | ~70% | 92% | 22 points |
| Referring expression grounding | ~75% | 96% | 21 points |
Why It Is Hard
- Resolution bottleneck: Visual encoders (ViT) process images at 224x224 or 336x336 resolution. Fine details (small text, distant objects) are lost.
- Tokenization loss: Converting a 1080p image into 576 visual tokens loses spatial precision. Each token covers a 45x45 pixel region — too coarse for precise grounding.
- Attention dilution: In a sequence of 1000+ tokens (text + visual), the model’s attention to any specific visual region is diluted.
Promising Directions
-
Region-of-interest encoding: Instead of encoding the entire image uniformly, identify regions of interest and encode them at higher resolution. This is the visual equivalent of “focusing attention.”
-
Coordinate-aware training: Teach the model to output pixel coordinates for referenced objects. This requires training data with bounding box annotations linked to text descriptions.
-
Video as structured data: Instead of processing video frame-by-frame, extract object tracks and represent the video as a structured graph of objects with spatial and temporal relationships.
class GroundedVisualEncoder(nn.Module):
"""
Visual encoder with region-of-interest processing.
Encodes the full image at low resolution, then re-encodes
detected regions at high resolution.
"""
def __init__(self, base_encoder, roi_encoder, llm_dim=4096):
super().__init__()
self.base_encoder = base_encoder # Full image, 224x224
self.roi_encoder = roi_encoder # ROI crops, 224x224
self.base_proj = nn.Linear(768, llm_dim)
self.roi_proj = nn.Linear(768, llm_dim)
self.region_position = nn.Linear(4, llm_dim) # bbox coords
def forward(self, image, roi_crops, roi_bboxes):
"""
Encode image with region-of-interest detail.
image: [B, 3, H, W] full image
roi_crops: [B, N, 3, 224, 224] cropped regions
roi_bboxes: [B, N, 4] normalized bbox coordinates
"""
B, N = roi_crops.shape[:2]
# Global encoding
global_tokens = self.base_encoder(image)
global_tokens = self.base_proj(global_tokens)
# ROI encoding
roi_flat = roi_crops.reshape(B * N, 3, 224, 224)
roi_tokens = self.roi_encoder(roi_flat)
roi_tokens = roi_tokens.reshape(B, N, -1, 768)
# Add spatial position information
bbox_emb = self.region_position(roi_bboxes)
roi_tokens = self.roi_proj(
roi_tokens.mean(dim=2)
) + bbox_emb
# Concatenate global and ROI tokens
return torch.cat([global_tokens, roi_tokens], dim=1)
Summary: The Difficulty Landscape
Estimated Years to Solve (Speculative)
(Estimated years)The eight problems are not independent. Reliable reasoning requires interpretability (understanding why the model reasons incorrectly). Hallucination reduction requires both calibrated uncertainty (efficient training) and grounding (multimodal). Scalable alignment requires interpretability and reliable reasoning. Edge inference enables broader deployment, which makes alignment more urgent. The research frontier is a coupled system where progress on each problem enables progress on others. The most impactful work is on the connective tissue between these problems, not on any single problem in isolation.