GPT-4 with code interpreter solves MATH at 69.7% accuracy. Without code interpreter, it scores 42.5% — a 27 point gap from tool use alone. The capability comes from training data: thousands of (thought, action, observation) trajectories where the model writes code, executes it, reads the output, and iterates. Standard chat data teaches text generation; agentic data teaches action selection, error recovery, and multi-step tool chaining. The format is not negotiable — if your training data lacks tool use traces, your model will not learn to use tools reliably.
This post covers the construction of agentic training data: tool use trace formats, trajectory collection from real environments, synthetic trajectory generation, reward shaping for multi-step actions, and error recovery training.
Agentic Data Format
Trace Schema
from dataclasses import dataclass, field
from datetime import datetime
from enum import Enum
from typing import Optional
class ActionType(Enum):
TOOL_CALL = "tool_call"
CODE_EXECUTE = "code_execute"
FILE_READ = "file_read"
FILE_WRITE = "file_write"
WEB_SEARCH = "web_search"
API_CALL = "api_call"
SHELL_COMMAND = "shell_command"
THINK = "think"
ANSWER = "answer"
@dataclass
class ToolDefinition:
"""Definition of a tool available to the agent."""
name: str
description: str
parameters: dict
returns: str
examples: list = field(default_factory=list)
@dataclass
class AgentStep:
"""Single step in an agent trajectory."""
step_index: int
thought: str
action_type: ActionType
action_input: dict
observation: str
reward: float = 0.0
is_error: bool = False
error_type: Optional[str] = None
latency_ms: float = 0.0
tokens_used: int = 0
@dataclass
class AgentTrajectory:
"""Complete agent trajectory for a task."""
task_id: str
task_description: str
available_tools: list
steps: list
final_answer: Optional[str] = None
task_completed: bool = False
total_steps: int = 0
total_tokens: int = 0
trajectory_reward: float = 0.0
metadata: dict = field(default_factory=dict)
class AgentTraceFormatter:
"""
Format agent trajectories into training-compatible
text sequences.
The format must encode:
- Available tools (system prompt)
- Thought-action-observation loops
- Error handling and recovery
- Final answer extraction
"""
THOUGHT_TAG = "<think>"
THOUGHT_END = "</think>"
ACTION_TAG = "<action>"
ACTION_END = "</action>"
OBSERVATION_TAG = "<observation>"
OBSERVATION_END = "</observation>"
def format_trajectory(self, trajectory):
"""
Convert a trajectory to a text training sequence.
"""
parts = []
# System prompt with tool definitions
tool_defs = self._format_tool_definitions(
trajectory.available_tools
)
parts.append(f"System: {tool_defs}")
# Task
parts.append(
f"User: {trajectory.task_description}"
)
# Steps
for step in trajectory.steps:
# Thought
parts.append(
f"{self.THOUGHT_TAG}"
f"{step.thought}"
f"{self.THOUGHT_END}"
)
# Action
action_str = self._format_action(step)
parts.append(
f"{self.ACTION_TAG}"
f"{action_str}"
f"{self.ACTION_END}"
)
# Observation
obs = step.observation
if len(obs) > 2000:
obs = obs[:2000] + "\n... (truncated)"
parts.append(
f"{self.OBSERVATION_TAG}"
f"{obs}"
f"{self.OBSERVATION_END}"
)
# Final answer
if trajectory.final_answer:
parts.append(
f"Assistant: {trajectory.final_answer}"
)
return "\n\n".join(parts)
def _format_tool_definitions(self, tools):
"""Format tool definitions for system prompt."""
lines = ["You have access to the following tools:\n"]
for tool in tools:
lines.append(f"### {tool.name}")
lines.append(f"Description: {tool.description}")
lines.append(f"Parameters: {tool.parameters}")
lines.append(f"Returns: {tool.returns}")
lines.append("")
return "\n".join(lines)
def _format_action(self, step):
"""Format an action step."""
if step.action_type == ActionType.TOOL_CALL:
return (
f"tool: {step.action_input.get('tool_name', '')}\n"
f"arguments: {step.action_input.get('arguments', {})}"
)
elif step.action_type == ActionType.CODE_EXECUTE:
return (
f"execute_code:\n"
f"```python\n"
f"{step.action_input.get('code', '')}\n"
f"```"
)
elif step.action_type == ActionType.SHELL_COMMAND:
return (
f"shell: {step.action_input.get('command', '')}"
)
else:
return str(step.action_input)
The thought-action-observation (TAO) format is standard across agentic frameworks (ReAct, Toolformer, AutoGPT). The key training signal is the thought: the model must learn to reason about which tool to use and why before acting. Training on action-observation pairs without thoughts produces agents that act without planning and fail on multi-step tasks.
Trajectory Collection from Real Environments
SWE-Bench Style Task Collection
import subprocess
import json
import os
class SWEBenchTrajectoryCollector:
"""
Collect agent trajectories from real software
engineering tasks (SWE-bench style).
Pipeline:
1. Set up a Git repository at a specific commit
2. Present the agent with an issue description
3. Record every action the agent takes
4. Evaluate the final patch against test suite
5. Label trajectory as success/failure
"""
def __init__(self, workspace_dir, model):
self.workspace_dir = workspace_dir
self.model = model
def collect_trajectory(self, task):
"""
Run an agent on a task and collect the full
trajectory with environment feedback.
"""
# Set up environment
repo_path = self._setup_repo(
task["repo"], task["base_commit"]
)
# Define available tools
tools = [
ToolDefinition(
name="read_file",
description="Read the contents of a file",
parameters={"path": "string"},
returns="File contents as string",
),
ToolDefinition(
name="write_file",
description="Write content to a file",
parameters={
"path": "string",
"content": "string",
},
returns="Success/failure message",
),
ToolDefinition(
name="search_code",
description="Search for a pattern in the codebase",
parameters={
"pattern": "string",
"file_glob": "string",
},
returns="Matching lines with file paths",
),
ToolDefinition(
name="run_tests",
description="Run the test suite",
parameters={"test_path": "string"},
returns="Test results (pass/fail counts)",
),
ToolDefinition(
name="shell",
description="Execute a shell command",
parameters={"command": "string"},
returns="Command output",
),
]
# Run agent
trajectory = AgentTrajectory(
task_id=task["instance_id"],
task_description=task["problem_statement"],
available_tools=tools,
steps=[],
)
context = (
f"Fix the following issue in the {task['repo']} "
f"repository:\n\n{task['problem_statement']}"
)
for step_idx in range(50): # Max 50 steps
# Get agent action
response = self.model.generate(
context,
temperature=0.0,
max_tokens=2048,
)
# Parse action from response
action = self._parse_action(response)
if action is None or action["type"] == "done":
trajectory.final_answer = response
break
# Execute action in environment
observation = self._execute_action(
action, repo_path
)
step = AgentStep(
step_index=step_idx,
thought=action.get("thought", ""),
action_type=ActionType(action["type"]),
action_input=action.get("input", {}),
observation=observation["output"],
is_error=observation.get("is_error", False),
error_type=observation.get("error_type"),
)
trajectory.steps.append(step)
trajectory.total_steps = step_idx + 1
# Update context with new step
context += (
f"\n\nAction: {action}\n"
f"Observation: {observation['output']}"
)
# Evaluate
test_result = self._run_evaluation(
repo_path, task
)
trajectory.task_completed = test_result["passed"]
trajectory.trajectory_reward = (
1.0 if test_result["passed"] else 0.0
)
return trajectory
def _setup_repo(self, repo_name, commit):
"""Set up repository at specified commit."""
repo_path = os.path.join(
self.workspace_dir, repo_name.replace("/", "_")
)
return repo_path
def _parse_action(self, response):
"""Parse an action from model response."""
return None # Placeholder
def _execute_action(self, action, repo_path):
"""Execute an action in the environment."""
return {"output": "", "is_error": False}
def _run_evaluation(self, repo_path, task):
"""Run tests to evaluate the agent's patch."""
return {"passed": False}
Web Browsing Trajectory Collection
class WebBrowsingTrajectoryCollector:
"""
Collect trajectories from web browsing tasks.
The agent navigates web pages to find information
or complete tasks (form filling, data extraction,
comparison shopping).
"""
def __init__(self, browser_env, model):
self.browser_env = browser_env
self.model = model
def collect_trajectory(self, task):
"""
Run agent on a web browsing task.
"""
tools = [
ToolDefinition(
name="click",
description="Click an element on the page",
parameters={"selector": "string"},
returns="Updated page content",
),
ToolDefinition(
name="type",
description="Type text into an input field",
parameters={
"selector": "string",
"text": "string",
},
returns="Updated page content",
),
ToolDefinition(
name="scroll",
description="Scroll the page",
parameters={"direction": "up|down"},
returns="Updated visible content",
),
ToolDefinition(
name="navigate",
description="Navigate to a URL",
parameters={"url": "string"},
returns="Page content",
),
ToolDefinition(
name="extract_text",
description="Extract text from visible page",
parameters={},
returns="Visible text content",
),
]
trajectory = AgentTrajectory(
task_id=task["id"],
task_description=task["instruction"],
available_tools=tools,
steps=[],
)
# Initialize browser
page_content = self.browser_env.reset(
task.get("start_url", "")
)
context = (
f"Task: {task['instruction']}\n\n"
f"Current page:\n{page_content[:5000]}"
)
for step_idx in range(30):
response = self.model.generate(
context,
temperature=0.0,
max_tokens=1024,
)
action = self._parse_action(response)
if action is None:
trajectory.final_answer = response
break
observation = self.browser_env.step(action)
step = AgentStep(
step_index=step_idx,
thought=action.get("thought", ""),
action_type=ActionType.TOOL_CALL,
action_input=action,
observation=str(observation)[:2000],
is_error=observation.get("error", False),
)
trajectory.steps.append(step)
context += (
f"\n\nAction: {action}\n"
f"Page: {str(observation)[:2000]}"
)
return trajectory
def _parse_action(self, response):
"""Parse browser action from model response."""
return None # Placeholder
Agentic Task Domains and Data Properties
| Task Domain | Avg Steps per Task | Tool Count | Success Rate (GPT-4) | Error Recovery Required | Data Collection Cost |
|---|---|---|---|---|---|
| SWE-bench (code repair) | 15-30 | 5 | 12-20% | High | $2-5 per trajectory |
| Web browsing (WebArena) | 8-15 | 5 | 15-25% | Medium | $1-3 per trajectory |
| Data analysis (SQL) | 5-10 | 3 | 40-60% | Medium | $0.5-1 per trajectory |
| API orchestration | 5-15 | 10-20 | 30-50% | High | $1-2 per trajectory |
| File management | 3-8 | 4 | 60-80% | Low | $0.3-0.5 per trajectory |
| Multi-step math | 5-12 | 2 | 35-55% | Medium | $0.5-1 per trajectory |
Synthetic Trajectory Generation
Scaling Agentic Data with Simulation
class SyntheticTrajectoryGenerator:
"""
Generate synthetic agentic trajectories at scale.
Real trajectory collection is expensive (environment
setup, execution time, evaluation). Synthetic generation
uses a strong model to simulate both agent and
environment, producing trajectories 10-100x cheaper.
Strategies:
1. Expert iteration: strong model acts, weak model
learns from successful trajectories
2. Self-play: model generates task + solution
3. Hindsight relabeling: failed trajectories are
relabeled with corrected actions
"""
def __init__(self, expert_model, student_model=None):
self.expert_model = expert_model
self.student_model = student_model
def generate_expert_trajectories(self, tasks,
environments):
"""
Expert iteration: use a strong model to generate
successful trajectories for training a weaker model.
"""
successful = []
failed = []
for task, env in zip(tasks, environments):
trajectory = self._run_expert(task, env)
if trajectory.task_completed:
successful.append(trajectory)
else:
failed.append(trajectory)
return successful, failed
def hindsight_relabel(self, failed_trajectories):
"""
Hindsight relabeling: take a failed trajectory,
identify the first error, and generate the
correct action sequence from that point.
This salvages failed trajectories by converting
them into training data that shows both the
error and the correction.
"""
relabeled = []
for traj in failed_trajectories:
# Find first error step
error_step = None
for step in traj.steps:
if step.is_error:
error_step = step
break
if error_step is None:
continue
# Generate corrected action from error point
prefix_steps = traj.steps[:error_step.step_index]
context = self._build_context_from_steps(
traj.task_description,
prefix_steps,
)
# Add error observation
context += (
f"\n\nPrevious action failed with error: "
f"{error_step.observation}\n"
f"Generate a corrected approach."
)
corrected_response = self.expert_model.generate(
context,
temperature=0.3,
max_tokens=2048,
)
# Create relabeled trajectory
relabeled_traj = AgentTrajectory(
task_id=traj.task_id + "_relabeled",
task_description=traj.task_description,
available_tools=traj.available_tools,
steps=prefix_steps + [
AgentStep(
step_index=error_step.step_index,
thought=(
f"The previous action failed. "
f"Error: {error_step.observation}. "
f"I need to try a different approach."
),
action_type=ActionType.THINK,
action_input={
"content": corrected_response
},
observation="",
is_error=False,
),
],
metadata={"relabeled": True},
)
relabeled.append(relabeled_traj)
return relabeled
def generate_error_recovery_data(self, n_samples=1000):
"""
Explicitly generate error recovery training data.
For each sample:
1. Generate a partial trajectory
2. Inject a realistic error
3. Generate the recovery sequence
"""
error_types = [
{
"name": "file_not_found",
"observation": "Error: FileNotFoundError: "
"No such file or directory",
"recovery": "search for the correct file path",
},
{
"name": "syntax_error",
"observation": "Error: SyntaxError: "
"invalid syntax at line 15",
"recovery": "fix the syntax error and re-run",
},
{
"name": "test_failure",
"observation": "FAILED: 3 tests failed. "
"AssertionError in test_edge_case",
"recovery": "read the failing test, understand "
"the edge case, fix the code",
},
{
"name": "timeout",
"observation": "Error: Command timed out "
"after 30 seconds",
"recovery": "optimize the approach or break "
"into smaller steps",
},
{
"name": "permission_denied",
"observation": "Error: PermissionError: "
"Permission denied",
"recovery": "check file permissions or use "
"a different path",
},
]
samples = []
for _ in range(n_samples):
import random
error = random.choice(error_types)
sample = self._generate_recovery_sample(error)
samples.append(sample)
return samples
def _run_expert(self, task, env):
"""Run expert model on a task."""
return AgentTrajectory(
task_id=task["id"],
task_description=task["description"],
available_tools=[],
steps=[],
)
def _build_context_from_steps(self, task, steps):
"""Build context string from trajectory steps."""
context = f"Task: {task}\n\n"
for step in steps:
context += (
f"Thought: {step.thought}\n"
f"Action: {step.action_input}\n"
f"Observation: {step.observation}\n\n"
)
return context
def _generate_recovery_sample(self, error):
"""Generate a single error recovery sample."""
return {} # Placeholder
Error recovery data is the most underrepresented category in agentic training. Most trajectory datasets contain only successful paths. Models trained exclusively on success trajectories freeze or loop when they encounter errors. Explicitly generating error-recovery training data (error injection + correction) improves task completion rates by 8-15% on benchmarks like SWE-bench and WebArena.
Reward Shaping for Multi-Step Agents
Intermediate Rewards
class AgentRewardShaper:
"""
Shape rewards for multi-step agent trajectories.
Sparse reward (success/failure at the end) produces
poor training signal for long trajectories. Dense
intermediate rewards at each step provide better
gradient signal.
Reward components:
1. Progress reward: did this step make measurable
progress toward the goal?
2. Efficiency reward: was this the most direct action?
3. Information gain: did this action reveal useful
information?
4. Error penalty: did this action cause an error?
5. Outcome reward: task success/failure
"""
def __init__(self, config):
self.progress_weight = config.get(
"progress_weight", 0.3
)
self.efficiency_weight = config.get(
"efficiency_weight", 0.1
)
self.info_gain_weight = config.get(
"info_gain_weight", 0.2
)
self.error_penalty = config.get(
"error_penalty", -0.5
)
self.outcome_weight = config.get(
"outcome_weight", 1.0
)
def shape_trajectory_rewards(self, trajectory):
"""
Assign dense rewards to each step in a trajectory.
"""
steps = trajectory.steps
n_steps = len(steps)
if n_steps == 0:
return trajectory
shaped_rewards = []
for i, step in enumerate(steps):
reward = 0.0
# Progress reward
progress = self._estimate_progress(
step, trajectory
)
reward += self.progress_weight * progress
# Efficiency reward (penalize redundant actions)
if i > 0:
redundancy = self._check_redundancy(
step, steps[:i]
)
reward -= self.efficiency_weight * redundancy
# Information gain
info_gain = self._estimate_info_gain(
step, steps[:i]
)
reward += self.info_gain_weight * info_gain
# Error penalty
if step.is_error:
reward += self.error_penalty
shaped_rewards.append(reward)
# Add outcome reward to last step
if trajectory.task_completed:
shaped_rewards[-1] += self.outcome_weight
else:
shaped_rewards[-1] -= self.outcome_weight * 0.5
# Apply discount factor (earlier steps get
# discounted reward from future steps)
gamma = 0.95
returns = self._compute_returns(
shaped_rewards, gamma
)
for i, step in enumerate(steps):
step.reward = returns[i]
return trajectory
def _estimate_progress(self, step, trajectory):
"""
Estimate progress made by this step.
Heuristics:
- File read that finds relevant code: +0.3
- Code edit that moves toward solution: +0.5
- Test run that passes more tests: +0.8
- Navigation to irrelevant area: -0.1
"""
if step.action_type == ActionType.CODE_EXECUTE:
if "PASSED" in step.observation:
return 0.8
if "Error" in step.observation:
return -0.2
if step.action_type == ActionType.FILE_READ:
return 0.2
if step.action_type == ActionType.FILE_WRITE:
return 0.4
return 0.0
def _check_redundancy(self, step, previous_steps):
"""Check if this step repeats a previous action."""
for prev in previous_steps:
if (
prev.action_type == step.action_type
and prev.action_input == step.action_input
):
return 1.0
return 0.0
def _estimate_info_gain(self, step, previous_steps):
"""Estimate information gained from this step."""
if step.action_type == ActionType.FILE_READ:
return 0.3
if step.action_type == ActionType.WEB_SEARCH:
return 0.4
return 0.1
def _compute_returns(self, rewards, gamma):
"""Compute discounted returns."""
returns = [0.0] * len(rewards)
running_return = 0.0
for i in reversed(range(len(rewards))):
running_return = rewards[i] + gamma * running_return
returns[i] = running_return
return returns
Impact of Reward Shaping on Agent Performance
| Metric | 0 | 10 | 25 | 50 | 100 | 200 |
|---|---|---|---|---|---|---|
| Dense shaped rewards | ||||||
| Sparse outcome reward only | ||||||
| No reward shaping (SFT only) |
Data Quality for Agent Training
Trajectory Filtering and Augmentation
class TrajectoryQualityFilter:
"""
Filter agent trajectories for training quality.
Not all successful trajectories are good training data.
A trajectory that succeeds after 45 steps of random
exploration is worse than one that succeeds in 8
focused steps.
"""
def __init__(self, config):
self.max_steps = config.get("max_steps", 30)
self.min_steps = config.get("min_steps", 2)
self.max_error_rate = config.get(
"max_error_rate", 0.3
)
self.max_redundancy_rate = config.get(
"max_redundancy_rate", 0.2
)
def filter_trajectories(self, trajectories):
"""
Filter trajectories by quality criteria.
"""
accepted = []
rejected = []
for traj in trajectories:
reasons = self._check_quality(traj)
if not reasons:
accepted.append(traj)
else:
rejected.append((traj, reasons))
return accepted, rejected
def _check_quality(self, trajectory):
"""Run all quality checks on a trajectory."""
reasons = []
steps = trajectory.steps
# Step count bounds
if len(steps) > self.max_steps:
reasons.append(
f"Too many steps: {len(steps)}"
)
if len(steps) < self.min_steps:
reasons.append(
f"Too few steps: {len(steps)}"
)
# Error rate
if steps:
error_rate = sum(
1 for s in steps if s.is_error
) / len(steps)
if error_rate > self.max_error_rate:
reasons.append(
f"High error rate: {error_rate:.2f}"
)
# Redundancy check
if steps:
actions = [
(s.action_type, str(s.action_input))
for s in steps
]
unique = len(set(actions))
redundancy = 1.0 - unique / len(actions)
if redundancy > self.max_redundancy_rate:
reasons.append(
f"High redundancy: {redundancy:.2f}"
)
# Circular behavior detection
if self._has_circular_behavior(steps):
reasons.append("Circular behavior detected")
return reasons
def _has_circular_behavior(self, steps):
"""
Detect circular behavior: agent repeats a
sequence of actions.
Example: read file A -> edit file B -> run tests
-> read file A -> edit file B -> run tests -> ...
"""
if len(steps) < 6:
return False
action_sequence = [
str(s.action_type) + str(s.action_input)
for s in steps
]
# Check for repeating subsequences
for period in range(2, len(action_sequence) // 2):
for start in range(
len(action_sequence) - 2 * period
):
subseq = action_sequence[
start:start + period
]
next_subseq = action_sequence[
start + period:start + 2 * period
]
if subseq == next_subseq:
return True
return False
Complete Agentic Data Pipeline
End-to-End Construction
class AgenticDataPipeline:
"""
End-to-end pipeline for building agentic training data.
Stages:
1. Task generation: create diverse tasks
2. Trajectory collection: run agents on tasks
3. Reward shaping: assign dense intermediate rewards
4. Quality filtering: remove low-quality trajectories
5. Augmentation: hindsight relabeling, error injection
6. Formatting: convert to training-ready sequences
"""
def __init__(self, config):
self.trajectory_collector = (
SWEBenchTrajectoryCollector(
workspace_dir=config["workspace"],
model=config["model"],
)
)
self.synthetic_generator = (
SyntheticTrajectoryGenerator(
expert_model=config["expert_model"],
)
)
self.reward_shaper = AgentRewardShaper(
config.get("reward_config", {})
)
self.quality_filter = TrajectoryQualityFilter(
config.get("filter_config", {})
)
self.formatter = AgentTraceFormatter()
def build_dataset(self, tasks, n_augmented=5000):
"""Build complete agentic training dataset."""
# Stage 1: Collect real trajectories
real_trajectories = []
for task in tasks:
traj = self.trajectory_collector.collect_trajectory(
task
)
real_trajectories.append(traj)
# Stage 2: Generate synthetic trajectories
successful, failed = (
self.synthetic_generator
.generate_expert_trajectories(tasks, tasks)
)
# Stage 3: Hindsight relabeling of failures
relabeled = (
self.synthetic_generator
.hindsight_relabel(failed)
)
# Stage 4: Error recovery data
error_recovery = (
self.synthetic_generator
.generate_error_recovery_data(n_augmented)
)
# Combine all trajectories
all_trajectories = (
real_trajectories
+ successful
+ relabeled
)
# Stage 5: Reward shaping
for traj in all_trajectories:
self.reward_shaper.shape_trajectory_rewards(traj)
# Stage 6: Quality filtering
accepted, rejected = (
self.quality_filter.filter_trajectories(
all_trajectories
)
)
# Stage 7: Format for training
formatted = []
for traj in accepted:
text = self.formatter.format_trajectory(traj)
formatted.append({
"text": text,
"reward": traj.trajectory_reward,
"steps": traj.total_steps,
"domain": traj.metadata.get("domain", ""),
"is_synthetic": traj.metadata.get(
"synthetic", False
),
})
return {
"training_samples": formatted,
"total_real": len(real_trajectories),
"total_synthetic": len(successful),
"total_relabeled": len(relabeled),
"filtered_out": len(rejected),
"final_count": len(formatted),
}
Agentic Training Data: Impact on SWE-bench Resolve Rate
| Metric | 0 | 5 | 10 | 25 | 50 | 100 |
|---|---|---|---|---|---|---|
| Real + synthetic + error recovery | ||||||
| Real + synthetic only | ||||||
| Real trajectories only |
Key Takeaways
Agentic training data requires a fundamentally different format from chat data. The thought-action-observation triple structure encodes not just what the model should say but what it should do, how the environment responds, and how to reason about the next action.
The critical engineering decisions:
-
Error recovery data is the highest-leverage addition: Models trained only on success trajectories fail catastrophically when they encounter errors. Adding explicit error-recovery training data (error injection + correction sequences) improves task completion rates by 8-15% on multi-step benchmarks.
-
Dense reward shaping accelerates training: Sparse outcome rewards (success/failure) provide weak signal for 15-30 step trajectories. Dense intermediate rewards (progress estimation, information gain, efficiency penalties) improve learning speed by 50-100% measured in training steps to target performance.
-
Hindsight relabeling salvages failed trajectories: 60-80% of collected trajectories fail. Relabeling failures by identifying the error point and generating corrections produces training data that explicitly teaches when and how to change strategy.
-
Trajectory quality matters more than quantity: Filtering for efficiency (low step count), low redundancy, and no circular behavior removes 30-50% of raw trajectories but produces better downstream performance per training token.
-
Synthetic trajectories scale cheaply: Expert-model-generated trajectories cost 10-100x less than real environment execution. A mix of real (for grounding) and synthetic (for coverage) produces the best training data.