Agentic Interaction Data: Tool Use Traces, Multi-Step Planning Logs, and Environment Feedback

Part of Series The Dataset Frontier 24 of 27

1 Synthetic Data Pipelines: Magpie, Nemotron-4, and Generating Training Data at Scale 2 Data Curation at Scale: DCLM, FineWeb-Edu, and the Exact Heuristics That Filter the Web 3 Agent-Based Simulation: Using 10,000 AI Agents to Generate Synthetic Training Data 4 Code Dataset Curation: Deduplication, License Filtering, and Quality Scoring for LLM Training 5 Multilingual Data: Cross-Lingual Transfer, Low-Resource Languages, and Translation Quality 6 Instruction Tuning Data: ShareGPT, OpenAssistant, and Quality Metrics for Alignment 7 Preference Data: Building DPO/RLHF Datasets from Human and AI Feedback 8 Data Mixing: Optimal Proportions of Code, Math, Web, and Books for LLM Training 9 Evaluation Datasets: Building Benchmarks That Actually Measure LLM Capability 10 Data Contamination: Detecting and Preventing Benchmark Leakage in Training Data 11 The Data Scaling Law: How Much Data Is Enough, and What Happens When You Run Out 12 Training a Tokenizer from Scratch: BPE Merge Rules, Vocabulary Optimization, and Compression Ratio 13 Multimodal Training Data: Image-Text Pairs, Video Captioning, and Interleaved Document Formats 14 RLHF Data at Scale: Collecting Millions of Human Preferences with Minimal Cost 15 Building a Decontamination Pipeline: Removing Benchmark Data from Training Corpora 16 Safety Training Data: Red Teaming, Refusal Training, and Building Datasets for Harmless AI 17 Data Versioning and Reproducibility: Tracking What Changed Between Training Runs 18 Domain-Specific Data: Building Medical, Legal, and Financial Training Datasets 19 Data Attribution and Provenance: Tracing Model Outputs Back to Training Examples 20 The Data Flywheel: Using Production Logs to Continuously Improve Training Data 21 Reward Model Training Data: Building Datasets for Math Verification and Code Correctness 22 Long-Context Training Data: Book-Length Documents, Multi-Document QA, and Needle-in-Haystack 23 Agentic Interaction Data: Tool Use Traces, Multi-Step Planning Logs, and Environment Feedback 24 Data Labeling Platforms: Scale AI, Surge AI, and Building Your Own Annotation Pipeline 25 Data Legal Issues: Copyright, Fair Use, Opt-Out, and the Regulatory Landscape for Training Data 26 Data Pipeline at Scale: Spark, Ray, and Processing 15 Trillion Tokens Across 1000 Nodes 27 Building a Data Pipeline: From Raw HTML to Clean Training Tokens in 500 Lines

GPT-4 with code interpreter solves MATH at 69.7% accuracy. Without code interpreter, it scores 42.5% — a 27 point gap from tool use alone. The capability comes from training data: thousands of (thought, action, observation) trajectories where the model writes code, executes it, reads the output, and iterates. Standard chat data teaches text generation; agentic data teaches action selection, error recovery, and multi-step tool chaining. The format is not negotiable — if your training data lacks tool use traces, your model will not learn to use tools reliably.

This post covers the construction of agentic training data: tool use trace formats, trajectory collection from real environments, synthetic trajectory generation, reward shaping for multi-step actions, and error recovery training.

Agentic Data Format

Trace Schema

from dataclasses import dataclass, field
from datetime import datetime
from enum import Enum
from typing import Optional

class ActionType(Enum):
    TOOL_CALL = "tool_call"
    CODE_EXECUTE = "code_execute"
    FILE_READ = "file_read"
    FILE_WRITE = "file_write"
    WEB_SEARCH = "web_search"
    API_CALL = "api_call"
    SHELL_COMMAND = "shell_command"
    THINK = "think"
    ANSWER = "answer"

@dataclass
class ToolDefinition:
    """Definition of a tool available to the agent."""
    name: str
    description: str
    parameters: dict
    returns: str
    examples: list = field(default_factory=list)

@dataclass
class AgentStep:
    """Single step in an agent trajectory."""
    step_index: int
    thought: str
    action_type: ActionType
    action_input: dict
    observation: str
    reward: float = 0.0
    is_error: bool = False
    error_type: Optional[str] = None
    latency_ms: float = 0.0
    tokens_used: int = 0

@dataclass
class AgentTrajectory:
    """Complete agent trajectory for a task."""
    task_id: str
    task_description: str
    available_tools: list
    steps: list
    final_answer: Optional[str] = None
    task_completed: bool = False
    total_steps: int = 0
    total_tokens: int = 0
    trajectory_reward: float = 0.0
    metadata: dict = field(default_factory=dict)

class AgentTraceFormatter:
    """
    Format agent trajectories into training-compatible
    text sequences.

    The format must encode:
    - Available tools (system prompt)
    - Thought-action-observation loops
    - Error handling and recovery
    - Final answer extraction
    """

    THOUGHT_TAG = "<think>"
    THOUGHT_END = "</think>"
    ACTION_TAG = "<action>"
    ACTION_END = "</action>"
    OBSERVATION_TAG = "<observation>"
    OBSERVATION_END = "</observation>"

    def format_trajectory(self, trajectory):
        """
        Convert a trajectory to a text training sequence.
        """
        parts = []

        # System prompt with tool definitions
        tool_defs = self._format_tool_definitions(
            trajectory.available_tools
        )
        parts.append(f"System: {tool_defs}")

        # Task
        parts.append(
            f"User: {trajectory.task_description}"
        )

        # Steps
        for step in trajectory.steps:
            # Thought
            parts.append(
                f"{self.THOUGHT_TAG}"
                f"{step.thought}"
                f"{self.THOUGHT_END}"
            )

            # Action
            action_str = self._format_action(step)
            parts.append(
                f"{self.ACTION_TAG}"
                f"{action_str}"
                f"{self.ACTION_END}"
            )

            # Observation
            obs = step.observation
            if len(obs) > 2000:
                obs = obs[:2000] + "\n... (truncated)"
            parts.append(
                f"{self.OBSERVATION_TAG}"
                f"{obs}"
                f"{self.OBSERVATION_END}"
            )

        # Final answer
        if trajectory.final_answer:
            parts.append(
                f"Assistant: {trajectory.final_answer}"
            )

        return "\n\n".join(parts)

    def _format_tool_definitions(self, tools):
        """Format tool definitions for system prompt."""
        lines = ["You have access to the following tools:\n"]

        for tool in tools:
            lines.append(f"### {tool.name}")
            lines.append(f"Description: {tool.description}")
            lines.append(f"Parameters: {tool.parameters}")
            lines.append(f"Returns: {tool.returns}")
            lines.append("")

        return "\n".join(lines)

    def _format_action(self, step):
        """Format an action step."""
        if step.action_type == ActionType.TOOL_CALL:
            return (
                f"tool: {step.action_input.get('tool_name', '')}\n"
                f"arguments: {step.action_input.get('arguments', {})}"
            )
        elif step.action_type == ActionType.CODE_EXECUTE:
            return (
                f"execute_code:\n"
                f"```python\n"
                f"{step.action_input.get('code', '')}\n"
                f"```"
            )
        elif step.action_type == ActionType.SHELL_COMMAND:
            return (
                f"shell: {step.action_input.get('command', '')}"
            )
        else:
            return str(step.action_input)

ℹ️ Note

The thought-action-observation (TAO) format is standard across agentic frameworks (ReAct, Toolformer, AutoGPT). The key training signal is the thought: the model must learn to reason about which tool to use and why before acting. Training on action-observation pairs without thoughts produces agents that act without planning and fail on multi-step tasks.

Trajectory Collection from Real Environments

SWE-Bench Style Task Collection

import subprocess
import json
import os

class SWEBenchTrajectoryCollector:
    """
    Collect agent trajectories from real software
    engineering tasks (SWE-bench style).

    Pipeline:
    1. Set up a Git repository at a specific commit
    2. Present the agent with an issue description
    3. Record every action the agent takes
    4. Evaluate the final patch against test suite
    5. Label trajectory as success/failure
    """

    def __init__(self, workspace_dir, model):
        self.workspace_dir = workspace_dir
        self.model = model

    def collect_trajectory(self, task):
        """
        Run an agent on a task and collect the full
        trajectory with environment feedback.
        """
        # Set up environment
        repo_path = self._setup_repo(
            task["repo"], task["base_commit"]
        )

        # Define available tools
        tools = [
            ToolDefinition(
                name="read_file",
                description="Read the contents of a file",
                parameters={"path": "string"},
                returns="File contents as string",
            ),
            ToolDefinition(
                name="write_file",
                description="Write content to a file",
                parameters={
                    "path": "string",
                    "content": "string",
                },
                returns="Success/failure message",
            ),
            ToolDefinition(
                name="search_code",
                description="Search for a pattern in the codebase",
                parameters={
                    "pattern": "string",
                    "file_glob": "string",
                },
                returns="Matching lines with file paths",
            ),
            ToolDefinition(
                name="run_tests",
                description="Run the test suite",
                parameters={"test_path": "string"},
                returns="Test results (pass/fail counts)",
            ),
            ToolDefinition(
                name="shell",
                description="Execute a shell command",
                parameters={"command": "string"},
                returns="Command output",
            ),
        ]

        # Run agent
        trajectory = AgentTrajectory(
            task_id=task["instance_id"],
            task_description=task["problem_statement"],
            available_tools=tools,
            steps=[],
        )

        context = (
            f"Fix the following issue in the {task['repo']} "
            f"repository:\n\n{task['problem_statement']}"
        )

        for step_idx in range(50):  # Max 50 steps
            # Get agent action
            response = self.model.generate(
                context,
                temperature=0.0,
                max_tokens=2048,
            )

            # Parse action from response
            action = self._parse_action(response)

            if action is None or action["type"] == "done":
                trajectory.final_answer = response
                break

            # Execute action in environment
            observation = self._execute_action(
                action, repo_path
            )

            step = AgentStep(
                step_index=step_idx,
                thought=action.get("thought", ""),
                action_type=ActionType(action["type"]),
                action_input=action.get("input", {}),
                observation=observation["output"],
                is_error=observation.get("is_error", False),
                error_type=observation.get("error_type"),
            )

            trajectory.steps.append(step)
            trajectory.total_steps = step_idx + 1

            # Update context with new step
            context += (
                f"\n\nAction: {action}\n"
                f"Observation: {observation['output']}"
            )

        # Evaluate
        test_result = self._run_evaluation(
            repo_path, task
        )
        trajectory.task_completed = test_result["passed"]
        trajectory.trajectory_reward = (
            1.0 if test_result["passed"] else 0.0
        )

        return trajectory

    def _setup_repo(self, repo_name, commit):
        """Set up repository at specified commit."""
        repo_path = os.path.join(
            self.workspace_dir, repo_name.replace("/", "_")
        )
        return repo_path

    def _parse_action(self, response):
        """Parse an action from model response."""
        return None  # Placeholder

    def _execute_action(self, action, repo_path):
        """Execute an action in the environment."""
        return {"output": "", "is_error": False}

    def _run_evaluation(self, repo_path, task):
        """Run tests to evaluate the agent's patch."""
        return {"passed": False}

Web Browsing Trajectory Collection

class WebBrowsingTrajectoryCollector:
    """
    Collect trajectories from web browsing tasks.

    The agent navigates web pages to find information
    or complete tasks (form filling, data extraction,
    comparison shopping).
    """

    def __init__(self, browser_env, model):
        self.browser_env = browser_env
        self.model = model

    def collect_trajectory(self, task):
        """
        Run agent on a web browsing task.
        """
        tools = [
            ToolDefinition(
                name="click",
                description="Click an element on the page",
                parameters={"selector": "string"},
                returns="Updated page content",
            ),
            ToolDefinition(
                name="type",
                description="Type text into an input field",
                parameters={
                    "selector": "string",
                    "text": "string",
                },
                returns="Updated page content",
            ),
            ToolDefinition(
                name="scroll",
                description="Scroll the page",
                parameters={"direction": "up|down"},
                returns="Updated visible content",
            ),
            ToolDefinition(
                name="navigate",
                description="Navigate to a URL",
                parameters={"url": "string"},
                returns="Page content",
            ),
            ToolDefinition(
                name="extract_text",
                description="Extract text from visible page",
                parameters={},
                returns="Visible text content",
            ),
        ]

        trajectory = AgentTrajectory(
            task_id=task["id"],
            task_description=task["instruction"],
            available_tools=tools,
            steps=[],
        )

        # Initialize browser
        page_content = self.browser_env.reset(
            task.get("start_url", "")
        )

        context = (
            f"Task: {task['instruction']}\n\n"
            f"Current page:\n{page_content[:5000]}"
        )

        for step_idx in range(30):
            response = self.model.generate(
                context,
                temperature=0.0,
                max_tokens=1024,
            )

            action = self._parse_action(response)
            if action is None:
                trajectory.final_answer = response
                break

            observation = self.browser_env.step(action)

            step = AgentStep(
                step_index=step_idx,
                thought=action.get("thought", ""),
                action_type=ActionType.TOOL_CALL,
                action_input=action,
                observation=str(observation)[:2000],
                is_error=observation.get("error", False),
            )

            trajectory.steps.append(step)

            context += (
                f"\n\nAction: {action}\n"
                f"Page: {str(observation)[:2000]}"
            )

        return trajectory

    def _parse_action(self, response):
        """Parse browser action from model response."""
        return None  # Placeholder

📊

Agentic Task Domains and Data Properties

Task Domain	Avg Steps per Task	Tool Count	Success Rate (GPT-4)	Error Recovery Required	Data Collection Cost
SWE-bench (code repair)	15-30	5	12-20%	High	$2-5 per trajectory
Web browsing (WebArena)	8-15	5	15-25%	Medium	$1-3 per trajectory
Data analysis (SQL)	5-10	3	40-60%	Medium	$0.5-1 per trajectory
API orchestration	5-15	10-20	30-50%	High	$1-2 per trajectory
File management	3-8	4	60-80%	Low	$0.3-0.5 per trajectory
Multi-step math	5-12	2	35-55%	Medium	$0.5-1 per trajectory

Synthetic Trajectory Generation

Scaling Agentic Data with Simulation

class SyntheticTrajectoryGenerator:
    """
    Generate synthetic agentic trajectories at scale.

    Real trajectory collection is expensive (environment
    setup, execution time, evaluation). Synthetic generation
    uses a strong model to simulate both agent and
    environment, producing trajectories 10-100x cheaper.

    Strategies:
    1. Expert iteration: strong model acts, weak model
       learns from successful trajectories
    2. Self-play: model generates task + solution
    3. Hindsight relabeling: failed trajectories are
       relabeled with corrected actions
    """

    def __init__(self, expert_model, student_model=None):
        self.expert_model = expert_model
        self.student_model = student_model

    def generate_expert_trajectories(self, tasks,
                                      environments):
        """
        Expert iteration: use a strong model to generate
        successful trajectories for training a weaker model.
        """
        successful = []
        failed = []

        for task, env in zip(tasks, environments):
            trajectory = self._run_expert(task, env)

            if trajectory.task_completed:
                successful.append(trajectory)
            else:
                failed.append(trajectory)

        return successful, failed

    def hindsight_relabel(self, failed_trajectories):
        """
        Hindsight relabeling: take a failed trajectory,
        identify the first error, and generate the
        correct action sequence from that point.

        This salvages failed trajectories by converting
        them into training data that shows both the
        error and the correction.
        """
        relabeled = []

        for traj in failed_trajectories:
            # Find first error step
            error_step = None
            for step in traj.steps:
                if step.is_error:
                    error_step = step
                    break

            if error_step is None:
                continue

            # Generate corrected action from error point
            prefix_steps = traj.steps[:error_step.step_index]
            context = self._build_context_from_steps(
                traj.task_description,
                prefix_steps,
            )

            # Add error observation
            context += (
                f"\n\nPrevious action failed with error: "
                f"{error_step.observation}\n"
                f"Generate a corrected approach."
            )

            corrected_response = self.expert_model.generate(
                context,
                temperature=0.3,
                max_tokens=2048,
            )

            # Create relabeled trajectory
            relabeled_traj = AgentTrajectory(
                task_id=traj.task_id + "_relabeled",
                task_description=traj.task_description,
                available_tools=traj.available_tools,
                steps=prefix_steps + [
                    AgentStep(
                        step_index=error_step.step_index,
                        thought=(
                            f"The previous action failed. "
                            f"Error: {error_step.observation}. "
                            f"I need to try a different approach."
                        ),
                        action_type=ActionType.THINK,
                        action_input={
                            "content": corrected_response
                        },
                        observation="",
                        is_error=False,
                    ),
                ],
                metadata={"relabeled": True},
            )

            relabeled.append(relabeled_traj)

        return relabeled

    def generate_error_recovery_data(self, n_samples=1000):
        """
        Explicitly generate error recovery training data.

        For each sample:
        1. Generate a partial trajectory
        2. Inject a realistic error
        3. Generate the recovery sequence
        """
        error_types = [
            {
                "name": "file_not_found",
                "observation": "Error: FileNotFoundError: "
                               "No such file or directory",
                "recovery": "search for the correct file path",
            },
            {
                "name": "syntax_error",
                "observation": "Error: SyntaxError: "
                               "invalid syntax at line 15",
                "recovery": "fix the syntax error and re-run",
            },
            {
                "name": "test_failure",
                "observation": "FAILED: 3 tests failed. "
                               "AssertionError in test_edge_case",
                "recovery": "read the failing test, understand "
                            "the edge case, fix the code",
            },
            {
                "name": "timeout",
                "observation": "Error: Command timed out "
                               "after 30 seconds",
                "recovery": "optimize the approach or break "
                            "into smaller steps",
            },
            {
                "name": "permission_denied",
                "observation": "Error: PermissionError: "
                               "Permission denied",
                "recovery": "check file permissions or use "
                            "a different path",
            },
        ]

        samples = []
        for _ in range(n_samples):
            import random
            error = random.choice(error_types)
            sample = self._generate_recovery_sample(error)
            samples.append(sample)

        return samples

    def _run_expert(self, task, env):
        """Run expert model on a task."""
        return AgentTrajectory(
            task_id=task["id"],
            task_description=task["description"],
            available_tools=[],
            steps=[],
        )

    def _build_context_from_steps(self, task, steps):
        """Build context string from trajectory steps."""
        context = f"Task: {task}\n\n"
        for step in steps:
            context += (
                f"Thought: {step.thought}\n"
                f"Action: {step.action_input}\n"
                f"Observation: {step.observation}\n\n"
            )
        return context

    def _generate_recovery_sample(self, error):
        """Generate a single error recovery sample."""
        return {}  # Placeholder

⚠️ Warning

Error recovery data is the most underrepresented category in agentic training. Most trajectory datasets contain only successful paths. Models trained exclusively on success trajectories freeze or loop when they encounter errors. Explicitly generating error-recovery training data (error injection + correction) improves task completion rates by 8-15% on benchmarks like SWE-bench and WebArena.

Reward Shaping for Multi-Step Agents

Intermediate Rewards

class AgentRewardShaper:
    """
    Shape rewards for multi-step agent trajectories.

    Sparse reward (success/failure at the end) produces
    poor training signal for long trajectories. Dense
    intermediate rewards at each step provide better
    gradient signal.

    Reward components:
    1. Progress reward: did this step make measurable
       progress toward the goal?
    2. Efficiency reward: was this the most direct action?
    3. Information gain: did this action reveal useful
       information?
    4. Error penalty: did this action cause an error?
    5. Outcome reward: task success/failure
    """

    def __init__(self, config):
        self.progress_weight = config.get(
            "progress_weight", 0.3
        )
        self.efficiency_weight = config.get(
            "efficiency_weight", 0.1
        )
        self.info_gain_weight = config.get(
            "info_gain_weight", 0.2
        )
        self.error_penalty = config.get(
            "error_penalty", -0.5
        )
        self.outcome_weight = config.get(
            "outcome_weight", 1.0
        )

    def shape_trajectory_rewards(self, trajectory):
        """
        Assign dense rewards to each step in a trajectory.
        """
        steps = trajectory.steps
        n_steps = len(steps)

        if n_steps == 0:
            return trajectory

        shaped_rewards = []

        for i, step in enumerate(steps):
            reward = 0.0

            # Progress reward
            progress = self._estimate_progress(
                step, trajectory
            )
            reward += self.progress_weight * progress

            # Efficiency reward (penalize redundant actions)
            if i > 0:
                redundancy = self._check_redundancy(
                    step, steps[:i]
                )
                reward -= self.efficiency_weight * redundancy

            # Information gain
            info_gain = self._estimate_info_gain(
                step, steps[:i]
            )
            reward += self.info_gain_weight * info_gain

            # Error penalty
            if step.is_error:
                reward += self.error_penalty

            shaped_rewards.append(reward)

        # Add outcome reward to last step
        if trajectory.task_completed:
            shaped_rewards[-1] += self.outcome_weight
        else:
            shaped_rewards[-1] -= self.outcome_weight * 0.5

        # Apply discount factor (earlier steps get
        # discounted reward from future steps)
        gamma = 0.95
        returns = self._compute_returns(
            shaped_rewards, gamma
        )

        for i, step in enumerate(steps):
            step.reward = returns[i]

        return trajectory

    def _estimate_progress(self, step, trajectory):
        """
        Estimate progress made by this step.

        Heuristics:
        - File read that finds relevant code: +0.3
        - Code edit that moves toward solution: +0.5
        - Test run that passes more tests: +0.8
        - Navigation to irrelevant area: -0.1
        """
        if step.action_type == ActionType.CODE_EXECUTE:
            if "PASSED" in step.observation:
                return 0.8
            if "Error" in step.observation:
                return -0.2
        if step.action_type == ActionType.FILE_READ:
            return 0.2
        if step.action_type == ActionType.FILE_WRITE:
            return 0.4
        return 0.0

    def _check_redundancy(self, step, previous_steps):
        """Check if this step repeats a previous action."""
        for prev in previous_steps:
            if (
                prev.action_type == step.action_type
                and prev.action_input == step.action_input
            ):
                return 1.0
        return 0.0

    def _estimate_info_gain(self, step, previous_steps):
        """Estimate information gained from this step."""
        if step.action_type == ActionType.FILE_READ:
            return 0.3
        if step.action_type == ActionType.WEB_SEARCH:
            return 0.4
        return 0.1

    def _compute_returns(self, rewards, gamma):
        """Compute discounted returns."""
        returns = [0.0] * len(rewards)
        running_return = 0.0

        for i in reversed(range(len(rewards))):
            running_return = rewards[i] + gamma * running_return
            returns[i] = running_return

        return returns

Impact of Reward Shaping on Agent Performance

Metric	0	10	25	50	100	200
Dense shaped rewards	5	15	28	42	55	62
Sparse outcome reward only	5	8	14	22	32	40
No reward shaping (SFT only)	5	12	18	24	28	30

Data Quality for Agent Training

Trajectory Filtering and Augmentation

class TrajectoryQualityFilter:
    """
    Filter agent trajectories for training quality.

    Not all successful trajectories are good training data.
    A trajectory that succeeds after 45 steps of random
    exploration is worse than one that succeeds in 8
    focused steps.
    """

    def __init__(self, config):
        self.max_steps = config.get("max_steps", 30)
        self.min_steps = config.get("min_steps", 2)
        self.max_error_rate = config.get(
            "max_error_rate", 0.3
        )
        self.max_redundancy_rate = config.get(
            "max_redundancy_rate", 0.2
        )

    def filter_trajectories(self, trajectories):
        """
        Filter trajectories by quality criteria.
        """
        accepted = []
        rejected = []

        for traj in trajectories:
            reasons = self._check_quality(traj)
            if not reasons:
                accepted.append(traj)
            else:
                rejected.append((traj, reasons))

        return accepted, rejected

    def _check_quality(self, trajectory):
        """Run all quality checks on a trajectory."""
        reasons = []
        steps = trajectory.steps

        # Step count bounds
        if len(steps) > self.max_steps:
            reasons.append(
                f"Too many steps: {len(steps)}"
            )
        if len(steps) < self.min_steps:
            reasons.append(
                f"Too few steps: {len(steps)}"
            )

        # Error rate
        if steps:
            error_rate = sum(
                1 for s in steps if s.is_error
            ) / len(steps)
            if error_rate > self.max_error_rate:
                reasons.append(
                    f"High error rate: {error_rate:.2f}"
                )

        # Redundancy check
        if steps:
            actions = [
                (s.action_type, str(s.action_input))
                for s in steps
            ]
            unique = len(set(actions))
            redundancy = 1.0 - unique / len(actions)
            if redundancy > self.max_redundancy_rate:
                reasons.append(
                    f"High redundancy: {redundancy:.2f}"
                )

        # Circular behavior detection
        if self._has_circular_behavior(steps):
            reasons.append("Circular behavior detected")

        return reasons

    def _has_circular_behavior(self, steps):
        """
        Detect circular behavior: agent repeats a
        sequence of actions.

        Example: read file A -> edit file B -> run tests
        -> read file A -> edit file B -> run tests -> ...
        """
        if len(steps) < 6:
            return False

        action_sequence = [
            str(s.action_type) + str(s.action_input)
            for s in steps
        ]

        # Check for repeating subsequences
        for period in range(2, len(action_sequence) // 2):
            for start in range(
                len(action_sequence) - 2 * period
            ):
                subseq = action_sequence[
                    start:start + period
                ]
                next_subseq = action_sequence[
                    start + period:start + 2 * period
                ]
                if subseq == next_subseq:
                    return True

        return False

Complete Agentic Data Pipeline

End-to-End Construction

class AgenticDataPipeline:
    """
    End-to-end pipeline for building agentic training data.

    Stages:
    1. Task generation: create diverse tasks
    2. Trajectory collection: run agents on tasks
    3. Reward shaping: assign dense intermediate rewards
    4. Quality filtering: remove low-quality trajectories
    5. Augmentation: hindsight relabeling, error injection
    6. Formatting: convert to training-ready sequences
    """

    def __init__(self, config):
        self.trajectory_collector = (
            SWEBenchTrajectoryCollector(
                workspace_dir=config["workspace"],
                model=config["model"],
            )
        )
        self.synthetic_generator = (
            SyntheticTrajectoryGenerator(
                expert_model=config["expert_model"],
            )
        )
        self.reward_shaper = AgentRewardShaper(
            config.get("reward_config", {})
        )
        self.quality_filter = TrajectoryQualityFilter(
            config.get("filter_config", {})
        )
        self.formatter = AgentTraceFormatter()

    def build_dataset(self, tasks, n_augmented=5000):
        """Build complete agentic training dataset."""
        # Stage 1: Collect real trajectories
        real_trajectories = []
        for task in tasks:
            traj = self.trajectory_collector.collect_trajectory(
                task
            )
            real_trajectories.append(traj)

        # Stage 2: Generate synthetic trajectories
        successful, failed = (
            self.synthetic_generator
            .generate_expert_trajectories(tasks, tasks)
        )

        # Stage 3: Hindsight relabeling of failures
        relabeled = (
            self.synthetic_generator
            .hindsight_relabel(failed)
        )

        # Stage 4: Error recovery data
        error_recovery = (
            self.synthetic_generator
            .generate_error_recovery_data(n_augmented)
        )

        # Combine all trajectories
        all_trajectories = (
            real_trajectories
            + successful
            + relabeled
        )

        # Stage 5: Reward shaping
        for traj in all_trajectories:
            self.reward_shaper.shape_trajectory_rewards(traj)

        # Stage 6: Quality filtering
        accepted, rejected = (
            self.quality_filter.filter_trajectories(
                all_trajectories
            )
        )

        # Stage 7: Format for training
        formatted = []
        for traj in accepted:
            text = self.formatter.format_trajectory(traj)
            formatted.append({
                "text": text,
                "reward": traj.trajectory_reward,
                "steps": traj.total_steps,
                "domain": traj.metadata.get("domain", ""),
                "is_synthetic": traj.metadata.get(
                    "synthetic", False
                ),
            })

        return {
            "training_samples": formatted,
            "total_real": len(real_trajectories),
            "total_synthetic": len(successful),
            "total_relabeled": len(relabeled),
            "filtered_out": len(rejected),
            "final_count": len(formatted),
        }

Agentic Training Data: Impact on SWE-bench Resolve Rate

Metric	0	5	10	25	50	100
Real + synthetic + error recovery	4	10	16	24	32	38
Real + synthetic only	4	8	13	20	26	30
Real trajectories only	4	7	10	15	19	22

Key Takeaways

Agentic training data requires a fundamentally different format from chat data. The thought-action-observation triple structure encodes not just what the model should say but what it should do, how the environment responds, and how to reason about the next action.

The critical engineering decisions:

Error recovery data is the highest-leverage addition: Models trained only on success trajectories fail catastrophically when they encounter errors. Adding explicit error-recovery training data (error injection + correction sequences) improves task completion rates by 8-15% on multi-step benchmarks.
Dense reward shaping accelerates training: Sparse outcome rewards (success/failure) provide weak signal for 15-30 step trajectories. Dense intermediate rewards (progress estimation, information gain, efficiency penalties) improve learning speed by 50-100% measured in training steps to target performance.
Hindsight relabeling salvages failed trajectories: 60-80% of collected trajectories fail. Relabeling failures by identifying the error point and generating corrections produces training data that explicitly teaches when and how to change strategy.
Trajectory quality matters more than quantity: Filtering for efficiency (low step count), low redundancy, and no circular behavior removes 30-50% of raw trajectories but produces better downstream performance per training token.
Synthetic trajectories scale cheaply: Expert-model-generated trajectories cost 10-100x less than real environment execution. A mix of real (for grounding) and synthetic (for coverage) produces the best training data.