Synthetic data generation typically follows a simple pattern: prompt a strong model, collect the output, filter for quality. This works for instruction-response pairs, but it produces data that is fundamentally single-turn and context-free. Real user behavior is none of these things. Users arrive with history, preferences, constraints, and goals that evolve over multiple interactions. A customer who bought a camera last week asks different questions about lenses than a first-time buyer. A budget-conscious shopper negotiates differently than someone optimizing for speed.
Agent-based simulation generates data with this structure. Instead of prompting a model once per example, you build a virtual environment — a marketplace, a help desk, a negotiation table — populate it with thousands of LLM-powered agents that have persistent memory and defined personas, and let them interact. The resulting conversations have context, history, multi-turn dynamics, and behavioral diversity that single-shot generation cannot produce.
This post builds a complete agent simulation system: the environment, the agent architecture, the persona system, the interaction loop, the data extraction pipeline, the quality scoring, and the cost analysis for generating 1M training examples.
1. Why Agent Simulation: The Data Gap
Standard synthetic data generation produces isolated examples. Each example is independent — there is no shared context, no user history, no evolving state. But models deployed in production face contextual interactions:
Contextual Dimensions Missing from Single-Shot Synthetic Data
| Dimension | Single-Shot Generation | Agent Simulation | Impact on Model Quality |
|---|---|---|---|
| User history | None — each example is independent | Agent has purchase history, past questions | Model learns to reference prior context |
| Multi-turn dynamics | Usually single-turn Q&A | Conversations run 3-15 turns | Model learns turn-taking, follow-ups |
| Negotiation patterns | Not representable | Agents haggle, counter-offer, compromise | Model learns persuasion, concession logic |
| Emotional trajectories | Static tone per example | Frustration builds over failed attempts | Model learns empathy, escalation handling |
| Referential coherence | No shared entities | Agents reference same products, prices | Model learns entity tracking across turns |
| Time-dependent behavior | No temporal structure | Agent behavior changes over simulated days | Model learns recency, urgency patterns |
The fundamental insight: a simulation environment provides structure that pure prompting cannot. The environment enforces consistency (prices are real, inventory is tracked, transactions are valid), and the agent personas provide behavioral diversity (a budget shopper and a tech enthusiast generate different conversations about the same product).
2. Environment Design: The Simulated Marketplace
The environment is the shared state that all agents interact with. For a marketplace simulation, this includes products, prices, inventory, reviews, and transaction history.
import json
import random
import uuid
from dataclasses import dataclass, field
from datetime import datetime, timedelta
@dataclass
class Product:
product_id: str
name: str
category: str
price: float
original_price: float
description: str
specs: dict
rating: float
num_reviews: int
inventory: int
seller_id: str
@dataclass
class Review:
review_id: str
product_id: str
author_id: str
rating: int # 1-5
title: str
body: str
helpful_votes: int
verified_purchase: bool
timestamp: str
@dataclass
class Transaction:
transaction_id: str
buyer_id: str
product_id: str
price_paid: float
quantity: int
timestamp: str
negotiated: bool # Was there a price negotiation?
discount_pct: float # 0.0 if no discount
class MarketplaceEnvironment:
"""
Shared state for all agents. Tracks products, inventory,
transactions, and reviews. Enforces consistency constraints.
"""
def __init__(self, seed=42):
self.rng = random.Random(seed)
self.products = {}
self.reviews = {} # product_id -> list of Review
self.transactions = []
self.agent_purchase_history = {} # agent_id -> list of Transaction
self.agent_review_history = {} # agent_id -> list of Review
self.current_time = datetime(2025, 1, 1)
def generate_catalog(self, num_products=500):
"""Generate a product catalog with realistic attributes."""
categories = {
"electronics": {
"names": [
"Wireless Noise-Canceling Headphones",
"4K Webcam with Ring Light",
"Mechanical Keyboard (Cherry MX)",
"USB-C Docking Station 12-in-1",
"Portable SSD 2TB",
"Smart Home Hub",
"Bluetooth Speaker Waterproof",
"Laptop Stand Aluminum",
"Wireless Mouse Ergonomic",
"LED Monitor 27-inch 144Hz",
],
"price_range": (29.99, 599.99),
"spec_keys": ["brand", "warranty", "connectivity", "weight"],
},
"home": {
"names": [
"Robot Vacuum with Mapping",
"Air Purifier HEPA H13",
"Instant Pot 8-Quart",
"Memory Foam Mattress Topper",
"Smart Thermostat",
"Cordless Stick Vacuum",
"LED Desk Lamp Dimmable",
"Electric Kettle Temperature Control",
"Weighted Blanket 15lbs",
"Water Filter Pitcher",
],
"price_range": (19.99, 399.99),
"spec_keys": ["brand", "dimensions", "power", "warranty"],
},
"outdoor": {
"names": [
"Hiking Backpack 65L",
"Camping Tent 4-Person",
"Trekking Poles Carbon Fiber",
"Portable Water Filter",
"Solar Charger 28W",
"Sleeping Bag 20F",
"Headlamp 1000 Lumens",
"Camp Stove Ultralight",
"Dry Bag 30L Waterproof",
"GPS Watch with Topo Maps",
],
"price_range": (24.99, 449.99),
"spec_keys": ["brand", "weight", "material", "capacity"],
},
}
for i in range(num_products):
cat = self.rng.choice(list(categories.keys()))
cat_info = categories[cat]
base_name = self.rng.choice(cat_info["names"])
variant = self.rng.choice(["Pro", "Plus", "Lite", "V2", "Max", ""])
name = f"{base_name} {variant}".strip()
lo, hi = cat_info["price_range"]
price = round(self.rng.uniform(lo, hi), 2)
product = Product(
product_id=f"prod_{i:04d}",
name=name,
category=cat,
price=price,
original_price=round(price * self.rng.uniform(1.0, 1.4), 2),
description=f"High-quality {name.lower()} for {cat} use.",
specs={k: f"spec_{k}_{self.rng.randint(1, 5)}" for k in cat_info["spec_keys"]},
rating=round(self.rng.uniform(2.5, 5.0), 1),
num_reviews=self.rng.randint(5, 2000),
inventory=self.rng.randint(0, 500),
seller_id=f"seller_{self.rng.randint(0, 49):03d}",
)
self.products[product.product_id] = product
self.reviews[product.product_id] = []
def execute_purchase(self, buyer_id, product_id, price_paid, quantity=1):
"""Process a purchase. Returns Transaction or None if out of stock."""
product = self.products.get(product_id)
if not product or product.inventory < quantity:
return None
product.inventory -= quantity
discount_pct = round(
1.0 - (price_paid / product.price), 4
) if price_paid < product.price else 0.0
txn = Transaction(
transaction_id=str(uuid.uuid4())[:8],
buyer_id=buyer_id,
product_id=product_id,
price_paid=price_paid,
quantity=quantity,
timestamp=self.current_time.isoformat(),
negotiated=discount_pct > 0,
discount_pct=discount_pct,
)
self.transactions.append(txn)
if buyer_id not in self.agent_purchase_history:
self.agent_purchase_history[buyer_id] = []
self.agent_purchase_history[buyer_id].append(txn)
return txn
def submit_review(self, author_id, product_id, rating, title, body):
"""Submit a product review. Enforces verified purchase check."""
purchases = self.agent_purchase_history.get(author_id, [])
verified = any(t.product_id == product_id for t in purchases)
review = Review(
review_id=str(uuid.uuid4())[:8],
product_id=product_id,
author_id=author_id,
rating=rating,
title=title,
body=body,
helpful_votes=0,
verified_purchase=verified,
timestamp=self.current_time.isoformat(),
)
self.reviews[product_id].append(review)
if author_id not in self.agent_review_history:
self.agent_review_history[author_id] = []
self.agent_review_history[author_id].append(review)
# Update product rating
product = self.products[product_id]
all_ratings = [r.rating for r in self.reviews[product_id]]
product.rating = round(sum(all_ratings) / len(all_ratings), 1)
product.num_reviews = len(all_ratings)
return review
def advance_time(self, hours=1):
"""Advance simulated time."""
self.current_time += timedelta(hours=hours)
def get_product_context(self, product_id):
"""
Build a context string for a product, including recent reviews.
This is what an agent 'sees' when browsing.
"""
product = self.products[product_id]
recent_reviews = self.reviews[product_id][-5:] # Last 5 reviews
review_text = ""
for r in recent_reviews:
stars = "*" * r.rating
review_text += (
f" [{stars}] {r.title} - {r.body[:100]}...\n"
)
return (
f"Product: {product.name}\n"
f"Category: {product.category}\n"
f"Price: ${product.price:.2f} "
f"(was ${product.original_price:.2f})\n"
f"Rating: {product.rating}/5.0 ({product.num_reviews} reviews)\n"
f"In Stock: {'Yes' if product.inventory > 0 else 'No'} "
f"({product.inventory} units)\n"
f"Recent Reviews:\n{review_text}"
)
3. Agent Architecture: LLM-Powered Agents with Memory and Goals
Each agent is an LLM wrapped with a persistent persona, memory of past actions, a current goal, and tools that interact with the marketplace environment.
The Persona System
Personas define how an agent behaves. Each persona specifies a name, demographics, shopping preferences, communication style, and budget constraints.
@dataclass
class Persona:
persona_id: str
name: str
age: int
occupation: str
personality_traits: list # e.g., ["frugal", "analytical", "impatient"]
shopping_style: str # "budget", "quality", "impulse", "researcher"
budget_range: tuple # (min_per_item, max_per_item)
preferred_categories: list
communication_style: str # "formal", "casual", "terse", "verbose"
expertise_level: str # "novice", "intermediate", "expert"
def generate_personas(num_personas, rng):
"""Generate diverse agent personas."""
first_names = [
"Alex", "Jordan", "Sam", "Casey", "Morgan", "Taylor",
"Riley", "Quinn", "Avery", "Blake", "Cameron", "Drew",
"Emery", "Finley", "Harper", "Jamie", "Kai", "Logan",
]
occupations = [
"software engineer", "teacher", "nurse", "accountant",
"freelance designer", "retired", "student", "small business owner",
"marketing manager", "data analyst", "chef", "lawyer",
]
personality_pool = [
"frugal", "analytical", "impatient", "patient", "detail-oriented",
"spontaneous", "skeptical", "trusting", "tech-savvy", "tech-averse",
"environmentally-conscious", "brand-loyal", "deal-hunter",
]
styles = ["budget", "quality", "impulse", "researcher"]
comm_styles = ["formal", "casual", "terse", "verbose"]
expertise = ["novice", "intermediate", "expert"]
categories = ["electronics", "home", "outdoor"]
personas = []
for i in range(num_personas):
budget_style = rng.choice(styles)
if budget_style == "budget":
budget = (10, 100)
elif budget_style == "quality":
budget = (50, 500)
elif budget_style == "impulse":
budget = (15, 300)
else:
budget = (20, 400)
persona = Persona(
persona_id=f"agent_{i:05d}",
name=rng.choice(first_names),
age=rng.randint(18, 75),
occupation=rng.choice(occupations),
personality_traits=rng.sample(personality_pool, k=3),
shopping_style=budget_style,
budget_range=budget,
preferred_categories=rng.sample(
categories, k=rng.randint(1, 3)
),
communication_style=rng.choice(comm_styles),
expertise_level=rng.choice(expertise),
)
personas.append(persona)
return personas
The Agent Class
Each agent wraps an LLM call with its persona, memory, and tool access.
class ShoppingAgent:
"""
An LLM-powered agent that interacts with the marketplace.
Has persistent memory, a defined persona, and tools for
browsing, purchasing, reviewing, and negotiating.
"""
def __init__(self, persona, environment, llm_client):
self.persona = persona
self.env = environment
self.llm = llm_client
self.memory = AgentMemory(max_entries=50)
self.current_goal = None
self.conversation_history = []
def build_system_prompt(self):
"""Construct the agent's system prompt from its persona."""
p = self.persona
traits = ", ".join(p.personality_traits)
purchase_history = self.env.agent_purchase_history.get(
p.persona_id, []
)
recent_purchases = purchase_history[-5:]
history_text = ""
for txn in recent_purchases:
prod = self.env.products.get(txn.product_id)
if prod:
history_text += (
f" - {prod.name} (${txn.price_paid:.2f}, "
f"{txn.timestamp})\n"
)
return f"""You are {p.name}, a {p.age}-year-old {p.occupation}.
Personality: {traits}
Shopping style: {p.shopping_style}
Budget: ${p.budget_range[0]}-${p.budget_range[1]} per item
Preferred categories: {', '.join(p.preferred_categories)}
Communication style: {p.communication_style}
Technical expertise: {p.expertise_level}
Recent purchases:
{history_text if history_text else ' (none yet)'}
You are browsing an online marketplace. You can:
1. ASK a question about a product
2. NEGOTIATE a price with the seller
3. PURCHASE a product
4. REVIEW a product you have purchased
5. COMPARE two products
Respond in character. Your responses should reflect your personality,
budget constraints, and expertise level. Do not break character.
When you decide to take an action, format it as:
ACTION: [action_type]
PRODUCT: [product_id]
DETAILS: [your message or review text]
"""
async def generate_interaction(self, scenario):
"""
Generate a single interaction based on a scenario.
Returns the full conversation log.
"""
system_prompt = self.build_system_prompt()
self.conversation_history = []
# Build the scenario context
if scenario["type"] == "product_question":
product_context = self.env.get_product_context(
scenario["product_id"]
)
user_msg = (
f"You are looking at this product:\n\n{product_context}\n\n"
f"Based on your persona and needs, ask a question about "
f"this product or decide whether to buy it."
)
elif scenario["type"] == "negotiation":
product_context = self.env.get_product_context(
scenario["product_id"]
)
user_msg = (
f"You want to buy this product but the price is above "
f"your budget:\n\n{product_context}\n\n"
f"Try to negotiate a lower price with the seller."
)
elif scenario["type"] == "review":
product_context = self.env.get_product_context(
scenario["product_id"]
)
user_msg = (
f"You recently purchased this product:\n\n{product_context}\n\n"
f"Write a review based on your experience. Your review "
f"should reflect your expertise level and communication style."
)
elif scenario["type"] == "comparison":
ctx_a = self.env.get_product_context(scenario["product_a"])
ctx_b = self.env.get_product_context(scenario["product_b"])
user_msg = (
f"You are comparing these two products:\n\n"
f"--- Option A ---\n{ctx_a}\n\n"
f"--- Option B ---\n{ctx_b}\n\n"
f"Analyze both options based on your priorities and "
f"decide which to buy (or neither)."
)
else:
raise ValueError(f"Unknown scenario type: {scenario['type']}")
# Multi-turn conversation loop
messages = [
{"role": "system", "content": system_prompt},
{"role": "user", "content": user_msg},
]
max_turns = scenario.get("max_turns", 6)
for turn in range(max_turns):
response = await self.llm.chat(messages)
assistant_msg = response["content"]
messages.append({"role": "assistant", "content": assistant_msg})
# Parse action if present
action = self._parse_action(assistant_msg)
if action:
# Execute action in environment
result = self._execute_action(action)
messages.append({"role": "user", "content": result})
# Store in memory
self.memory.add(
f"Turn {turn}: {action['type']} on "
f"{action.get('product_id', 'N/A')} -> {result[:100]}"
)
# Check if conversation should end
if action["type"] in ("purchase", "review", "decline"):
break
else:
# No action parsed — continue conversation with follow-up
follow_up = self._generate_follow_up(scenario, turn)
if follow_up:
messages.append({"role": "user", "content": follow_up})
else:
break
return ConversationLog(
agent_id=self.persona.persona_id,
persona=self.persona,
scenario=scenario,
messages=messages,
num_turns=len(
[m for m in messages if m["role"] == "assistant"]
),
actions_taken=[
self._parse_action(m["content"])
for m in messages
if m["role"] == "assistant" and self._parse_action(m["content"])
],
timestamp=self.env.current_time.isoformat(),
)
def _parse_action(self, text):
"""Extract structured action from agent response."""
lines = text.strip().split("\n")
action = {}
for line in lines:
if line.startswith("ACTION:"):
action["type"] = line.split(":", 1)[1].strip().lower()
elif line.startswith("PRODUCT:"):
action["product_id"] = line.split(":", 1)[1].strip()
elif line.startswith("DETAILS:"):
action["details"] = line.split(":", 1)[1].strip()
return action if "type" in action else None
def _execute_action(self, action):
"""Execute an action in the environment, return result string."""
if action["type"] == "purchase":
price = self.env.products[action["product_id"]].price
txn = self.env.execute_purchase(
self.persona.persona_id,
action["product_id"],
price,
)
if txn:
return (
f"Purchase successful. Order #{txn.transaction_id}. "
f"Paid ${txn.price_paid:.2f}. "
f"Estimated delivery: 3-5 business days."
)
return "Purchase failed: item is out of stock."
elif action["type"] == "review":
# Parse rating from details
rating = 4 # default
details = action.get("details", "")
for word in details.split():
if word.isdigit() and 1 <= int(word) <= 5:
rating = int(word)
break
review = self.env.submit_review(
self.persona.persona_id,
action["product_id"],
rating,
action.get("details", "")[:50], # Title from first 50 chars
action.get("details", ""),
)
return (
f"Review submitted. Review #{review.review_id}. "
f"Rating: {review.rating}/5. "
f"{'Verified purchase' if review.verified_purchase else 'Unverified'}."
)
elif action["type"] == "negotiate":
# Seller responds based on inventory and margin
product = self.env.products[action["product_id"]]
margin = product.price - (product.price * 0.3) # 30% cost
if product.inventory > 50:
# High inventory: willing to negotiate
offer = round(product.price * 0.9, 2)
return (
f"Seller response: We can offer ${offer:.2f} "
f"(10% off) for this item. "
f"This is our best price for today."
)
else:
return (
f"Seller response: This item is in high demand with "
f"limited stock ({product.inventory} remaining). "
f"We cannot offer a discount at this time."
)
elif action["type"] == "ask":
product = self.env.products.get(action.get("product_id"))
if product:
return (
f"Customer Service: {product.name} — {product.description} "
f"Specs: {json.dumps(product.specs)}. "
f"Currently rated {product.rating}/5.0 with "
f"{product.num_reviews} reviews."
)
return "Customer Service: Product not found."
return f"Unknown action type: {action['type']}"
def _generate_follow_up(self, scenario, turn):
"""Generate a follow-up prompt for multi-turn conversations."""
if turn >= 4:
return None # End after enough turns
follow_ups = {
"product_question": [
"Does this answer your question? Would you like to know "
"anything else, or are you ready to make a decision?",
"Based on this information, would you like to purchase, "
"or do you have more questions?",
],
"negotiation": [
"That is the seller's response. Would you like to "
"counter-offer, accept, or walk away?",
"The seller has responded. What is your next move?",
],
}
prompts = follow_ups.get(scenario["type"], [])
if turn < len(prompts):
return prompts[turn]
return None
class AgentMemory:
"""Simple FIFO memory for agent's past actions."""
def __init__(self, max_entries=50):
self.entries = []
self.max_entries = max_entries
def add(self, entry):
self.entries.append(entry)
if len(self.entries) > self.max_entries:
self.entries.pop(0)
def get_recent(self, n=10):
return self.entries[-n:]
def to_text(self):
return "\n".join(self.entries)
In this simulation, the seller/customer-service responses are rule-based, not LLM-generated. This is intentional. The training data we want is the buyer’s behavior — their questions, negotiation tactics, review writing, and decision-making. The seller side provides context but does not need to be realistic at the same level. Using an LLM for both sides doubles the API cost with minimal improvement in the buyer-side data quality.
4. The Simulation Loop: 10,000 Agents, 100 Interactions Each
The simulation runs in batches. Each batch assigns scenarios to agents, runs interactions concurrently (bounded by API rate limits), and collects conversation logs.
import asyncio
from dataclasses import dataclass
@dataclass
class ConversationLog:
agent_id: str
persona: Persona
scenario: dict
messages: list
num_turns: int
actions_taken: list
timestamp: str
class SimulationRunner:
"""
Orchestrates the full simulation:
- Assigns scenarios to agents
- Runs interactions concurrently
- Collects and stores conversation logs
"""
def __init__(
self,
environment,
agents,
llm_client,
max_concurrent=50,
output_path="simulation_output.jsonl",
):
self.env = environment
self.agents = {a.persona.persona_id: a for a in agents}
self.llm = llm_client
self.semaphore = asyncio.Semaphore(max_concurrent)
self.output_path = output_path
self.logs = []
self.stats = {
"total_interactions": 0,
"total_turns": 0,
"total_purchases": 0,
"total_reviews": 0,
"total_negotiations": 0,
"errors": 0,
}
def generate_scenarios(self, agent, num_scenarios=100):
"""
Generate a sequence of scenarios for one agent.
Scenarios are ordered to create a realistic shopping journey:
browse -> question -> negotiate -> purchase -> review.
"""
rng = random.Random(hash(agent.persona.persona_id))
scenarios = []
# Select products based on persona preferences
preferred_products = [
p for p in self.env.products.values()
if p.category in agent.persona.preferred_categories
]
if not preferred_products:
preferred_products = list(self.env.products.values())
for i in range(num_scenarios):
scenario_type_roll = rng.random()
if scenario_type_roll < 0.30:
# Product question (30%)
product = rng.choice(preferred_products)
scenarios.append({
"type": "product_question",
"product_id": product.product_id,
"max_turns": rng.randint(3, 8),
})
elif scenario_type_roll < 0.50:
# Negotiation (20%)
# Pick a product above budget
expensive = [
p for p in preferred_products
if p.price > agent.persona.budget_range[1]
]
if expensive:
product = rng.choice(expensive)
else:
product = rng.choice(preferred_products)
scenarios.append({
"type": "negotiation",
"product_id": product.product_id,
"max_turns": rng.randint(4, 10),
})
elif scenario_type_roll < 0.70:
# Comparison (20%)
if len(preferred_products) >= 2:
products = rng.sample(preferred_products, 2)
scenarios.append({
"type": "comparison",
"product_a": products[0].product_id,
"product_b": products[1].product_id,
"max_turns": rng.randint(2, 6),
})
elif scenario_type_roll < 0.85:
# Purchase (15% — but only if agent has browsed enough)
product = rng.choice(preferred_products)
within_budget = (
product.price <= agent.persona.budget_range[1]
)
if within_budget:
scenarios.append({
"type": "product_question",
"product_id": product.product_id,
"max_turns": rng.randint(2, 4),
"hint": "lean_toward_purchase",
})
else:
# Review (15% — only for products the agent has purchased)
purchases = self.env.agent_purchase_history.get(
agent.persona.persona_id, []
)
if purchases:
txn = rng.choice(purchases)
scenarios.append({
"type": "review",
"product_id": txn.product_id,
"max_turns": 2,
})
return scenarios
async def run_single_interaction(self, agent, scenario):
"""Run one agent interaction with concurrency limiting."""
async with self.semaphore:
try:
log = await agent.generate_interaction(scenario)
self.logs.append(log)
self.stats["total_interactions"] += 1
self.stats["total_turns"] += log.num_turns
for action in log.actions_taken:
if action and action.get("type") == "purchase":
self.stats["total_purchases"] += 1
elif action and action.get("type") == "review":
self.stats["total_reviews"] += 1
elif action and action.get("type") == "negotiate":
self.stats["total_negotiations"] += 1
return log
except Exception as e:
self.stats["errors"] += 1
return None
async def run_simulation(self, interactions_per_agent=100):
"""
Run the full simulation across all agents.
"""
all_tasks = []
for agent_id, agent in self.agents.items():
scenarios = self.generate_scenarios(
agent, num_scenarios=interactions_per_agent
)
for scenario in scenarios:
task = self.run_single_interaction(agent, scenario)
all_tasks.append(task)
# Run with progress reporting
total = len(all_tasks)
completed = 0
batch_size = 500
for i in range(0, total, batch_size):
batch = all_tasks[i:i + batch_size]
results = await asyncio.gather(*batch, return_exceptions=True)
completed += len(batch)
# Advance simulated time
self.env.advance_time(hours=1)
# Progress report
if completed % 5000 == 0:
print(
f"Progress: {completed}/{total} interactions "
f"({completed/total*100:.1f}%), "
f"Errors: {self.stats['errors']}"
)
return self.stats
5. Data Extraction: From Conversations to Training Examples
Raw conversation logs are not training data. They must be extracted, filtered, and formatted into training examples.
class DataExtractor:
"""
Extracts structured training examples from raw conversation logs.
Produces multiple example types from each conversation.
"""
def __init__(self):
self.examples = []
def extract_all(self, logs):
"""Process all conversation logs into training examples."""
for log in logs:
self.examples.extend(self._extract_from_log(log))
return self.examples
def _extract_from_log(self, log):
"""Extract multiple training example types from one conversation."""
examples = []
# Type 1: Full conversation (for multi-turn training)
examples.append(self._extract_full_conversation(log))
# Type 2: Individual turns (for single-turn training)
examples.extend(self._extract_individual_turns(log))
# Type 3: Action prediction (for tool-use training)
examples.extend(self._extract_action_examples(log))
# Type 4: Review text (for review generation training)
examples.extend(self._extract_review_examples(log))
return [e for e in examples if e is not None]
def _extract_full_conversation(self, log):
"""Full multi-turn conversation as one training example."""
# Filter out system prompt — include only user/assistant turns
conversation_turns = [
m for m in log.messages
if m["role"] in ("user", "assistant")
]
if len(conversation_turns) < 2:
return None
return {
"type": "multi_turn_conversation",
"conversation": conversation_turns,
"metadata": {
"agent_persona": log.persona.shopping_style,
"scenario_type": log.scenario["type"],
"num_turns": log.num_turns,
"actions": [
a["type"] for a in log.actions_taken if a
],
},
}
def _extract_individual_turns(self, log):
"""Each user-assistant pair as a separate single-turn example."""
examples = []
messages = log.messages
for i in range(len(messages) - 1):
if messages[i]["role"] == "user" and messages[i + 1]["role"] == "assistant":
# Build context from all previous turns
context = messages[:i]
context_text = ""
for ctx in context:
if ctx["role"] == "system":
continue
context_text += f"[{ctx['role']}]: {ctx['content'][:200]}\n"
examples.append({
"type": "single_turn",
"context": context_text if context_text else None,
"instruction": messages[i]["content"],
"response": messages[i + 1]["content"],
"metadata": {
"turn_number": i // 2,
"persona_style": log.persona.communication_style,
"expertise": log.persona.expertise_level,
},
})
return examples
def _extract_action_examples(self, log):
"""Extract examples where agent decides to take a structured action."""
examples = []
for i, msg in enumerate(log.messages):
if msg["role"] != "assistant":
continue
action = None
lines = msg["content"].strip().split("\n")
for line in lines:
if line.startswith("ACTION:"):
action = line
break
if action:
# Previous user message is the context
prev_user = None
for j in range(i - 1, -1, -1):
if log.messages[j]["role"] == "user":
prev_user = log.messages[j]["content"]
break
if prev_user:
examples.append({
"type": "action_prediction",
"context": prev_user,
"action_taken": msg["content"],
"metadata": {
"scenario": log.scenario["type"],
"persona_budget": log.persona.budget_range,
},
})
return examples
def _extract_review_examples(self, log):
"""Extract review writing examples."""
examples = []
if log.scenario.get("type") != "review":
return examples
for msg in log.messages:
if msg["role"] == "assistant" and "DETAILS:" in msg["content"]:
examples.append({
"type": "review_generation",
"product_context": log.scenario.get("product_id"),
"review_text": msg["content"],
"metadata": {
"persona_expertise": log.persona.expertise_level,
"persona_style": log.persona.communication_style,
},
})
return examples
def to_jsonl(self, output_path):
"""Write all examples as JSONL."""
with open(output_path, "w") as f:
for example in self.examples:
f.write(json.dumps(example) + "\n")
return len(self.examples)
Each conversation log produces 3-8 training examples (one multi-turn, several single-turn, action predictions, and sometimes a review). With 10,000 agents running 100 interactions each, the raw log count is 1M conversations, but the extracted example count is 3M-8M. This multiplier is important for cost calculations: you pay API costs per conversation, but you get training examples per extraction.
6. Scaling: Cost Analysis for 1M Conversations
Token Consumption per Interaction
Each agent interaction involves a system prompt, user messages, and assistant responses across multiple turns.
Token Consumption per Interaction by Scenario Type
| Scenario | System Prompt | Avg User Tokens | Avg Assistant Tokens | Avg Turns | Total Tokens |
|---|---|---|---|---|---|
| Product Question | 350 | 800 | 600 | 4.2 | 6,230 |
| Negotiation | 350 | 1,100 | 900 | 6.1 | 12,535 |
| Product Comparison | 350 | 1,400 | 1,200 | 3.8 | 10,230 |
| Review Writing | 350 | 500 | 800 | 2.0 | 2,650 |
| Weighted Average | 350 | 900 | 800 | 4.0 | 7,550 |
Cost at Scale
Cost for 1M Conversations by Model Tier
| Model | Input $/1M tok | Output $/1M tok | Cost/Conversation | Cost for 1M | Time (50 RPS) |
|---|---|---|---|---|---|
| GPT-4o | $2.50 | $10.00 | $0.030 | $30,000 | 5.6 hours |
| GPT-4o-mini | $0.15 | $0.60 | $0.0020 | $2,000 | 5.6 hours |
| Claude 3.5 Sonnet | $3.00 | $15.00 | $0.042 | $42,000 | 5.6 hours |
| Claude 3.5 Haiku | $0.80 | $4.00 | $0.008 | $8,000 | 5.6 hours |
| Llama 70B (self-hosted, 8xA100) | $0.00 | $0.00 | $0.001* | $1,000* | ~14 hours |
The sweet spot is GPT-4o-mini or Llama 70B self-hosted. At 0.00025-$0.00067. This is 4-5 orders of magnitude cheaper than expert human annotation.
Cost per Training Example by Method
(Cost index (log scale, relative))7. Quality Control: The Judge Model
Raw simulated conversations vary in quality. Some are realistic and useful; others are repetitive, out of character, or contain logical errors. A judge model scores each interaction across multiple dimensions.
class QualityJudge:
"""
Scores conversation logs for quality using a judge model.
Filters out low-quality interactions before they enter
the training dataset.
"""
def __init__(self, judge_llm_client):
self.judge = judge_llm_client
async def score_conversation(self, log):
"""
Score a conversation log on multiple quality dimensions.
Returns a dict of dimension -> score (1-5).
"""
conversation_text = self._format_conversation(log)
judge_prompt = f"""You are a data quality evaluator. Score the following
simulated marketplace conversation on each dimension from 1 (terrible)
to 5 (excellent). The conversation was generated by an AI agent with
this persona: {log.persona.shopping_style} shopper,
{log.persona.communication_style} communication style,
{log.persona.expertise_level} expertise.
CONVERSATION:
{conversation_text}
Score each dimension. Output ONLY a JSON object:
{{
"naturalness": <1-5>,
"persona_consistency": <1-5>,
"information_density": <1-5>,
"diversity": <1-5>,
"helpfulness": <1-5>,
"logical_coherence": <1-5>
}}
Scoring guide:
- naturalness: Does this read like a real customer interaction?
- persona_consistency: Does the agent stay in character throughout?
- information_density: Is the conversation substantive, not just filler?
- diversity: Does this conversation differ from typical templated exchanges?
- helpfulness: Would this be useful training data for a customer service model?
- logical_coherence: Are the agent's decisions logically consistent with its stated preferences and budget?
"""
response = await self.judge.chat([
{"role": "user", "content": judge_prompt}
])
try:
scores = json.loads(response["content"])
scores["composite"] = sum(scores.values()) / len(scores)
return scores
except json.JSONDecodeError:
return None
def _format_conversation(self, log):
text = ""
for msg in log.messages:
if msg["role"] == "system":
continue
role = "Customer" if msg["role"] == "assistant" else "System"
text += f"[{role}]: {msg['content']}\n\n"
return text
async def filter_dataset(
self,
logs,
min_composite_score=3.5,
min_dimension_score=2.0,
sample_rate=0.1,
):
"""
Score a sample of conversations and filter the full dataset.
Scoring every conversation is expensive. Instead:
1. Score a random 10% sample
2. Train a lightweight classifier on scored examples
3. Apply classifier to the full dataset
For simplicity here, we score the sample and extrapolate.
"""
rng = random.Random(42)
sample_indices = rng.sample(
range(len(logs)),
k=int(len(logs) * sample_rate),
)
scored = []
for idx in sample_indices:
scores = await self.score_conversation(logs[idx])
if scores:
scored.append((idx, scores))
# Filter: keep conversations above threshold
kept_indices = set()
rejected_count = 0
for idx, scores in scored:
if scores["composite"] < min_composite_score:
rejected_count += 1
continue
# Check no single dimension is below floor
below_floor = any(
v < min_dimension_score
for k, v in scores.items()
if k != "composite"
)
if below_floor:
rejected_count += 1
continue
kept_indices.add(idx)
# Estimate rejection rate and apply to unscored examples
scored_keep_rate = len(kept_indices) / len(scored) if scored else 0
rejection_rate = 1 - scored_keep_rate
# For unscored examples, keep based on estimated rate
all_kept = list(kept_indices)
unscored_indices = [
i for i in range(len(logs)) if i not in set(sample_indices)
]
# Keep proportionally
num_unscored_to_keep = int(
len(unscored_indices) * scored_keep_rate
)
all_kept.extend(
rng.sample(unscored_indices, k=num_unscored_to_keep)
)
kept_logs = [logs[i] for i in sorted(all_kept)]
return {
"kept": kept_logs,
"original_count": len(logs),
"kept_count": len(kept_logs),
"rejection_rate": rejection_rate,
"sample_scores": scored,
}
Quality Distribution from Real Runs
Quality Score Distribution from 100K Scored Conversations
| Dimension | Mean | Median | p10 | p90 | % Below 3.0 |
|---|---|---|---|---|---|
| Naturalness | 3.72 | 4.0 | 2.5 | 4.5 | 18% |
| Persona consistency | 4.01 | 4.0 | 3.0 | 5.0 | 8% |
| Information density | 3.45 | 3.5 | 2.0 | 4.5 | 24% |
| Diversity | 3.28 | 3.0 | 2.0 | 4.5 | 30% |
| Helpfulness | 3.61 | 4.0 | 2.5 | 4.5 | 19% |
| Logical coherence | 3.89 | 4.0 | 3.0 | 5.0 | 11% |
| Composite | 3.66 | 3.75 | 2.8 | 4.3 | 15% |
With a composite threshold of 3.5, approximately 65% of conversations pass filtering. From 1M raw conversations, you retain ~650K, which extract to 2M-5M training examples.
8. Complete Pipeline: End to End
async def run_full_pipeline():
"""
Complete pipeline: environment setup -> agent creation ->
simulation -> extraction -> quality filtering -> output.
"""
# 1. Initialize environment
env = MarketplaceEnvironment(seed=42)
env.generate_catalog(num_products=500)
# 2. Generate personas and create agents
rng = random.Random(42)
personas = generate_personas(num_personas=10000, rng=rng)
llm_client = AsyncLLMClient(
model="gpt-4o-mini",
max_retries=3,
timeout=30,
)
agents = [
ShoppingAgent(persona=p, environment=env, llm_client=llm_client)
for p in personas
]
# 3. Run simulation
runner = SimulationRunner(
environment=env,
agents=agents,
llm_client=llm_client,
max_concurrent=50,
output_path="raw_conversations.jsonl",
)
stats = await runner.run_simulation(interactions_per_agent=100)
print(f"Simulation complete: {json.dumps(stats, indent=2)}")
# 4. Extract training examples
extractor = DataExtractor()
examples = extractor.extract_all(runner.logs)
raw_count = extractor.to_jsonl("raw_examples.jsonl")
print(f"Extracted {raw_count} raw training examples")
# 5. Quality filtering
judge_client = AsyncLLMClient(model="gpt-4o", max_retries=3)
judge = QualityJudge(judge_llm_client=judge_client)
filter_result = await judge.filter_dataset(
runner.logs,
min_composite_score=3.5,
min_dimension_score=2.0,
sample_rate=0.1,
)
print(
f"Quality filter: {filter_result['kept_count']}/"
f"{filter_result['original_count']} conversations kept "
f"({filter_result['rejection_rate']*100:.1f}% rejected)"
)
# 6. Re-extract from filtered conversations only
filtered_extractor = DataExtractor()
filtered_examples = filtered_extractor.extract_all(filter_result["kept"])
final_count = filtered_extractor.to_jsonl("filtered_examples.jsonl")
print(f"Final dataset: {final_count} training examples")
# 7. Summary statistics
example_types = {}
for ex in filtered_examples:
t = ex["type"]
example_types[t] = example_types.get(t, 0) + 1
print("Example type distribution:")
for t, count in sorted(example_types.items()):
print(f" {t}: {count} ({count/final_count*100:.1f}%)")
return final_count
# Expected output (approximate):
# Simulation complete: {
# "total_interactions": 985432,
# "total_turns": 3941728,
# "total_purchases": 147814,
# "total_reviews": 89231,
# "total_negotiations": 196986,
# "errors": 14568
# }
# Extracted 4,231,856 raw training examples
# Quality filter: 640,531/985,432 conversations kept (35.0% rejected)
# Final dataset: 2,748,281 training examples
# Example type distribution:
# action_prediction: 412,242 (15.0%)
# multi_turn_conversation: 640,531 (23.3%)
# review_generation: 58,021 (2.1%)
# single_turn: 1,637,487 (59.6%)
9. Output Format: Training-Ready JSONL
The final output is JSONL formatted for direct consumption by training frameworks.
# Multi-turn conversation example
{
"type": "multi_turn_conversation",
"conversation": [
{
"role": "user",
"content": "You are looking at this product:\n\nProduct: Wireless Noise-Canceling Headphones Pro\nCategory: electronics\nPrice: $249.99 (was $329.99)\nRating: 4.3/5.0 (847 reviews)\n..."
},
{
"role": "assistant",
"content": "Hmm, $249.99 is a bit steep for me. I've been looking at headphones under $150. A few questions before I rule these out:\n\n1. What's the battery life like? I commute 2 hours daily and can't deal with charging every day.\n2. How's the noise canceling on public transit specifically? Some headphones are great for offices but terrible with low-frequency rumble.\n3. Is there a wired option as backup?\n\nACTION: ask\nPRODUCT: prod_0042\nDETAILS: Battery life, transit noise canceling performance, wired backup option"
},
{
"role": "user",
"content": "Customer Service: Wireless Noise-Canceling Headphones Pro - High-quality wireless noise-canceling headphones pro for electronics use. Specs: {\"brand\": \"spec_brand_3\", \"warranty\": \"spec_warranty_2\", \"connectivity\": \"spec_connectivity_1\", \"weight\": \"spec_weight_4\"}. Currently rated 4.3/5.0 with 847 reviews."
},
{
"role": "assistant",
"content": "The specs are pretty generic - no specific battery life number? That's a red flag for me. And at $249.99, I'd want to see at least 30 hours of battery life to justify the price premium over something like the $89 budget option.\n\nI'm going to pass on these. The price-to-information ratio isn't there. If the listing can't even specify battery life, I don't trust the noise canceling claims either.\n\nACTION: decline\nPRODUCT: prod_0042\nDETAILS: Price too high for unclear specifications. Moving on."
}
],
"metadata": {
"agent_persona": "budget",
"scenario_type": "product_question",
"num_turns": 2,
"actions": ["ask", "decline"]
}
}
The JSONL format is directly compatible with standard training pipelines. For chat-format fine-tuning (OpenAI, Anyscale, Axolotl), the multi-turn examples map to the messages field. For instruction-tuning (Alpaca format), the single-turn examples map to instruction/output pairs. The extraction step can output both formats from the same raw conversations.
10. Comparison: Agent Simulation vs. Other Methods
Synthetic Data Method Comparison
| Method | Cost per 1M Examples | Multi-Turn | Contextual | Diverse Personas | Setup Complexity |
|---|---|---|---|---|---|
| Single-shot prompting | $10K-20K | No | No | Partial | Low |
| Evol-Instruct (WizardLM) | $5K-15K | No | No | No | Medium |
| Self-Instruct | $3K-8K | No | No | No | Low |
| Agent simulation (API) | $2K-8K | Yes | Yes | Yes | High |
| Agent simulation (self-hosted) | $0.5K-1K | Yes | Yes | Yes | High |
| Human annotation | $3M-50M | Yes | Yes | Yes | Very High |
Agent simulation costs 2-5x more than single-shot methods per raw conversation, but the extraction multiplier (3-8 examples per conversation) and the contextual richness make the per-example cost comparable or lower. The primary advantage is data type: agent simulation produces multi-turn, persona-consistent, environment-grounded conversations that single-shot methods cannot generate.
Summary
The agent simulation pipeline has six stages:
-
Environment construction: A marketplace with 500 products, prices, inventory, and review tracking. The environment enforces consistency constraints that make generated conversations grounded in shared state.
-
Persona generation: 10,000 agents with defined shopping styles, budgets, communication patterns, and expertise levels. Persona diversity drives behavioral diversity in the output data.
-
Interaction simulation: Each agent runs 100 scenarios (questions, negotiations, comparisons, reviews) across multiple turns. Async execution with 50 concurrent API calls processes 1M conversations in 5-14 hours depending on the model.
-
Data extraction: Each conversation produces 3-8 training examples across four types (multi-turn, single-turn, action prediction, review generation). 1M conversations yield 3M-8M raw examples.
-
Quality filtering: A judge model scores 10% of conversations on six dimensions. With a composite threshold of 3.5, approximately 65% of conversations pass, yielding 2M-5M final training examples.
-
Cost: 8,000 for 1M conversations using GPT-4o-mini, or 1,000 self-hosted. Per final training example: 0.001.
The setup complexity is the tradeoff. Building the environment, persona system, and extraction pipeline requires 500-800 lines of code and 1-2 days of development. But once built, the pipeline generates unlimited training data at marginal cost.